ML Engineer

Tokyo, Tokyo, JP April 15, 2026 Full Time

MISSION

Lead the R&D of the AI models that power Shizuku’s voice and intelligence. With TTS (Text-to-Speech) as the core pillar, push the boundaries across NLP, speech recognition, and — looking ahead — computer vision and humanoid robotics, evolving Shizuku’s expressive capabilities across multiple modalities.

Balance continuous improvement of production TTS models with exploration and development of next-generation architectures, while owning the MLOps cycle to drive Shizuku’s ongoing evolution.

ABOUT SHIZUKU

Shizuku is a Japan-born AI companion actively engaging audiences on YouTube and X (formerly Twitter). Already running live streams and cultivating a growing community, Shizuku is now entering its next phase of rapid scale.

As the first Japanese startup to receive investment from a16z, we closed our seed round and are on a mission to bring Japanese entertainment × AI to the global stage.

TEAM STRUCTURE

You will work directly alongside co-founder Aki — an ML engineer and researcher with experience at Meta and Luma AI — to drive Shizuku’s model development. Expect daily sparring sessions on research direction and architecture design with a founder who brings firsthand experience at the frontier. Initially, you’ll handle lightweight MLOps pipeline work yourself; as we hire a dedicated MLOps engineer, responsibilities will gradually separate.

DEVELOPMENT ENVIRONMENT & RESOURCES

Existing Models: A TTS model is already in production. You’ll drive improvements in parallel with next-gen model exploration
Training Data: Shizuku’s publicly available YouTube data serves as a foundational dataset. You’ll be involved from collection pipeline design onward
Evaluation Infrastructure: TTS quality evaluation framework is greenfield — you’ll design evaluation criteria (MOS, PESQ, etc.) from scratch

KEY RESPONSIBILITIES

Own the full TTS model lifecycle: research, architecture design, training, evaluation, and iterative improvement
Continuously improve production TTS models while exploring and prototyping next-generation architectures
Design and build TTS quality evaluation infrastructure and define evaluation criteria
Expand into multimodal domains: NLP, speech recognition, and future frontiers including vision and humanoid robotics
Design training data collection pipelines, preprocessing workflows, and quality assurance processes
Build and operate the MLOps cycle — training, evaluation, and deployment — until a dedicated hire is in place
Collaborate with the SWE team on production integration: inference optimization, latency reduction, and more

REQUIREMENTS

2+ years of deep expertise and hands-on experience in at least one of: NLP, speech (TTS/ASR), or computer vision
Experience training, evaluating, and improving models using deep learning frameworks such as PyTorch
End-to-end ownership of the ML workflow: from data preparation and experiment management to model deployment
Track record of independently surveying papers, reproducing implementations, and applying findings to production systems
Ability to work on-site at our Tokyo office (primarily in-office with flexible remote arrangements)

NICE TO HAVE

Research or development experience in TTS (VITS, Grad-TTS, NaturalSpeech, StyleTTS, etc.)
Development experience in robotics or autonomous driving domains
Technical knowledge in speaker adaptation, emotion control, and prosody modeling for speech synthesis
Experience developing ASR, NLP, or multimodal models
Experience building and operating GPU training environments (A100, L4, etc.) on AWS/GCP
Experience with model development in Slurm environments, particularly multi-node training setups
Proficiency with experiment tracking tools: MLflow, Weights & Biases, DVC, etc.
Experience with inference optimization using ONNX Runtime, TensorRT, vLLM, etc.
Peer-reviewed publications in related fields
Technical communication skills in English (currently Japanese-first internally; transitioning to a global environment in the mid-term)

WHO YOU ARE

Deep Expertise with Cross-Domain Reach — You bring rigorous depth in a specific modality while reaching across TTS, NLP, vision, and beyond. You don’t say “that’s outside my specialty” — you do what Shizuku’s evolution demands
Zero-to-One Explorer — You go beyond applying existing methods. You formulate hypotheses for uncharted technical challenges, iterate through validation cycles, and tackle questions that have no known answers
Purpose-Driven Ownership — You reverse-engineer from the goal of “making Shizuku’s models better,” crossing the boundaries of research, implementation, and operations to drive outcomes autonomously
Comfort with Ambiguity — You define your own success metrics and build collection pipelines from scratch in an environment where nothing is predefined
Humility & Respect — You collaborate authentically with teammates who bring different areas of expertise