ML Engineer
MISSION
Lead the R&D of the AI models that power Shizuku’s voice and intelligence. With TTS (Text-to-Speech) as the core pillar, push the boundaries across NLP, speech recognition, and — looking ahead — computer vision and humanoid robotics, evolving Shizuku’s expressive capabilities across multiple modalities.
Balance continuous improvement of production TTS models with exploration and development of next-generation architectures, while owning the MLOps cycle to drive Shizuku’s ongoing evolution.
ABOUT SHIZUKU
Shizuku is a Japan-born AI companion actively engaging audiences on YouTube and X (formerly Twitter). Already running live streams and cultivating a growing community, Shizuku is now entering its next phase of rapid scale.
As the first Japanese startup to receive investment from a16z, we closed our seed round and are on a mission to bring Japanese entertainment × AI to the global stage.
TEAM STRUCTURE
You will work directly alongside co-founder Aki — an ML engineer and researcher with experience at Meta and Luma AI — to drive Shizuku’s model development. Expect daily sparring sessions on research direction and architecture design with a founder who brings firsthand experience at the frontier. Initially, you’ll handle lightweight MLOps pipeline work yourself; as we hire a dedicated MLOps engineer, responsibilities will gradually separate.
DEVELOPMENT ENVIRONMENT & RESOURCES
- Existing Models: A TTS model is already in production. You’ll drive improvements in parallel with next-gen model exploration
- Training Data: Shizuku’s publicly available YouTube data serves as a foundational dataset. You’ll be involved from collection pipeline design onward
- Evaluation Infrastructure: TTS quality evaluation framework is greenfield — you’ll design evaluation criteria (MOS, PESQ, etc.) from scratch
KEY RESPONSIBILITIES
- Own the full TTS model lifecycle: research, architecture design, training, evaluation, and iterative improvement
- Continuously improve production TTS models while exploring and prototyping next-generation architectures
- Design and build TTS quality evaluation infrastructure and define evaluation criteria
- Expand into multimodal domains: NLP, speech recognition, and future frontiers including vision and humanoid robotics
- Design training data collection pipelines, preprocessing workflows, and quality assurance processes
- Build and operate the MLOps cycle — training, evaluation, and deployment — until a dedicated hire is in place
- Collaborate with the SWE team on production integration: inference optimization, latency reduction, and more
REQUIREMENTS
- 2+ years of deep expertise and hands-on experience in at least one of: NLP, speech (TTS/ASR), or computer vision
- Experience training, evaluating, and improving models using deep learning frameworks such as PyTorch
- End-to-end ownership of the ML workflow: from data preparation and experiment management to model deployment
- Track record of independently surveying papers, reproducing implementations, and applying findings to production systems
- Ability to work on-site at our Tokyo office (primarily in-office with flexible remote arrangements)
NICE TO HAVE
- Research or development experience in TTS (VITS, Grad-TTS, NaturalSpeech, StyleTTS, etc.)
- Development experience in robotics or autonomous driving domains
- Technical knowledge in speaker adaptation, emotion control, and prosody modeling for speech synthesis
- Experience developing ASR, NLP, or multimodal models
- Experience building and operating GPU training environments (A100, L4, etc.) on AWS/GCP
- Experience with model development in Slurm environments, particularly multi-node training setups
- Proficiency with experiment tracking tools: MLflow, Weights & Biases, DVC, etc.
- Experience with inference optimization using ONNX Runtime, TensorRT, vLLM, etc.
- Peer-reviewed publications in related fields
- Technical communication skills in English (currently Japanese-first internally; transitioning to a global environment in the mid-term)
WHO YOU ARE
- Deep Expertise with Cross-Domain Reach — You bring rigorous depth in a specific modality while reaching across TTS, NLP, vision, and beyond. You don’t say “that’s outside my specialty” — you do what Shizuku’s evolution demands
- Zero-to-One Explorer — You go beyond applying existing methods. You formulate hypotheses for uncharted technical challenges, iterate through validation cycles, and tackle questions that have no known answers
- Purpose-Driven Ownership — You reverse-engineer from the goal of “making Shizuku’s models better,” crossing the boundaries of research, implementation, and operations to drive outcomes autonomously
- Comfort with Ambiguity — You define your own success metrics and build collection pipelines from scratch in an environment where nothing is predefined
- Humility & Respect — You collaborate authentically with teammates who bring different areas of expertise