ML Engineer

Tokyo, Tokyo, JP April 15, 2026 Full Time

MISSION

Lead the R&D of the AI models that power Shizuku’s voice and intelligence. With TTS (Text-to-Speech) as the core pillar, push the boundaries across NLP, speech recognition, and — looking ahead — computer vision and humanoid robotics, evolving Shizuku’s expressive capabilities across multiple modalities.

Balance continuous improvement of production TTS models with exploration and development of next-generation architectures, while owning the MLOps cycle to drive Shizuku’s ongoing evolution.

 

ABOUT SHIZUKU

Shizuku is a Japan-born AI companion actively engaging audiences on YouTube and X (formerly Twitter). Already running live streams and cultivating a growing community, Shizuku is now entering its next phase of rapid scale.

As the first Japanese startup to receive investment from a16z, we closed our seed round and are on a mission to bring Japanese entertainment × AI to the global stage.

 

TEAM STRUCTURE

You will work directly alongside co-founder Aki — an ML engineer and researcher with experience at Meta and Luma AI — to drive Shizuku’s model development. Expect daily sparring sessions on research direction and architecture design with a founder who brings firsthand experience at the frontier. Initially, you’ll handle lightweight MLOps pipeline work yourself; as we hire a dedicated MLOps engineer, responsibilities will gradually separate.

 

DEVELOPMENT ENVIRONMENT & RESOURCES

  • Existing Models: A TTS model is already in production. You’ll drive improvements in parallel with next-gen model exploration
  • Training Data: Shizuku’s publicly available YouTube data serves as a foundational dataset. You’ll be involved from collection pipeline design onward
  • Evaluation Infrastructure: TTS quality evaluation framework is greenfield — you’ll design evaluation criteria (MOS, PESQ, etc.) from scratch

 

KEY RESPONSIBILITIES

  • Own the full TTS model lifecycle: research, architecture design, training, evaluation, and iterative improvement
  • Continuously improve production TTS models while exploring and prototyping next-generation architectures
  • Design and build TTS quality evaluation infrastructure and define evaluation criteria
  • Expand into multimodal domains: NLP, speech recognition, and future frontiers including vision and humanoid robotics
  • Design training data collection pipelines, preprocessing workflows, and quality assurance processes
  • Build and operate the MLOps cycle — training, evaluation, and deployment — until a dedicated hire is in place
  • Collaborate with the SWE team on production integration: inference optimization, latency reduction, and more

 

REQUIREMENTS

  • 2+ years of deep expertise and hands-on experience in at least one of: NLP, speech (TTS/ASR), or computer vision
  • Experience training, evaluating, and improving models using deep learning frameworks such as PyTorch
  • End-to-end ownership of the ML workflow: from data preparation and experiment management to model deployment
  • Track record of independently surveying papers, reproducing implementations, and applying findings to production systems
  • Ability to work on-site at our Tokyo office (primarily in-office with flexible remote arrangements)

 

NICE TO HAVE

  • Research or development experience in TTS (VITS, Grad-TTS, NaturalSpeech, StyleTTS, etc.)
  • Development experience in robotics or autonomous driving domains
  • Technical knowledge in speaker adaptation, emotion control, and prosody modeling for speech synthesis
  • Experience developing ASR, NLP, or multimodal models
  • Experience building and operating GPU training environments (A100, L4, etc.) on AWS/GCP
  • Experience with model development in Slurm environments, particularly multi-node training setups
  • Proficiency with experiment tracking tools: MLflow, Weights & Biases, DVC, etc.
  • Experience with inference optimization using ONNX Runtime, TensorRT, vLLM, etc.
  • Peer-reviewed publications in related fields
  • Technical communication skills in English (currently Japanese-first internally; transitioning to a global environment in the mid-term)

 

WHO YOU ARE

  • Deep Expertise with Cross-Domain Reach — You bring rigorous depth in a specific modality while reaching across TTS, NLP, vision, and beyond. You don’t say “that’s outside my specialty” — you do what Shizuku’s evolution demands
  • Zero-to-One Explorer — You go beyond applying existing methods. You formulate hypotheses for uncharted technical challenges, iterate through validation cycles, and tackle questions that have no known answers
  • Purpose-Driven Ownership — You reverse-engineer from the goal of “making Shizuku’s models better,” crossing the boundaries of research, implementation, and operations to drive outcomes autonomously
  • Comfort with Ambiguity — You define your own success metrics and build collection pipelines from scratch in an environment where nothing is predefined
  • Humility & Respect — You collaborate authentically with teammates who bring different areas of expertise
Apply on company site

How well do you match this role?

Check My Resume