Summary
Apple Services Engineering (ASE) designs, builds, and operates the cloud infrastructure, server systems, and platform technologies that power many of Apple's most beloved experiences.
Within ASE, the Storage Platforms organization develops the systems that store, protect, and serve Apple's data at massive scale, with a mission to deliver storage that is durable, secure, highly available, and operated with excellence. Engineers on this team will have the rare opportunity to work on storage device-optimized low-level storage, large-scale distributed systems, and high-performance IO stacks operating at mission-critical levels of availability and durability.
Each component is being built using first principles from the ground up to unlock optimization opportunities at every layer of the stack. Being part of Apple Services Engineering organization opens the door to exerting cross-functional influence and making a more significant organizational impact.
If you are passionate about large scale distributed systems, operational excellence, and creating resilient platforms that enable innovation across Apple, we would love to hear from you.
Description
We are seeking a highly skilled, collaborative, and pragmatic Storage Site Reliability Engineer to join our team. In this role, you will help build and operate reliable, scalable storage infrastructure that supports rapidly growing platform needs. You will partner with cross-functional teams across software engineering, compute, networking, and infrastructure to design and implement automation, improve observability, strengthen incident response, and enhance the overall reliability of the platform.
The team contributes to all major aspects of storage deployment infrastructure, including maintenance automation, backup and recovery services, monitoring and alerting tooling, dashboards, deployment architecture, and database improvements focused on stability, performance, and scale. You will also play an important role in shaping the evolution of the platform as it scales by orders of magnitude.
Success in this role requires a passion for large-scale distributed systems, strong problem-solving ability, excellent communication, and a strong customer-focused mindset when working with internal platform users. Experience working effectively in a distributed team environment is highly valued.
Minimum Qualifications
3+ years of experience in Site Reliability Engineering or infrastructure engineering
Strong analytical and problem-solving skills, with careful attention to detail
Experience designing, building, or operating storage systems
2+ years of programming experience in one or more of the following languages: Rust, C++, Java, or C#
Experience with scripting languages such as Bash, Python, or Perl
Strong understanding of operating systems fundamentals, including multithreading, memory management, networking, storage, performance, and scalability
Bachelor’s degree in Computer Science, a related engineering field, or equivalent practical experience
Preferred Qualifications
Excellent knowledge of software testing methodologies & practices
Deep understanding of core computer science concepts, including data structures, algorithms, and concurrency.
Solid grasp of distributed systems fundamentals such as fault tolerance, consistency, and distributed rate limiting.
Experience designing and operating large-scale distributed systems such as databases or storage platforms.
Proficient with UNIX/Linux