Principal Engineer - Systems for ML Inference and Training Optimization, Deep Science for Systems and Services

Jobdatum: 11. November 2025

Einsatzort: 72 Tübingen; Baden-Württemberg

Arbeitgeber: Amazon

Jobdetails

Beschreibung

We are seeking an exceptional Principal Engineer specializing in ML Systems, training, and inference optimization to lead our technical strategy and implementation for next-generation AI performance at scale. This role requires deep expertise in performance engineering, distributed systems architecture, low-level systems optimization, and the ability to drive technical excellence across multiple teams. You will set the technical direction for kernel-level optimizations, define architectural strategies for heterogeneous compute platforms, architect multi-GPU and multi-node training systems, and lead the delivery of solutions that fundamentally change how AWS serves ML training and inference workloads.
As a Principal Engineer in DS3, you will be a key technical leader responsible for organization-level architecture and performance strategy spanning the entire ML lifecycle—from distributed training of frontier models to high-throughput inference serving. You will work at the lowest levels of the software stack—defining standards for CUDA kernel development, optimizing assembly-level code (e.g. Nvidia PTX code), architect cross-platform acceleration strategies including GPUs and AWS Neuron, designing efficient multi-node communication patterns, and inventing novel approaches to achieve 10× or greater performance improvements. Your work will directly influence AWS's competitive position in AI infrastructure and set the standard for ML systems engineering across the industry.

Utility Computing (UC)
AWS Utility Computing (UC) provides product innovations — from foundational services such as Amazon’s Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2), to consistently released new product innovations that continue to set AWS’s services and features apart in the industry. As a member of the UC organization, you’ll support the development and management of Compute, Database, Storage, Internet of Things (Iot), Platform, and Productivity Apps services in AWS, including support for customers who require specialized security solutions for their cloud services.

Key job responsibilities
Technical Strategy & Vision: Define and drive the technical strategy and architectural roadmap for ML inference and training optimization across multiple teams within your organization. Bring systems architecture and performance engineering context to strategic business decisions.
Cross-Platform Performance Leadership: Lead the design and architecture of kernel-level optimizations spanning NVIDIA GPUs, AWS Inferentia/Trainium, and emerging AI accelerators. Establish standards and best practices for low-level optimization across the organization.
Intrinsically Hard Problems: Tackle the most difficult performance challenges—endemic bottlenecks, architectural complexity that prevents innovation, and critical business/technical problems requiring order-of-magnitude improvements (10× or greater).
Systems-Level Innovation: Drive the design, implementation, and delivery of performance solutions at the program level that address significantly large or endemic customer and business problems across your organization and potentially others.
Hardware-Software Co-Design: Establish deep understanding of new SoCs, GPUs, and AI accelerators; derive guidelines for optimal utilization and influence hardware selection decisions based on facts-driven analysis and resource budgeting.
Technical Excellence & Force Multiplication: Set the standard for engineering excellence in your organization. Create mechanisms, tools, and processes that enable performance measurement, analysis, and optimization at scale across multiple teams.
Organization-Level Influence: Align teams across your organization toward coherent performance strategies and architectural decisions. Drive adoption of new optimization approaches, concepts, and paradigms. Lead the most important and complex technical reviews.
Team Development: Guide the career growth of senior engineers in your organization. Mentor and develop the next generation of performance engineering leaders. Participate in Principal promotion assessments and help grow the Principal Engineering community.
Hands-On Technical Leadership: Remain a practitioner—personally writing critical-path code, designing zero-overhead portable libraries, and prototyping solutions that inform technical direction for your organization.

About the team
Deep Science for Systems and Services (DS3) is a science organization within AWS Compute & ML Services focused on advancing AI/ML technologies at the systems level. Our team works at the intersection of machine learning and high-performance computing, developing optimizations for large model inference across diverse hardware platforms. We push the boundaries of what's possible in ML inference performance, working directly with CUDA, AWS Neuron, and other low-level compute abstractions to deliver order-of-magnitude performance improvements and industry-leading cost-performance for AWS customers deploying AI at scale.

About AWS
Diverse Experiences
Amazon values diverse experiences. Even if you do not meet all of the preferred qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying.

Why AWS
Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses.
Work/Life Balance
We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud.
Inclusive Team Culture
Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (diversity) conferences, inspire us to never stop embracing our uniqueness.
Mentorship and Career Growth
We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional.

Grundqualifikationen

10+ years of software development experience with demonstrated progression in technical leadership and impact.
Expert-level proficiency in C/C++ and low-level systems programming with proven track record of delivering order-of-magnitude (10× or greater) performance improvements in production systems.
Extensive experience with CUDA programming, GPU architecture, assembly-level optimization (e.g., Nvidia PTX), and kernel development across multiple hardware platforms.
Demonstrated ability to lead organization-level technical initiatives spanning multiple teams, building consensus on contentious technical decisions and driving architectural strategy.
Experience defining technical roadmaps, conducting performance analysis and resource budgeting, and translating system analysis into strategic development plans.

Bevorzugte Qualifikationen

Master's degree (or higher) in Computer Science, Computer Engineering, or related technical field with 15+ years of performance engineering experience.
Experience optimizing ML inference and/or training workloads (LLMs, Transformers, CNNs) across diverse hardware: GPUs, AWS Neuron/Inferentia, and other accelerators.
Deep expertise across multiple hardware architectures and platforms (x86, ARM, multiple GPU generations, SoCs, custom accelerators) with ability to quickly master new hardware platforms.
Track record of developing portable, high-performance libraries, tools, or frameworks used across engineering organizations or open-source projects with significant adoption.
Experience leading large-scale optimization initiatives or coordinating performance engineering efforts across multiple teams and organizations.
Proven ability to establish deep understanding of complex systems and create performance measurement/analysis tools that provide critical insights for organization-wide use.
Entrepreneurial experience including startup founding, CTO role, or driving technical vision in product development environments.

Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your experience and skills. We value your passion to discover, invent, simplify and build. Protecting your privacy and the security of your data is a longstanding top priority for Amazon. Please consult our Privacy Notice (https://www.amazon.jobs/en/privacy_page) to know more about how we collect, use and transfer the personal data of our candidates.

m/w/d

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.

Info zur Bewerbung

Jobtitel:

Principal Engineer - Systems for ML Inference and Training Optimization, Deep Science for Systems and Services

Jobkennzeichen:
connecticum Job-1827693

Bereiche:

E-Technik, Informatik, Ingenieurwissenschaften
Ingenieurwissenschaften: Ingenieurwesen, allg.
Informatik: Informatik, Technische Informatik

Einsatzort: 72 Tübingen; Baden-Württemberg

Jobdetails

Mehr Jobangebote von Amazon Deutschland Services

Data Center Technician Trainee, Data Center Operations 22.12.2025 · Traineestellen · Frankfurt am Main
Working Student Product Manager 22.12.2025 · Studentenjobs · München
Returnship Program DE (FTC), Amazon Operations 18.12.2025 · München
Engineering Operations Technician (EOT), DCEO, DCEO 18.12.2025 · Frankfurt am Main
Sr. Account Manager (Energy & HCLS), AWS Germany 18.12.2025 · Senior · München

Jobs von Amazon Deutschland Services