Sr. Director - Backend Engineering
Coupang
-
Sr Director- Backend Engineering
Key Skills and Role Responsibilities:
This role is for a strategic and technical leader to define, build, and operate the infrastructure orchestration systems that power our organization's cutting-edge Artificial Intelligence (AI) initiatives. The Senior Director will lead a team responsible for ensuring a robust, scalable, cost-efficient, and high-performance platform for all stages of the AI lifecycle, from experimentation and training to deployment and inference.
Strategy and Leadership
-
Define and execute the long-term vision and roadmap for the company’s AI infrastructure Network Services, aligning it with overall business and AI Services goals.
-
Lead, mentor, and grow a high-performing engineering and operations team focused on AI infrastructure and platform engineering.
-
Manage budget and resource allocation for AI infrastructure Network Services deliverables.
-
Act as a key liaison between AI infrastructure and other services owners and consumers, core engineering, Cloud infrastructure, and executive leadership.
AI Infra Development and Operations
-
Oversee the design, implementation, and maintenance of the core network orchestration platforms for large-scale AI model training (e.g., distributed training, hyperparameter tuning) and deployment (e.g., containerization, serverless functions, edge deployment).
-
Ensure reliability, security, and compliance of the AI infrastructure, meeting strict standards for data governance and model integrity.
-
Establish Service Level Objectives (SLOs) and Key Performance Indicators (KPIs) for the AI platform services and lead efforts for continuous optimization and performance tuning.
Technology and Architecture
-
Select, evaluate, and integrate the core technologies required for the AI stack (e.g., Cloud Overlay/Under networking, Infiniband, Load Balancer, DNS, Core Networking, Kubernetes, Ray, GPU/accelerator management, distributed file systems).
-
Champion infrastructure-as-code (IaC) principles to manage and provision AI resources consistently and at scale.
Qualifications
Required
-
Education: Bachelor's or Master’s degree in Computer Science, Engineering, or a related technical field.
-
Experience:
-
15+ years of progressive experience in software engineering, infrastructure, or platform operations.
-
5+ years of experience leading and managing technical teams, ideally in a Director or Sr. Director level or equivalent capacity.
-
Deep, hands-on experience designing and operating large-scale distributed systems and cloud-native network architectures.
-
Proven experience specifically with AI infrastructure orchestration (e.g., using Kubernetes) and managing accelerated compute resources (GPUs, TPUs, etc.).
-
15+ years of Cloud backend engineering, Cloud Design, Deployment, DevOps.
-
15+ years of experience leading system design and architecture leveraging Private Clouds and AWS and/or Azure/GCP.
-
10+ years of demonstrable experience building and operating infrastructure as code, Infra Automation, and comfort with various flavors of Linux.
-
15+ years of experience in building high-performance, highly available, and scalable distributed systems in the cloud.
-
15+ years of experience in building and managing high-performance, highly available, and scalable Hybrid Cloud environments.
-
Excellent cross-group collaboration, outstanding verbal and written communication skills.
-
Skills:
-
Expert-level knowledge of containerization and orchestration (Docker, Kubernetes).
-
Software Defined Cloud Networking.
-
Strong background in DevOps and MLOps principles and tooling.
-
Proficiency in at least one modern programming language (e.g., Python, Go).
-
Exceptional strategic planning, organizational, and written/verbal communication skills.
Preferred
-
Prior experience managing infrastructure for training and inference of large language models (LLMs) or foundation models.
-
Experience in a regulated industry with strict compliance requirements.
-
AI Private Cloud - Building and operating.
Success Metrics
A successful Senior Director - AI Infrastructure Orchestration will be measured by:
-
The time-to-market for AI infrastructure build, scale, and operation.
-
The resource utilization rate and cost efficiency of the AI compute infrastructure.
-
The reliability and uptime of the core AI platform services.
-
The talent retention and development within the AI Infrastructure team.
-