SENIOR SITE RELIABILITY ENGINEER
Company: NVIDIA
Location: Santa Clara
Posted on: October 28, 2024
|
|
Job Description:
Joining NVIDIA's AI Efficiency Team means contributing to the
infrastructure that powers our innovative AI research. This team
focuses on optimizing efficiency and resiliency of AI workloads, as
well as developing scalable AI and Data infrastructure tools and
services. Our objective is to deliver a stable, scalable
environment for AI researchers, providing them with the necessary
resources and scale to foster innovation. We are seeking a Senior
Site Reliability Engineer (SRE) to join our team. You'll be
instrumental in designing, building, and maintaining cloud services
that enable large-scale AI training and inferencing. The
responsibilities include implementing software and systems
engineering practices to ensure high efficiency and availability of
the platform, as well as applying SRE principles to improve
production systems and optimize service SLOs. Additionally,
collaboration with our customers to plan implement changes to the
existing system, while monitoring capacity, latency, and
performance is part of the role.As a Senior SRE at NVIDIA, you will
have the opportunity to work on innovative technologies that power
the future of AI and data science, and be part of a dynamic and
supportive team that values learning and growth. The role provides
the autonomy to work on meaningful projects with the support and
mentorship needed to succeed, and contributes to a culture of
blameless postmortems, iterative improvement, and risk-taking. If
you are seeking an exciting and rewarding career that makes a
difference, we invite you to apply now!What you'll be doing:Develop
software solutions to ensure reliability and operability of
large-scale systems supporting machine-critical use cases.Gain a
deep understanding of our system operations, scalability,
interactions, and failures to identify improvement opportunities
and risks.Create tools and automation to reduce operational
overhead and eliminate manual tasks.Establish frameworks,
processes, and standard methodologies to enhance operational
maturity, team efficiency, and accelerate innovation.Define
meaningful and actionable reliability metrics to track and improve
system and service reliability.Oversee capacity and performance
management to facilitate infrastructure scaling across public and
private clouds globally.Build tools to improve our service
observability for faster issue resolution.Practice sustainable
incident response and blameless postmortemsSkilled in
problem-solving, root cause analysis, and optimization.What we need
to see:Minimum of 8 years of experience in SRE, Cloud platforms, or
DevOps with large-scale microservices in production
environments.Bachelor's degree or equivalent experience.Strong
understanding of SRE principles, including error budgets, SLOs, and
SLAs.Experience with AI training and inferencing and data
infrastructure services.Expertise in building and operating
large-scale observability platforms for monitoring and logging
(e.g., ELK, Prometheus, Loki).Proficiency in programming languages
such as Python, Go, script languagesHands-on experience with
scaling distributed systems in public, private, or hybrid cloud
environments.Experience in deploying, supporting, and supervising
services, platforms, and application stacks.Knowledge of CI/CD
systems, such as GitLab and Familiarity with Infrastructure as Code
(IaC) methodologies and tools.Excellent communication and
collaboration skills, and a culture of diversity, intellectual
curiosity, problem solving, and openness are essential.Ways to
stand out from the crowd:Extensive experience in Slurm workload
manager and K8sGood understanding on DL frameworks, orchestrators
like PyTorch, TensorFlow, JAX, and RayStrong background in software
design and development.Experience operating large-scale distributed
systems with strong SLAs.Extensive experience in operating data
platforms with p roficiency in incident, change, and problem
management processes.NVIDIA leads the way in groundbreaking
developments in Artificial Intelligence, High-Performance
Computing, and Visualization. The GPU, our invention, serves as the
visual cortex of modern computers and is at the heart of our
products and services. Our work opens up new universes to explore,
enables amazing creativity and discovery, and powers what were once
science fiction inventions, from artificial intelligence to
autonomous cars. NVIDIA is looking for exceptional people like you
to help us accelerate the next wave of artificial intelligence.The
base salary range is 180,000 USD - 339,250 USD. Your base salary
will be determined based on your location, experience, and the pay
of employees in similar positions.You will also be eligible for
equity and benefits (https://www.nvidia.com/en-us/benefits/) .
NVIDIA accepts applications on an ongoing basis.NVIDIA is committed
to fostering a diverse work environment and proud to be an equal
opportunity employer. As we highly value diversity in our current
and future employees, we do not discriminate (including in our
hiring and promotion practices) on the basis of race, religion,
color, national origin, gender, gender expression, sexual
orientation, age, marital status, veteran status, disability status
or any other characteristic protected by law.
Keywords: NVIDIA, Berkeley , SENIOR SITE RELIABILITY ENGINEER, Professions , Santa Clara, California
Click
here to apply!
|