Senior Software Engineer, Vertex AI Platform SRE
Company: Google
Location: Sunnyvale
Posted on: April 1, 2026
|
|
|
Job Description:
Minimum qualifications: Bachelor’s degree in Computer Science, a
related field, or equivalent practical experience. 5 years of
experience with software development in one or more programming
languages. 3 years of experience in designing, analyzing, and
troubleshooting large-scale distributed systems. 2 years of
experience leading projects and providing technical leadership.
Experience with cloud compute platforms (e.g., Kubernetes, Google
Cloud Functions). Experience in site reliability engineering,
system design, and distributed computing. Preferred qualifications:
Master's degree in Computer Science or Engineering. Experience with
Kubernetes, Google Cloud Platform (GCP), GKE Networking, and Istio.
Ability to demonstrate passion for technology and apply technical
depth to uncover root causes of technical problems and provide
guidance on solving them. About the job Site Reliability
Engineering (SRE) combines software and systems engineering to
build and run large-scale, massively distributed, fault-tolerant
systems. SRE ensures that Google Cloud's services—both our
internally critical and our externally-visible systems—have
reliability, uptime appropriate to customer's needs and a fast rate
of improvement. Additionally SRE’s will keep an ever-watchful eye
on our systems capacity and performance. Much of our software
development focuses on optimizing existing systems, building
infrastructure and eliminating work through automation. On the SRE
team, you’ll have the opportunity to manage the issues of scale
which are unique to Google Cloud, while using your expertise in
coding, algorithms, complexity analysis and large-scale system
design. SRE's culture of intellectual curiosity, problem solving
and openness is key to its success. Our organization brings
together people with a wide variety of backgrounds, experiences and
perspectives. We encourage them to collaborate, think big and take
risks in a blame-free environment. We promote self-direction to
work on meaningful projects, while we also strive to create an
environment that provides the support and mentorship needed to
learn and grow. In this role, you will help the Vertex Third-Party
(3P) AI Platform Site Reliability Engineering (SRE) team build and
run the infrastructure for third-party AI workloads on Google
Cloud's Vertex AI. You will enable a seamless experience for
third-party AI workloads on Vertex AI, focusing on reliability,
performance, scalability, observability, and efficiency.Behind
everything our users see online is the architecture built by the
Technical Infrastructure team to keep it running. From developing
and maintaining our data centers to building the next-generation of
Google platforms, we make Google's product portfolio possible.
We're proud to be our engineers' engineers and love voiding
warranties by taking things apart so we can rebuild them. We keep
our networks up and running, ensuring our users have the best and
fastest experience possible. The US base salary range for this
full-time position is $174,000-$252,000 bonus equity benefits. Our
salary ranges are determined by role, level, and location. Within
the range, individual pay is determined by work location and
additional factors, including job-related skills, experience, and
relevant education or training. Your recruiter can share more about
the specific salary range for your preferred location during the
hiring process. Please note that the compensation details listed in
US role postings reflect the base salary only, and do not include
bonus, equity, or benefits. Learn more about benefits at Google .
Responsibilities Manage and scale Google Kubernetes Engine (GKE)
fleets (approximately 60 thousand clusters and growing), including
specialized mega-clusters designed for AI/ML models. Optimize
low-latency and high-throughput model serving across all layers of
the platform stack to maximize performance for AI/ML inferences.
Architect and maintain connectivity and interactions between the
internal control planes and the external Google Cloud Platform
(GCP) based data planes in a hybrid environment. Implement and
configure essential cloud technologies, including Istio for service
mesh management, advanced load balancing, Google Compute Engine
(GCE), and various GCP networking and security components.
Keywords: Google, Redwood City , Senior Software Engineer, Vertex AI Platform SRE, IT / Software / Systems , Sunnyvale, California