Bayside Solutions

SRE/DevOps

in Cupertino, California

New Job

Job Description Job Attributes+

  • Req ID

    25097_1767830135

  • Job Category

    IT

  • Job Type

    Contract

  • Hourly Salary

    From $0 to $0

  • Job Location

    Cupertino, California
    United States

Overview

SRE/DevOps

W2 Contract

Salary Range: $114,400 - $135,200 per year

Location: Cupertino, CA - Remote Role

Job Summary:

We are preparing a global-scale service rollout, beginning with US East and US West regions, then expanding to five additional regions worldwide. We need a highly technical Site Reliability Engineer (SRE) / DevOps Engineer who can own infrastructure, tooling, and incident response for this mission-critical launch.

Duties and Responsibilities:

  • Global Rollout Execution

  • Lead deployment and operations in US East / West; extend to 5+ regions globally.

  • Maintain consistent environment setup and infrastructure standards across regions.

  • Core Systems Ownership

  • Operate and scale Amazon RDS, EKS clusters, and Ray distributed compute environments.

  • Optimize cluster performance, scheduling, and workload orchestration.

  • Operational Reliability

  • Serve as primary SRE for launch; manage incident triage, escalation, and postmortems.

  • Implement and enforce vac-pairing schedules for global on-call coverage.

  • Utilize AWS Systems Manager Incident Manager for automation and resolution.

  • Tooling & Observability

  • Build and manage monitoring dashboards with Grafana; integrate logs and traces in Splunk.

  • Develop automation and alerting pipelines for proactive incident detection.

  • Ensure instrumentation is consistent across multi-region services.

  • High-Severity Incident Management

  • Respond to Sev-0/Sev-1 incidents with immediate root-cause analysis.

  • Conduct resilience testing, chaos drills, and failover validation.

  • Protect our brand by ensuring zero-downtime objectives during launch phases.

Requirements and Qualifications:

  • AWS Expertise - Deep hands-on experience with Amazon RDS, EKS, IAM, VPC; proven track record in multi-region deployments, HA/DR, and failover strategies at enterprise scale.

  • Kubernetes / EKS - 5+ years operating and scaling multi-cluster environments; advanced debugging and tuning at the cluster, pod, and network layers; Helm proficiency.

  • Incident Response & Ops - Expert in Sev-0/1 incident triage and recovery, PagerDuty/OpsGenie/AWS Incident Manager (any one of these tools), and large-scale runbook automation.

  • Observability & Monitoring - Strong in Grafana, Splunk, Prometheus, and tracing systems; ability to design end-to-end observability pipelines across global workloads.

  • Infrastructure as Code (IaC) - Production-grade Terraform, Crossplane, AWS CDK, or Ansible; ability to enforce parity across multiple AWS regions.

  • Programming & Automation - Intermediate level Experience in Python, Go, or Bash for scripting, automation, and tooling development.

  • Operational Rigor - Demonstrated ability to thrive in high-pressure, high-visibility environments; experience supporting global-scale product launches with strict zero-downtime objectives.

Preferred Qualifications:

  • Machine Learning Infrastructure - Familiarity with Amazon SageMaker ( deployment, monitoring), feature stores, and ML pipeline operations.

  • Caching & Distributed Storage - Experience with Redis/ElastiCache and caching strategies for large-scale, high-throughput systems.

  • Data Lake & Governance - Hands-on with AWS Lake Formation, Glue, or similar tools for secure, governed multi-region data access.

  • Distributed Systems (Ray or equivalent) - Workload profiling, scheduling, and distributed compute optimization.

  • Chaos Engineering - Background in resilience testing, chaos drills, and automated failover validation.

  • Security & Compliance - Knowledge of multi-region security, compliance, and data protection frameworks for enterprise cloud workloads.

  • AI/ML Ops - Experience operationalizing ML in production: monitoring drift, scaling inference endpoints, and integrating ML workloads into SRE practices.

Bayside Solutions, Inc. is not able to sponsor any candidates at this time. Additionally, candidates for this position must qualify as a W2 candidate.

Bayside Solutions, Inc. may collect your personal information during the position application process. Please reference Bayside Solutions, Inc.'s CCPA Privacy Policy at www.baysidesolutions.com.

Saved Jobs

    © 2026 Bayside Solutions. All Rights Reserved. Privacy Policy. Powered by Adverto Inc.