Bayside Solutions

Infrastructure Development Engineer

in Austin, Texas

New Job

Job Description Job Attributes+

  • Req ID

    24043_1761594905

  • Job Category

    IT

  • Job Type

    Contract

  • Hourly Salary

    From $0 to $0

  • Job Location

    Austin, Texas
    United States

Overview

Infrastructure Development Engineer

W2 Contract

Salary Range: $124,800 - $145,600 per year

Location: Austin, TX - Remote Role

Duties and Responsibilities:

Platform Reliability & Operations

  • Own end-to-end reliability for our AI Agent Platform across all environments (Dev, Staging, Production).
  • Maintain and optimize EKS clusters, databases, and LangGraph/LangSmith environments.
  • Implement and manage proactive monitoring, alerting, and tracing systems across platform components.
  • Drive root-cause analysis (RCA) and implement incident prevention automations.

Observability & Tooling

  • Deliver a unified observability strategy across services using logging, metrics, and distributed tracing.
  • Lead the migration from DataDog to Mosaic for dashboards and alerting.
  • Develop self-healing automation and smoke tests to validate post-deployment system health.
  • Ensure visibility into latency, availability, and error budgets (SLOs/SLIs).

Support & Incident Management

  • Own the AI platform Support Channel - triage issues, answer platform questions, and guide onboarding.
  • Provide L1/L2 triage during business hours; coordinate after-hours escalation with the core team.
  • Establish structured runbooks, escalation policies, and post-incident review processes.

Deployment & Environment Consistency

  • Standardize infrastructure and CI/CD practices across environments.
  • Partner with platform and ML engineers to streamline release pipelines, security policies, and service configurations.
  • Ensure consistent rollout of new features and agent services with minimal downtime.

Automation & Continuous Improvement

  • Develop Python or Go utilities to automate deployment, monitoring, and maintenance tasks.
  • Build tooling for alert correlation, system diagnostics, and capacity forecasting.
  • Continuously evaluate new tools and frameworks to improve operational efficiency.

Requirements and Qualifications:

  • 4+ years of experience as an SRE, DevOps Engineer, or Platform Engineer in cloud environments
  • Deep expertise with Kubernetes (EKS/GKE), CI/CD pipelines, and infrastructure automation.
  • Proficiency with observability tools such as Grafana, Prometheus, DataDog, Splunk, or OpenTelemetry.
  • Experience in at least one modern programming language (Python, Go, or Rust).
  • Strong understanding of incident management, SLAs/SLOs, and post-mortem practices.
  • Excellent communication and collaboration skills; ability to work across platforms, AI, and data teams.

Preferred Qualifications:

  • Experience operating AI/ML workloads (LangGraph, LangChain, or distributed compute systems like Ray).
  • Familiarity with LLM-based infrastructure and AI observability tooling.
  • Prior experience in managed service transitions or vendor-to-product operating model shifts.
  • Exposure to Azure or AWS cloud ecosystems, Terraform, and GitOps workflows (ArgoCD/Flux).

Bayside Solutions, Inc. is not able to sponsor any candidates at this time. Additionally, candidates for this position must qualify as a W2 candidate.

Bayside Solutions, Inc. may collect your personal information during the position application process. Please reference Bayside Solutions, Inc.'s CCPA Privacy Policy at www.baysidesolutions.com.

Saved Jobs

    © 2025 Bayside Solutions. All Rights Reserved. Privacy Policy. Powered by Adverto Inc.