AI Applied SRE

Careers / Detail

What you'll do

From Application to InfrastructureOwn reliability across every layer: application, data pipelines, AI model serving, and cloud infrastructure
Practice AI Native DevOpsUnify delivery and operations through CI/CD pipelines, infrastructure-as-code, and automated remediation
Build ObservabilitySLO definition, monitoring stack design, and incident response leadership — including AI model inference quality and latency monitoring
Capacity PlanningTraffic forecasting, resource planning, and scaling strategy to ensure systems meet demand at optimal cost
On-call Design & Incident ResponseDesign on-call rotations, define escalation policies, and lead incident response across production systems
Production Readiness ReviewEvaluate reliability risks from an SRE perspective at the design stage, before any release reaches production

What we're looking for

Required Experience & Skills

Problem Framing
The ability to transform a report like “the system is slow” into a structural question: “Which component at which layer is becoming a bottleneck under which conditions?” Digging beneath surface-level symptoms to uncover systemic issues, defining the priority and blast radius of problems to address. Identifying essential risks before they manifest, never missing the early signs of failure

Problem Solving
The ability to identify critical issues with limited information and time during incidents, and recover with minimal impact. Not “waiting for complete root cause analysis” but “stopping the bleeding first while pursuing permanent fixes in parallel.” Relentlessly pursuing root causes in post-mortems and embedding prevention measures into the system

Communication
Discussing architectural improvements with engineers, explaining incident impact and countermeasures to management, and reporting service level status to clients — adjusting technical depth and expression for each audience. Calm escalation during incidents and sharing operational knowledge to raise the team’s overall operational quality

Linux & Networking
Deep understanding of Linux fundamentals — kernel behavior, systemd, cgroups, and namespaces. Solid grasp of TCP/IP, DNS, and HTTP/HTTPS, with the ability to design and troubleshoot load balancing and CDN configurations. Writing operational automation in Bash and Python

Cloud Platforms
Production operations on AWS (EKS, SageMaker, Bedrock, Lambda, EC2, RDS, S3), GCP (GKE, Vertex AI, Cloud Run), or Azure (AKS, Azure OpenAI). Building and operating reproducible infrastructure with Terraform or CloudFormation

Observability & Incident
Experience designing and operating monitoring stacks with Prometheus, Grafana, Datadog, or New Relic. System state visualization through OpenTelemetry and distributed tracing. Driving SLO-based alert design, using error budgets as a decision-making tool for balancing reliability and velocity. Incident response leadership and post-mortem culture — learning from failures and evolving the system

Databases & Storage
Production experience with performance tuning, replication, and failover design for RDB (MySQL, PostgreSQL). Understanding characteristics of cloud-native stores such as DynamoDB and S3. Backup and disaster recovery strategy design

Automation & CI/CD
Designing and operating CI/CD pipelines with GitHub Actions or GitLab CI. Rigorous Infrastructure as Code practices, building automated remediation mechanisms, and validating fault tolerance through chaos engineering (Chaos Monkey, Litmus)

Nice to Have

Container Orchestration
Production Kubernetes operations (EKS, GKE, AKS) and GitOps-based deployment management with Helm and Argo CD

Security
Secrets management (Vault, cloud-native solutions). Experience designing and implementing zero-trust architecture

ML/AI Fundamentals
Understanding of the ML model lifecycle — training, evaluation, deployment, and monitoring concepts. Ability to reason about GPU resource requirements, model versioning, A/B testing infrastructure, and performance characteristics of ML inference workloads to make informed infrastructure decisions

AI Model Serving
Production LLM deployment experience with vLLM, TGI, or Triton Inference Server. Optimizing inference latency and throughput, efficient GPU resource management

Data Platform
Streaming infrastructure experience with Kafka or similar, or operational experience with Snowflake, Databricks, or equivalent cloud data platforms

What you'll do

What we're looking for

Required Experience & Skills

Nice to Have

Interested in this role?