AI Applied SRE
In the age of AI, product launch is just the beginning. Application, infrastructure, AI models — all in scope. Through AI Native DevOps practices, continuously optimizing reliability, cost, and resilience to protect production.
What you'll do
- From Application to InfrastructureOwn reliability across every layer: application, data pipelines, AI model serving, and cloud infrastructure
- Practice AI Native DevOpsUnify delivery and operations through CI/CD pipelines, infrastructure-as-code, and automated remediation
- Build ObservabilitySLO definition, monitoring stack design, and incident response leadership — including AI model inference quality and latency monitoring
- Capacity PlanningTraffic forecasting, resource planning, and scaling strategy to ensure systems meet demand at optimal cost
- On-call Design & Incident ResponseDesign on-call rotations, define escalation policies, and lead incident response across production systems
- Production Readiness ReviewEvaluate reliability risks from an SRE perspective at the design stage, before any release reaches production
What we're looking for
Required Experience & Skills
Problem Framing
The ability to transform a report like “the system is slow” into a structural question: “Which component at which layer is becoming a bottleneck under which conditions?” Digging beneath surface-level symptoms to uncover systemic issues, defining the priority and blast radius of problems to address. Identifying essential risks before they manifest, never missing the early signs of failure
Problem Solving
The ability to identify critical issues with limited information and time during incidents, and recover with minimal impact. Not “waiting for complete root cause analysis” but “stopping the bleeding first while pursuing permanent fixes in parallel.” Relentlessly pursuing root causes in post-mortems and embedding prevention measures into the system
Communication
Discussing architectural improvements with engineers, explaining incident impact and countermeasures to management, and reporting service level status to clients — adjusting technical depth and expression for each audience. Calm escalation during incidents and sharing operational knowledge to raise the team’s overall operational quality
Linux & Networking
Deep understanding of Linux fundamentals — kernel behavior, systemd, cgroups, and namespaces. Solid grasp of TCP/IP, DNS, and HTTP/HTTPS, with the ability to design and troubleshoot load balancing and CDN configurations. Writing operational automation in Bash and Python
Cloud Platforms
Production operations on AWS (EKS, SageMaker, Bedrock, Lambda, EC2, RDS, S3), GCP (GKE, Vertex AI, Cloud Run), or Azure (AKS, Azure OpenAI). Building and operating reproducible infrastructure with Terraform or CloudFormation
Observability & Incident
Experience designing and operating monitoring stacks with Prometheus, Grafana, Datadog, or New Relic. System state visualization through OpenTelemetry and distributed tracing. Driving SLO-based alert design, using error budgets as a decision-making tool for balancing reliability and velocity. Incident response leadership and post-mortem culture — learning from failures and evolving the system
Databases & Storage
Production experience with performance tuning, replication, and failover design for RDB (MySQL, PostgreSQL). Understanding characteristics of cloud-native stores such as DynamoDB and S3. Backup and disaster recovery strategy design
Automation & CI/CD
Designing and operating CI/CD pipelines with GitHub Actions or GitLab CI. Rigorous Infrastructure as Code practices, building automated remediation mechanisms, and validating fault tolerance through chaos engineering (Chaos Monkey, Litmus)
Nice to Have
Container Orchestration
Production Kubernetes operations (EKS, GKE, AKS) and GitOps-based deployment management with Helm and Argo CD
Security
Secrets management (Vault, cloud-native solutions). Experience designing and implementing zero-trust architecture
ML/AI Fundamentals
Understanding of the ML model lifecycle — training, evaluation, deployment, and monitoring concepts. Ability to reason about GPU resource requirements, model versioning, A/B testing infrastructure, and performance characteristics of ML inference workloads to make informed infrastructure decisions
AI Model Serving
Production LLM deployment experience with vLLM, TGI, or Triton Inference Server. Optimizing inference latency and throughput, efficient GPU resource management
Data Platform
Streaming infrastructure experience with Kafka or similar, or operational experience with Snowflake, Databricks, or equivalent cloud data platforms
Interested in this role?
Tell us what drives you and what you want to build.