About
Ritesh Sonawane
Site Reliability Engineer / Platform Engineer
Hi there! I’m Ritesh, a Site Reliability Engineer with 2+ years of experience managing production Kubernetes clusters across AWS, GCP, Azure, and bare-metal environments. I’ve worked as a DevOps, SRE, and Platform Engineer across multiple projects, supporting infrastructure for 17+ companies in multi-cloud environments.
My expertise spans cloud-native infrastructure, observability, backend development in Golang, and infrastructure automation using GitOps practices. I specialize in building production-grade Kubernetes platforms across multi-cloud environments, implementing comprehensive observability stacks, and developing custom tools that solve complex infrastructure challenges.
Throughout my career, I’ve worked with multiple clients across telecom, AI/ML, fintech, and enterprise SaaS—delivering solutions from 100TB+ data migrations and GPU cloud platforms to achieving significant cost optimizations and reducing deployment times.
Kubernetes Homelab
To validate ideas before production, I run a multi-node k3s cluster at pidoku.co.in where I experiment with cutting-edge cloud-native technologies and test infrastructure patterns. Want to know more about my setup? Drop me an email at contact@riteshsonawane.com
Experience
Site Reliability Engineer - CloudRaft
April 2024 - March 2026 • Remote
Owned infrastructure architecture and platform engineering across 10+ client environments spanning telecom, AI/ML, fintech, and enterprise SaaS.
EnkryptAI - AI/ML Infrastructure & Production Operations
- Owned complete AWS EKS infrastructure and VPC-level architecture, serving as DevOps Lead for Enterprise client deployments
- Architected and developed 3 cloud-agnostic Helm Charts (EnkryptAI Stack, Platform Stack, Platform Core) to orchestrate the deployment of 70+ microservices in fully air-gapped multi-cloud deployments (AWS, GCP, Azure) with GPU node support and VRAM-based scheduling
- Onboarded 2 enterprise clients on AKS and EKS environments
- Re-architected Redteaming job orchestration by migrating from Celery-based execution to a Kubernetes-native stack using Argo Workflows, Argo Events, and NATS
- Created a Golang-based container entrypoint script, reducing Kubernetes pod startup time from 5 minutes to 30 seconds for a NextJS application
- Managed on-call rotation using incident.io for guardrails and red-teaming services, ensuring production uptime and incident response
- Deployed production guardrails application on NVIDIA A30/H100 GPUs using NVIDIA GPU operators
- Resolved critical compatibility and deployment issues with NVIDIA GPU Operator on Azure Kubernetes Service (AKS)
- Deployed on-prem OpenFGA integrated with CloudNativePG (CNPG) as the backing PostgreSQL database to implement fine-grained, scalable RBAC across the platform
- Migrated Elasticsearch to OpenSearch using the OpenSearch Kubernetes Operator and successfully transitioned Kibana dashboards to OpenSearch Dashboards, ensuring seamless continuity of audit logging
- Customized and deployed Supabase Helm Chart with CloudNativePG (CNPG) to migrate Supabase from managed cloud to on-prem VPC deployment, enabling full functionality within air-gapped enterprise environments
- Implemented DevSecOps practices by integrating SBOM generation and Grype-based vulnerability scanning into CI/CD pipelines, with automated Slack alerts for high-severity security findings
- Configured Devspace for developers enabling rapid deployment to AWS GPU nodes
- Supported SOC 2 and ISO 27001 compliance initiatives
Composio - Enterprise SaaS Platform & Customer Onboarding
- Implemented Replicated as enterprise portal for customer onboarding, reducing onboarding time from days to 1 hour
- Migrated existing clients to Replicated platform, enabling centralized release management and automated updates
- Wrote preflight and postflight validation checks for installation reliability
- Built multi-channel CI/CD pipeline with GitHub Actions supporting Unstable, Stable, and Nightly releases
- Led client onboarding for 6+ production environments across AWS EKS and GCP GKE
- Customized Temporal Helm charts for client-specific requirements
- Automated deployment of unstable Replicated releases to GKE for continuous testing
Skyswitch - 100TB Monitoring Migration (Telecom)
- Led migration of 100TB+ InfluxDB data to Grafana Mimir with zero downtime
- Built custom Golang conversion tool to transform InfluxDB data to OpenMetrics format for Grafana Mimir
- Migrated 1000+ Grafana dashboards from InfluxQL to PromQL
- Re-architected single-tenant Mimir to multi-tenant model, reducing compactor startup time from hours to 5 minutes
- Reduced infrastructure compute and memory usage by 30%
NeevCloud - GPU Cloud Platform
- Built production-grade GPU cloud platform on bare-metal Kubernetes (kubeadm) with KubeVirt and NVIDIA GPUs
- Developed Golang-based VM provisioning API with GPU-aware scheduling and inventory validation logic
- Implemented custom x-api-key authentication middleware for VLLM model endpoints using Traefik
- Configured Cilium CNI to assign public IPs to KubeVirt VMs
- Resolved GPU passthrough issues and fixed VM network persistence using KubeMacPool
- Enforced AWS-style resource quotas to prevent over-provisioning
- Achieved 99.9%+ uptime for commercial GPU workloads with CI/CD via GitHub Actions, ArgoCD, and Harbor
Rezolve.ai - Enterprise Observability & Cost Optimization
- Designed and deployed an Observability stack across 4 production AKS clusters using Prometheus, Grafana, OpenTelemetry Operator, Loki, and Tempo
- Enabled RED metrics and service graphs for production services
- Identified infrastructure inefficiencies using Steampipe/Powerpipe, delivering $36,000 annual cost savings
- Led Kubernetes version upgrades to v1.31 across all clusters
- Planned and executed Jenkins server upgrade
Ditto - Multi-Cloud BYOC Platform
- Using Cluster API, provisioned kubeadm clusters on client environments across AWS (CAPA), GCP (CAPG), and Azure (CAPZ)
- Developed custom Golang migration script for transition from MachinePool to MachineDeployment for AWS and GCP based Kubernetes Clusters
- Automated deployment of Velero, Cluster Autoscaler, and Node Problem Detector using ArgoCD ApplicationSets across multi-cloud environments
Clika - AI Job Orchestration Platform
- Built Golang-based Kubernetes job scheduler with credit-based billing and a ledger system
- Implemented S3-compatible storage integration for ML model outputs with access controls
- Integrated VictoriaMetrics and VictoriaLogs for real-time job observability with custom API endpoints
- Provisioned GKE infrastructure via Terraform with CI/CD automation using ArgoCD and GitHub Actions
IFF - Enterprise Authentication Integration (Fortune 500)
- Implemented organization-wide SAML-based authentication for ArgoCD, enhancing secure access management across enterprise deployments
DrDroid - CI/CD Modernization
- Modernized bash-based deployments to production-grade CI/CD pipeline using GitHub Actions and ArgoCD with Image Updater
- Set up Metabase for database analytics and insights
DevOps Engineer - Makerble
October 2023 - April 2024 • Remote
- Migrated complete staging, pre-production, and production infrastructure from AWS to Azure using Terraform, achieving 40% cost reduction
- Optimized AWS workloads, reducing costs by 17% through request/limit tuning, affinity/toleration rules, and efficient node placement
- Built CI/CD pipelines in Tekton and automated Testsigma execution with Slack notifications post-production sync
- Configured Ingress Controller with custom error pages, integrated Rollbar with ArgoCD hooks for deployment tracking, and implemented Robusta for Kubernetes monitoring
- Deployed Uptime Kuma for uptime/downtime alerts, Redis Insight via Ingress for Redis monitoring, and BotKube for infrastructure logs on Slack
- Resolved AWS IP exhaustion by adding a secondary subnet for EC2
- Created custom GitHub runners on Oracle Cloud for GitHub Actions
- Deployed a WireGuard-based VPN for secure internal access and implemented Passbolt as a self-hosted password manager
- Designed Azure Kubernetes architecture, documented cluster + CI/CD workflows, and integrated DeepSource, Snyk, and PrefectScale for code quality, security scanning, and cost monitoring
- Implemented Liveness, Readiness, and Startup probes to eliminate downtime in staging/pre-production environments
- Configured Prometheus & Grafana for monitoring and deployed EFK stack for centralized logging
- Automated Azure VM start for developers via Logic Apps