About

Ritesh Sonawane

Site Reliability Engineer / Platform Engineer


Hi there! I’m Ritesh, a Site Reliability Engineer with 2+ years of experience managing production Kubernetes clusters across AWS, GCP, Azure, and bare-metal environments. I’ve worked as a DevOps, SRE, and Platform Engineer across multiple projects, supporting infrastructure for 17+ companies in multi-cloud environments.

My expertise spans cloud-native infrastructure, observability, backend development in Golang, and infrastructure automation using GitOps practices. I specialize in building production-grade Kubernetes platforms across multi-cloud environments, implementing comprehensive observability stacks, and developing custom tools that solve complex infrastructure challenges.

Throughout my career, I’ve worked with multiple clients across telecom, AI/ML, fintech, and enterprise SaaS—delivering solutions from 100TB+ data migrations and GPU cloud platforms to achieving significant cost optimizations and reducing deployment times.

Kubernetes Homelab

To validate ideas before production, I run a multi-node k3s cluster at pidoku.co.in where I experiment with cutting-edge cloud-native technologies and test infrastructure patterns. Want to know more about my setup? Drop me an email at contact@riteshsonawane.com

Experience

Site Reliability Engineer - CloudRaft

April 2024 - March 2026 • Remote

Owned infrastructure architecture and platform engineering across 10+ client environments spanning telecom, AI/ML, fintech, and enterprise SaaS.

EnkryptAI - AI/ML Infrastructure & Production Operations

  • Owned complete AWS EKS infrastructure and VPC-level architecture, serving as DevOps Lead for Enterprise client deployments
  • Architected and developed 3 cloud-agnostic Helm Charts (EnkryptAI Stack, Platform Stack, Platform Core) to orchestrate the deployment of 70+ microservices in fully air-gapped multi-cloud deployments (AWS, GCP, Azure) with GPU node support and VRAM-based scheduling
  • Onboarded 2 enterprise clients on AKS and EKS environments
  • Re-architected Redteaming job orchestration by migrating from Celery-based execution to a Kubernetes-native stack using Argo Workflows, Argo Events, and NATS
  • Created a Golang-based container entrypoint script, reducing Kubernetes pod startup time from 5 minutes to 30 seconds for a NextJS application
  • Managed on-call rotation using incident.io for guardrails and red-teaming services, ensuring production uptime and incident response
  • Deployed production guardrails application on NVIDIA A30/H100 GPUs using NVIDIA GPU operators
  • Resolved critical compatibility and deployment issues with NVIDIA GPU Operator on Azure Kubernetes Service (AKS)
  • Deployed on-prem OpenFGA integrated with CloudNativePG (CNPG) as the backing PostgreSQL database to implement fine-grained, scalable RBAC across the platform
  • Migrated Elasticsearch to OpenSearch using the OpenSearch Kubernetes Operator and successfully transitioned Kibana dashboards to OpenSearch Dashboards, ensuring seamless continuity of audit logging
  • Customized and deployed Supabase Helm Chart with CloudNativePG (CNPG) to migrate Supabase from managed cloud to on-prem VPC deployment, enabling full functionality within air-gapped enterprise environments
  • Implemented DevSecOps practices by integrating SBOM generation and Grype-based vulnerability scanning into CI/CD pipelines, with automated Slack alerts for high-severity security findings
  • Configured Devspace for developers enabling rapid deployment to AWS GPU nodes
  • Supported SOC 2 and ISO 27001 compliance initiatives

Composio - Enterprise SaaS Platform & Customer Onboarding

  • Implemented Replicated as enterprise portal for customer onboarding, reducing onboarding time from days to 1 hour
  • Migrated existing clients to Replicated platform, enabling centralized release management and automated updates
  • Wrote preflight and postflight validation checks for installation reliability
  • Built multi-channel CI/CD pipeline with GitHub Actions supporting Unstable, Stable, and Nightly releases
  • Led client onboarding for 6+ production environments across AWS EKS and GCP GKE
  • Customized Temporal Helm charts for client-specific requirements
  • Automated deployment of unstable Replicated releases to GKE for continuous testing

Skyswitch - 100TB Monitoring Migration (Telecom)

  • Led migration of 100TB+ InfluxDB data to Grafana Mimir with zero downtime
  • Built custom Golang conversion tool to transform InfluxDB data to OpenMetrics format for Grafana Mimir
  • Migrated 1000+ Grafana dashboards from InfluxQL to PromQL
  • Re-architected single-tenant Mimir to multi-tenant model, reducing compactor startup time from hours to 5 minutes
  • Reduced infrastructure compute and memory usage by 30%

NeevCloud - GPU Cloud Platform

  • Built production-grade GPU cloud platform on bare-metal Kubernetes (kubeadm) with KubeVirt and NVIDIA GPUs
  • Developed Golang-based VM provisioning API with GPU-aware scheduling and inventory validation logic
  • Implemented custom x-api-key authentication middleware for VLLM model endpoints using Traefik
  • Configured Cilium CNI to assign public IPs to KubeVirt VMs
  • Resolved GPU passthrough issues and fixed VM network persistence using KubeMacPool
  • Enforced AWS-style resource quotas to prevent over-provisioning
  • Achieved 99.9%+ uptime for commercial GPU workloads with CI/CD via GitHub Actions, ArgoCD, and Harbor

Rezolve.ai - Enterprise Observability & Cost Optimization

  • Designed and deployed an Observability stack across 4 production AKS clusters using Prometheus, Grafana, OpenTelemetry Operator, Loki, and Tempo
  • Enabled RED metrics and service graphs for production services
  • Identified infrastructure inefficiencies using Steampipe/Powerpipe, delivering $36,000 annual cost savings
  • Led Kubernetes version upgrades to v1.31 across all clusters
  • Planned and executed Jenkins server upgrade

Ditto - Multi-Cloud BYOC Platform

  • Using Cluster API, provisioned kubeadm clusters on client environments across AWS (CAPA), GCP (CAPG), and Azure (CAPZ)
  • Developed custom Golang migration script for transition from MachinePool to MachineDeployment for AWS and GCP based Kubernetes Clusters
  • Automated deployment of Velero, Cluster Autoscaler, and Node Problem Detector using ArgoCD ApplicationSets across multi-cloud environments

Clika - AI Job Orchestration Platform

  • Built Golang-based Kubernetes job scheduler with credit-based billing and a ledger system
  • Implemented S3-compatible storage integration for ML model outputs with access controls
  • Integrated VictoriaMetrics and VictoriaLogs for real-time job observability with custom API endpoints
  • Provisioned GKE infrastructure via Terraform with CI/CD automation using ArgoCD and GitHub Actions

IFF - Enterprise Authentication Integration (Fortune 500)

  • Implemented organization-wide SAML-based authentication for ArgoCD, enhancing secure access management across enterprise deployments

DrDroid - CI/CD Modernization

  • Modernized bash-based deployments to production-grade CI/CD pipeline using GitHub Actions and ArgoCD with Image Updater
  • Set up Metabase for database analytics and insights

DevOps Engineer - Makerble

October 2023 - April 2024 • Remote

  • Migrated complete staging, pre-production, and production infrastructure from AWS to Azure using Terraform, achieving 40% cost reduction
  • Optimized AWS workloads, reducing costs by 17% through request/limit tuning, affinity/toleration rules, and efficient node placement
  • Built CI/CD pipelines in Tekton and automated Testsigma execution with Slack notifications post-production sync
  • Configured Ingress Controller with custom error pages, integrated Rollbar with ArgoCD hooks for deployment tracking, and implemented Robusta for Kubernetes monitoring
  • Deployed Uptime Kuma for uptime/downtime alerts, Redis Insight via Ingress for Redis monitoring, and BotKube for infrastructure logs on Slack
  • Resolved AWS IP exhaustion by adding a secondary subnet for EC2
  • Created custom GitHub runners on Oracle Cloud for GitHub Actions
  • Deployed a WireGuard-based VPN for secure internal access and implemented Passbolt as a self-hosted password manager
  • Designed Azure Kubernetes architecture, documented cluster + CI/CD workflows, and integrated DeepSource, Snyk, and PrefectScale for code quality, security scanning, and cost monitoring
  • Implemented Liveness, Readiness, and Startup probes to eliminate downtime in staging/pre-production environments
  • Configured Prometheus & Grafana for monitoring and deployed EFK stack for centralized logging
  • Automated Azure VM start for developers via Logic Apps