About

Ritesh Sonawane

Site Reliability Engineer / Platform Engineer

Hi there! I’m Ritesh, a Site Reliability Engineer with 2+ years of experience managing production Kubernetes clusters across AWS, GCP, Azure, and bare-metal environments. I’ve worked as a DevOps, SRE, and Platform Engineer across multiple projects, supporting infrastructure for 17+ companies in multi-cloud environments.

My expertise spans cloud-native infrastructure, observability, backend development in Golang, and infrastructure automation using GitOps practices. I specialize in building production-grade Kubernetes platforms across multi-cloud environments, implementing comprehensive observability stacks, and developing custom tools that solve complex infrastructure challenges.

Throughout my career, I’ve worked with multiple clients across telecom, AI/ML, fintech, and enterprise SaaS—delivering solutions from 100TB+ data migrations and GPU cloud platforms to achieving significant cost optimizations and reducing deployment times.

Kubernetes Homelab

To validate ideas before production, I run a multi-node k3s cluster at pidoku.co.in where I experiment with cutting-edge cloud-native technologies and test infrastructure patterns. Want to know more about my setup? Drop me an email at contact@riteshsonawane.com

Experience

Site Reliability Engineer - CloudRaft

April 2024 - March 2026 • Remote

Owned infrastructure architecture and platform engineering across 10+ client environments spanning telecom, AI/ML, fintech, and enterprise SaaS.

EnkryptAI - AI/ML Infrastructure & Production Operations

Owned complete AWS EKS infrastructure and VPC-level architecture, serving as DevOps Lead for Enterprise client deployments
Architected and developed 3 cloud-agnostic Helm Charts (EnkryptAI Stack, Platform Stack, Platform Core) to orchestrate the deployment of 70+ microservices in fully air-gapped multi-cloud deployments (AWS, GCP, Azure) with GPU node support and VRAM-based scheduling
Onboarded 2 enterprise clients on AKS and EKS environments
Re-architected Redteaming job orchestration by migrating from Celery-based execution to a Kubernetes-native stack using Argo Workflows, Argo Events, and NATS
Created a Golang-based container entrypoint script, reducing Kubernetes pod startup time from 5 minutes to 30 seconds for a NextJS application
Managed on-call rotation using incident.io for guardrails and red-teaming services, ensuring production uptime and incident response
Deployed production guardrails application on NVIDIA A30/H100 GPUs using NVIDIA GPU operators
Resolved critical compatibility and deployment issues with NVIDIA GPU Operator on Azure Kubernetes Service (AKS)
Deployed on-prem OpenFGA integrated with CloudNativePG (CNPG) as the backing PostgreSQL database to implement fine-grained, scalable RBAC across the platform
Migrated Elasticsearch to OpenSearch using the OpenSearch Kubernetes Operator and successfully transitioned Kibana dashboards to OpenSearch Dashboards, ensuring seamless continuity of audit logging
Customized and deployed Supabase Helm Chart with CloudNativePG (CNPG) to migrate Supabase from managed cloud to on-prem VPC deployment, enabling full functionality within air-gapped enterprise environments
Implemented DevSecOps practices by integrating SBOM generation and Grype-based vulnerability scanning into CI/CD pipelines, with automated Slack alerts for high-severity security findings
Configured Devspace for developers enabling rapid deployment to AWS GPU nodes
Supported SOC 2 and ISO 27001 compliance initiatives

Composio - Enterprise SaaS Platform & Customer Onboarding

Implemented Replicated as enterprise portal for customer onboarding, reducing onboarding time from days to 1 hour
Migrated existing clients to Replicated platform, enabling centralized release management and automated updates
Wrote preflight and postflight validation checks for installation reliability
Built multi-channel CI/CD pipeline with GitHub Actions supporting Unstable, Stable, and Nightly releases
Led client onboarding for 6+ production environments across AWS EKS and GCP GKE
Customized Temporal Helm charts for client-specific requirements
Automated deployment of unstable Replicated releases to GKE for continuous testing

Skyswitch - 100TB Monitoring Migration (Telecom)

Led migration of 100TB+ InfluxDB data to Grafana Mimir with zero downtime
Built custom Golang conversion tool to transform InfluxDB data to OpenMetrics format for Grafana Mimir
Migrated 1000+ Grafana dashboards from InfluxQL to PromQL
Re-architected single-tenant Mimir to multi-tenant model, reducing compactor startup time from hours to 5 minutes
Reduced infrastructure compute and memory usage by 30%

NeevCloud - GPU Cloud Platform

Built production-grade GPU cloud platform on bare-metal Kubernetes (kubeadm) with KubeVirt and NVIDIA GPUs
Developed Golang-based VM provisioning API with GPU-aware scheduling and inventory validation logic
Implemented custom x-api-key authentication middleware for VLLM model endpoints using Traefik
Configured Cilium CNI to assign public IPs to KubeVirt VMs
Resolved GPU passthrough issues and fixed VM network persistence using KubeMacPool
Enforced AWS-style resource quotas to prevent over-provisioning
Achieved 99.9%+ uptime for commercial GPU workloads with CI/CD via GitHub Actions, ArgoCD, and Harbor

Rezolve.ai - Enterprise Observability & Cost Optimization

Designed and deployed an Observability stack across 4 production AKS clusters using Prometheus, Grafana, OpenTelemetry Operator, Loki, and Tempo
Enabled RED metrics and service graphs for production services
Identified infrastructure inefficiencies using Steampipe/Powerpipe, delivering $36,000 annual cost savings
Led Kubernetes version upgrades to v1.31 across all clusters
Planned and executed Jenkins server upgrade

Ditto - Multi-Cloud BYOC Platform

Using Cluster API, provisioned kubeadm clusters on client environments across AWS (CAPA), GCP (CAPG), and Azure (CAPZ)
Developed custom Golang migration script for transition from MachinePool to MachineDeployment for AWS and GCP based Kubernetes Clusters
Automated deployment of Velero, Cluster Autoscaler, and Node Problem Detector using ArgoCD ApplicationSets across multi-cloud environments

Clika - AI Job Orchestration Platform

Built Golang-based Kubernetes job scheduler with credit-based billing and a ledger system
Implemented S3-compatible storage integration for ML model outputs with access controls
Integrated VictoriaMetrics and VictoriaLogs for real-time job observability with custom API endpoints
Provisioned GKE infrastructure via Terraform with CI/CD automation using ArgoCD and GitHub Actions

IFF - Enterprise Authentication Integration (Fortune 500)

Implemented organization-wide SAML-based authentication for ArgoCD, enhancing secure access management across enterprise deployments

DrDroid - CI/CD Modernization

Modernized bash-based deployments to production-grade CI/CD pipeline using GitHub Actions and ArgoCD with Image Updater
Set up Metabase for database analytics and insights

DevOps Engineer - Makerble

October 2023 - April 2024 • Remote

Migrated complete staging, pre-production, and production infrastructure from AWS to Azure using Terraform, achieving 40% cost reduction
Optimized AWS workloads, reducing costs by 17% through request/limit tuning, affinity/toleration rules, and efficient node placement
Built CI/CD pipelines in Tekton and automated Testsigma execution with Slack notifications post-production sync
Configured Ingress Controller with custom error pages, integrated Rollbar with ArgoCD hooks for deployment tracking, and implemented Robusta for Kubernetes monitoring
Deployed Uptime Kuma for uptime/downtime alerts, Redis Insight via Ingress for Redis monitoring, and BotKube for infrastructure logs on Slack
Resolved AWS IP exhaustion by adding a secondary subnet for EC2
Created custom GitHub runners on Oracle Cloud for GitHub Actions
Deployed a WireGuard-based VPN for secure internal access and implemented Passbolt as a self-hosted password manager
Designed Azure Kubernetes architecture, documented cluster + CI/CD workflows, and integrated DeepSource, Snyk, and PrefectScale for code quality, security scanning, and cost monitoring
Implemented Liveness, Readiness, and Startup probes to eliminate downtime in staging/pre-production environments
Configured Prometheus & Grafana for monitoring and deployed EFK stack for centralized logging
Automated Azure VM start for developers via Logic Apps