I own production — and make it more reliable, observable & cost-efficient.
Platform & Cloud Reliability Engineer with 4.7+ years owning the production estate on GCP. I built Blue-Green zero-downtime deploys, scaled databases with read replicas & PgBouncer, rolled out Datadog SLO observability, automated backups and cloud-cost governance — and kept 62+ services running with zero major outages.
About Me
I own production systems — and I make them more reliable, more observable, and cheaper to run.
I'm a Platform / DevOps & Cloud Reliability Engineer with 4.7+ years owning cloud-native infrastructure on Google Cloud Platform. I don't just keep the lights on — I own the production estate for key.AI (Blueprint X Labs) — 46+ VMs, 62+ Cloud Run services and multi-cluster GKE across dev, stage and production and I'm the person who improves it: safer deploys, deeper observability, scalable databases and a cloud bill that reflects real demand.
My work is reliability-first and automation-led: Blue-Green frontend deployments for zero-downtime releases, Helm rollout validation with automatic rollback, Read Replica + PgBouncer database scaling, Datadog observability with SLO-style alerting, and GitOps CI/CD on standardized, self-healing GitHub Actions runners.
I treat cloud cost as an engineering problem. I built billing-anomaly detection on BigQuery billing exports, automated monthly cost reporting for stakeholders, and chased down a recurring Cloud Monitoring leak rooted in noisy Ops Agent metric collection. Earlier at Hexanika I laid the Terraform foundations that cut provisioning from 2+ days to under 30 minutes and eliminated drift across three environments.
DevOps / Platform Engineer · Blueprint X Labs Inc (key.AI)
Jun 2024 – PresentAhmedabad, India
- Own and operate production infrastructure across dev, stage and production — 46+ VMs, 62+ Cloud Run services and multi-cluster GKE with zero major outages.
- Built Blue-Green frontend deployments and Helm rollout validation with auto-rollback — zero-downtime releases and 30% fewer deploy failures.
- Scaled the database tier with read replicas and PgBouncer pooling, eliminating connection-exhaustion errors under load.
- Drove cloud cost governance: BigQuery billing-anomaly detection, automated monthly reporting, and remediation of a recurring Cloud Monitoring leak.
DevOps Engineer · Hexanika Research Pvt. Ltd.
May 2021 – Oct 2023Pune, India
- Provisioned multi-environment GCP infrastructure with Terraform — reducing provisioning from 2+ days to under 30 minutes.
- Authored reusable Terraform modules for networking, compute and storage, eliminating drift across 3 environments.
- Built end-to-end CI/CD pipelines on Azure DevOps with automated testing gates.
- Automated SQL Server stored procedures and optimised Azure Data Factory pipelines, cutting manual data effort ~50%.
~/areas-of-expertise
- Production ownership & reliability engineering on GCP at scale
- Zero-downtime & Blue-Green deployment architecture
- Database scaling — read replicas, PgBouncer connection pooling, backups
- Observability & SLOs with Datadog and Cloud Monitoring
- Cloud cost governance — anomaly detection, reporting, leak remediation
- Platform automation — Terraform IaC, GitOps CI/CD, Kubernetes (GKE)
Skills
The tools and platforms I reach for to build reliable, automated, observable infrastructure.
Cloud
Kubernetes & Containers
CI/CD & GitOps
Observability
Databases
Security & Networking
Cost & FinOps
IaC & Automation
Production Impact
The production systems I own and the initiatives I've shipped to make them more reliable, automated and cost-efficient — not infrastructure I merely managed.
Zero-Downtime Frontend Deployments
Re-architected frontend releases behind a Blue-Green model so users never hit a cold or half-deployed environment during a rollout.
Blue-Green Deployment Implementation
Stood up parallel Blue/Green environments behind the load balancer with health-gated traffic switching and instant rollback.
Read Replica Rollout
Introduced read replicas to offload read-heavy traffic from the primary, removing it as a single point of contention.
PgBouncer Connection Pooling
Deployed PgBouncer to pool and reuse database connections, eliminating connection-exhaustion errors under load.
Production Monitoring & Observability
Datadog APM + custom DogStatsD metrics on long-lived VM services (workers, Kafka consumers), with monitors and dashboards as code; Cloud Monitoring for serverless and CI.
GitHub Actions Runner Standardization
Standardized self-hosted runners with consistent images and auto-recovery, ending the flaky, one-off runner failures that stalled pipelines.
Billing Anomaly Detection Automation
Automated spend monitoring on top of BigQuery billing exports with threshold alerts, so cost spikes surface in hours, not at month-end.
Monthly Cloud Billing Reporting
Automated recurring billing reports that give stakeholders clear, on-time visibility into cloud spend and trends.
Cloud Cost Leak Investigation
Traced a recurring Cloud Monitoring overspend to noisy Ops Agent metric collection and remediated it, recovering ongoing waste.
Infrastructure Automation
Replaced manual, error-prone operational steps with repeatable automation across provisioning, deploys and reporting.
Monitoring Coverage Expansion
Extended monitoring to previously-blind services and hosts, closing detection gaps that delayed incident response.
Database Backup Automation
Automated scheduled backups with retention and restore validation, turning disaster recovery from hope into a tested procedure.
Kubernetes Platform Improvements
Hardened GKE workloads with namespace isolation, Helm-driven deploys and rollout validation to shrink blast radius.
Reliability Engineering Initiatives
Drove proactive reliability work — health checks, auto-rollback and SLO alerting — that kept production outage-free at scale.
Engineering Case Studies
Nine production initiatives, each written up end-to-end — problem, existing architecture, investigation, solution, rollout, validation, results and lessons learned.
Executive Summary
Re-architected frontend releases onto a Blue-Green model behind the load balancer, turning risky in-place deploys into health-gated, instantly-reversible traffic switches — and removing user-facing downtime from the release path.
Problem
Frontend deployments updated the live environment in place. During a rollout users could hit a half-deployed or cold environment, and a bad release meant a scramble to roll forward while customers saw errors.
Existing Architecture
- Single live frontend environment behind the load balancer
- In-place deploys that mutated the serving environment directly
- No warm standby to fall back to; rollback meant redeploying the previous build
Investigation
- Mapped exactly when users saw errors during a deploy window
- Identified the gap between 'new version starting' and 'new version healthy' as the downtime source
- Confirmed the load balancer could shift traffic between backends near-instantly
Challenges
- Keeping the old version fully serving until the new one is proven healthy
- Avoiding double-charging for two full environments any longer than necessary
- Making rollback a single, fast, low-risk action under pressure
Solution Architecture
- Parallel Blue (current) and Green (next) environments behind one HTTP(S) load balancer
- Deploy lands on the idle environment while the active one keeps serving 100% of traffic
- Health checks gate the cutover; traffic only switches once Green is verified healthy
- Instant rollback by pointing traffic back at the still-warm previous environment
Implementation
- Templated environment provisioning so Blue/Green are byte-for-byte identical
- Wired deploy automation to target the idle color, run health checks, then flip the LB
- Added readiness gating so a failed health check aborts the cutover automatically
Rollout Strategy
- Piloted on a lower-risk frontend surface to prove the switch + rollback flow
- Ran old and new paths side by side before making Blue-Green the default release path
- Documented the cutover/rollback runbook so any engineer can execute it
Validation
- Verified zero error spikes in monitoring across deploy windows
- Rehearsed rollback to confirm sub-minute reversion to the previous version
- Confirmed health gates correctly block an intentionally-broken build from taking traffic
Results
- Eliminated user-facing downtime from frontend releases
- Rollback became a single fast traffic switch instead of a redeploy
- Teams ship frontend changes during business hours with confidence
Lessons Learned
- Downtime usually hides in the gap between 'started' and 'healthy' — gate the cutover on health, not on time.
- A warm previous environment is the cheapest insurance policy you can buy for a release.
Architecture Gallery
The systems behind the case studies, drawn out. Pick a diagram to see how the pieces fit together.
New versions deploy to the idle environment and only receive traffic once health checks pass — rollback is an instant switch back to the warm previous version.
Operational Excellence
The business outcomes behind the engineering: less toil, earlier detection, safer releases, standardized infrastructure and governed cloud spend.
Automation Over Toil
Reduced manual operational effort across the lifecycle
- Automated billing reports, backups and provisioning that were once manual
- GitOps pipelines removed hand-run release steps
- Self-healing runners eliminated repetitive CI babysitting
Proactive Monitoring & Alerting
Issues surface before users feel them
- Datadog APM + custom-metric alerting on stateful VM services
- Billing anomaly alerts catch cost spikes in hours
- Cloud Monitoring → Slack for serverless and CI signals
Faster Detection & Resolution
Lower MTTD and calmer incident response
- Single pane of glass for fast triage
- Automatic Helm rollback on failed health checks
- Namespace isolation contains failures to one domain
Deployment Safety
Releases that don't scare anyone
- Blue-Green frontend deploys with zero downtime
- Health-gated cutovers and instant rollback
- 30% fewer deployment failures on a core platform service
Infrastructure Standardization
One paved path for every team
- Reusable Terraform modules, drift-free across 3 environments
- Standardized Helm charts and CI runners
- Secret Manager-backed secrets, never baked into images
Cloud Cost Governance
Spend that reflects real demand
- Billing anomaly detection and automated monthly reporting
- Recovered a recurring Cloud Monitoring cost leak
- Right-sized compute and scale-to-zero where it fits
Certifications
Credentials and education backing the hands-on production experience.
Certifications
Google Cloud Fundamentals: Core Infrastructure
Google Cloud
Infrastructure Automation with Terraform
HashiCorp / Coursera
Deploying Apps with Docker & ECS
Cloud Training
Industry Relevant AWS Training
AWS
Education
Master of Computer Applications (MCA)
MIT World Peace University, Pune
2021 · 81.2%
BSc Information Technology
Saurashtra University, Junagadh
2018 · 69.7%
Blog
Notes from the field on reliability, GCP networking and infrastructure automation.
Contact
Hiring for a DevOps, Platform or SRE role? Let's talk about keeping your infrastructure fast, reliable and secure.