Platform · DevOps · Cloud Reliability Engineer

I own production — and make it more reliable, observable & cost-efficient.

Platform & Cloud Reliability Engineer with 4.7+ years owning the production estate on GCP. I built Blue-Green zero-downtime deploys, scaled databases with read replicas & PgBouncer, rolled out Datadog SLO observability, automated backups and cloud-cost governance — and kept 62+ services running with zero major outages.

Production OwnershipZero-Downtime DeploysObservability & SLOsDatabase ScalingCloud Cost Governance
Ahmedabad, India·4.7+ yrs production ownership·Open to Platform / SRE / Cloud roles
manmohansingh@gcp:~/production
$
46+
Servers Monitored
VM fleet across dev · stage · prod
62+
Production Services
Cloud Run + multi-cluster GKE
30%
Fewer Deploy Failures
Helm rollout validation + auto-rollback
15+
Zero-Downtime Deploys / Day
Near-zero manual intervention
01~/about-me

About Me

I own production systems — and I make them more reliable, more observable, and cheaper to run.

I'm a Platform / DevOps & Cloud Reliability Engineer with 4.7+ years owning cloud-native infrastructure on Google Cloud Platform. I don't just keep the lights on — I own the production estate for key.AI (Blueprint X Labs) — 46+ VMs, 62+ Cloud Run services and multi-cluster GKE across dev, stage and production and I'm the person who improves it: safer deploys, deeper observability, scalable databases and a cloud bill that reflects real demand.

My work is reliability-first and automation-led: Blue-Green frontend deployments for zero-downtime releases, Helm rollout validation with automatic rollback, Read Replica + PgBouncer database scaling, Datadog observability with SLO-style alerting, and GitOps CI/CD on standardized, self-healing GitHub Actions runners.

I treat cloud cost as an engineering problem. I built billing-anomaly detection on BigQuery billing exports, automated monthly cost reporting for stakeholders, and chased down a recurring Cloud Monitoring leak rooted in noisy Ops Agent metric collection. Earlier at Hexanika I laid the Terraform foundations that cut provisioning from 2+ days to under 30 minutes and eliminated drift across three environments.

DevOps / Platform Engineer · Blueprint X Labs Inc (key.AI)

Jun 2024 – Present

Ahmedabad, India

  • Own and operate production infrastructure across dev, stage and production — 46+ VMs, 62+ Cloud Run services and multi-cluster GKE with zero major outages.
  • Built Blue-Green frontend deployments and Helm rollout validation with auto-rollback — zero-downtime releases and 30% fewer deploy failures.
  • Scaled the database tier with read replicas and PgBouncer pooling, eliminating connection-exhaustion errors under load.
  • Drove cloud cost governance: BigQuery billing-anomaly detection, automated monthly reporting, and remediation of a recurring Cloud Monitoring leak.

DevOps Engineer · Hexanika Research Pvt. Ltd.

May 2021 – Oct 2023

Pune, India

  • Provisioned multi-environment GCP infrastructure with Terraform — reducing provisioning from 2+ days to under 30 minutes.
  • Authored reusable Terraform modules for networking, compute and storage, eliminating drift across 3 environments.
  • Built end-to-end CI/CD pipelines on Azure DevOps with automated testing gates.
  • Automated SQL Server stored procedures and optimised Azure Data Factory pipelines, cutting manual data effort ~50%.

~/areas-of-expertise

  • Production ownership & reliability engineering on GCP at scale
  • Zero-downtime & Blue-Green deployment architecture
  • Database scaling — read replicas, PgBouncer connection pooling, backups
  • Observability & SLOs with Datadog and Cloud Monitoring
  • Cloud cost governance — anomaly detection, reporting, leak remediation
  • Platform automation — Terraform IaC, GitOps CI/CD, Kubernetes (GKE)
02~/skills

Skills

The tools and platforms I reach for to build reliable, automated, observable infrastructure.

Cloud

Google Cloud Platform
Compute Engine
Cloud Run
Cloud SQL
BigQuery (billing)
AWS (ECS, S3)

Kubernetes & Containers

GKE (multi-cluster)
Helm
Docker
Namespace isolation
Rollout validation

CI/CD & GitOps

GitHub Actions
Self-hosted runners
Azure DevOps
Jenkins
Blue-Green deploys
Auto-rollback

Observability

Datadog (APM, dashboards, alerts)
SLOs & error budgets
GCP Cloud Monitoring
Uptime monitors

Databases

Cloud SQL (Postgres)
Read replicas
PgBouncer pooling
Automated backups & restore
SQL Server

Security & Networking

VPC & private subnets
IAM & RBAC
Secret Manager
SSL/TLS termination
Load balancing

Cost & FinOps

Billing anomaly detection
BigQuery billing export
Automated cost reporting
Right-sizing
Leak remediation

IaC & Automation

Terraform (modules, state, workspaces)
Bash
Python automation
Linux (Ubuntu/Debian)
03~/production-impact

Production Impact

The production systems I own and the initiatives I've shipped to make them more reliable, automated and cost-efficient — not infrastructure I merely managed.

Zero
Major Production Outages
46+
Servers Monitored
62+
Production Services Supported
Fleet-wide
GitHub Runners Standardized
Automated
Database Backup Coverage
Multi-project
GCP Estate Operated
30%
Reduction in Deploy Failures
Recurring
Cloud Cost Leaks Remediated
Deployment Safety

Zero-Downtime Frontend Deployments

Re-architected frontend releases behind a Blue-Green model so users never hit a cold or half-deployed environment during a rollout.

Deployment Safety

Blue-Green Deployment Implementation

Stood up parallel Blue/Green environments behind the load balancer with health-gated traffic switching and instant rollback.

Database Scaling

Read Replica Rollout

Introduced read replicas to offload read-heavy traffic from the primary, removing it as a single point of contention.

Database Scaling

PgBouncer Connection Pooling

Deployed PgBouncer to pool and reuse database connections, eliminating connection-exhaustion errors under load.

Observability

Production Monitoring & Observability

Datadog APM + custom DogStatsD metrics on long-lived VM services (workers, Kafka consumers), with monitors and dashboards as code; Cloud Monitoring for serverless and CI.

CI/CD Reliability

GitHub Actions Runner Standardization

Standardized self-hosted runners with consistent images and auto-recovery, ending the flaky, one-off runner failures that stalled pipelines.

Cost Governance

Billing Anomaly Detection Automation

Automated spend monitoring on top of BigQuery billing exports with threshold alerts, so cost spikes surface in hours, not at month-end.

Cost Governance

Monthly Cloud Billing Reporting

Automated recurring billing reports that give stakeholders clear, on-time visibility into cloud spend and trends.

Cost Optimization

Cloud Cost Leak Investigation

Traced a recurring Cloud Monitoring overspend to noisy Ops Agent metric collection and remediated it, recovering ongoing waste.

Platform Engineering

Infrastructure Automation

Replaced manual, error-prone operational steps with repeatable automation across provisioning, deploys and reporting.

Observability

Monitoring Coverage Expansion

Extended monitoring to previously-blind services and hosts, closing detection gaps that delayed incident response.

Reliability

Database Backup Automation

Automated scheduled backups with retention and restore validation, turning disaster recovery from hope into a tested procedure.

Platform Engineering

Kubernetes Platform Improvements

Hardened GKE workloads with namespace isolation, Helm-driven deploys and rollout validation to shrink blast radius.

Reliability

Reliability Engineering Initiatives

Drove proactive reliability work — health checks, auto-rollback and SLO alerting — that kept production outage-free at scale.

04~/engineering-case-studies

Engineering Case Studies

Nine production initiatives, each written up end-to-end — problem, existing architecture, investigation, solution, rollout, validation, results and lessons learned.

Executive Summary

Re-architected frontend releases onto a Blue-Green model behind the load balancer, turning risky in-place deploys into health-gated, instantly-reversible traffic switches — and removing user-facing downtime from the release path.

Problem

Frontend deployments updated the live environment in place. During a rollout users could hit a half-deployed or cold environment, and a bad release meant a scramble to roll forward while customers saw errors.

Existing Architecture

  • Single live frontend environment behind the load balancer
  • In-place deploys that mutated the serving environment directly
  • No warm standby to fall back to; rollback meant redeploying the previous build

Investigation

  • Mapped exactly when users saw errors during a deploy window
  • Identified the gap between 'new version starting' and 'new version healthy' as the downtime source
  • Confirmed the load balancer could shift traffic between backends near-instantly

Challenges

  • Keeping the old version fully serving until the new one is proven healthy
  • Avoiding double-charging for two full environments any longer than necessary
  • Making rollback a single, fast, low-risk action under pressure

Solution Architecture

  • Parallel Blue (current) and Green (next) environments behind one HTTP(S) load balancer
  • Deploy lands on the idle environment while the active one keeps serving 100% of traffic
  • Health checks gate the cutover; traffic only switches once Green is verified healthy
  • Instant rollback by pointing traffic back at the still-warm previous environment

Implementation

  • Templated environment provisioning so Blue/Green are byte-for-byte identical
  • Wired deploy automation to target the idle color, run health checks, then flip the LB
  • Added readiness gating so a failed health check aborts the cutover automatically

Rollout Strategy

  • Piloted on a lower-risk frontend surface to prove the switch + rollback flow
  • Ran old and new paths side by side before making Blue-Green the default release path
  • Documented the cutover/rollback runbook so any engineer can execute it

Validation

  • Verified zero error spikes in monitoring across deploy windows
  • Rehearsed rollback to confirm sub-minute reversion to the previous version
  • Confirmed health gates correctly block an intentionally-broken build from taking traffic

Results

  • Eliminated user-facing downtime from frontend releases
  • Rollback became a single fast traffic switch instead of a redeploy
  • Teams ship frontend changes during business hours with confidence
Zero
Deploy-window downtime
Instant
Rollback (traffic switch)
Health-gated
Cutover

Lessons Learned

  • Downtime usually hides in the gap between 'started' and 'healthy' — gate the cutover on health, not on time.
  • A warm previous environment is the cheapest insurance policy you can buy for a release.
GCP Load Balancer
Blue-Green
Health Checks
GitHub Actions
Cloud Run
05~/architecture-gallery

Architecture Gallery

The systems behind the case studies, drawn out. Pick a diagram to see how the pieces fit together.

blue-green-deployment-architecture.mmd

New versions deploy to the idle environment and only receive traffic once health checks pass — rollback is an instant switch back to the warm previous version.

06~/operational-excellence

Operational Excellence

The business outcomes behind the engineering: less toil, earlier detection, safer releases, standardized infrastructure and governed cloud spend.

Automation Over Toil

Reduced manual operational effort across the lifecycle

  • Automated billing reports, backups and provisioning that were once manual
  • GitOps pipelines removed hand-run release steps
  • Self-healing runners eliminated repetitive CI babysitting

Proactive Monitoring & Alerting

Issues surface before users feel them

  • Datadog APM + custom-metric alerting on stateful VM services
  • Billing anomaly alerts catch cost spikes in hours
  • Cloud Monitoring → Slack for serverless and CI signals

Faster Detection & Resolution

Lower MTTD and calmer incident response

  • Single pane of glass for fast triage
  • Automatic Helm rollback on failed health checks
  • Namespace isolation contains failures to one domain

Deployment Safety

Releases that don't scare anyone

  • Blue-Green frontend deploys with zero downtime
  • Health-gated cutovers and instant rollback
  • 30% fewer deployment failures on a core platform service

Infrastructure Standardization

One paved path for every team

  • Reusable Terraform modules, drift-free across 3 environments
  • Standardized Helm charts and CI runners
  • Secret Manager-backed secrets, never baked into images

Cloud Cost Governance

Spend that reflects real demand

  • Billing anomaly detection and automated monthly reporting
  • Recovered a recurring Cloud Monitoring cost leak
  • Right-sized compute and scale-to-zero where it fits
07~/certifications

Certifications

Credentials and education backing the hands-on production experience.

Certifications

Google Cloud Fundamentals: Core Infrastructure

Google Cloud

Infrastructure Automation with Terraform

HashiCorp / Coursera

Deploying Apps with Docker & ECS

Cloud Training

Industry Relevant AWS Training

AWS

Education

Master of Computer Applications (MCA)

MIT World Peace University, Pune

2021 · 81.2%

BSc Information Technology

Saurashtra University, Junagadh

2018 · 69.7%

09~/contact

Contact

Hiring for a DevOps, Platform or SRE role? Let's talk about keeping your infrastructure fast, reliable and secure.

$ contact --init

Let's build reliable infrastructure together.

I'm open to DevOps / Platform / SRE opportunities and infrastructure consulting. The fastest way to reach me is email — I usually reply within a day.

Ahmedabad, India · Open to remote