Platform · DevOps · Cloud Reliability Engineer

I own production — and make it more reliable, observable & cost-efficient.

Platform & Cloud Reliability Engineer with 4.7+ years owning the production estate on GCP. I built Blue-Green zero-downtime deploys, scaled databases with read replicas & PgBouncer, rolled out Datadog SLO observability, automated backups and cloud-cost governance — and kept 62+ services running with zero major outages.

Production OwnershipZero-Downtime DeploysObservability & SLOsDatabase ScalingCloud Cost Governance

View Case Studies Download Resume

Ahmedabad, India·4.7+ yrs production ownership·Open to Platform / SRE / Cloud roles

manmohansingh@gcp:~/production

46+

Servers Monitored

VM fleet across dev · stage · prod

62+

Production Services

Cloud Run + multi-cluster GKE

30%

Fewer Deploy Failures

Helm rollout validation + auto-rollback

15+

Zero-Downtime Deploys / Day

Near-zero manual intervention

01~/about-me

About Me

I own production systems — and I make them more reliable, more observable, and cheaper to run.

I'm a Platform / DevOps & Cloud Reliability Engineer with 4.7+ years owning cloud-native infrastructure on Google Cloud Platform. I don't just keep the lights on — I own the production estate for key.AI (Blueprint X Labs) — 46+ VMs, 62+ Cloud Run services and multi-cluster GKE across dev, stage and production and I'm the person who improves it: safer deploys, deeper observability, scalable databases and a cloud bill that reflects real demand.

My work is reliability-first and automation-led: Blue-Green frontend deployments for zero-downtime releases, Helm rollout validation with automatic rollback, Read Replica + PgBouncer database scaling, Datadog observability with SLO-style alerting, and GitOps CI/CD on standardized, self-healing GitHub Actions runners.

I treat cloud cost as an engineering problem. I built billing-anomaly detection on BigQuery billing exports, automated monthly cost reporting for stakeholders, and chased down a recurring Cloud Monitoring leak rooted in noisy Ops Agent metric collection. Earlier at Hexanika I laid the Terraform foundations that cut provisioning from 2+ days to under 30 minutes and eliminated drift across three environments.

DevOps / Platform Engineer · Blueprint X Labs Inc (key.AI)

Jun 2024 – Present

Ahmedabad, India

Own and operate production infrastructure across dev, stage and production — 46+ VMs, 62+ Cloud Run services and multi-cluster GKE with zero major outages.
Built Blue-Green frontend deployments and Helm rollout validation with auto-rollback — zero-downtime releases and 30% fewer deploy failures.
Scaled the database tier with read replicas and PgBouncer pooling, eliminating connection-exhaustion errors under load.
Drove cloud cost governance: BigQuery billing-anomaly detection, automated monthly reporting, and remediation of a recurring Cloud Monitoring leak.

DevOps Engineer · Hexanika Research Pvt. Ltd.

May 2021 – Oct 2023

Pune, India

Provisioned multi-environment GCP infrastructure with Terraform — reducing provisioning from 2+ days to under 30 minutes.
Authored reusable Terraform modules for networking, compute and storage, eliminating drift across 3 environments.
Built end-to-end CI/CD pipelines on Azure DevOps with automated testing gates.
Automated SQL Server stored procedures and optimised Azure Data Factory pipelines, cutting manual data effort ~50%.

~/areas-of-expertise

Production ownership & reliability engineering on GCP at scale
Zero-downtime & Blue-Green deployment architecture
Database scaling — read replicas, PgBouncer connection pooling, backups
Observability & SLOs with Datadog and Cloud Monitoring
Cloud cost governance — anomaly detection, reporting, leak remediation
Platform automation — Terraform IaC, GitOps CI/CD, Kubernetes (GKE)

02~/skills

Skills

The tools and platforms I reach for to build reliable, automated, observable infrastructure.

Cloud

Google Cloud Platform

Compute Engine

Cloud Run

Cloud SQL

BigQuery (billing)

AWS (ECS, S3)

Kubernetes & Containers

GKE (multi-cluster)

Helm

Docker

Namespace isolation

Rollout validation

CI/CD & GitOps

GitHub Actions

Self-hosted runners

Azure DevOps

Jenkins

Blue-Green deploys

Auto-rollback

Observability

Datadog (APM, dashboards, alerts)

SLOs & error budgets

GCP Cloud Monitoring

Uptime monitors

Databases

Cloud SQL (Postgres)

Read replicas

PgBouncer pooling

Automated backups & restore

SQL Server

Security & Networking

VPC & private subnets

IAM & RBAC

Secret Manager

SSL/TLS termination

Load balancing

Cost & FinOps

Billing anomaly detection

BigQuery billing export

Automated cost reporting

Right-sizing

Leak remediation

IaC & Automation

Terraform (modules, state, workspaces)

Bash

Python automation

Linux (Ubuntu/Debian)

03~/production-impact

Production Impact

The production systems I own and the initiatives I've shipped to make them more reliable, automated and cost-efficient — not infrastructure I merely managed.

Zero

Major Production Outages

46+

Servers Monitored

62+

Production Services Supported

Fleet-wide

GitHub Runners Standardized

Automated

Database Backup Coverage

Multi-project

GCP Estate Operated

30%

Reduction in Deploy Failures

Recurring

Cloud Cost Leaks Remediated

Deployment Safety

Zero-Downtime Frontend Deployments

Re-architected frontend releases behind a Blue-Green model so users never hit a cold or half-deployed environment during a rollout.

Deployment Safety

Blue-Green Deployment Implementation

Stood up parallel Blue/Green environments behind the load balancer with health-gated traffic switching and instant rollback.

Database Scaling

Read Replica Rollout

Introduced read replicas to offload read-heavy traffic from the primary, removing it as a single point of contention.

Database Scaling

PgBouncer Connection Pooling

Deployed PgBouncer to pool and reuse database connections, eliminating connection-exhaustion errors under load.

Observability

Production Monitoring & Observability

Datadog APM + custom DogStatsD metrics on long-lived VM services (workers, Kafka consumers), with monitors and dashboards as code; Cloud Monitoring for serverless and CI.

CI/CD Reliability

GitHub Actions Runner Standardization

Standardized self-hosted runners with consistent images and auto-recovery, ending the flaky, one-off runner failures that stalled pipelines.

Cost Governance

Billing Anomaly Detection Automation

Automated spend monitoring on top of BigQuery billing exports with threshold alerts, so cost spikes surface in hours, not at month-end.

Cost Governance

Monthly Cloud Billing Reporting

Automated recurring billing reports that give stakeholders clear, on-time visibility into cloud spend and trends.

Cost Optimization

Cloud Cost Leak Investigation

Traced a recurring Cloud Monitoring overspend to noisy Ops Agent metric collection and remediated it, recovering ongoing waste.

Platform Engineering

Infrastructure Automation

Replaced manual, error-prone operational steps with repeatable automation across provisioning, deploys and reporting.

Observability

Monitoring Coverage Expansion

Extended monitoring to previously-blind services and hosts, closing detection gaps that delayed incident response.

Reliability

Database Backup Automation

Automated scheduled backups with retention and restore validation, turning disaster recovery from hope into a tested procedure.

Platform Engineering

Kubernetes Platform Improvements

Hardened GKE workloads with namespace isolation, Helm-driven deploys and rollout validation to shrink blast radius.

Reliability

Reliability Engineering Initiatives

Drove proactive reliability work — health checks, auto-rollback and SLO alerting — that kept production outage-free at scale.

04~/engineering-case-studies

Engineering Case Studies

Nine production initiatives, each written up end-to-end — problem, existing architecture, investigation, solution, rollout, validation, results and lessons learned.

Executive Summary

Re-architected frontend releases onto a Blue-Green model behind the load balancer, turning risky in-place deploys into health-gated, instantly-reversible traffic switches — and removing user-facing downtime from the release path.

Problem

Frontend deployments updated the live environment in place. During a rollout users could hit a half-deployed or cold environment, and a bad release meant a scramble to roll forward while customers saw errors.

Existing Architecture

Single live frontend environment behind the load balancer
In-place deploys that mutated the serving environment directly
No warm standby to fall back to; rollback meant redeploying the previous build

Investigation

Mapped exactly when users saw errors during a deploy window
Identified the gap between 'new version starting' and 'new version healthy' as the downtime source
Confirmed the load balancer could shift traffic between backends near-instantly

Challenges

Keeping the old version fully serving until the new one is proven healthy
Avoiding double-charging for two full environments any longer than necessary
Making rollback a single, fast, low-risk action under pressure

Solution Architecture

Parallel Blue (current) and Green (next) environments behind one HTTP(S) load balancer
Deploy lands on the idle environment while the active one keeps serving 100% of traffic
Health checks gate the cutover; traffic only switches once Green is verified healthy
Instant rollback by pointing traffic back at the still-warm previous environment

Implementation

Templated environment provisioning so Blue/Green are byte-for-byte identical
Wired deploy automation to target the idle color, run health checks, then flip the LB
Added readiness gating so a failed health check aborts the cutover automatically

Rollout Strategy

Piloted on a lower-risk frontend surface to prove the switch + rollback flow
Ran old and new paths side by side before making Blue-Green the default release path
Documented the cutover/rollback runbook so any engineer can execute it

Validation

Verified zero error spikes in monitoring across deploy windows
Rehearsed rollback to confirm sub-minute reversion to the previous version
Confirmed health gates correctly block an intentionally-broken build from taking traffic

Results

Eliminated user-facing downtime from frontend releases
Rollback became a single fast traffic switch instead of a redeploy
Teams ship frontend changes during business hours with confidence

Zero

Deploy-window downtime

Instant

Rollback (traffic switch)

Health-gated

Cutover

Lessons Learned

Downtime usually hides in the gap between 'started' and 'healthy' — gate the cutover on health, not on time.
A warm previous environment is the cheapest insurance policy you can buy for a release.

GCP Load Balancer

Blue-Green

Health Checks

GitHub Actions

Cloud Run

05~/architecture-gallery

Architecture Gallery

The systems behind the case studies, drawn out. Pick a diagram to see how the pieces fit together.

blue-green-deployment-architecture.mmd

New versions deploy to the idle environment and only receive traffic once health checks pass — rollback is an instant switch back to the warm previous version.

06~/operational-excellence