CODITECT Cloud Backend - Staging Deployment Troubleshooting Guide
Document Status: Production Last Updated: December 1, 2025 Environment: Google Kubernetes Engine (GKE) Staging Purpose: Comprehensive troubleshooting guide for common deployment issues
Table of Contents
- Issue 1: GCR Deprecation (403 Forbidden)
- Issue 2: Multi-Platform Docker Build
- Issue 3: Dockerfile User Permissions
- Issue 4: Cloud SQL SSL Certificate Requirement
- Issue 5: Database User Authentication
- Issue 6: Django ALLOWED_HOSTS Rejection
- Issue 7: Health Probe HTTPS/HTTP Mismatch
- Quick Reference
- Related Documentation
Overview
This guide documents 7 critical issues encountered and resolved during the initial staging deployment of CODITECT Cloud Backend to Google Kubernetes Engine. Each issue includes:
- Error symptoms - What you'll see when this happens
- Root cause analysis - Why it happened
- Complete solution - How to fix it permanently
- Verification steps - How to confirm it's resolved
- Prevention guidance - How to avoid in future deployments
Deployment Context:
- Platform: Google Kubernetes Engine (GKE) Standard
- Region: us-central1
- Cluster: coditect-staging-cluster
- Namespace: coditect-staging
- Image Registry: Artifact Registry (us-central1-docker.pkg.dev)
- Database: Cloud SQL PostgreSQL 15
- Framework: Django 5.2.8 with Python 3.12.12
Issue 1: GCR Deprecation (403 Forbidden)
Error Symptoms
Failed to pull image "gcr.io/coditect-cloud-infra/coditect-cloud-backend:v1.0.0-staging":
rpc error: code = Unknown desc = failed to pull and unpack image "gcr.io/...":
failed to resolve reference "gcr.io/...":
pull access denied, repository does not exist or may require authorization:
server message: insufficient_scope: authorization failed
Pod Status: ImagePullBackOff or ErrImagePull
Timeline: After March 18, 2025 (GCR shutdown date)
Root Cause Analysis
Primary Cause: Google Container Registry (gcr.io) was deprecated and shut down on March 18, 2025.
Background:
- Google announced GCR deprecation in 2023
- All GCR URLs (gcr.io, us.gcr.io, etc.) now return 403 Forbidden
- Google requires migration to Artifact Registry (pkg.dev)
- Existing images in GCR were migrated automatically, but new pushes fail
Why This Happened:
- Deployment manifests used legacy
gcr.io/PROJECT_ID/...URLs - GKE service account had
storage.objectViewerrole (for GCR) - Missing Artifact Registry API enablement and permissions
Complete Solution
Step 1: Enable Artifact Registry API
gcloud services enable artifactregistry.googleapis.com \
--project=coditect-cloud-infra
Step 2: Create Artifact Registry Repository
gcloud artifacts repositories create coditect-backend \
--repository-format=docker \
--location=us-central1 \
--description="CODITECT Cloud Backend Docker Images" \
--project=coditect-cloud-infra
Verify:
gcloud artifacts repositories list \
--location=us-central1 \
--project=coditect-cloud-infra
Expected output:
REPOSITORY FORMAT MODE DESCRIPTION
coditect-backend DOCKER STANDARD_REPOSITORY CODITECT Cloud Backend Docker Images
Step 3: Grant GKE Service Account Pull Access
# Get GKE service account
GKE_SA=$(gcloud container clusters describe coditect-staging-cluster \
--region=us-central1 \
--format="value(nodeConfig.serviceAccount)")
# Grant Artifact Registry Reader role
gcloud artifacts repositories add-iam-policy-binding coditect-backend \
--location=us-central1 \
--member="serviceAccount:${GKE_SA}" \
--role="roles/artifactregistry.reader" \
--project=coditect-cloud-infra
Step 4: Update Deployment Manifests
File: deployment/kubernetes/staging/backend-deployment.yaml
Change:
# OLD (GCR - deprecated)
image: gcr.io/coditect-cloud-infra/coditect-cloud-backend:v1.0.0-staging
# NEW (Artifact Registry)
image: us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging
Step 5: Build and Push to Artifact Registry
# Configure Docker authentication
gcloud auth configure-docker us-central1-docker.pkg.dev
# Build for correct platform (see Issue 2)
docker buildx build \
--platform linux/amd64 \
--tag us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
--push \
.
Step 6: Apply Updated Deployment
kubectl apply -f deployment/kubernetes/staging/backend-deployment.yaml
Verification Steps
1. Check image pull status:
kubectl describe pod -n coditect-staging -l app=coditect-backend | grep -A 5 "Events:"
Success indicators:
- No
ImagePullBackOffevents Successfully pulled imagemessage- Pod status:
Running
2. Verify image in Artifact Registry:
gcloud artifacts docker images list \
us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend
3. Test image pull manually:
docker pull us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging
Prevention Guidance
For Future Deployments:
-
Always use Artifact Registry for new projects
- Never use
gcr.ioURLs - Standard format:
REGION-docker.pkg.dev/PROJECT/REPOSITORY/IMAGE:TAG
- Never use
-
Update CI/CD pipelines
- GitHub Actions: Use
google-github-actions/auth@v2with Artifact Registry - Replace all
gcr.ioreferences in workflow files
- GitHub Actions: Use
-
Document registry locations
- Add Artifact Registry URLs to deployment documentation
- Include in
.env.examplefiles
-
IAM least privilege
- Grant only
roles/artifactregistry.readerfor pulling - Grant
roles/artifactregistry.writeronly to CI/CD service accounts
- Grant only
Related ADRs:
- ADR-0001: Google Cloud Platform as Primary Cloud Provider
- ADR-0007: Docker Container Strategy
Issue 2: Multi-Platform Docker Build
Error Symptoms
Error response from daemon:
no match for platform in manifest:
not found
Or:
exec /usr/local/bin/python: exec format error
Pod Status: CrashLoopBackOff with exit code 1
Timeline: After successfully pulling image from Artifact Registry
Root Cause Analysis
Primary Cause: Docker image built on macOS (arm64/Apple Silicon) incompatible with GKE nodes (linux/amd64).
Technical Details:
- macOS with Apple Silicon uses ARM64 architecture
- GKE Standard nodes use x86-64 (AMD64) architecture
- Default
docker buildcreates single-platform image for host architecture - Kubernetes pulls image but cannot execute ARM64 binaries on AMD64 nodes
Why This Happened:
- Built image locally on MacBook (ARM64)
- Pushed to Artifact Registry without platform specification
- GKE attempted to run ARM64 image on AMD64 nodes
- Binary format mismatch caused exec errors
Complete Solution
Step 1: Install Docker Buildx (if not already installed)
# Verify buildx availability
docker buildx version
# If missing, install
docker buildx install
Step 2: Create Multi-Platform Builder
# Create builder instance
docker buildx create --name multiplatform --use
# Verify builder
docker buildx inspect multiplatform --bootstrap
Step 3: Build for Correct Platform
# Build for linux/amd64 (GKE platform)
docker buildx build \
--platform linux/amd64 \
--tag us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
--push \
.
Important Flags:
--platform linux/amd64- Build for x86-64 (GKE nodes)--push- Push directly to registry (required for multi-platform).- Dockerfile location
Step 4: Verify Image Manifest
# Inspect image architecture
docker buildx imagetools inspect \
us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging
Expected Output:
Name: us-central1-docker.pkg.dev/.../coditect-cloud-backend:v1.0.0-staging
MediaType: application/vnd.docker.distribution.manifest.v2+json
Digest: sha256:...
Manifests:
Name: us-central1-docker.pkg.dev/.../coditect-cloud-backend:v1.0.0-staging@sha256:...
MediaType: application/vnd.docker.distribution.manifest.v2+json
Platform: linux/amd64 <-- VERIFY THIS
Verification Steps
1. Check pod startup:
kubectl logs -n coditect-staging -l app=coditect-backend --tail=50
Success indicators:
- Django/Gunicorn startup messages
- No "exec format error"
- Application listening on port 8000
2. Verify architecture in running container:
kubectl exec -n coditect-staging deployment/coditect-backend -- uname -m
Expected: x86_64 (not aarch64)
3. Test image locally (optional):
# Pull and run locally with platform specification
docker run --platform linux/amd64 \
us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
python --version
Prevention Guidance
For Future Deployments:
-
Always specify platform in builds
- Add
--platform linux/amd64to all production builds - Document in deployment runbooks
- Add
-
CI/CD standardization
- GitHub Actions runners are linux/amd64 by default (correct)
- If building locally, create shell alias:
alias docker-build-gke='docker buildx build --platform linux/amd64'
-
Multi-platform support (optional)
- For broader compatibility, build multi-platform:
docker buildx build \
--platform linux/amd64,linux/arm64 \
--tag IMAGE \
--push . - Supports both AMD64 (GKE) and ARM64 (future ARM nodes)
- For broader compatibility, build multi-platform:
-
Makefile integration
- Create
Makefilewith standardized build commands:.PHONY: build-staging
build-staging:
docker buildx build \
--platform linux/amd64 \
--tag us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
--push \
.
- Create
Related ADRs:
- ADR-0007: Docker Container Strategy
Issue 3: Dockerfile User Permissions
Error Symptoms
Traceback (most recent call last):
File "/app/manage.py", line 11, in <module>
from django.core.management import execute_from_command_line
ModuleNotFoundError: No module named 'django'
Pod Logs:
/usr/local/bin/python: can't open file '/app/manage.py': [Errno 13] Permission denied
Or:
[Errno 13] Permission denied: '/app/staticfiles'
Pod Status: CrashLoopBackOff with exit code 1
Root Cause Analysis
Primary Cause: Python packages installed to /root/.local in builder stage, but runtime container runs as non-root user django (UID 1000) without access to /root.
Technical Details:
- Multi-stage Dockerfile uses
FROM python:3.12.12-slim-bookworm as builder - Builder stage runs as root, installs packages to
/root/.local - Runtime stage creates
useradd -m -u 1000 django COPY --from=builder /root/.localcopied to runtime, but still owned by rootUSER djangodirective switches to non-root user- User
djangocannot read/root/.local(permission denied)
Why This Happened:
- Security best practice: Containers should run as non-root
- GKE enforces
runAsNonRoot: truein PodSecurityPolicy - Incorrect assumption that copying files changes ownership
Complete Solution
File: Dockerfile
Original (Broken):
FROM python:3.12.12-slim-bookworm as builder
# ... build steps ...
RUN pip install --no-cache-dir --user -r requirements.txt # Installs to /root/.local
FROM python:3.12.12-slim-bookworm
RUN useradd -m -u 1000 django
COPY --from=builder /root/.local /root/.local # Root-owned files
USER django # Cannot access /root
Fixed:
FROM python:3.12.12-slim-bookworm as builder
WORKDIR /app
# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc g++ make postgresql-client libpq-dev \
&& rm -rf /var/lib/apt/lists/*
# Install Python packages to /root/.local
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
# Copy application code
COPY . .
# Collect static files
ENV DJANGO_SETTINGS_MODULE=license_platform.settings.production
ENV DJANGO_SECRET_KEY=temp-build-key
ENV DB_NAME=build DB_USER=build DB_PASSWORD=build DB_HOST=localhost
RUN python manage.py collectstatic --noinput --clear
# Stage 2: Runtime
FROM python:3.12.12-slim-bookworm
WORKDIR /app
# Install runtime dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
postgresql-client libpq-dev curl \
&& rm -rf /var/lib/apt/lists/*
# Create non-root user FIRST
RUN useradd -m -u 1000 django
# Copy Python packages to USER HOME directory (not /root)
COPY --from=builder /root/.local /home/django/.local
# Copy application code
COPY --from=builder /app /app
# Change ownership to django user
RUN chown -R django:django /app /home/django/.local
# Add .local/bin to PATH so django user can find installed packages
ENV PATH=/home/django/.local/bin:$PATH
# Switch to non-root user
USER django
EXPOSE 8000
CMD ["gunicorn", ...]
Key Changes:
- Create user first:
RUN useradd -m -u 1000 djangobefore copying files - Copy to user directory:
COPY --from=builder /root/.local /home/django/.local - Fix ownership:
RUN chown -R django:django /app /home/django/.local - Update PATH:
ENV PATH=/home/django/.local/bin:$PATH - Switch user last:
USER djangoafter ownership changes
Verification Steps
1. Rebuild and push image:
docker buildx build \
--platform linux/amd64 \
--tag us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
--push \
.
2. Check pod startup:
kubectl logs -n coditect-staging -l app=coditect-backend --tail=50
Success indicators:
[INFO] Booting worker with pid: 7
[INFO] Listening at: http://0.0.0.0:8000
3. Verify user in running container:
kubectl exec -n coditect-staging deployment/coditect-backend -- whoami
# Expected: django
kubectl exec -n coditect-staging deployment/coditect-backend -- id
# Expected: uid=1000(django) gid=1000(django) groups=1000(django)
4. Verify package access:
kubectl exec -n coditect-staging deployment/coditect-backend -- python -c "import django; print(django.__version__)"
# Expected: 5.2.8
Prevention Guidance
For Future Dockerfiles:
-
Always create user before copying files
RUN useradd -m -u 1000 appuser
COPY --from=builder /root/.local /home/appuser/.local
RUN chown -R appuser:appuser /app /home/appuser/.local
USER appuser -
Use explicit ownership in COPY
COPY --from=builder --chown=appuser:appuser /root/.local /home/appuser/.local -
Verify permissions in build
RUN ls -la /home/appuser/.local/lib/python*/site-packages/ | head -10 -
Test as non-root locally
docker run --rm -it --user 1000:1000 IMAGE /bin/bash
python -c "import django"
Related ADRs:
- ADR-0007: Docker Container Strategy
- ADR-0008: Security Best Practices
Issue 4: Cloud SQL SSL Certificate Requirement
Error Symptoms
django.db.utils.OperationalError:
connection to server at "10.41.64.3", port 5432 failed:
FATAL: connection requires a valid client certificate
Pod Logs:
psycopg2.OperationalError:
FATAL: pg_hba.conf rejects connection for host "10.56.1.10",
user "coditect_app", database "coditect_db", no encryption
Pod Status: CrashLoopBackOff during database connection
Root Cause Analysis
Primary Cause: Cloud SQL instance configured with requireSsl: true, but application not providing client SSL certificates.
Technical Details:
- Cloud SQL instance created with default security settings
settings.requireSsl: trueenforces TLS for all connections- Django
DATABASESconfiguration missing SSL parameters - PostgreSQL server rejects non-SSL connections per
pg_hba.confrules
Why This Happened:
- Security best practice: Cloud SQL requires SSL by default
- Application not configured for SSL connections
- Missing Cloud SQL proxy configuration OR client certificates
Security Tradeoffs:
- Production: MUST use SSL with client certificates (high security)
- Staging: MAY disable SSL if on private VPC (convenience vs. security)
Complete Solution
Option A: Disable SSL Requirement (Staging Only)
WARNING: Only for staging environments on private networks. NEVER for production.
gcloud sql instances patch coditect-db \
--no-require-ssl \
--project=coditect-cloud-infra
Verification:
gcloud sql instances describe coditect-db \
--project=coditect-cloud-infra \
--format="get(settings.ipConfiguration.requireSsl)"
Expected: False
Pros:
- Simple configuration
- No certificate management
- Faster debugging cycles
Cons:
- Unencrypted database connections (private VPC only)
- Not production-ready
- Security compliance risk
Option B: Use Cloud SQL Proxy (Production Recommended)
Step 1: Add Cloud SQL Proxy Sidecar
File: deployment/kubernetes/staging/backend-deployment.yaml
spec:
template:
spec:
containers:
# Main application container
- name: backend
image: us-central1-docker.pkg.dev/.../coditect-cloud-backend:v1.0.0-staging
env:
- name: DB_HOST
value: "127.0.0.1" # Connect to proxy sidecar
- name: DB_PORT
value: "5432"
# Cloud SQL Proxy sidecar
- name: cloud-sql-proxy
image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.0
args:
- "--private-ip"
- "coditect-cloud-infra:us-central1:coditect-db"
- "--port=5432"
securityContext:
runAsNonRoot: true
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
Step 2: Grant Cloud SQL Client Role
gcloud projects add-iam-policy-binding coditect-cloud-infra \
--member="serviceAccount:coditect-cloud-backend@coditect-cloud-infra.iam.gserviceaccount.com" \
--role="roles/cloudsql.client"
Step 3: Apply Updated Deployment
kubectl apply -f deployment/kubernetes/staging/backend-deployment.yaml
Pros:
- Automatic SSL with Google-managed certificates
- No certificate file management
- Automatic IAM authentication support
- Production-ready security
Cons:
- Additional sidecar container overhead
- More complex configuration
Option C: Client SSL Certificates (Manual)
Not recommended - Use Cloud SQL Proxy instead.
Verification Steps
1. Test database connection from pod:
kubectl exec -n coditect-staging deployment/coditect-backend -- \
python manage.py dbshell --command "SELECT version();"
Success indicator: PostgreSQL version output
2. Check application startup:
kubectl logs -n coditect-staging -l app=coditect-backend --tail=50 | grep -i database
Success indicators:
- No SSL errors
- "Applying migration..." messages
3. Verify SSL status (if using proxy):
kubectl exec -n coditect-staging deployment/coditect-backend -- \
python manage.py dbshell --command "SELECT ssl_is_used();"
Expected (with proxy): t (true)
Expected (without SSL): f (false)
Prevention Guidance
For Future Deployments:
-
Production: Always use Cloud SQL Proxy
- Deploy proxy as sidecar container
- Enable IAM authentication where possible
- Never disable SSL requirement
-
Staging: Document security tradeoffs
- If disabling SSL, document in deployment README
- Add comment in Terraform/IaC files
- Set reminder to re-enable for production
-
Infrastructure as Code
- Terraform: Set
require_ssl = truefor production - Document SSL configuration in
terraform/variables.tf
- Terraform: Set
-
Connection testing
- Add health check that verifies database SSL
- Monitor SSL connection metrics in production
Related ADRs:
- ADR-0009: Database Architecture and Management
- ADR-0008: Security Best Practices
Issue 5: Database User Authentication
Error Symptoms
django.db.utils.OperationalError:
connection to server at "10.41.64.3", port 5432 failed:
FATAL: password authentication failed for user "coditect_app"
Pod Logs:
psycopg2.OperationalError:
FATAL: role "coditect_app" does not exist
Pod Status: CrashLoopBackOff during database connection initialization
Root Cause Analysis
Primary Cause: Database user coditect_app either doesn't exist in Cloud SQL instance or has incorrect password.
Technical Details:
- Cloud SQL instance created, but database users not provisioned
- Django
DATABASESconfiguration referencesDB_USER=coditect_app - Kubernetes secret
backend-secretsmay have wrong password - PostgreSQL rejects authentication for non-existent or mismatched credentials
Why This Happened:
- Database infrastructure (instance) created separately from application resources
- User creation not automated in Terraform/IaC
- Manual user creation step missed
- Password mismatch between gcloud command and Kubernetes secret
Complete Solution
Step 1: Verify Database Instance is Running
gcloud sql instances describe coditect-db \
--project=coditect-cloud-infra \
--format="get(state)"
Expected: RUNNABLE
Step 2: Create Database User
# Generate secure password (save this!)
DB_PASSWORD=$(openssl rand -base64 32)
echo "Database Password: $DB_PASSWORD"
# Create user in Cloud SQL
gcloud sql users create coditect_app \
--instance=coditect-db \
--password="$DB_PASSWORD" \
--project=coditect-cloud-infra
Verify:
gcloud sql users list \
--instance=coditect-db \
--project=coditect-cloud-infra
Expected Output:
NAME HOST
coditect_app %
postgres %
Step 3: Create Database (if not exists)
gcloud sql databases create coditect_db \
--instance=coditect-db \
--project=coditect-cloud-infra
Verify:
gcloud sql databases list \
--instance=coditect-db \
--project=coditect-cloud-infra
Step 4: Create Kubernetes Secret
# Delete old secret if exists
kubectl delete secret backend-secrets -n coditect-staging --ignore-not-found
# Create new secret with correct password
kubectl create secret generic backend-secrets \
-n coditect-staging \
--from-literal=django-secret-key="$(openssl rand -base64 64)" \
--from-literal=db-name="coditect_db" \
--from-literal=db-user="coditect_app" \
--from-literal=db-password="$DB_PASSWORD" \
--from-literal=db-host="10.41.64.3" \
--from-literal=redis-host="10.41.65.4"
Verify:
kubectl describe secret backend-secrets -n coditect-staging
Step 5: Restart Deployment
kubectl rollout restart deployment/coditect-backend -n coditect-staging
Step 6: Grant Database Permissions
# Connect to database
gcloud sql connect coditect-db \
--user=postgres \
--quiet \
--project=coditect-cloud-infra
# In psql prompt:
GRANT ALL PRIVILEGES ON DATABASE coditect_db TO coditect_app;
GRANT ALL PRIVILEGES ON SCHEMA public TO coditect_app;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON TABLES TO coditect_app;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON SEQUENCES TO coditect_app;
\q
Verification Steps
1. Test authentication from pod:
kubectl exec -n coditect-staging deployment/coditect-backend -- \
python manage.py check --database default
Success indicator: System check identified no issues (0 silenced).
2. Run database migrations:
kubectl exec -n coditect-staging deployment/coditect-backend -- \
python manage.py migrate --noinput
Success indicators:
- "Applying migrations..." messages
- No authentication errors
3. Test database query:
kubectl exec -n coditect-staging deployment/coditect-backend -- \
python manage.py dbshell --command "SELECT current_user, current_database();"
Expected Output:
current_user | current_database
--------------+------------------
coditect_app | coditect_db
4. Check application logs:
kubectl logs -n coditect-staging -l app=coditect-backend --tail=50 | grep -E "(database|migration)"
Success indicators:
- No authentication errors
- Migration success messages
Prevention Guidance
For Future Deployments:
-
Automate user creation in Terraform
resource "google_sql_user" "app_user" {
name = "coditect_app"
instance = google_sql_database_instance.main.name
password = random_password.db_password.result
}
resource "random_password" "db_password" {
length = 32
special = true
}
resource "google_secret_manager_secret_version" "db_password" {
secret = google_secret_manager_secret.db_password.id
secret_data = random_password.db_password.result
} -
Use GCP Secret Manager (not Kubernetes secrets)
- Store passwords in Secret Manager
- Mount as environment variables in pods
- Automatic rotation support
-
Database initialization job
- Create Kubernetes Job to run migrations
- Verify database connectivity before deployment
- Example:
deployment/kubernetes/staging/migrate-job.yaml
-
Document database credentials
- Store password in password manager (1Password, etc.)
- Document user creation in runbook
- Add to disaster recovery procedures
Related ADRs:
- ADR-0009: Database Architecture and Management
- ADR-0008: Security Best Practices
Issue 6: Django ALLOWED_HOSTS Rejection
Error Symptoms
Invalid HTTP_HOST header: '10.56.2.20:8000'.
You may need to add '10.56.2.20' to ALLOWED_HOSTS.
Pod Logs:
DisallowedHost at /api/v1/health/live
Invalid HTTP_HOST header: '10.56.1.10:8000'.
The domain name provided is not valid according to RFC 1034/1035.
HTTP Response: 400 Bad Request
Health Probes: Failing with HTTP 400
Root Cause Analysis
Primary Cause: Django ALLOWED_HOSTS setting doesn't include Kubernetes pod IP addresses, and Django doesn't support CIDR notation natively.
Technical Details:
- Kubernetes assigns dynamic pod IPs from cluster CIDR (e.g.,
10.56.0.0/16) - Django
ALLOWED_HOSTSrequires explicit hostname/IP list - Health probes send requests with
Host: POD_IP:8000header - Django rejects requests with Host header not in
ALLOWED_HOSTS - CIDR notation (
10.56.0.0/16) not supported by Django
Why This Happened:
- Security feature: Django prevents HTTP Host header attacks
- Production settings enforce strict
ALLOWED_HOSTSvalidation - Kubernetes pod IPs change on every deployment/restart
- Cannot predict exact pod IPs in advance
Security Tradeoffs:
- Production: Strict
ALLOWED_HOSTSwith explicit domains (high security) - Staging: Relaxed
ALLOWED_HOSTSfor debugging (convenience vs. security)
Complete Solution
Option A: Wildcard for Staging (Quick Fix)
File: deployment/kubernetes/staging/backend-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: backend-config
namespace: coditect-staging
labels:
app: coditect-backend
environment: staging
data:
DJANGO_ALLOWED_HOSTS: "*" # Allow all hosts (staging only!)
Pros:
- Simple configuration
- Works with any pod IP
- No CIDR parsing needed
Cons:
- Disables Host header validation
- Vulnerable to Host header attacks
- NOT for production
Option B: Include Cluster CIDR + Service DNS (Balanced)
File: deployment/kubernetes/staging/backend-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: backend-config
namespace: coditect-staging
data:
# Comma-separated list (Django native format)
DJANGO_ALLOWED_HOSTS: "10.56.0.0/16,coditect-backend.coditect-staging.svc.cluster.local,*.coditect.com,localhost"
Note: Django doesn't natively support CIDR, so this requires custom middleware.
Step 1: Create Custom Middleware
File: src/license_platform/middleware/allowed_hosts.py
import ipaddress
from django.core.exceptions import DisallowedHost
from django.http import HttpResponseBadRequest
class CIDRAwareAllowedHostsMiddleware:
"""
Custom middleware to support CIDR notation in ALLOWED_HOSTS.
Checks if request Host header matches:
1. Exact hostnames (existing Django behavior)
2. CIDR ranges (custom logic)
"""
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request):
host = request.get_host().split(':')[0] # Remove port
if not self.is_allowed_host(host):
return HttpResponseBadRequest(f"Invalid Host: {host}")
return self.get_response(request)
def is_allowed_host(self, host):
from django.conf import settings
allowed_hosts = settings.ALLOWED_HOSTS
# Check exact matches (wildcards supported by Django)
if host in allowed_hosts:
return True
# Check wildcard matches
if '*' in allowed_hosts:
return True
# Check CIDR ranges
try:
ip = ipaddress.ip_address(host)
for allowed in allowed_hosts:
if '/' in allowed: # CIDR notation
network = ipaddress.ip_network(allowed, strict=False)
if ip in network:
return True
except ValueError:
pass # Not an IP address
return False
Step 2: Register Middleware
File: src/license_platform/settings/production.py
MIDDLEWARE = [
'django.middleware.security.SecurityMiddleware',
'license_platform.middleware.allowed_hosts.CIDRAwareAllowedHostsMiddleware', # Add this
# ... rest of middleware
]
Step 3: Update ConfigMap
data:
DJANGO_ALLOWED_HOSTS: "10.56.0.0/16,coditect-backend.coditect-staging.svc.cluster.local,*.coditect.com,localhost"
Pros:
- Maintains security validation
- Supports dynamic pod IPs
- Production-ready with proper domains
Cons:
- Custom middleware maintenance
- Slight performance overhead
Option C: Use Service DNS Only (Recommended for Production)
File: deployment/kubernetes/staging/backend-deployment.yaml
livenessProbe:
httpGet:
path: /api/v1/health/live
port: http
httpHeaders:
- name: Host
value: coditect-backend.coditect-staging.svc.cluster.local
File: deployment/kubernetes/staging/backend-config.yaml
data:
DJANGO_ALLOWED_HOSTS: "coditect-backend.coditect-staging.svc.cluster.local,api.coditect.com"
Pros:
- Most secure (explicit hosts only)
- No custom middleware
- Best for production
Cons:
- Requires updating probe configuration
- Less flexible for debugging
Verification Steps
1. Check ConfigMap applied:
kubectl get configmap backend-config -n coditect-staging -o yaml
2. Restart deployment to pick up ConfigMap:
kubectl rollout restart deployment/coditect-backend -n coditect-staging
3. Test health endpoint:
kubectl exec -n coditect-staging deployment/coditect-backend -- \
curl -H "Host: 10.56.1.10:8000" http://localhost:8000/api/v1/health/live
Success indicator: HTTP 200 response
4. Check liveness probe status:
kubectl describe pod -n coditect-staging -l app=coditect-backend | grep -A 5 "Liveness:"
Success indicators:
Liveness: ... #success=...(no failures)- No "Liveness probe failed" events
5. Test from external request:
curl -H "Host: api.coditect.com" http://LOAD_BALANCER_IP/api/v1/health/live
Prevention Guidance
For Future Deployments:
-
Production: Use explicit domains only
ALLOWED_HOSTS = [
'api.coditect.com',
'api-staging.coditect.com',
'coditect-backend.coditect-staging.svc.cluster.local',
] -
Staging: Use wildcard OR CIDR middleware
- Wildcard for quick debugging
- CIDR middleware for production-like testing
-
Health probes: Use Service DNS
livenessProbe:
httpGet:
httpHeaders:
- name: Host
value: service.namespace.svc.cluster.local -
Document tradeoffs in settings
# settings/production.py
# SECURITY WARNING: Wildcard ('*') disables Host header validation
# Only use in staging/dev, NEVER in production
ALLOWED_HOSTS = os.environ.get('DJANGO_ALLOWED_HOSTS', '').split(',')
Related ADRs:
- ADR-0008: Security Best Practices
Issue 7: Health Probe HTTPS/HTTP Mismatch
Error Symptoms
Liveness probe failed:
Get "https://10.56.1.10:8000/api/v1/health/live":
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Pod Logs:
No errors (application running correctly)
Pod Status: Not Ready despite application running
Health Probe Status: Failing with timeout
Root Cause Analysis
Primary Cause: Kubernetes health probes defaulting to HTTPS, but application serving HTTP only. Production Django settings redirect HTTP to HTTPS, creating probe timeout loop.
Technical Details:
- Django production settings:
SECURE_SSL_REDIRECT = True - Application listens on HTTP port 8000 (no TLS termination)
- Kubernetes
livenessProbewithout explicitscheme: HTTPdefaults to HTTPS - Probe sends HTTPS request to HTTP port
- Django redirects to HTTPS (301/302)
- Probe times out waiting for HTTPS response
Why This Happened:
- Security best practice: Django enforces HTTPS in production
- TLS termination expected at load balancer, not application
- Kubernetes probe defaults changed in recent versions
- Missing explicit
schemeparameter in probe configuration
Architecture Context:
- Load Balancer: Terminates TLS, forwards HTTP to backend
- Backend: Serves HTTP only (trusts private network)
- Health Probes: Direct pod access (bypasses load balancer)
Complete Solution
File: deployment/kubernetes/staging/backend-deployment.yaml
Original (Broken):
livenessProbe:
httpGet:
path: /api/v1/health/live
port: http # Named port (8000)
# scheme defaults to HTTPS if not specified!
initialDelaySeconds: 30
periodSeconds: 10
Fixed:
livenessProbe:
httpGet:
path: /api/v1/health/live
port: http
scheme: HTTP # Explicitly specify HTTP
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/v1/health/ready
port: http
scheme: HTTP # Also fix readiness probe
initialDelaySeconds: 20
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Key Changes:
- Add
scheme: HTTPto both liveness and readiness probes - Explicit timeouts for clarity
- Consistent configuration across all probes
Step 1: Update Deployment Manifest
# Edit file
vim deployment/kubernetes/staging/backend-deployment.yaml
# Apply changes
kubectl apply -f deployment/kubernetes/staging/backend-deployment.yaml
Step 2: Verify Probe Configuration
kubectl get pod -n coditect-staging -l app=coditect-backend -o yaml | grep -A 10 "livenessProbe:"
Expected Output:
livenessProbe:
httpGet:
path: /api/v1/health/live
port: 8000
scheme: HTTP # Verify this line
Verification Steps
1. Watch pod status during rollout:
kubectl rollout status deployment/coditect-backend -n coditect-staging
Success indicator: deployment "coditect-backend" successfully rolled out
2. Check probe success:
kubectl describe pod -n coditect-staging -l app=coditect-backend | grep -A 10 "Liveness:"
Success indicators:
Liveness: http-get http://:8000/api/v1/health/live- No failure events
3. Manual probe test:
POD_IP=$(kubectl get pod -n coditect-staging -l app=coditect-backend -o jsonpath='{.items[0].status.podIP}')
curl -v http://$POD_IP:8000/api/v1/health/live
Expected: HTTP 200 response
4. Check probe failure history:
kubectl get events -n coditect-staging --field-selector involvedObject.name=coditect-backend --sort-by='.lastTimestamp' | tail -20
Success indicator: No recent "Liveness probe failed" events
Prevention Guidance
For Future Deployments:
-
Always specify probe scheme explicitly
# Don't rely on defaults
livenessProbe:
httpGet:
scheme: HTTP # Explicit is better than implicit -
Document TLS termination architecture
Client --HTTPS--> Load Balancer --HTTP--> Backend Pods
(TLS termination) (HTTP only) -
Test probes independently
# Before deploying, test probe endpoint
curl http://localhost:8000/api/v1/health/live -
Monitor probe metrics
- Prometheus:
prober_probe_success - Grafana dashboard for health probe success rate
- Prometheus:
-
Production considerations
- If backend must serve HTTPS, use self-signed certs for probes
- Configure probe to skip TLS verification:
livenessProbe:
httpGet:
scheme: HTTPS
httpHeaders:
- name: X-Forwarded-Proto
value: https
Related ADRs:
- ADR-0008: Security Best Practices
- ADR-0005: Kubernetes Deployment Strategy
Quick Reference
Common Commands
# Check pod status
kubectl get pods -n coditect-staging -l app=coditect-backend
# View pod logs
kubectl logs -n coditect-staging -l app=coditect-backend --tail=100 -f
# Describe pod (includes events)
kubectl describe pod -n coditect-staging -l app=coditect-backend
# Execute command in pod
kubectl exec -n coditect-staging deployment/coditect-backend -- COMMAND
# Restart deployment
kubectl rollout restart deployment/coditect-backend -n coditect-staging
# Check rollout status
kubectl rollout status deployment/coditect-backend -n coditect-staging
# View ConfigMap
kubectl get configmap backend-config -n coditect-staging -o yaml
# View Secrets (keys only, not values)
kubectl describe secret backend-secrets -n coditect-staging
# Check service endpoints
kubectl get endpoints coditect-backend -n coditect-staging
Issue Decision Tree
Pod Status: ImagePullBackOff
└─> Issue 1: GCR Deprecation OR Issue 2: Platform Mismatch
Pod Status: CrashLoopBackOff
├─> Logs: "ModuleNotFoundError" → Issue 3: User Permissions
├─> Logs: "password authentication failed" → Issue 5: Database User
├─> Logs: "connection requires a valid client certificate" → Issue 4: SSL Requirement
└─> Logs: "DisallowedHost" → Issue 6: ALLOWED_HOSTS
Pod Status: Not Ready
├─> Liveness probe failed: HTTPS timeout → Issue 7: Probe HTTPS/HTTP
└─> Readiness probe failed: DB connection → Issue 4 or 5
Environment Variables Checklist
Required in backend-secrets:
- ✅
django-secret-key- Random 64-byte string - ✅
db-name- Database name (coditect_db) - ✅
db-user- Database user (coditect_app) - ✅
db-password- Database password (32+ chars) - ✅
db-host- Cloud SQL private IP - ✅
redis-host- Redis private IP
Required in backend-config:
- ✅
DJANGO_ALLOWED_HOSTS- Comma-separated hosts
Pre-Deployment Checklist
Before deploying to staging:
- Artifact Registry repository created
- GKE service account has
artifactregistry.readerrole - Image built with
--platform linux/amd64 - Image pushed to Artifact Registry (not GCR)
- Cloud SQL instance running
- Cloud SQL user created (
coditect_app) - Cloud SQL database created (
coditect_db) - Kubernetes secrets created
- ConfigMap updated with correct
ALLOWED_HOSTS - Deployment manifest specifies
scheme: HTTPin probes - SSL requirement disabled (staging) OR Cloud SQL Proxy configured (production)
Related Documentation
Internal Documentation
- README.md - Deployment overview
- CLAUDE.md - AI agent configuration
- ADR-0005: Kubernetes Deployment Strategy
- ADR-0007: Docker Container Strategy
- ADR-0008: Security Best Practices
- ADR-0009: Database Architecture
External References
- Google Artifact Registry Migration
- Docker Buildx Multi-Platform
- Cloud SQL Proxy
- Django ALLOWED_HOSTS
- Kubernetes Health Probes
Appendix: Full Working Configuration
Dockerfile (Final)
File: Dockerfile
# CODITECT Cloud Backend - Production Dockerfile
# Django 5.2.8 Backend with Multi-Stage Build
# Python 3.12.12 for protobuf compatibility
# Stage 1: Builder - Install dependencies and collect static files
FROM python:3.12.12-slim-bookworm as builder
LABEL maintainer="AZ1.AI INC <engineering@az1.ai>"
LABEL description="CODITECT Cloud Backend - Django 5.2.8 API Server"
WORKDIR /app
# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
g++ \
make \
postgresql-client \
libpq-dev \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
# Copy application code
COPY . .
# Collect static files (production settings expect environment variables)
ENV DJANGO_SETTINGS_MODULE=license_platform.settings.production
ENV DJANGO_SECRET_KEY=temp-build-key
ENV DB_NAME=build
ENV DB_USER=build
ENV DB_PASSWORD=build
ENV DB_HOST=localhost
RUN python manage.py collectstatic --noinput --clear
# Stage 2: Runtime - Minimal production image
FROM python:3.12.12-slim-bookworm
WORKDIR /app
# Install runtime dependencies only
RUN apt-get update && apt-get install -y --no-install-recommends \
postgresql-client \
libpq-dev \
curl \
&& rm -rf /var/lib/apt/lists/*
# Create non-root user first
RUN useradd -m -u 1000 django
# Copy Python packages from builder to user directory
COPY --from=builder /root/.local /home/django/.local
# Copy application code from builder
COPY --from=builder /app /app
# Set ownership
RUN chown -R django:django /app /home/django/.local
# Make sure scripts in .local are usable
ENV PATH=/home/django/.local/bin:$PATH
USER django
# Expose port
EXPOSE 8000
# Health check endpoint
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
CMD curl -f http://localhost:8000/api/v1/health/ || exit 1
# Run with gunicorn for production
CMD ["gunicorn", \
"--bind", "0.0.0.0:8000", \
"--workers", "4", \
"--worker-class", "sync", \
"--timeout", "60", \
"--access-logfile", "-", \
"--error-logfile", "-", \
"--log-level", "info", \
"license_platform.wsgi:application"]
Deployment Manifest (Final)
File: deployment/kubernetes/staging/backend-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: coditect-backend
namespace: coditect-staging
labels:
app: coditect-backend
environment: staging
version: v1.0.0
spec:
replicas: 2
revisionHistoryLimit: 3
selector:
matchLabels:
app: coditect-backend
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: coditect-backend
environment: staging
version: v1.0.0
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: coditect-cloud-backend
imagePullSecrets:
- name: gcr-json-key
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: backend
image: us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging
imagePullPolicy: Always
ports:
- name: http
containerPort: 8000
protocol: TCP
env:
- name: DJANGO_SETTINGS_MODULE
value: "license_platform.settings.production"
- name: DJANGO_SECRET_KEY
valueFrom:
secretKeyRef:
name: backend-secrets
key: django-secret-key
- name: DB_NAME
valueFrom:
secretKeyRef:
name: backend-secrets
key: db-name
- name: DB_USER
valueFrom:
secretKeyRef:
name: backend-secrets
key: db-user
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: backend-secrets
key: db-password
- name: DB_HOST
valueFrom:
secretKeyRef:
name: backend-secrets
key: db-host
- name: DB_PORT
value: "5432"
- name: REDIS_HOST
valueFrom:
secretKeyRef:
name: backend-secrets
key: redis-host
- name: REDIS_PORT
value: "6379"
- name: CLOUD_KMS_PROJECT_ID
value: "coditect-prod-563272"
- name: CLOUD_KMS_LOCATION
value: "us-central1"
- name: CLOUD_KMS_KEYRING
value: "license-signing-keyring"
- name: CLOUD_KMS_KEY
value: "license-signing-key"
- name: GCP_PROJECT_ID
value: "coditect-prod-563272"
- name: ENVIRONMENT
value: "staging"
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /api/v1/health/live
port: http
scheme: HTTP # Critical: Explicit HTTP scheme
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/v1/health/ready
port: http
scheme: HTTP # Critical: Explicit HTTP scheme
initialDelaySeconds: 20
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: false
capabilities:
drop:
- ALL
terminationGracePeriodSeconds: 30
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- coditect-backend
topologyKey: kubernetes.io/hostname
ConfigMap (Final)
File: deployment/kubernetes/staging/backend-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: backend-config
namespace: coditect-staging
labels:
app: coditect-backend
environment: staging
data:
# Staging: Wildcard for convenience (NOT for production!)
DJANGO_ALLOWED_HOSTS: "*"
# Alternative (production-ready):
# DJANGO_ALLOWED_HOSTS: "10.56.0.0/16,coditect-backend.coditect-staging.svc.cluster.local,*.coditect.com,localhost"
Build and Deploy Script
File: scripts/deploy-staging.sh
#!/bin/bash
set -euo pipefail
PROJECT_ID="coditect-cloud-infra"
REGION="us-central1"
IMAGE_REPO="coditect-backend"
IMAGE_NAME="coditect-cloud-backend"
VERSION="${1:-v1.0.0-staging}"
echo "Building image for linux/amd64..."
docker buildx build \
--platform linux/amd64 \
--tag ${REGION}-docker.pkg.dev/${PROJECT_ID}/${IMAGE_REPO}/${IMAGE_NAME}:${VERSION} \
--push \
.
echo "Applying Kubernetes manifests..."
kubectl apply -f deployment/kubernetes/staging/namespace.yaml
kubectl apply -f deployment/kubernetes/staging/backend-config.yaml
kubectl apply -f deployment/kubernetes/staging/backend-deployment.yaml
kubectl apply -f deployment/kubernetes/staging/backend-service.yaml
echo "Waiting for rollout..."
kubectl rollout status deployment/coditect-backend -n coditect-staging
echo "Deployment complete!"
kubectl get pods -n coditect-staging -l app=coditect-backend
Document Status: Complete Last Validated: December 1, 2025 Next Review: January 1, 2026 Owner: AZ1.AI INC Engineering Team