Skip to main content

CODITECT Cloud Backend - Staging Deployment Troubleshooting Guide

Document Status: Production Last Updated: December 1, 2025 Environment: Google Kubernetes Engine (GKE) Staging Purpose: Comprehensive troubleshooting guide for common deployment issues


Table of Contents

  1. Issue 1: GCR Deprecation (403 Forbidden)
  2. Issue 2: Multi-Platform Docker Build
  3. Issue 3: Dockerfile User Permissions
  4. Issue 4: Cloud SQL SSL Certificate Requirement
  5. Issue 5: Database User Authentication
  6. Issue 6: Django ALLOWED_HOSTS Rejection
  7. Issue 7: Health Probe HTTPS/HTTP Mismatch
  8. Quick Reference
  9. Related Documentation

Overview

This guide documents 7 critical issues encountered and resolved during the initial staging deployment of CODITECT Cloud Backend to Google Kubernetes Engine. Each issue includes:

  • Error symptoms - What you'll see when this happens
  • Root cause analysis - Why it happened
  • Complete solution - How to fix it permanently
  • Verification steps - How to confirm it's resolved
  • Prevention guidance - How to avoid in future deployments

Deployment Context:

  • Platform: Google Kubernetes Engine (GKE) Standard
  • Region: us-central1
  • Cluster: coditect-staging-cluster
  • Namespace: coditect-staging
  • Image Registry: Artifact Registry (us-central1-docker.pkg.dev)
  • Database: Cloud SQL PostgreSQL 15
  • Framework: Django 5.2.8 with Python 3.12.12

Issue 1: GCR Deprecation (403 Forbidden)

Error Symptoms

Failed to pull image "gcr.io/coditect-cloud-infra/coditect-cloud-backend:v1.0.0-staging":
rpc error: code = Unknown desc = failed to pull and unpack image "gcr.io/...":
failed to resolve reference "gcr.io/...":
pull access denied, repository does not exist or may require authorization:
server message: insufficient_scope: authorization failed

Pod Status: ImagePullBackOff or ErrImagePull

Timeline: After March 18, 2025 (GCR shutdown date)

Root Cause Analysis

Primary Cause: Google Container Registry (gcr.io) was deprecated and shut down on March 18, 2025.

Background:

  • Google announced GCR deprecation in 2023
  • All GCR URLs (gcr.io, us.gcr.io, etc.) now return 403 Forbidden
  • Google requires migration to Artifact Registry (pkg.dev)
  • Existing images in GCR were migrated automatically, but new pushes fail

Why This Happened:

  • Deployment manifests used legacy gcr.io/PROJECT_ID/... URLs
  • GKE service account had storage.objectViewer role (for GCR)
  • Missing Artifact Registry API enablement and permissions

Complete Solution

Step 1: Enable Artifact Registry API

gcloud services enable artifactregistry.googleapis.com \
--project=coditect-cloud-infra

Step 2: Create Artifact Registry Repository

gcloud artifacts repositories create coditect-backend \
--repository-format=docker \
--location=us-central1 \
--description="CODITECT Cloud Backend Docker Images" \
--project=coditect-cloud-infra

Verify:

gcloud artifacts repositories list \
--location=us-central1 \
--project=coditect-cloud-infra

Expected output:

REPOSITORY         FORMAT  MODE                 DESCRIPTION
coditect-backend DOCKER STANDARD_REPOSITORY CODITECT Cloud Backend Docker Images

Step 3: Grant GKE Service Account Pull Access

# Get GKE service account
GKE_SA=$(gcloud container clusters describe coditect-staging-cluster \
--region=us-central1 \
--format="value(nodeConfig.serviceAccount)")

# Grant Artifact Registry Reader role
gcloud artifacts repositories add-iam-policy-binding coditect-backend \
--location=us-central1 \
--member="serviceAccount:${GKE_SA}" \
--role="roles/artifactregistry.reader" \
--project=coditect-cloud-infra

Step 4: Update Deployment Manifests

File: deployment/kubernetes/staging/backend-deployment.yaml

Change:

# OLD (GCR - deprecated)
image: gcr.io/coditect-cloud-infra/coditect-cloud-backend:v1.0.0-staging

# NEW (Artifact Registry)
image: us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging

Step 5: Build and Push to Artifact Registry

# Configure Docker authentication
gcloud auth configure-docker us-central1-docker.pkg.dev

# Build for correct platform (see Issue 2)
docker buildx build \
--platform linux/amd64 \
--tag us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
--push \
.

Step 6: Apply Updated Deployment

kubectl apply -f deployment/kubernetes/staging/backend-deployment.yaml

Verification Steps

1. Check image pull status:

kubectl describe pod -n coditect-staging -l app=coditect-backend | grep -A 5 "Events:"

Success indicators:

  • No ImagePullBackOff events
  • Successfully pulled image message
  • Pod status: Running

2. Verify image in Artifact Registry:

gcloud artifacts docker images list \
us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend

3. Test image pull manually:

docker pull us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging

Prevention Guidance

For Future Deployments:

  1. Always use Artifact Registry for new projects

    • Never use gcr.io URLs
    • Standard format: REGION-docker.pkg.dev/PROJECT/REPOSITORY/IMAGE:TAG
  2. Update CI/CD pipelines

    • GitHub Actions: Use google-github-actions/auth@v2 with Artifact Registry
    • Replace all gcr.io references in workflow files
  3. Document registry locations

    • Add Artifact Registry URLs to deployment documentation
    • Include in .env.example files
  4. IAM least privilege

    • Grant only roles/artifactregistry.reader for pulling
    • Grant roles/artifactregistry.writer only to CI/CD service accounts

Related ADRs:

  • ADR-0001: Google Cloud Platform as Primary Cloud Provider
  • ADR-0007: Docker Container Strategy

Issue 2: Multi-Platform Docker Build

Error Symptoms

Error response from daemon:
no match for platform in manifest:
not found

Or:

exec /usr/local/bin/python: exec format error

Pod Status: CrashLoopBackOff with exit code 1

Timeline: After successfully pulling image from Artifact Registry

Root Cause Analysis

Primary Cause: Docker image built on macOS (arm64/Apple Silicon) incompatible with GKE nodes (linux/amd64).

Technical Details:

  • macOS with Apple Silicon uses ARM64 architecture
  • GKE Standard nodes use x86-64 (AMD64) architecture
  • Default docker build creates single-platform image for host architecture
  • Kubernetes pulls image but cannot execute ARM64 binaries on AMD64 nodes

Why This Happened:

  • Built image locally on MacBook (ARM64)
  • Pushed to Artifact Registry without platform specification
  • GKE attempted to run ARM64 image on AMD64 nodes
  • Binary format mismatch caused exec errors

Complete Solution

Step 1: Install Docker Buildx (if not already installed)

# Verify buildx availability
docker buildx version

# If missing, install
docker buildx install

Step 2: Create Multi-Platform Builder

# Create builder instance
docker buildx create --name multiplatform --use

# Verify builder
docker buildx inspect multiplatform --bootstrap

Step 3: Build for Correct Platform

# Build for linux/amd64 (GKE platform)
docker buildx build \
--platform linux/amd64 \
--tag us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
--push \
.

Important Flags:

  • --platform linux/amd64 - Build for x86-64 (GKE nodes)
  • --push - Push directly to registry (required for multi-platform)
  • . - Dockerfile location

Step 4: Verify Image Manifest

# Inspect image architecture
docker buildx imagetools inspect \
us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging

Expected Output:

Name:      us-central1-docker.pkg.dev/.../coditect-cloud-backend:v1.0.0-staging
MediaType: application/vnd.docker.distribution.manifest.v2+json
Digest: sha256:...

Manifests:
Name: us-central1-docker.pkg.dev/.../coditect-cloud-backend:v1.0.0-staging@sha256:...
MediaType: application/vnd.docker.distribution.manifest.v2+json
Platform: linux/amd64 <-- VERIFY THIS

Verification Steps

1. Check pod startup:

kubectl logs -n coditect-staging -l app=coditect-backend --tail=50

Success indicators:

  • Django/Gunicorn startup messages
  • No "exec format error"
  • Application listening on port 8000

2. Verify architecture in running container:

kubectl exec -n coditect-staging deployment/coditect-backend -- uname -m

Expected: x86_64 (not aarch64)

3. Test image locally (optional):

# Pull and run locally with platform specification
docker run --platform linux/amd64 \
us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
python --version

Prevention Guidance

For Future Deployments:

  1. Always specify platform in builds

    • Add --platform linux/amd64 to all production builds
    • Document in deployment runbooks
  2. CI/CD standardization

    • GitHub Actions runners are linux/amd64 by default (correct)
    • If building locally, create shell alias:
      alias docker-build-gke='docker buildx build --platform linux/amd64'
  3. Multi-platform support (optional)

    • For broader compatibility, build multi-platform:
      docker buildx build \
      --platform linux/amd64,linux/arm64 \
      --tag IMAGE \
      --push .
    • Supports both AMD64 (GKE) and ARM64 (future ARM nodes)
  4. Makefile integration

    • Create Makefile with standardized build commands:
      .PHONY: build-staging
      build-staging:
      docker buildx build \
      --platform linux/amd64 \
      --tag us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
      --push \
      .

Related ADRs:

  • ADR-0007: Docker Container Strategy

Issue 3: Dockerfile User Permissions

Error Symptoms

Traceback (most recent call last):
File "/app/manage.py", line 11, in <module>
from django.core.management import execute_from_command_line
ModuleNotFoundError: No module named 'django'

Pod Logs:

/usr/local/bin/python: can't open file '/app/manage.py': [Errno 13] Permission denied

Or:

[Errno 13] Permission denied: '/app/staticfiles'

Pod Status: CrashLoopBackOff with exit code 1

Root Cause Analysis

Primary Cause: Python packages installed to /root/.local in builder stage, but runtime container runs as non-root user django (UID 1000) without access to /root.

Technical Details:

  • Multi-stage Dockerfile uses FROM python:3.12.12-slim-bookworm as builder
  • Builder stage runs as root, installs packages to /root/.local
  • Runtime stage creates useradd -m -u 1000 django
  • COPY --from=builder /root/.local copied to runtime, but still owned by root
  • USER django directive switches to non-root user
  • User django cannot read /root/.local (permission denied)

Why This Happened:

  • Security best practice: Containers should run as non-root
  • GKE enforces runAsNonRoot: true in PodSecurityPolicy
  • Incorrect assumption that copying files changes ownership

Complete Solution

File: Dockerfile

Original (Broken):

FROM python:3.12.12-slim-bookworm as builder
# ... build steps ...
RUN pip install --no-cache-dir --user -r requirements.txt # Installs to /root/.local

FROM python:3.12.12-slim-bookworm
RUN useradd -m -u 1000 django
COPY --from=builder /root/.local /root/.local # Root-owned files
USER django # Cannot access /root

Fixed:

FROM python:3.12.12-slim-bookworm as builder
WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc g++ make postgresql-client libpq-dev \
&& rm -rf /var/lib/apt/lists/*

# Install Python packages to /root/.local
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Copy application code
COPY . .

# Collect static files
ENV DJANGO_SETTINGS_MODULE=license_platform.settings.production
ENV DJANGO_SECRET_KEY=temp-build-key
ENV DB_NAME=build DB_USER=build DB_PASSWORD=build DB_HOST=localhost
RUN python manage.py collectstatic --noinput --clear

# Stage 2: Runtime
FROM python:3.12.12-slim-bookworm
WORKDIR /app

# Install runtime dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
postgresql-client libpq-dev curl \
&& rm -rf /var/lib/apt/lists/*

# Create non-root user FIRST
RUN useradd -m -u 1000 django

# Copy Python packages to USER HOME directory (not /root)
COPY --from=builder /root/.local /home/django/.local

# Copy application code
COPY --from=builder /app /app

# Change ownership to django user
RUN chown -R django:django /app /home/django/.local

# Add .local/bin to PATH so django user can find installed packages
ENV PATH=/home/django/.local/bin:$PATH

# Switch to non-root user
USER django

EXPOSE 8000
CMD ["gunicorn", ...]

Key Changes:

  1. Create user first: RUN useradd -m -u 1000 django before copying files
  2. Copy to user directory: COPY --from=builder /root/.local /home/django/.local
  3. Fix ownership: RUN chown -R django:django /app /home/django/.local
  4. Update PATH: ENV PATH=/home/django/.local/bin:$PATH
  5. Switch user last: USER django after ownership changes

Verification Steps

1. Rebuild and push image:

docker buildx build \
--platform linux/amd64 \
--tag us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
--push \
.

2. Check pod startup:

kubectl logs -n coditect-staging -l app=coditect-backend --tail=50

Success indicators:

[INFO] Booting worker with pid: 7
[INFO] Listening at: http://0.0.0.0:8000

3. Verify user in running container:

kubectl exec -n coditect-staging deployment/coditect-backend -- whoami
# Expected: django

kubectl exec -n coditect-staging deployment/coditect-backend -- id
# Expected: uid=1000(django) gid=1000(django) groups=1000(django)

4. Verify package access:

kubectl exec -n coditect-staging deployment/coditect-backend -- python -c "import django; print(django.__version__)"
# Expected: 5.2.8

Prevention Guidance

For Future Dockerfiles:

  1. Always create user before copying files

    RUN useradd -m -u 1000 appuser
    COPY --from=builder /root/.local /home/appuser/.local
    RUN chown -R appuser:appuser /app /home/appuser/.local
    USER appuser
  2. Use explicit ownership in COPY

    COPY --from=builder --chown=appuser:appuser /root/.local /home/appuser/.local
  3. Verify permissions in build

    RUN ls -la /home/appuser/.local/lib/python*/site-packages/ | head -10
  4. Test as non-root locally

    docker run --rm -it --user 1000:1000 IMAGE /bin/bash
    python -c "import django"

Related ADRs:

  • ADR-0007: Docker Container Strategy
  • ADR-0008: Security Best Practices

Issue 4: Cloud SQL SSL Certificate Requirement

Error Symptoms

django.db.utils.OperationalError:
connection to server at "10.41.64.3", port 5432 failed:
FATAL: connection requires a valid client certificate

Pod Logs:

psycopg2.OperationalError:
FATAL: pg_hba.conf rejects connection for host "10.56.1.10",
user "coditect_app", database "coditect_db", no encryption

Pod Status: CrashLoopBackOff during database connection

Root Cause Analysis

Primary Cause: Cloud SQL instance configured with requireSsl: true, but application not providing client SSL certificates.

Technical Details:

  • Cloud SQL instance created with default security settings
  • settings.requireSsl: true enforces TLS for all connections
  • Django DATABASES configuration missing SSL parameters
  • PostgreSQL server rejects non-SSL connections per pg_hba.conf rules

Why This Happened:

  • Security best practice: Cloud SQL requires SSL by default
  • Application not configured for SSL connections
  • Missing Cloud SQL proxy configuration OR client certificates

Security Tradeoffs:

  • Production: MUST use SSL with client certificates (high security)
  • Staging: MAY disable SSL if on private VPC (convenience vs. security)

Complete Solution

Option A: Disable SSL Requirement (Staging Only)

WARNING: Only for staging environments on private networks. NEVER for production.

gcloud sql instances patch coditect-db \
--no-require-ssl \
--project=coditect-cloud-infra

Verification:

gcloud sql instances describe coditect-db \
--project=coditect-cloud-infra \
--format="get(settings.ipConfiguration.requireSsl)"

Expected: False

Pros:

  • Simple configuration
  • No certificate management
  • Faster debugging cycles

Cons:

  • Unencrypted database connections (private VPC only)
  • Not production-ready
  • Security compliance risk

Option B: Use Cloud SQL Proxy (Production Recommended)

Step 1: Add Cloud SQL Proxy Sidecar

File: deployment/kubernetes/staging/backend-deployment.yaml

spec:
template:
spec:
containers:
# Main application container
- name: backend
image: us-central1-docker.pkg.dev/.../coditect-cloud-backend:v1.0.0-staging
env:
- name: DB_HOST
value: "127.0.0.1" # Connect to proxy sidecar
- name: DB_PORT
value: "5432"

# Cloud SQL Proxy sidecar
- name: cloud-sql-proxy
image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.0
args:
- "--private-ip"
- "coditect-cloud-infra:us-central1:coditect-db"
- "--port=5432"
securityContext:
runAsNonRoot: true
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"

Step 2: Grant Cloud SQL Client Role

gcloud projects add-iam-policy-binding coditect-cloud-infra \
--member="serviceAccount:coditect-cloud-backend@coditect-cloud-infra.iam.gserviceaccount.com" \
--role="roles/cloudsql.client"

Step 3: Apply Updated Deployment

kubectl apply -f deployment/kubernetes/staging/backend-deployment.yaml

Pros:

  • Automatic SSL with Google-managed certificates
  • No certificate file management
  • Automatic IAM authentication support
  • Production-ready security

Cons:

  • Additional sidecar container overhead
  • More complex configuration

Option C: Client SSL Certificates (Manual)

Not recommended - Use Cloud SQL Proxy instead.

Verification Steps

1. Test database connection from pod:

kubectl exec -n coditect-staging deployment/coditect-backend -- \
python manage.py dbshell --command "SELECT version();"

Success indicator: PostgreSQL version output

2. Check application startup:

kubectl logs -n coditect-staging -l app=coditect-backend --tail=50 | grep -i database

Success indicators:

  • No SSL errors
  • "Applying migration..." messages

3. Verify SSL status (if using proxy):

kubectl exec -n coditect-staging deployment/coditect-backend -- \
python manage.py dbshell --command "SELECT ssl_is_used();"

Expected (with proxy): t (true) Expected (without SSL): f (false)

Prevention Guidance

For Future Deployments:

  1. Production: Always use Cloud SQL Proxy

    • Deploy proxy as sidecar container
    • Enable IAM authentication where possible
    • Never disable SSL requirement
  2. Staging: Document security tradeoffs

    • If disabling SSL, document in deployment README
    • Add comment in Terraform/IaC files
    • Set reminder to re-enable for production
  3. Infrastructure as Code

    • Terraform: Set require_ssl = true for production
    • Document SSL configuration in terraform/variables.tf
  4. Connection testing

    • Add health check that verifies database SSL
    • Monitor SSL connection metrics in production

Related ADRs:

  • ADR-0009: Database Architecture and Management
  • ADR-0008: Security Best Practices

Issue 5: Database User Authentication

Error Symptoms

django.db.utils.OperationalError:
connection to server at "10.41.64.3", port 5432 failed:
FATAL: password authentication failed for user "coditect_app"

Pod Logs:

psycopg2.OperationalError:
FATAL: role "coditect_app" does not exist

Pod Status: CrashLoopBackOff during database connection initialization

Root Cause Analysis

Primary Cause: Database user coditect_app either doesn't exist in Cloud SQL instance or has incorrect password.

Technical Details:

  • Cloud SQL instance created, but database users not provisioned
  • Django DATABASES configuration references DB_USER=coditect_app
  • Kubernetes secret backend-secrets may have wrong password
  • PostgreSQL rejects authentication for non-existent or mismatched credentials

Why This Happened:

  • Database infrastructure (instance) created separately from application resources
  • User creation not automated in Terraform/IaC
  • Manual user creation step missed
  • Password mismatch between gcloud command and Kubernetes secret

Complete Solution

Step 1: Verify Database Instance is Running

gcloud sql instances describe coditect-db \
--project=coditect-cloud-infra \
--format="get(state)"

Expected: RUNNABLE

Step 2: Create Database User

# Generate secure password (save this!)
DB_PASSWORD=$(openssl rand -base64 32)
echo "Database Password: $DB_PASSWORD"

# Create user in Cloud SQL
gcloud sql users create coditect_app \
--instance=coditect-db \
--password="$DB_PASSWORD" \
--project=coditect-cloud-infra

Verify:

gcloud sql users list \
--instance=coditect-db \
--project=coditect-cloud-infra

Expected Output:

NAME           HOST
coditect_app %
postgres %

Step 3: Create Database (if not exists)

gcloud sql databases create coditect_db \
--instance=coditect-db \
--project=coditect-cloud-infra

Verify:

gcloud sql databases list \
--instance=coditect-db \
--project=coditect-cloud-infra

Step 4: Create Kubernetes Secret

# Delete old secret if exists
kubectl delete secret backend-secrets -n coditect-staging --ignore-not-found

# Create new secret with correct password
kubectl create secret generic backend-secrets \
-n coditect-staging \
--from-literal=django-secret-key="$(openssl rand -base64 64)" \
--from-literal=db-name="coditect_db" \
--from-literal=db-user="coditect_app" \
--from-literal=db-password="$DB_PASSWORD" \
--from-literal=db-host="10.41.64.3" \
--from-literal=redis-host="10.41.65.4"

Verify:

kubectl describe secret backend-secrets -n coditect-staging

Step 5: Restart Deployment

kubectl rollout restart deployment/coditect-backend -n coditect-staging

Step 6: Grant Database Permissions

# Connect to database
gcloud sql connect coditect-db \
--user=postgres \
--quiet \
--project=coditect-cloud-infra

# In psql prompt:
GRANT ALL PRIVILEGES ON DATABASE coditect_db TO coditect_app;
GRANT ALL PRIVILEGES ON SCHEMA public TO coditect_app;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON TABLES TO coditect_app;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON SEQUENCES TO coditect_app;
\q

Verification Steps

1. Test authentication from pod:

kubectl exec -n coditect-staging deployment/coditect-backend -- \
python manage.py check --database default

Success indicator: System check identified no issues (0 silenced).

2. Run database migrations:

kubectl exec -n coditect-staging deployment/coditect-backend -- \
python manage.py migrate --noinput

Success indicators:

  • "Applying migrations..." messages
  • No authentication errors

3. Test database query:

kubectl exec -n coditect-staging deployment/coditect-backend -- \
python manage.py dbshell --command "SELECT current_user, current_database();"

Expected Output:

 current_user | current_database
--------------+------------------
coditect_app | coditect_db

4. Check application logs:

kubectl logs -n coditect-staging -l app=coditect-backend --tail=50 | grep -E "(database|migration)"

Success indicators:

  • No authentication errors
  • Migration success messages

Prevention Guidance

For Future Deployments:

  1. Automate user creation in Terraform

    resource "google_sql_user" "app_user" {
    name = "coditect_app"
    instance = google_sql_database_instance.main.name
    password = random_password.db_password.result
    }

    resource "random_password" "db_password" {
    length = 32
    special = true
    }

    resource "google_secret_manager_secret_version" "db_password" {
    secret = google_secret_manager_secret.db_password.id
    secret_data = random_password.db_password.result
    }
  2. Use GCP Secret Manager (not Kubernetes secrets)

    • Store passwords in Secret Manager
    • Mount as environment variables in pods
    • Automatic rotation support
  3. Database initialization job

    • Create Kubernetes Job to run migrations
    • Verify database connectivity before deployment
    • Example: deployment/kubernetes/staging/migrate-job.yaml
  4. Document database credentials

    • Store password in password manager (1Password, etc.)
    • Document user creation in runbook
    • Add to disaster recovery procedures

Related ADRs:

  • ADR-0009: Database Architecture and Management
  • ADR-0008: Security Best Practices

Issue 6: Django ALLOWED_HOSTS Rejection

Error Symptoms

Invalid HTTP_HOST header: '10.56.2.20:8000'.
You may need to add '10.56.2.20' to ALLOWED_HOSTS.

Pod Logs:

DisallowedHost at /api/v1/health/live
Invalid HTTP_HOST header: '10.56.1.10:8000'.
The domain name provided is not valid according to RFC 1034/1035.

HTTP Response: 400 Bad Request

Health Probes: Failing with HTTP 400

Root Cause Analysis

Primary Cause: Django ALLOWED_HOSTS setting doesn't include Kubernetes pod IP addresses, and Django doesn't support CIDR notation natively.

Technical Details:

  • Kubernetes assigns dynamic pod IPs from cluster CIDR (e.g., 10.56.0.0/16)
  • Django ALLOWED_HOSTS requires explicit hostname/IP list
  • Health probes send requests with Host: POD_IP:8000 header
  • Django rejects requests with Host header not in ALLOWED_HOSTS
  • CIDR notation (10.56.0.0/16) not supported by Django

Why This Happened:

  • Security feature: Django prevents HTTP Host header attacks
  • Production settings enforce strict ALLOWED_HOSTS validation
  • Kubernetes pod IPs change on every deployment/restart
  • Cannot predict exact pod IPs in advance

Security Tradeoffs:

  • Production: Strict ALLOWED_HOSTS with explicit domains (high security)
  • Staging: Relaxed ALLOWED_HOSTS for debugging (convenience vs. security)

Complete Solution

Option A: Wildcard for Staging (Quick Fix)

File: deployment/kubernetes/staging/backend-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
name: backend-config
namespace: coditect-staging
labels:
app: coditect-backend
environment: staging
data:
DJANGO_ALLOWED_HOSTS: "*" # Allow all hosts (staging only!)

Pros:

  • Simple configuration
  • Works with any pod IP
  • No CIDR parsing needed

Cons:

  • Disables Host header validation
  • Vulnerable to Host header attacks
  • NOT for production

Option B: Include Cluster CIDR + Service DNS (Balanced)

File: deployment/kubernetes/staging/backend-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
name: backend-config
namespace: coditect-staging
data:
# Comma-separated list (Django native format)
DJANGO_ALLOWED_HOSTS: "10.56.0.0/16,coditect-backend.coditect-staging.svc.cluster.local,*.coditect.com,localhost"

Note: Django doesn't natively support CIDR, so this requires custom middleware.

Step 1: Create Custom Middleware

File: src/license_platform/middleware/allowed_hosts.py

import ipaddress
from django.core.exceptions import DisallowedHost
from django.http import HttpResponseBadRequest


class CIDRAwareAllowedHostsMiddleware:
"""
Custom middleware to support CIDR notation in ALLOWED_HOSTS.

Checks if request Host header matches:
1. Exact hostnames (existing Django behavior)
2. CIDR ranges (custom logic)
"""

def __init__(self, get_response):
self.get_response = get_response

def __call__(self, request):
host = request.get_host().split(':')[0] # Remove port

if not self.is_allowed_host(host):
return HttpResponseBadRequest(f"Invalid Host: {host}")

return self.get_response(request)

def is_allowed_host(self, host):
from django.conf import settings

allowed_hosts = settings.ALLOWED_HOSTS

# Check exact matches (wildcards supported by Django)
if host in allowed_hosts:
return True

# Check wildcard matches
if '*' in allowed_hosts:
return True

# Check CIDR ranges
try:
ip = ipaddress.ip_address(host)
for allowed in allowed_hosts:
if '/' in allowed: # CIDR notation
network = ipaddress.ip_network(allowed, strict=False)
if ip in network:
return True
except ValueError:
pass # Not an IP address

return False

Step 2: Register Middleware

File: src/license_platform/settings/production.py

MIDDLEWARE = [
'django.middleware.security.SecurityMiddleware',
'license_platform.middleware.allowed_hosts.CIDRAwareAllowedHostsMiddleware', # Add this
# ... rest of middleware
]

Step 3: Update ConfigMap

data:
DJANGO_ALLOWED_HOSTS: "10.56.0.0/16,coditect-backend.coditect-staging.svc.cluster.local,*.coditect.com,localhost"

Pros:

  • Maintains security validation
  • Supports dynamic pod IPs
  • Production-ready with proper domains

Cons:

  • Custom middleware maintenance
  • Slight performance overhead

Option C: Use Service DNS Only (Recommended for Production)

File: deployment/kubernetes/staging/backend-deployment.yaml

livenessProbe:
httpGet:
path: /api/v1/health/live
port: http
httpHeaders:
- name: Host
value: coditect-backend.coditect-staging.svc.cluster.local

File: deployment/kubernetes/staging/backend-config.yaml

data:
DJANGO_ALLOWED_HOSTS: "coditect-backend.coditect-staging.svc.cluster.local,api.coditect.com"

Pros:

  • Most secure (explicit hosts only)
  • No custom middleware
  • Best for production

Cons:

  • Requires updating probe configuration
  • Less flexible for debugging

Verification Steps

1. Check ConfigMap applied:

kubectl get configmap backend-config -n coditect-staging -o yaml

2. Restart deployment to pick up ConfigMap:

kubectl rollout restart deployment/coditect-backend -n coditect-staging

3. Test health endpoint:

kubectl exec -n coditect-staging deployment/coditect-backend -- \
curl -H "Host: 10.56.1.10:8000" http://localhost:8000/api/v1/health/live

Success indicator: HTTP 200 response

4. Check liveness probe status:

kubectl describe pod -n coditect-staging -l app=coditect-backend | grep -A 5 "Liveness:"

Success indicators:

  • Liveness: ... #success=... (no failures)
  • No "Liveness probe failed" events

5. Test from external request:

curl -H "Host: api.coditect.com" http://LOAD_BALANCER_IP/api/v1/health/live

Prevention Guidance

For Future Deployments:

  1. Production: Use explicit domains only

    ALLOWED_HOSTS = [
    'api.coditect.com',
    'api-staging.coditect.com',
    'coditect-backend.coditect-staging.svc.cluster.local',
    ]
  2. Staging: Use wildcard OR CIDR middleware

    • Wildcard for quick debugging
    • CIDR middleware for production-like testing
  3. Health probes: Use Service DNS

    livenessProbe:
    httpGet:
    httpHeaders:
    - name: Host
    value: service.namespace.svc.cluster.local
  4. Document tradeoffs in settings

    # settings/production.py
    # SECURITY WARNING: Wildcard ('*') disables Host header validation
    # Only use in staging/dev, NEVER in production
    ALLOWED_HOSTS = os.environ.get('DJANGO_ALLOWED_HOSTS', '').split(',')

Related ADRs:

  • ADR-0008: Security Best Practices

Issue 7: Health Probe HTTPS/HTTP Mismatch

Error Symptoms

Liveness probe failed:
Get "https://10.56.1.10:8000/api/v1/health/live":
context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Pod Logs:

No errors (application running correctly)

Pod Status: Not Ready despite application running

Health Probe Status: Failing with timeout

Root Cause Analysis

Primary Cause: Kubernetes health probes defaulting to HTTPS, but application serving HTTP only. Production Django settings redirect HTTP to HTTPS, creating probe timeout loop.

Technical Details:

  • Django production settings: SECURE_SSL_REDIRECT = True
  • Application listens on HTTP port 8000 (no TLS termination)
  • Kubernetes livenessProbe without explicit scheme: HTTP defaults to HTTPS
  • Probe sends HTTPS request to HTTP port
  • Django redirects to HTTPS (301/302)
  • Probe times out waiting for HTTPS response

Why This Happened:

  • Security best practice: Django enforces HTTPS in production
  • TLS termination expected at load balancer, not application
  • Kubernetes probe defaults changed in recent versions
  • Missing explicit scheme parameter in probe configuration

Architecture Context:

  • Load Balancer: Terminates TLS, forwards HTTP to backend
  • Backend: Serves HTTP only (trusts private network)
  • Health Probes: Direct pod access (bypasses load balancer)

Complete Solution

File: deployment/kubernetes/staging/backend-deployment.yaml

Original (Broken):

livenessProbe:
httpGet:
path: /api/v1/health/live
port: http # Named port (8000)
# scheme defaults to HTTPS if not specified!
initialDelaySeconds: 30
periodSeconds: 10

Fixed:

livenessProbe:
httpGet:
path: /api/v1/health/live
port: http
scheme: HTTP # Explicitly specify HTTP
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3

readinessProbe:
httpGet:
path: /api/v1/health/ready
port: http
scheme: HTTP # Also fix readiness probe
initialDelaySeconds: 20
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3

Key Changes:

  1. Add scheme: HTTP to both liveness and readiness probes
  2. Explicit timeouts for clarity
  3. Consistent configuration across all probes

Step 1: Update Deployment Manifest

# Edit file
vim deployment/kubernetes/staging/backend-deployment.yaml

# Apply changes
kubectl apply -f deployment/kubernetes/staging/backend-deployment.yaml

Step 2: Verify Probe Configuration

kubectl get pod -n coditect-staging -l app=coditect-backend -o yaml | grep -A 10 "livenessProbe:"

Expected Output:

livenessProbe:
httpGet:
path: /api/v1/health/live
port: 8000
scheme: HTTP # Verify this line

Verification Steps

1. Watch pod status during rollout:

kubectl rollout status deployment/coditect-backend -n coditect-staging

Success indicator: deployment "coditect-backend" successfully rolled out

2. Check probe success:

kubectl describe pod -n coditect-staging -l app=coditect-backend | grep -A 10 "Liveness:"

Success indicators:

  • Liveness: http-get http://:8000/api/v1/health/live
  • No failure events

3. Manual probe test:

POD_IP=$(kubectl get pod -n coditect-staging -l app=coditect-backend -o jsonpath='{.items[0].status.podIP}')

curl -v http://$POD_IP:8000/api/v1/health/live

Expected: HTTP 200 response

4. Check probe failure history:

kubectl get events -n coditect-staging --field-selector involvedObject.name=coditect-backend --sort-by='.lastTimestamp' | tail -20

Success indicator: No recent "Liveness probe failed" events

Prevention Guidance

For Future Deployments:

  1. Always specify probe scheme explicitly

    # Don't rely on defaults
    livenessProbe:
    httpGet:
    scheme: HTTP # Explicit is better than implicit
  2. Document TLS termination architecture

    Client --HTTPS--> Load Balancer --HTTP--> Backend Pods
    (TLS termination) (HTTP only)
  3. Test probes independently

    # Before deploying, test probe endpoint
    curl http://localhost:8000/api/v1/health/live
  4. Monitor probe metrics

    • Prometheus: prober_probe_success
    • Grafana dashboard for health probe success rate
  5. Production considerations

    • If backend must serve HTTPS, use self-signed certs for probes
    • Configure probe to skip TLS verification:
      livenessProbe:
      httpGet:
      scheme: HTTPS
      httpHeaders:
      - name: X-Forwarded-Proto
      value: https

Related ADRs:

  • ADR-0008: Security Best Practices
  • ADR-0005: Kubernetes Deployment Strategy

Quick Reference

Common Commands

# Check pod status
kubectl get pods -n coditect-staging -l app=coditect-backend

# View pod logs
kubectl logs -n coditect-staging -l app=coditect-backend --tail=100 -f

# Describe pod (includes events)
kubectl describe pod -n coditect-staging -l app=coditect-backend

# Execute command in pod
kubectl exec -n coditect-staging deployment/coditect-backend -- COMMAND

# Restart deployment
kubectl rollout restart deployment/coditect-backend -n coditect-staging

# Check rollout status
kubectl rollout status deployment/coditect-backend -n coditect-staging

# View ConfigMap
kubectl get configmap backend-config -n coditect-staging -o yaml

# View Secrets (keys only, not values)
kubectl describe secret backend-secrets -n coditect-staging

# Check service endpoints
kubectl get endpoints coditect-backend -n coditect-staging

Issue Decision Tree

Pod Status: ImagePullBackOff
└─> Issue 1: GCR Deprecation OR Issue 2: Platform Mismatch

Pod Status: CrashLoopBackOff
├─> Logs: "ModuleNotFoundError" → Issue 3: User Permissions
├─> Logs: "password authentication failed" → Issue 5: Database User
├─> Logs: "connection requires a valid client certificate" → Issue 4: SSL Requirement
└─> Logs: "DisallowedHost" → Issue 6: ALLOWED_HOSTS

Pod Status: Not Ready
├─> Liveness probe failed: HTTPS timeout → Issue 7: Probe HTTPS/HTTP
└─> Readiness probe failed: DB connection → Issue 4 or 5

Environment Variables Checklist

Required in backend-secrets:

  • django-secret-key - Random 64-byte string
  • db-name - Database name (coditect_db)
  • db-user - Database user (coditect_app)
  • db-password - Database password (32+ chars)
  • db-host - Cloud SQL private IP
  • redis-host - Redis private IP

Required in backend-config:

  • DJANGO_ALLOWED_HOSTS - Comma-separated hosts

Pre-Deployment Checklist

Before deploying to staging:

  • Artifact Registry repository created
  • GKE service account has artifactregistry.reader role
  • Image built with --platform linux/amd64
  • Image pushed to Artifact Registry (not GCR)
  • Cloud SQL instance running
  • Cloud SQL user created (coditect_app)
  • Cloud SQL database created (coditect_db)
  • Kubernetes secrets created
  • ConfigMap updated with correct ALLOWED_HOSTS
  • Deployment manifest specifies scheme: HTTP in probes
  • SSL requirement disabled (staging) OR Cloud SQL Proxy configured (production)

Internal Documentation

External References


Appendix: Full Working Configuration

Dockerfile (Final)

File: Dockerfile

# CODITECT Cloud Backend - Production Dockerfile
# Django 5.2.8 Backend with Multi-Stage Build
# Python 3.12.12 for protobuf compatibility

# Stage 1: Builder - Install dependencies and collect static files
FROM python:3.12.12-slim-bookworm as builder

LABEL maintainer="AZ1.AI INC <engineering@az1.ai>"
LABEL description="CODITECT Cloud Backend - Django 5.2.8 API Server"

WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
g++ \
make \
postgresql-client \
libpq-dev \
&& rm -rf /var/lib/apt/lists/*

# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Copy application code
COPY . .

# Collect static files (production settings expect environment variables)
ENV DJANGO_SETTINGS_MODULE=license_platform.settings.production
ENV DJANGO_SECRET_KEY=temp-build-key
ENV DB_NAME=build
ENV DB_USER=build
ENV DB_PASSWORD=build
ENV DB_HOST=localhost
RUN python manage.py collectstatic --noinput --clear

# Stage 2: Runtime - Minimal production image
FROM python:3.12.12-slim-bookworm

WORKDIR /app

# Install runtime dependencies only
RUN apt-get update && apt-get install -y --no-install-recommends \
postgresql-client \
libpq-dev \
curl \
&& rm -rf /var/lib/apt/lists/*

# Create non-root user first
RUN useradd -m -u 1000 django

# Copy Python packages from builder to user directory
COPY --from=builder /root/.local /home/django/.local

# Copy application code from builder
COPY --from=builder /app /app

# Set ownership
RUN chown -R django:django /app /home/django/.local

# Make sure scripts in .local are usable
ENV PATH=/home/django/.local/bin:$PATH

USER django

# Expose port
EXPOSE 8000

# Health check endpoint
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
CMD curl -f http://localhost:8000/api/v1/health/ || exit 1

# Run with gunicorn for production
CMD ["gunicorn", \
"--bind", "0.0.0.0:8000", \
"--workers", "4", \
"--worker-class", "sync", \
"--timeout", "60", \
"--access-logfile", "-", \
"--error-logfile", "-", \
"--log-level", "info", \
"license_platform.wsgi:application"]

Deployment Manifest (Final)

File: deployment/kubernetes/staging/backend-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
name: coditect-backend
namespace: coditect-staging
labels:
app: coditect-backend
environment: staging
version: v1.0.0
spec:
replicas: 2
revisionHistoryLimit: 3
selector:
matchLabels:
app: coditect-backend
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: coditect-backend
environment: staging
version: v1.0.0
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: coditect-cloud-backend
imagePullSecrets:
- name: gcr-json-key
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000

containers:
- name: backend
image: us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging
imagePullPolicy: Always

ports:
- name: http
containerPort: 8000
protocol: TCP

env:
- name: DJANGO_SETTINGS_MODULE
value: "license_platform.settings.production"

- name: DJANGO_SECRET_KEY
valueFrom:
secretKeyRef:
name: backend-secrets
key: django-secret-key

- name: DB_NAME
valueFrom:
secretKeyRef:
name: backend-secrets
key: db-name

- name: DB_USER
valueFrom:
secretKeyRef:
name: backend-secrets
key: db-user

- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: backend-secrets
key: db-password

- name: DB_HOST
valueFrom:
secretKeyRef:
name: backend-secrets
key: db-host

- name: DB_PORT
value: "5432"

- name: REDIS_HOST
valueFrom:
secretKeyRef:
name: backend-secrets
key: redis-host

- name: REDIS_PORT
value: "6379"

- name: CLOUD_KMS_PROJECT_ID
value: "coditect-prod-563272"

- name: CLOUD_KMS_LOCATION
value: "us-central1"

- name: CLOUD_KMS_KEYRING
value: "license-signing-keyring"

- name: CLOUD_KMS_KEY
value: "license-signing-key"

- name: GCP_PROJECT_ID
value: "coditect-prod-563272"

- name: ENVIRONMENT
value: "staging"

resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"

livenessProbe:
httpGet:
path: /api/v1/health/live
port: http
scheme: HTTP # Critical: Explicit HTTP scheme
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3

readinessProbe:
httpGet:
path: /api/v1/health/ready
port: http
scheme: HTTP # Critical: Explicit HTTP scheme
initialDelaySeconds: 20
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3

securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: false
capabilities:
drop:
- ALL

terminationGracePeriodSeconds: 30

affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- coditect-backend
topologyKey: kubernetes.io/hostname

ConfigMap (Final)

File: deployment/kubernetes/staging/backend-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
name: backend-config
namespace: coditect-staging
labels:
app: coditect-backend
environment: staging
data:
# Staging: Wildcard for convenience (NOT for production!)
DJANGO_ALLOWED_HOSTS: "*"

# Alternative (production-ready):
# DJANGO_ALLOWED_HOSTS: "10.56.0.0/16,coditect-backend.coditect-staging.svc.cluster.local,*.coditect.com,localhost"

Build and Deploy Script

File: scripts/deploy-staging.sh

#!/bin/bash
set -euo pipefail

PROJECT_ID="coditect-cloud-infra"
REGION="us-central1"
IMAGE_REPO="coditect-backend"
IMAGE_NAME="coditect-cloud-backend"
VERSION="${1:-v1.0.0-staging}"

echo "Building image for linux/amd64..."
docker buildx build \
--platform linux/amd64 \
--tag ${REGION}-docker.pkg.dev/${PROJECT_ID}/${IMAGE_REPO}/${IMAGE_NAME}:${VERSION} \
--push \
.

echo "Applying Kubernetes manifests..."
kubectl apply -f deployment/kubernetes/staging/namespace.yaml
kubectl apply -f deployment/kubernetes/staging/backend-config.yaml
kubectl apply -f deployment/kubernetes/staging/backend-deployment.yaml
kubectl apply -f deployment/kubernetes/staging/backend-service.yaml

echo "Waiting for rollout..."
kubectl rollout status deployment/coditect-backend -n coditect-staging

echo "Deployment complete!"
kubectl get pods -n coditect-staging -l app=coditect-backend

Document Status: Complete Last Validated: December 1, 2025 Next Review: January 1, 2026 Owner: AZ1.AI INC Engineering Team