CODITECT Cloud Backend - Staging Deployment Troubleshooting Guide

Document Status: Production Last Updated: December 1, 2025 Environment: Google Kubernetes Engine (GKE) Staging Purpose: Comprehensive troubleshooting guide for common deployment issues

Issue 1: GCR Deprecation (403 Forbidden)
Issue 2: Multi-Platform Docker Build
Issue 3: Dockerfile User Permissions
Issue 4: Cloud SQL SSL Certificate Requirement
Issue 5: Database User Authentication
Issue 6: Django ALLOWED_HOSTS Rejection
Issue 7: Health Probe HTTPS/HTTP Mismatch
Quick Reference
Related Documentation

Overview

This guide documents 7 critical issues encountered and resolved during the initial staging deployment of CODITECT Cloud Backend to Google Kubernetes Engine. Each issue includes:

Error symptoms - What you'll see when this happens
Root cause analysis - Why it happened
Complete solution - How to fix it permanently
Verification steps - How to confirm it's resolved
Prevention guidance - How to avoid in future deployments

Deployment Context:

Platform: Google Kubernetes Engine (GKE) Standard
Region: us-central1
Cluster: coditect-staging-cluster
Namespace: coditect-staging
Image Registry: Artifact Registry (us-central1-docker.pkg.dev)
Database: Cloud SQL PostgreSQL 15
Framework: Django 5.2.8 with Python 3.12.12

Issue 1: GCR Deprecation (403 Forbidden)

Error Symptoms

Failed to pull image "gcr.io/coditect-cloud-infra/coditect-cloud-backend:v1.0.0-staging":
rpc error: code = Unknown desc = failed to pull and unpack image "gcr.io/...":
failed to resolve reference "gcr.io/...":
pull access denied, repository does not exist or may require authorization:
server message: insufficient_scope: authorization failed

Pod Status: ImagePullBackOff or ErrImagePull

Timeline: After March 18, 2025 (GCR shutdown date)

Root Cause Analysis

Primary Cause: Google Container Registry (gcr.io) was deprecated and shut down on March 18, 2025.

Background:

Google announced GCR deprecation in 2023
All GCR URLs (gcr.io, us.gcr.io, etc.) now return 403 Forbidden
Google requires migration to Artifact Registry (pkg.dev)
Existing images in GCR were migrated automatically, but new pushes fail

Why This Happened:

Deployment manifests used legacy gcr.io/PROJECT_ID/... URLs
GKE service account had storage.objectViewer role (for GCR)
Missing Artifact Registry API enablement and permissions

Complete Solution

Step 1: Enable Artifact Registry API

gcloud services enable artifactregistry.googleapis.com \
  --project=coditect-cloud-infra

Step 2: Create Artifact Registry Repository

gcloud artifacts repositories create coditect-backend \
  --repository-format=docker \
  --location=us-central1 \
  --description="CODITECT Cloud Backend Docker Images" \
  --project=coditect-cloud-infra

Verify:

gcloud artifacts repositories list \
  --location=us-central1 \
  --project=coditect-cloud-infra

Expected output:

REPOSITORY         FORMAT  MODE                 DESCRIPTION
coditect-backend   DOCKER  STANDARD_REPOSITORY  CODITECT Cloud Backend Docker Images

Step 3: Grant GKE Service Account Pull Access

# Get GKE service account
GKE_SA=$(gcloud container clusters describe coditect-staging-cluster \
  --region=us-central1 \
  --format="value(nodeConfig.serviceAccount)")

# Grant Artifact Registry Reader role
gcloud artifacts repositories add-iam-policy-binding coditect-backend \
  --location=us-central1 \
  --member="serviceAccount:${GKE_SA}" \
  --role="roles/artifactregistry.reader" \
  --project=coditect-cloud-infra

Step 4: Update Deployment Manifests

File: deployment/kubernetes/staging/backend-deployment.yaml

Change:

# OLD (GCR - deprecated)
image: gcr.io/coditect-cloud-infra/coditect-cloud-backend:v1.0.0-staging

# NEW (Artifact Registry)
image: us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging

Step 5: Build and Push to Artifact Registry

# Configure Docker authentication
gcloud auth configure-docker us-central1-docker.pkg.dev

# Build for correct platform (see Issue 2)
docker buildx build \
  --platform linux/amd64 \
  --tag us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
  --push \
  .

Step 6: Apply Updated Deployment

kubectl apply -f deployment/kubernetes/staging/backend-deployment.yaml

Verification Steps

1. Check image pull status:

kubectl describe pod -n coditect-staging -l app=coditect-backend | grep -A 5 "Events:"

Success indicators:

No ImagePullBackOff events
Successfully pulled image message
Pod status: Running

2. Verify image in Artifact Registry:

gcloud artifacts docker images list \
  us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend

3. Test image pull manually:

docker pull us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging

Prevention Guidance

For Future Deployments:

Always use Artifact Registry for new projects
- Never use gcr.io URLs
- Standard format: REGION-docker.pkg.dev/PROJECT/REPOSITORY/IMAGE:TAG
Update CI/CD pipelines
- GitHub Actions: Use google-github-actions/auth@v2 with Artifact Registry
- Replace all gcr.io references in workflow files
Document registry locations
- Add Artifact Registry URLs to deployment documentation
- Include in .env.example files
IAM least privilege
- Grant only roles/artifactregistry.reader for pulling
- Grant roles/artifactregistry.writer only to CI/CD service accounts

Related ADRs:

ADR-0001: Google Cloud Platform as Primary Cloud Provider
ADR-0007: Docker Container Strategy

Issue 2: Multi-Platform Docker Build

Error Symptoms

Error response from daemon:
no match for platform in manifest:
not found

Or:

exec /usr/local/bin/python: exec format error

Pod Status: CrashLoopBackOff with exit code 1

Timeline: After successfully pulling image from Artifact Registry

Root Cause Analysis

Primary Cause: Docker image built on macOS (arm64/Apple Silicon) incompatible with GKE nodes (linux/amd64).

Technical Details:

macOS with Apple Silicon uses ARM64 architecture
GKE Standard nodes use x86-64 (AMD64) architecture
Default docker build creates single-platform image for host architecture
Kubernetes pulls image but cannot execute ARM64 binaries on AMD64 nodes

Why This Happened:

Built image locally on MacBook (ARM64)
Pushed to Artifact Registry without platform specification
GKE attempted to run ARM64 image on AMD64 nodes
Binary format mismatch caused exec errors

Complete Solution

Step 1: Install Docker Buildx (if not already installed)

# Verify buildx availability
docker buildx version

# If missing, install
docker buildx install

Step 2: Create Multi-Platform Builder

# Create builder instance
docker buildx create --name multiplatform --use

# Verify builder
docker buildx inspect multiplatform --bootstrap

Step 3: Build for Correct Platform

# Build for linux/amd64 (GKE platform)
docker buildx build \
  --platform linux/amd64 \
  --tag us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
  --push \
  .

Important Flags:

--platform linux/amd64 - Build for x86-64 (GKE nodes)
--push - Push directly to registry (required for multi-platform)
. - Dockerfile location

Step 4: Verify Image Manifest

# Inspect image architecture
docker buildx imagetools inspect \
  us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging

Expected Output:

Name:      us-central1-docker.pkg.dev/.../coditect-cloud-backend:v1.0.0-staging
MediaType: application/vnd.docker.distribution.manifest.v2+json
Digest:    sha256:...

Manifests:
  Name:      us-central1-docker.pkg.dev/.../coditect-cloud-backend:v1.0.0-staging@sha256:...
  MediaType: application/vnd.docker.distribution.manifest.v2+json
  Platform:  linux/amd64  <-- VERIFY THIS

Verification Steps

1. Check pod startup:

kubectl logs -n coditect-staging -l app=coditect-backend --tail=50

Success indicators:

Django/Gunicorn startup messages
No "exec format error"
Application listening on port 8000

2. Verify architecture in running container:

kubectl exec -n coditect-staging deployment/coditect-backend -- uname -m

Expected: x86_64 (not aarch64)

3. Test image locally (optional):

# Pull and run locally with platform specification
docker run --platform linux/amd64 \
  us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
  python --version

Prevention Guidance

For Future Deployments:

Always specify platform in builds
- Add --platform linux/amd64 to all production builds
- Document in deployment runbooks
CI/CD standardization
- GitHub Actions runners are linux/amd64 by default (correct)
- If building locally, create shell alias:
```
alias docker-build-gke='docker buildx build --platform linux/amd64'
```
Multi-platform support (optional)
- For broader compatibility, build multi-platform:
```
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --tag IMAGE \
  --push .
```
- Supports both AMD64 (GKE) and ARM64 (future ARM nodes)

Makefile integration

Create Makefile with standardized build commands:

.PHONY: build-staging
build-staging:
    docker buildx build \
      --platform linux/amd64 \
      --tag us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
      --push \
      .

Related ADRs:

ADR-0007: Docker Container Strategy

Issue 3: Dockerfile User Permissions

Error Symptoms

Traceback (most recent call last):
  File "/app/manage.py", line 11, in <module>
    from django.core.management import execute_from_command_line
ModuleNotFoundError: No module named 'django'

Pod Logs:

/usr/local/bin/python: can't open file '/app/manage.py': [Errno 13] Permission denied

Or:

[Errno 13] Permission denied: '/app/staticfiles'

Pod Status: CrashLoopBackOff with exit code 1

Root Cause Analysis

Primary Cause: Python packages installed to /root/.local in builder stage, but runtime container runs as non-root user django (UID 1000) without access to /root.

Technical Details:

Multi-stage Dockerfile uses FROM python:3.12.12-slim-bookworm as builder
Builder stage runs as root, installs packages to /root/.local
Runtime stage creates useradd -m -u 1000 django
COPY --from=builder /root/.local copied to runtime, but still owned by root
USER django directive switches to non-root user
User django cannot read /root/.local (permission denied)

Why This Happened:

Security best practice: Containers should run as non-root
GKE enforces runAsNonRoot: true in PodSecurityPolicy
Incorrect assumption that copying files changes ownership

Complete Solution

File: Dockerfile

Original (Broken):

FROM python:3.12.12-slim-bookworm as builder
# ... build steps ...
RUN pip install --no-cache-dir --user -r requirements.txt  # Installs to /root/.local

FROM python:3.12.12-slim-bookworm
RUN useradd -m -u 1000 django
COPY --from=builder /root/.local /root/.local  # Root-owned files
USER django  # Cannot access /root

Fixed:

FROM python:3.12.12-slim-bookworm as builder
WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc g++ make postgresql-client libpq-dev \
    && rm -rf /var/lib/apt/lists/*

# Install Python packages to /root/.local
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Copy application code
COPY . .

# Collect static files
ENV DJANGO_SETTINGS_MODULE=license_platform.settings.production
ENV DJANGO_SECRET_KEY=temp-build-key
ENV DB_NAME=build DB_USER=build DB_PASSWORD=build DB_HOST=localhost
RUN python manage.py collectstatic --noinput --clear

# Stage 2: Runtime
FROM python:3.12.12-slim-bookworm
WORKDIR /app

# Install runtime dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    postgresql-client libpq-dev curl \
    && rm -rf /var/lib/apt/lists/*

# Create non-root user FIRST
RUN useradd -m -u 1000 django

# Copy Python packages to USER HOME directory (not /root)
COPY --from=builder /root/.local /home/django/.local

# Copy application code
COPY --from=builder /app /app

# Change ownership to django user
RUN chown -R django:django /app /home/django/.local

# Add .local/bin to PATH so django user can find installed packages
ENV PATH=/home/django/.local/bin:$PATH

# Switch to non-root user
USER django

EXPOSE 8000
CMD ["gunicorn", ...]

Key Changes:

Create user first: RUN useradd -m -u 1000 django before copying files
Copy to user directory: COPY --from=builder /root/.local /home/django/.local
Fix ownership: RUN chown -R django:django /app /home/django/.local
Update PATH: ENV PATH=/home/django/.local/bin:$PATH
Switch user last: USER django after ownership changes

Verification Steps

1. Rebuild and push image:

docker buildx build \
  --platform linux/amd64 \
  --tag us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging \
  --push \
  .

2. Check pod startup:

kubectl logs -n coditect-staging -l app=coditect-backend --tail=50

Success indicators:

[INFO] Booting worker with pid: 7
[INFO] Listening at: http://0.0.0.0:8000

3. Verify user in running container:

kubectl exec -n coditect-staging deployment/coditect-backend -- whoami
# Expected: django

kubectl exec -n coditect-staging deployment/coditect-backend -- id
# Expected: uid=1000(django) gid=1000(django) groups=1000(django)

4. Verify package access:

kubectl exec -n coditect-staging deployment/coditect-backend -- python -c "import django; print(django.__version__)"
# Expected: 5.2.8

Prevention Guidance

For Future Dockerfiles:

Always create user before copying files

RUN useradd -m -u 1000 appuser
COPY --from=builder /root/.local /home/appuser/.local
RUN chown -R appuser:appuser /app /home/appuser/.local
USER appuser

Use explicit ownership in COPY

COPY --from=builder --chown=appuser:appuser /root/.local /home/appuser/.local

Verify permissions in build

RUN ls -la /home/appuser/.local/lib/python*/site-packages/ | head -10

Test as non-root locally

docker run --rm -it --user 1000:1000 IMAGE /bin/bash
python -c "import django"

Related ADRs:

ADR-0007: Docker Container Strategy
ADR-0008: Security Best Practices

Issue 4: Cloud SQL SSL Certificate Requirement

Error Symptoms

django.db.utils.OperationalError:
connection to server at "10.41.64.3", port 5432 failed:
FATAL: connection requires a valid client certificate

Pod Logs:

psycopg2.OperationalError:
FATAL: pg_hba.conf rejects connection for host "10.56.1.10",
user "coditect_app", database "coditect_db", no encryption

Pod Status: CrashLoopBackOff during database connection

Root Cause Analysis

Primary Cause: Cloud SQL instance configured with requireSsl: true, but application not providing client SSL certificates.

Technical Details:

Cloud SQL instance created with default security settings
settings.requireSsl: true enforces TLS for all connections
Django DATABASES configuration missing SSL parameters
PostgreSQL server rejects non-SSL connections per pg_hba.conf rules

Why This Happened:

Security best practice: Cloud SQL requires SSL by default
Application not configured for SSL connections
Missing Cloud SQL proxy configuration OR client certificates

Security Tradeoffs:

Production: MUST use SSL with client certificates (high security)
Staging: MAY disable SSL if on private VPC (convenience vs. security)

Complete Solution

Option A: Disable SSL Requirement (Staging Only)

WARNING: Only for staging environments on private networks. NEVER for production.

gcloud sql instances patch coditect-db \
  --no-require-ssl \
  --project=coditect-cloud-infra

Verification:

gcloud sql instances describe coditect-db \
  --project=coditect-cloud-infra \
  --format="get(settings.ipConfiguration.requireSsl)"

Expected: False

Pros:

Simple configuration
No certificate management
Faster debugging cycles

Cons:

Unencrypted database connections (private VPC only)
Not production-ready
Security compliance risk

Option B: Use Cloud SQL Proxy (Production Recommended)

Step 1: Add Cloud SQL Proxy Sidecar

File: deployment/kubernetes/staging/backend-deployment.yaml

spec:
  template:
    spec:
      containers:
      # Main application container
      - name: backend
        image: us-central1-docker.pkg.dev/.../coditect-cloud-backend:v1.0.0-staging
        env:
        - name: DB_HOST
          value: "127.0.0.1"  # Connect to proxy sidecar
        - name: DB_PORT
          value: "5432"

      # Cloud SQL Proxy sidecar
      - name: cloud-sql-proxy
        image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.0
        args:
        - "--private-ip"
        - "coditect-cloud-infra:us-central1:coditect-db"
        - "--port=5432"
        securityContext:
          runAsNonRoot: true
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"

Step 2: Grant Cloud SQL Client Role

gcloud projects add-iam-policy-binding coditect-cloud-infra \
  --member="serviceAccount:coditect-cloud-backend@coditect-cloud-infra.iam.gserviceaccount.com" \
  --role="roles/cloudsql.client"

Step 3: Apply Updated Deployment

kubectl apply -f deployment/kubernetes/staging/backend-deployment.yaml

Pros:

Automatic SSL with Google-managed certificates
No certificate file management
Automatic IAM authentication support
Production-ready security

Cons:

Additional sidecar container overhead
More complex configuration

Option C: Client SSL Certificates (Manual)

Not recommended - Use Cloud SQL Proxy instead.

Verification Steps

1. Test database connection from pod:

kubectl exec -n coditect-staging deployment/coditect-backend -- \
  python manage.py dbshell --command "SELECT version();"

Success indicator: PostgreSQL version output

2. Check application startup:

kubectl logs -n coditect-staging -l app=coditect-backend --tail=50 | grep -i database

Success indicators:

No SSL errors
"Applying migration..." messages

3. Verify SSL status (if using proxy):

kubectl exec -n coditect-staging deployment/coditect-backend -- \
  python manage.py dbshell --command "SELECT ssl_is_used();"

Expected (with proxy): t (true) Expected (without SSL): f (false)

Prevention Guidance

For Future Deployments:

Production: Always use Cloud SQL Proxy
- Deploy proxy as sidecar container
- Enable IAM authentication where possible
- Never disable SSL requirement
Staging: Document security tradeoffs
- If disabling SSL, document in deployment README
- Add comment in Terraform/IaC files
- Set reminder to re-enable for production
Infrastructure as Code
- Terraform: Set require_ssl = true for production
- Document SSL configuration in terraform/variables.tf
Connection testing
- Add health check that verifies database SSL
- Monitor SSL connection metrics in production

Related ADRs:

ADR-0009: Database Architecture and Management
ADR-0008: Security Best Practices

Issue 5: Database User Authentication

Error Symptoms

django.db.utils.OperationalError:
connection to server at "10.41.64.3", port 5432 failed:
FATAL: password authentication failed for user "coditect_app"

Pod Logs:

psycopg2.OperationalError:
FATAL: role "coditect_app" does not exist

Pod Status: CrashLoopBackOff during database connection initialization

Root Cause Analysis

Primary Cause: Database user coditect_app either doesn't exist in Cloud SQL instance or has incorrect password.

Technical Details:

Cloud SQL instance created, but database users not provisioned
Django DATABASES configuration references DB_USER=coditect_app
Kubernetes secret backend-secrets may have wrong password
PostgreSQL rejects authentication for non-existent or mismatched credentials

Why This Happened:

Database infrastructure (instance) created separately from application resources
User creation not automated in Terraform/IaC
Manual user creation step missed
Password mismatch between gcloud command and Kubernetes secret

Complete Solution

Step 1: Verify Database Instance is Running

gcloud sql instances describe coditect-db \
  --project=coditect-cloud-infra \
  --format="get(state)"

Expected: RUNNABLE

Step 2: Create Database User

# Generate secure password (save this!)
DB_PASSWORD=$(openssl rand -base64 32)
echo "Database Password: $DB_PASSWORD"

# Create user in Cloud SQL
gcloud sql users create coditect_app \
  --instance=coditect-db \
  --password="$DB_PASSWORD" \
  --project=coditect-cloud-infra

Verify:

gcloud sql users list \
  --instance=coditect-db \
  --project=coditect-cloud-infra

Expected Output:

NAME           HOST
coditect_app   %
postgres       %

Step 3: Create Database (if not exists)

gcloud sql databases create coditect_db \
  --instance=coditect-db \
  --project=coditect-cloud-infra

Verify:

gcloud sql databases list \
  --instance=coditect-db \
  --project=coditect-cloud-infra

Step 4: Create Kubernetes Secret

# Delete old secret if exists
kubectl delete secret backend-secrets -n coditect-staging --ignore-not-found

# Create new secret with correct password
kubectl create secret generic backend-secrets \
  -n coditect-staging \
  --from-literal=django-secret-key="$(openssl rand -base64 64)" \
  --from-literal=db-name="coditect_db" \
  --from-literal=db-user="coditect_app" \
  --from-literal=db-password="$DB_PASSWORD" \
  --from-literal=db-host="10.41.64.3" \
  --from-literal=redis-host="10.41.65.4"

Verify:

kubectl describe secret backend-secrets -n coditect-staging

Step 5: Restart Deployment

kubectl rollout restart deployment/coditect-backend -n coditect-staging

Step 6: Grant Database Permissions

# Connect to database
gcloud sql connect coditect-db \
  --user=postgres \
  --quiet \
  --project=coditect-cloud-infra

# In psql prompt:
GRANT ALL PRIVILEGES ON DATABASE coditect_db TO coditect_app;
GRANT ALL PRIVILEGES ON SCHEMA public TO coditect_app;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON TABLES TO coditect_app;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON SEQUENCES TO coditect_app;
\q

Verification Steps

1. Test authentication from pod:

kubectl exec -n coditect-staging deployment/coditect-backend -- \
  python manage.py check --database default

Success indicator: System check identified no issues (0 silenced).

2. Run database migrations:

kubectl exec -n coditect-staging deployment/coditect-backend -- \
  python manage.py migrate --noinput

Success indicators:

"Applying migrations..." messages
No authentication errors

3. Test database query:

kubectl exec -n coditect-staging deployment/coditect-backend -- \
  python manage.py dbshell --command "SELECT current_user, current_database();"

Expected Output:

 current_user | current_database
--------------+------------------
 coditect_app | coditect_db

4. Check application logs:

kubectl logs -n coditect-staging -l app=coditect-backend --tail=50 | grep -E "(database|migration)"

Success indicators:

No authentication errors
Migration success messages

Prevention Guidance

For Future Deployments:

Automate user creation in Terraform

resource "google_sql_user" "app_user" {
  name     = "coditect_app"
  instance = google_sql_database_instance.main.name
  password = random_password.db_password.result
}

resource "random_password" "db_password" {
  length  = 32
  special = true
}

resource "google_secret_manager_secret_version" "db_password" {
  secret      = google_secret_manager_secret.db_password.id
  secret_data = random_password.db_password.result
}

Use GCP Secret Manager (not Kubernetes secrets)
- Store passwords in Secret Manager
- Mount as environment variables in pods
- Automatic rotation support
Database initialization job
- Create Kubernetes Job to run migrations
- Verify database connectivity before deployment
- Example: deployment/kubernetes/staging/migrate-job.yaml
Document database credentials
- Store password in password manager (1Password, etc.)
- Document user creation in runbook
- Add to disaster recovery procedures

Related ADRs:

ADR-0009: Database Architecture and Management
ADR-0008: Security Best Practices

Issue 6: Django ALLOWED_HOSTS Rejection

Error Symptoms

Invalid HTTP_HOST header: '10.56.2.20:8000'.
You may need to add '10.56.2.20' to ALLOWED_HOSTS.

Pod Logs:

DisallowedHost at /api/v1/health/live
Invalid HTTP_HOST header: '10.56.1.10:8000'.
The domain name provided is not valid according to RFC 1034/1035.

HTTP Response: 400 Bad Request

Health Probes: Failing with HTTP 400

Root Cause Analysis

Primary Cause: Django ALLOWED_HOSTS setting doesn't include Kubernetes pod IP addresses, and Django doesn't support CIDR notation natively.

Technical Details:

Kubernetes assigns dynamic pod IPs from cluster CIDR (e.g., 10.56.0.0/16)
Django ALLOWED_HOSTS requires explicit hostname/IP list
Health probes send requests with Host: POD_IP:8000 header
Django rejects requests with Host header not in ALLOWED_HOSTS
CIDR notation (10.56.0.0/16) not supported by Django

Why This Happened:

Security feature: Django prevents HTTP Host header attacks
Production settings enforce strict ALLOWED_HOSTS validation
Kubernetes pod IPs change on every deployment/restart
Cannot predict exact pod IPs in advance

Security Tradeoffs:

Production: Strict ALLOWED_HOSTS with explicit domains (high security)
Staging: Relaxed ALLOWED_HOSTS for debugging (convenience vs. security)

Complete Solution

Option A: Wildcard for Staging (Quick Fix)

File: deployment/kubernetes/staging/backend-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: backend-config
  namespace: coditect-staging
  labels:
    app: coditect-backend
    environment: staging
data:
  DJANGO_ALLOWED_HOSTS: "*"  # Allow all hosts (staging only!)

Pros:

Simple configuration
Works with any pod IP
No CIDR parsing needed

Cons:

Disables Host header validation
Vulnerable to Host header attacks
NOT for production

Option B: Include Cluster CIDR + Service DNS (Balanced)

File: deployment/kubernetes/staging/backend-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: backend-config
  namespace: coditect-staging
data:
  # Comma-separated list (Django native format)
  DJANGO_ALLOWED_HOSTS: "10.56.0.0/16,coditect-backend.coditect-staging.svc.cluster.local,*.coditect.com,localhost"

Note: Django doesn't natively support CIDR, so this requires custom middleware.

Step 1: Create Custom Middleware

File: src/license_platform/middleware/allowed_hosts.py

import ipaddress
from django.core.exceptions import DisallowedHost
from django.http import HttpResponseBadRequest


class CIDRAwareAllowedHostsMiddleware:
    """
    Custom middleware to support CIDR notation in ALLOWED_HOSTS.

    Checks if request Host header matches:
    1. Exact hostnames (existing Django behavior)
    2. CIDR ranges (custom logic)
    """

    def __init__(self, get_response):
        self.get_response = get_response

    def __call__(self, request):
        host = request.get_host().split(':')[0]  # Remove port

        if not self.is_allowed_host(host):
            return HttpResponseBadRequest(f"Invalid Host: {host}")

        return self.get_response(request)

    def is_allowed_host(self, host):
        from django.conf import settings

        allowed_hosts = settings.ALLOWED_HOSTS

        # Check exact matches (wildcards supported by Django)
        if host in allowed_hosts:
            return True

        # Check wildcard matches
        if '*' in allowed_hosts:
            return True

        # Check CIDR ranges
        try:
            ip = ipaddress.ip_address(host)
            for allowed in allowed_hosts:
                if '/' in allowed:  # CIDR notation
                    network = ipaddress.ip_network(allowed, strict=False)
                    if ip in network:
                        return True
        except ValueError:
            pass  # Not an IP address

        return False

Step 2: Register Middleware

File: src/license_platform/settings/production.py

MIDDLEWARE = [
    'django.middleware.security.SecurityMiddleware',
    'license_platform.middleware.allowed_hosts.CIDRAwareAllowedHostsMiddleware',  # Add this
    # ... rest of middleware
]

Step 3: Update ConfigMap

data:
  DJANGO_ALLOWED_HOSTS: "10.56.0.0/16,coditect-backend.coditect-staging.svc.cluster.local,*.coditect.com,localhost"

Pros:

Maintains security validation
Supports dynamic pod IPs
Production-ready with proper domains

Cons:

Custom middleware maintenance
Slight performance overhead

Option C: Use Service DNS Only (Recommended for Production)

File: deployment/kubernetes/staging/backend-deployment.yaml

livenessProbe:
  httpGet:
    path: /api/v1/health/live
    port: http
    httpHeaders:
    - name: Host
      value: coditect-backend.coditect-staging.svc.cluster.local

File: deployment/kubernetes/staging/backend-config.yaml

data:
  DJANGO_ALLOWED_HOSTS: "coditect-backend.coditect-staging.svc.cluster.local,api.coditect.com"

Pros:

Most secure (explicit hosts only)
No custom middleware
Best for production

Cons:

Requires updating probe configuration
Less flexible for debugging

Verification Steps

1. Check ConfigMap applied:

kubectl get configmap backend-config -n coditect-staging -o yaml

2. Restart deployment to pick up ConfigMap:

kubectl rollout restart deployment/coditect-backend -n coditect-staging

3. Test health endpoint:

kubectl exec -n coditect-staging deployment/coditect-backend -- \
  curl -H "Host: 10.56.1.10:8000" http://localhost:8000/api/v1/health/live

Success indicator: HTTP 200 response

4. Check liveness probe status:

kubectl describe pod -n coditect-staging -l app=coditect-backend | grep -A 5 "Liveness:"

Success indicators:

Liveness: ... #success=... (no failures)
No "Liveness probe failed" events

5. Test from external request:

curl -H "Host: api.coditect.com" http://LOAD_BALANCER_IP/api/v1/health/live

Prevention Guidance

For Future Deployments:

Production: Use explicit domains only

ALLOWED_HOSTS = [
    'api.coditect.com',
    'api-staging.coditect.com',
    'coditect-backend.coditect-staging.svc.cluster.local',
]

Staging: Use wildcard OR CIDR middleware
- Wildcard for quick debugging
- CIDR middleware for production-like testing

Health probes: Use Service DNS

livenessProbe:
  httpGet:
    httpHeaders:
    - name: Host
      value: service.namespace.svc.cluster.local

Document tradeoffs in settings

# settings/production.py
# SECURITY WARNING: Wildcard ('*') disables Host header validation
# Only use in staging/dev, NEVER in production
ALLOWED_HOSTS = os.environ.get('DJANGO_ALLOWED_HOSTS', '').split(',')

Related ADRs:

ADR-0008: Security Best Practices

Issue 7: Health Probe HTTPS/HTTP Mismatch

Error Symptoms

Liveness probe failed:
Get "https://10.56.1.10:8000/api/v1/health/live":
context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Pod Logs:

No errors (application running correctly)

Pod Status: Not Ready despite application running

Health Probe Status: Failing with timeout

Root Cause Analysis

Primary Cause: Kubernetes health probes defaulting to HTTPS, but application serving HTTP only. Production Django settings redirect HTTP to HTTPS, creating probe timeout loop.

Technical Details:

Django production settings: SECURE_SSL_REDIRECT = True
Application listens on HTTP port 8000 (no TLS termination)
Kubernetes livenessProbe without explicit scheme: HTTP defaults to HTTPS
Probe sends HTTPS request to HTTP port
Django redirects to HTTPS (301/302)
Probe times out waiting for HTTPS response

Why This Happened:

Security best practice: Django enforces HTTPS in production
TLS termination expected at load balancer, not application
Kubernetes probe defaults changed in recent versions
Missing explicit scheme parameter in probe configuration

Architecture Context:

Load Balancer: Terminates TLS, forwards HTTP to backend
Backend: Serves HTTP only (trusts private network)
Health Probes: Direct pod access (bypasses load balancer)

Complete Solution

File: deployment/kubernetes/staging/backend-deployment.yaml

Original (Broken):

livenessProbe:
  httpGet:
    path: /api/v1/health/live
    port: http  # Named port (8000)
    # scheme defaults to HTTPS if not specified!
  initialDelaySeconds: 30
  periodSeconds: 10

Fixed:

livenessProbe:
  httpGet:
    path: /api/v1/health/live
    port: http
    scheme: HTTP  # Explicitly specify HTTP
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /api/v1/health/ready
    port: http
    scheme: HTTP  # Also fix readiness probe
  initialDelaySeconds: 20
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3

Key Changes:

Add scheme: HTTP to both liveness and readiness probes
Explicit timeouts for clarity
Consistent configuration across all probes

Step 1: Update Deployment Manifest

# Edit file
vim deployment/kubernetes/staging/backend-deployment.yaml

# Apply changes
kubectl apply -f deployment/kubernetes/staging/backend-deployment.yaml

Step 2: Verify Probe Configuration

kubectl get pod -n coditect-staging -l app=coditect-backend -o yaml | grep -A 10 "livenessProbe:"

Expected Output:

livenessProbe:
  httpGet:
    path: /api/v1/health/live
    port: 8000
    scheme: HTTP  # Verify this line

Verification Steps

1. Watch pod status during rollout:

kubectl rollout status deployment/coditect-backend -n coditect-staging

Success indicator: deployment "coditect-backend" successfully rolled out

2. Check probe success:

kubectl describe pod -n coditect-staging -l app=coditect-backend | grep -A 10 "Liveness:"

Success indicators:

Liveness: http-get http://:8000/api/v1/health/live
No failure events

3. Manual probe test:

POD_IP=$(kubectl get pod -n coditect-staging -l app=coditect-backend -o jsonpath='{.items[0].status.podIP}')

curl -v http://$POD_IP:8000/api/v1/health/live

Expected: HTTP 200 response

4. Check probe failure history:

kubectl get events -n coditect-staging --field-selector involvedObject.name=coditect-backend --sort-by='.lastTimestamp' | tail -20

Success indicator: No recent "Liveness probe failed" events

Prevention Guidance

For Future Deployments:

Always specify probe scheme explicitly

# Don't rely on defaults
livenessProbe:
  httpGet:
    scheme: HTTP  # Explicit is better than implicit

Document TLS termination architecture

Client --HTTPS--> Load Balancer --HTTP--> Backend Pods
                  (TLS termination)        (HTTP only)

Test probes independently

# Before deploying, test probe endpoint
curl http://localhost:8000/api/v1/health/live

Monitor probe metrics
- Prometheus: prober_probe_success
- Grafana dashboard for health probe success rate

Production considerations

If backend must serve HTTPS, use self-signed certs for probes

Configure probe to skip TLS verification:

livenessProbe:
  httpGet:
    scheme: HTTPS
    httpHeaders:
    - name: X-Forwarded-Proto
      value: https

Related ADRs:

ADR-0008: Security Best Practices
ADR-0005: Kubernetes Deployment Strategy

Quick Reference

Common Commands

# Check pod status
kubectl get pods -n coditect-staging -l app=coditect-backend

# View pod logs
kubectl logs -n coditect-staging -l app=coditect-backend --tail=100 -f

# Describe pod (includes events)
kubectl describe pod -n coditect-staging -l app=coditect-backend

# Execute command in pod
kubectl exec -n coditect-staging deployment/coditect-backend -- COMMAND

# Restart deployment
kubectl rollout restart deployment/coditect-backend -n coditect-staging

# Check rollout status
kubectl rollout status deployment/coditect-backend -n coditect-staging

# View ConfigMap
kubectl get configmap backend-config -n coditect-staging -o yaml

# View Secrets (keys only, not values)
kubectl describe secret backend-secrets -n coditect-staging

# Check service endpoints
kubectl get endpoints coditect-backend -n coditect-staging

Issue Decision Tree

Pod Status: ImagePullBackOff
└─> Issue 1: GCR Deprecation OR Issue 2: Platform Mismatch

Pod Status: CrashLoopBackOff
├─> Logs: "ModuleNotFoundError" → Issue 3: User Permissions
├─> Logs: "password authentication failed" → Issue 5: Database User
├─> Logs: "connection requires a valid client certificate" → Issue 4: SSL Requirement
└─> Logs: "DisallowedHost" → Issue 6: ALLOWED_HOSTS

Pod Status: Not Ready
├─> Liveness probe failed: HTTPS timeout → Issue 7: Probe HTTPS/HTTP
└─> Readiness probe failed: DB connection → Issue 4 or 5

Environment Variables Checklist

Required in backend-secrets:

✅ django-secret-key - Random 64-byte string
✅ db-name - Database name (coditect_db)
✅ db-user - Database user (coditect_app)
✅ db-password - Database password (32+ chars)
✅ db-host - Cloud SQL private IP
✅ redis-host - Redis private IP

Required in backend-config:

✅ DJANGO_ALLOWED_HOSTS - Comma-separated hosts

Pre-Deployment Checklist

Before deploying to staging:

Internal Documentation

README.md - Deployment overview
CLAUDE.md - AI agent configuration
ADR-0005: Kubernetes Deployment Strategy
ADR-0007: Docker Container Strategy
ADR-0008: Security Best Practices
ADR-0009: Database Architecture

External References

Appendix: Full Working Configuration

Dockerfile (Final)

File: Dockerfile

# CODITECT Cloud Backend - Production Dockerfile
# Django 5.2.8 Backend with Multi-Stage Build
# Python 3.12.12 for protobuf compatibility

# Stage 1: Builder - Install dependencies and collect static files
FROM python:3.12.12-slim-bookworm as builder

LABEL maintainer="AZ1.AI INC <engineering@az1.ai>"
LABEL description="CODITECT Cloud Backend - Django 5.2.8 API Server"

WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    g++ \
    make \
    postgresql-client \
    libpq-dev \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Copy application code
COPY . .

# Collect static files (production settings expect environment variables)
ENV DJANGO_SETTINGS_MODULE=license_platform.settings.production
ENV DJANGO_SECRET_KEY=temp-build-key
ENV DB_NAME=build
ENV DB_USER=build
ENV DB_PASSWORD=build
ENV DB_HOST=localhost
RUN python manage.py collectstatic --noinput --clear

# Stage 2: Runtime - Minimal production image
FROM python:3.12.12-slim-bookworm

WORKDIR /app

# Install runtime dependencies only
RUN apt-get update && apt-get install -y --no-install-recommends \
    postgresql-client \
    libpq-dev \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Create non-root user first
RUN useradd -m -u 1000 django

# Copy Python packages from builder to user directory
COPY --from=builder /root/.local /home/django/.local

# Copy application code from builder
COPY --from=builder /app /app

# Set ownership
RUN chown -R django:django /app /home/django/.local

# Make sure scripts in .local are usable
ENV PATH=/home/django/.local/bin:$PATH

USER django

# Expose port
EXPOSE 8000

# Health check endpoint
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
    CMD curl -f http://localhost:8000/api/v1/health/ || exit 1

# Run with gunicorn for production
CMD ["gunicorn", \
     "--bind", "0.0.0.0:8000", \
     "--workers", "4", \
     "--worker-class", "sync", \
     "--timeout", "60", \
     "--access-logfile", "-", \
     "--error-logfile", "-", \
     "--log-level", "info", \
     "license_platform.wsgi:application"]

Deployment Manifest (Final)

File: deployment/kubernetes/staging/backend-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: coditect-backend
  namespace: coditect-staging
  labels:
    app: coditect-backend
    environment: staging
    version: v1.0.0
spec:
  replicas: 2
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: coditect-backend
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: coditect-backend
        environment: staging
        version: v1.0.0
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: coditect-cloud-backend
      imagePullSecrets:
      - name: gcr-json-key
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000

      containers:
      - name: backend
        image: us-central1-docker.pkg.dev/coditect-cloud-infra/coditect-backend/coditect-cloud-backend:v1.0.0-staging
        imagePullPolicy: Always

        ports:
        - name: http
          containerPort: 8000
          protocol: TCP

        env:
        - name: DJANGO_SETTINGS_MODULE
          value: "license_platform.settings.production"

        - name: DJANGO_SECRET_KEY
          valueFrom:
            secretKeyRef:
              name: backend-secrets
              key: django-secret-key

        - name: DB_NAME
          valueFrom:
            secretKeyRef:
              name: backend-secrets
              key: db-name

        - name: DB_USER
          valueFrom:
            secretKeyRef:
              name: backend-secrets
              key: db-user

        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: backend-secrets
              key: db-password

        - name: DB_HOST
          valueFrom:
            secretKeyRef:
              name: backend-secrets
              key: db-host

        - name: DB_PORT
          value: "5432"

        - name: REDIS_HOST
          valueFrom:
            secretKeyRef:
              name: backend-secrets
              key: redis-host

        - name: REDIS_PORT
          value: "6379"

        - name: CLOUD_KMS_PROJECT_ID
          value: "coditect-prod-563272"

        - name: CLOUD_KMS_LOCATION
          value: "us-central1"

        - name: CLOUD_KMS_KEYRING
          value: "license-signing-keyring"

        - name: CLOUD_KMS_KEY
          value: "license-signing-key"

        - name: GCP_PROJECT_ID
          value: "coditect-prod-563272"

        - name: ENVIRONMENT
          value: "staging"

        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"

        livenessProbe:
          httpGet:
            path: /api/v1/health/live
            port: http
            scheme: HTTP  # Critical: Explicit HTTP scheme
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3

        readinessProbe:
          httpGet:
            path: /api/v1/health/ready
            port: http
            scheme: HTTP  # Critical: Explicit HTTP scheme
          initialDelaySeconds: 20
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3

        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: false
          capabilities:
            drop:
            - ALL

      terminationGracePeriodSeconds: 30

      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - coditect-backend
              topologyKey: kubernetes.io/hostname

ConfigMap (Final)

File: deployment/kubernetes/staging/backend-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: backend-config
  namespace: coditect-staging
  labels:
    app: coditect-backend
    environment: staging
data:
  # Staging: Wildcard for convenience (NOT for production!)
  DJANGO_ALLOWED_HOSTS: "*"

  # Alternative (production-ready):
  # DJANGO_ALLOWED_HOSTS: "10.56.0.0/16,coditect-backend.coditect-staging.svc.cluster.local,*.coditect.com,localhost"

Build and Deploy Script

File: scripts/deploy-staging.sh

#!/bin/bash
set -euo pipefail

PROJECT_ID="coditect-cloud-infra"
REGION="us-central1"
IMAGE_REPO="coditect-backend"
IMAGE_NAME="coditect-cloud-backend"
VERSION="${1:-v1.0.0-staging}"

echo "Building image for linux/amd64..."
docker buildx build \
  --platform linux/amd64 \
  --tag ${REGION}-docker.pkg.dev/${PROJECT_ID}/${IMAGE_REPO}/${IMAGE_NAME}:${VERSION} \
  --push \
  .

echo "Applying Kubernetes manifests..."
kubectl apply -f deployment/kubernetes/staging/namespace.yaml
kubectl apply -f deployment/kubernetes/staging/backend-config.yaml
kubectl apply -f deployment/kubernetes/staging/backend-deployment.yaml
kubectl apply -f deployment/kubernetes/staging/backend-service.yaml

echo "Waiting for rollout..."
kubectl rollout status deployment/coditect-backend -n coditect-staging

echo "Deployment complete!"
kubectl get pods -n coditect-staging -l app=coditect-backend

Document Status: Complete Last Validated: December 1, 2025 Next Review: January 1, 2026 Owner: AZ1.AI INC Engineering Team

Table of Contents​

Overview​

Issue 1: GCR Deprecation (403 Forbidden)​

Error Symptoms​

Root Cause Analysis​

Complete Solution​

Verification Steps​

Prevention Guidance​

Issue 2: Multi-Platform Docker Build​

Error Symptoms​

Root Cause Analysis​

Complete Solution​

Verification Steps​

Prevention Guidance​

Issue 3: Dockerfile User Permissions​

Error Symptoms​

Root Cause Analysis​

Complete Solution​

Verification Steps​

Prevention Guidance​

Issue 4: Cloud SQL SSL Certificate Requirement​

Error Symptoms​

Root Cause Analysis​

Complete Solution​

Verification Steps​

Prevention Guidance​

Issue 5: Database User Authentication​

Error Symptoms​

Root Cause Analysis​

Complete Solution​

Verification Steps​

Prevention Guidance​

Issue 6: Django ALLOWED_HOSTS Rejection​

Error Symptoms​

Root Cause Analysis​

Complete Solution​

Verification Steps​

Prevention Guidance​

Issue 7: Health Probe HTTPS/HTTP Mismatch​

Error Symptoms​

Root Cause Analysis​

Complete Solution​

Verification Steps​

Prevention Guidance​

Quick Reference​

Common Commands​

Issue Decision Tree​

Environment Variables Checklist​

Pre-Deployment Checklist​

Related Documentation​

Internal Documentation​

External References​

Appendix: Full Working Configuration​

Dockerfile (Final)​

Deployment Manifest (Final)​

ConfigMap (Final)​

Build and Deploy Script​

Table of Contents

Overview

Issue 1: GCR Deprecation (403 Forbidden)

Error Symptoms

Root Cause Analysis

Complete Solution

Verification Steps

Prevention Guidance

Issue 2: Multi-Platform Docker Build

Error Symptoms

Root Cause Analysis

Complete Solution

Verification Steps

Prevention Guidance

Issue 3: Dockerfile User Permissions

Error Symptoms

Root Cause Analysis

Complete Solution

Verification Steps

Prevention Guidance

Issue 4: Cloud SQL SSL Certificate Requirement

Error Symptoms

Root Cause Analysis

Complete Solution

Verification Steps

Prevention Guidance

Issue 5: Database User Authentication

Error Symptoms

Root Cause Analysis

Complete Solution

Verification Steps

Prevention Guidance

Issue 6: Django ALLOWED_HOSTS Rejection

Error Symptoms

Root Cause Analysis

Complete Solution

Verification Steps

Prevention Guidance

Issue 7: Health Probe HTTPS/HTTP Mismatch

Error Symptoms

Root Cause Analysis

Complete Solution

Verification Steps

Prevention Guidance

Quick Reference

Common Commands

Issue Decision Tree

Environment Variables Checklist

Pre-Deployment Checklist

Related Documentation

Internal Documentation

External References

Appendix: Full Working Configuration

Dockerfile (Final)

Deployment Manifest (Final)

ConfigMap (Final)

Build and Deploy Script