The Checklist Before You Deploy

Pilots don't skip the preflight checklist because they've flown a thousand times. The checklist exists because human memory fails under pressure. The more routine something becomes, the more likely you'll miss something obvious.

LLM deployments deserve the same discipline. Here's the checklist that separates smooth launches from 3am incidents.

The Pre-Deploy Checklist

class PreDeploymentChecklist:
    """
    Go through every item before production deploy
    """

    infrastructure = [
        {
            "item": "1. Memory headroom verified",
            "check": "KV cache won't OOM at max concurrent requests",
            "how": "Load test at 2x expected peak",
            "failure_mode": "OOM crash under load",
        },
        {
            "item": "2. GPU memory limits set",
            "check": "gpu_memory_utilization < 95%",
            "how": "Leave headroom for spikes",
            "failure_mode": "Fragmentation causes OOM",
        },
        {
            "item": "3. Health checks configured",
            "check": "Endpoint returns unhealthy before OOM",
            "how": "Test health check under memory pressure",
            "failure_mode": "Load balancer routes to dying instance",
        },
        {
            "item": "4. Autoscaling tested",
            "check": "New instances come up before overload",
            "how": "Simulate traffic spike, measure time to scale",
            "failure_mode": "Cascade failure during traffic spike",
        },
    ]

    reliability = [
        {
            "item": "5. Timeouts configured",
            "check": "Request timeout < user patience",
            "how": "Typically 30-60s for generation",
            "failure_mode": "Stuck requests consume resources",
        },
        {
            "item": "6. Rate limiting enabled",
            "check": "Per-user and per-endpoint limits",
            "how": "Test that limits trigger correctly",
            "failure_mode": "One user DOSes everyone",
        },
        {
            "item": "7. Retry logic has backoff",
            "check": "Exponential backoff, max retries",
            "how": "Simulate failures, verify backoff",
            "failure_mode": "Retry storms amplify outages",
        },
        {
            "item": "8. Graceful degradation configured",
            "check": "Fallback behavior when overloaded",
            "how": "Test what happens at 2x capacity",
            "failure_mode": "Complete failure instead of degraded",
        },
    ]

    observability = [
        {
            "item": "9. Latency metrics enabled",
            "check": "TTFT, generation time, P99 tracked",
            "how": "Verify metrics appear in dashboard",
            "failure_mode": "Blind to performance problems",
        },
        {
            "item": "10. Error tracking configured",
            "check": "Errors captured with context",
            "how": "Trigger an error, verify capture",
            "failure_mode": "Don't know what's failing",
        },
        {
            "item": "11. Alerts set up",
            "check": "P99 latency, error rate, memory alerts",
            "how": "Test alerts fire correctly",
            "failure_mode": "Problems discovered by users",
        },
        {
            "item": "12. Cost tracking enabled",
            "check": "Token usage per user/feature tracked",
            "how": "Verify cost attribution works",
            "failure_mode": "Surprise bills, can't optimize",
        },
    ]

The Detailed Checks

def memory_check_details():
    """
    Item 1: Memory headroom verified
    """
    return {
        "test_procedure": """
        1. Calculate max concurrent requests for SLA
        2. Estimate KV cache per request (tokens × layers × heads × dim × 2)
        3. Add 20% safety margin
        4. Verify total < available GPU memory × 0.9
        5. Load test to confirm
        """,

        "common_mistakes": [
            "Calculating for average, not max tokens",
            "Forgetting about model weights in memory",
            "Not testing with production-like prompts",
        ],

        "verification": """
        nvidia-smi during load test should show:
        - Memory usage stable (not growing)
        - No OOM errors in logs
        - Response times stable at load
        """,
    }


def timeout_check_details():
    """
    Item 5: Timeouts configured
    """
    return {
        "layers_to_check": [
            "Load balancer timeout",
            "Reverse proxy timeout",
            "API gateway timeout",
            "Application timeout",
            "Model inference timeout",
        ],

        "common_mistake": """
        Setting app timeout to 60s but load balancer to 30s.
        Result: LB kills request, app keeps processing.
        All timeouts should be coordinated.
        """,

        "recommended_values": {
            "load_balancer": 65,  # seconds
            "api_gateway": 60,
            "application": 55,
            "inference": 50,
        },
    }

The One-Page Summary

def quick_checklist() -> list:
    """
    Print this and check off each item
    """
    return [
        "[ ] 1. Memory: KV cache fits at max concurrent",
        "[ ] 2. Memory: GPU utilization < 95%",
        "[ ] 3. Health: Check fails before OOM",
        "[ ] 4. Scale: Autoscaling tested under load",
        "[ ] 5. Timeout: All layers configured, coordinated",
        "[ ] 6. Limits: Rate limiting per user enabled",
        "[ ] 7. Retry: Exponential backoff configured",
        "[ ] 8. Fallback: Graceful degradation tested",
        "[ ] 9. Metrics: Latency percentiles tracked",
        "[ ] 10. Errors: Errors captured with context",
        "[ ] 11. Alerts: P99, error rate, memory alerts set",
        "[ ] 12. Cost: Token usage tracked by user/feature",
    ]

The Common Failures

def failure_case_studies():
    return {
        "case_1": {
            "symptom": "Works fine, then OOM crash",
            "root_cause": "Didn't test with long outputs",
            "checklist_item": "#1 - Memory headroom",
            "fix": "Test with max_tokens, not average",
        },
        "case_2": {
            "symptom": "Latency spikes, no alerts",
            "root_cause": "Only monitoring average, not P99",
            "checklist_item": "#9, #11 - Latency metrics, alerts",
            "fix": "Alert on P99 latency threshold",
        },
        "case_3": {
            "symptom": "One user's bug DOSed the service",
            "root_cause": "No rate limiting",
            "checklist_item": "#6 - Rate limiting",
            "fix": "Per-user token limits",
        },
        "case_4": {
            "symptom": "$50K bill from runaway feature",
            "root_cause": "No per-feature cost tracking",
            "checklist_item": "#12 - Cost tracking",
            "fix": "Tag requests with feature_id",
        },
    }

The Deployment Ritual

def deployment_ritual():
    """
    The process, not just the checklist
    """
    return {
        "before": [
            "Review checklist with another engineer",
            "Run load test in staging",
            "Verify rollback procedure works",
            "Check on-call schedule",
        ],
        "during": [
            "Deploy to 1% traffic first (canary)",
            "Watch metrics for 15 minutes",
            "Compare to baseline",
            "Expand to 10%, then 50%, then 100%",
        ],
        "after": [
            "Monitor for 1 hour",
            "Check error rates, latency, cost",
            "Document any issues found",
            "Update checklist if gaps found",
        ],
    }

The checklist isn't overhead—it's insurance. Each item exists because someone learned the hard way. Spend 30 minutes on the checklist, save 8 hours of incident response.