Treating Evals as Non-Negotiable Constraints

Bridges have load limits. The sign says 10 tons, not "10 tons unless you're in a hurry" or "10 tons but we trust your judgment." The limit exists because exceeding it risks collapse. No negotiation, no exceptions.

Evals should work the same way. If your model fails a safety eval, you don't ship. If your optimization breaks a quality threshold, you revert. Evals aren't aspirational targets. They're constraints that define what's acceptable.

Evals as System Invariants

class EvalInvariant:
    """
    Evals that must never be violated
    """

    def __init__(self, name: str, threshold: float, critical: bool = True):
        self.name = name
        self.threshold = threshold
        self.critical = critical
        self.no_exceptions = critical

    def check(self, score: float) -> dict:
        passed = score >= self.threshold

        if not passed and self.critical:
            return {
                "passed": False,
                "action": "BLOCK_DEPLOYMENT",
                "message": f"{self.name} failed: {score} < {self.threshold}",
                "negotiable": False,
            }

        return {"passed": passed, "score": score}


# Define your invariants
INVARIANTS = [
    EvalInvariant("safety_refusal", 0.99, critical=True),
    EvalInvariant("no_pii_generation", 1.00, critical=True),
    EvalInvariant("factual_accuracy", 0.90, critical=True),
    EvalInvariant("instruction_following", 0.95, critical=True),
]

Why Flexibility Is Dangerous

def case_study_flexible_evals():
    """
    What happens when evals become negotiable
    """
    return {
        "scenario": """
        Team ships optimization that improves latency 20%
        but drops quality eval from 92% to 88%.

        Manager says: "It's only 4 points, and latency matters."
        Team ships anyway.
        """,

        "week_1": "Users complain about answer quality.",
        "week_2": "Support tickets up 30%.",
        "week_3": "Another team ships similar optimization, quality drops to 85%.",
        "week_4": "Quality is now 'known issue', threshold adjusted to 85%.",
        "month_3": "Quality at 80%, new normal.",

        "lesson": """
        Each exception makes the next easier.
        Thresholds erode incrementally.
        By the time you notice, quality is far below original.
        """,
    }

The Invariant Mindset

def invariant_mindset():
    return {
        "definition": """
        An invariant is a property that must always be true.
        Not 'usually true' or 'true when convenient'.
        Always.
        """,

        "examples_in_code": {
            "null_safety": "Function never returns null",
            "type_safety": "Variable always has expected type",
            "thread_safety": "Shared state always consistent",
        },

        "examples_in_ml": {
            "safety": "Model never provides harmful instructions",
            "accuracy": "Factual claims are verifiable",
            "reliability": "Same input produces consistent output",
        },

        "enforcement": """
        class InvariantEnforcer:
            def check_before_deploy(self, model):
                for invariant in self.invariants:
                    result = invariant.check(model)
                    if not result.passed:
                        # Not 'log and continue'
                        # Not 'alert and maybe fix later'
                        raise DeploymentBlockedError(result.message)

                return DeploymentApproved()
        """,
    }

Setting the Right Thresholds

def threshold_setting():
    return {
        "too_high": {
            "problem": "Nothing ever ships",
            "symptom": "Teams always requesting exceptions",
            "solution": "Threshold should be achievable by good work",
        },

        "too_low": {
            "problem": "Bad quality ships regularly",
            "symptom": "Evals always pass, users still complain",
            "solution": "Threshold should catch real problems",
        },

        "process": """
        1. Collect baseline: What does current production score?
        2. Set floor: What's the minimum acceptable?
        3. Set target: What should we aim for?
        4. Invariant = floor, not target

        Example:
        - Current production: 93%
        - Target: 95%
        - Invariant floor: 90%

        Changes scoring < 90% are blocked.
        Changes scoring 90-95% ship with monitoring.
        Changes scoring > 95% are celebrated.
        """,

        "periodic_review": """
        Review thresholds quarterly:
        - Is the floor still appropriate?
        - Has the baseline improved?
        - Should floor rise with baseline?
        """,
    }

Handling Conflicts

def handling_conflicts():
    return {
        "optimization_vs_quality": {
            "situation": "Faster model fails quality eval",
            "wrong": "Lower quality threshold to ship speed",
            "right": "Find optimization that maintains quality",
            "principle": "Quality is the constraint, optimize within it",
        },

        "cost_vs_quality": {
            "situation": "Cheaper model fails quality eval",
            "wrong": "Accept lower quality for cost savings",
            "right": "Either pay more or find different cost reduction",
            "principle": "Cost is a goal, quality is a constraint",
        },

        "deadline_vs_quality": {
            "situation": "Ship date approaching, eval still failing",
            "wrong": "Ship anyway, fix later",
            "right": "Delay ship date or reduce scope",
            "principle": "Deadlines can move, invariants cannot",
        },

        "implementation": """
        def decide_conflict(change, eval_result, constraint):
            if not eval_result.passed:
                # No negotiation
                return Decision(
                    action="BLOCKED",
                    message=f"Cannot ship: {constraint.name} failed",
                    alternatives=[
                        "Improve change to pass eval",
                        "Reduce scope to maintain quality",
                        "Delay until eval passes",
                    ]
                )
        """,
    }

Building Organizational Discipline

def organizational_discipline():
    return {
        "leadership_commitment": [
            "No exceptions granted from above",
            "No pressure to ship failing changes",
            "Public support when teams delay for quality",
        ],

        "team_norms": [
            "Write evals before features",
            "Run evals before PRs",
            "Review eval failures, not just code",
        ],

        "process_enforcement": [
            "CI blocks merge on eval failure",
            "No manual override without documented justification",
            "Override audit trail reviewed weekly",
        ],

        "cultural_reinforcement": """
        When someone ships and eval fails:
        - Roll back immediately
        - Root cause analysis
        - Process improvement
        - No blame, but no acceptance either

        When someone catches a near-miss:
        - Public recognition
        - Share learnings
        - Strengthen evals based on the case
        """,
    }

Monitoring Invariant Health

class InvariantMonitor:
    """
    Track how close we are to our constraints
    """

    def health_check(self) -> dict:
        results = {}

        for invariant in self.invariants:
            current_score = self.measure_current(invariant)
            margin = current_score - invariant.threshold

            results[invariant.name] = {
                "current": current_score,
                "threshold": invariant.threshold,
                "margin": margin,
                "health": self.classify_health(margin),
            }

        return results

    def classify_health(self, margin: float) -> str:
        if margin > 0.10:
            return "healthy"  # Plenty of room
        elif margin > 0.05:
            return "acceptable"  # Comfortable
        elif margin > 0.02:
            return "warning"  # Getting close
        else:
            return "critical"  # Near violation

    def trend_analysis(self, window_days: int = 30) -> dict:
        """Detect if we're drifting toward invariant violation"""
        history = self.get_history(window_days)

        trends = {}
        for invariant in self.invariants:
            scores = [h[invariant.name] for h in history]
            slope = self.calculate_trend(scores)

            if slope < -0.001:  # Declining
                days_to_violation = (scores[-1] - invariant.threshold) / abs(slope)
                trends[invariant.name] = {
                    "direction": "declining",
                    "slope": slope,
                    "days_to_violation": days_to_violation,
                    "alert": days_to_violation < 30,
                }

        return trends

Exception Process (When Necessary)

def exception_process():
    """
    Exceptions should exist but be painful
    """
    return {
        "when_allowed": [
            "Critical security fix that temporarily breaks other eval",
            "External dependency change requiring immediate response",
            "Never for convenience or deadlines",
        ],

        "requirements": [
            "Written justification from senior engineer",
            "Explicit time limit (< 1 week)",
            "Mitigation plan to restore compliance",
            "Sign-off from eval owner",
        ],

        "tracking": """
        Every exception is:
        - Logged with full context
        - Reviewed at weekly team meeting
        - Counted against team metrics
        - Closed with resolution documented
        """,

        "warning_signs": [
            "> 2 exceptions per month",
            "Same invariant excepted repeatedly",
            "Exceptions extending beyond time limit",
            "Exceptions becoming routine",
        ],
    }

Invariants are the difference between quality as aspiration and quality as guarantee. When evals are negotiable, quality erodes. When evals are invariants, quality is structural. The constraint isn't comfortable, but that discomfort is what maintains standards.