Tensor vs Pipeline Parallelism: When Each Wins

Assembly lines and parallel workstations solve different problems. An assembly line processes items in sequence through specialized stages. Parallel workstations run the same complete process simultaneously. Both increase total output, but they optimize for different constraints.

Tensor parallelism is the parallel workstation: split each operation across GPUs so it completes faster. Pipeline parallelism is the assembly line: split the model into stages so more requests process concurrently. Your bottleneck determines which strategy wins.

The Two Strategies

def parallelism_strategies():
    return {
        "tensor_parallelism": {
            "how_it_works": "Split each layer horizontally across GPUs",
            "example": "Attention heads 0-31 on GPU0, 32-63 on GPU1",
            "communication": "All-reduce after every layer",
            "latency": "Lower (same request processes faster)",
            "throughput": "Higher for single request",
            "memory": "Weights and KV cache split across GPUs",
        },

        "pipeline_parallelism": {
            "how_it_works": "Split model vertically: layers 0-39 on GPU0, 40-79 on GPU1",
            "example": "First half of model on GPU0, second half on GPU1",
            "communication": "Point-to-point between stages",
            "latency": "Same as single GPU (plus bubble time)",
            "throughput": "Higher when pipeline is filled",
            "memory": "Each GPU holds subset of layers",
        },
    }

When Tensor Parallelism Wins

def tensor_parallelism_advantages():
    return {
        "best_for": [
            "Latency-sensitive applications",
            "Interactive chat",
            "Real-time inference",
            "Single requests that need to be fast",
        ],

        "why_it_wins": {
            "lower_per_request_latency": """
                Each layer computes in parallel across GPUs.
                A layer that takes 10ms on 1 GPU takes 5ms on 2.
                Total latency scales down with GPU count.
            """,
            "simpler_batching": """
                Batch processes together, same as single GPU.
                No pipeline bubbles or micro-batching needed.
            """,
        },

        "requirements": {
            "interconnect": "Fast interconnect (NVLink) essential",
            "gpu_count": "Usually <= 8 GPUs (within single node)",
            "batch_size": "Works with any batch size",
        },

        "example_latencies": """
            70B model, single request, 100 output tokens:

            1 GPU:  250ms
            2 GPU (TP=2): 145ms (1.7x faster)
            4 GPU (TP=4): 85ms (2.9x faster)
            8 GPU (TP=8): 55ms (4.5x faster)
        """,
    }

When Pipeline Parallelism Wins

def pipeline_parallelism_advantages():
    return {
        "best_for": [
            "Throughput-maximizing batch processing",
            "Very large models (100B+)",
            "Multi-node deployments",
            "Cost efficiency at scale",
        ],

        "why_it_wins": {
            "higher_throughput": """
                Multiple requests in flight simultaneously.
                While GPU1 processes request A's second half,
                GPU0 processes request B's first half.
            """,
            "lower_communication": """
                Only point-to-point between stages.
                No all-reduce across all GPUs.
                Works better over slower interconnects.
            """,
            "scales_across_nodes": """
                Stages can be on different machines.
                Only inter-stage communication crosses network.
            """,
        },

        "requirements": {
            "micro_batching": "Need multiple requests to fill pipeline",
            "pipeline_bubbles": "Some inefficiency during ramp-up/down",
            "careful_balancing": "Stages should have equal compute",
        },

        "example_throughput": """
            70B model, continuous stream of requests:

            1 GPU: 50 tokens/sec
            2 GPU (PP=2): 90 tokens/sec (1.8x)
            4 GPU (PP=4): 160 tokens/sec (3.2x)

            Note: Individual request latency unchanged or slightly worse.
            But total system throughput much higher.
        """,
    }

The Hybrid Approach

def hybrid_parallelism():
    """
    Combine TP and PP for best of both
    """
    return {
        "strategy": "TP within node, PP across nodes",

        "example_8gpu_single_node": {
            "tp_only": "TP=8, all layers split across 8 GPUs",
            "pp_only": "PP=8, 10 layers per GPU",
            "hybrid": "TP=4 x PP=2: two stages, each tensor-parallel",
        },

        "example_multi_node": {
            "setup": "2 nodes, 8 GPUs each",
            "approach": "TP=8 within each node, PP=2 across nodes",
            "why": "NVLink within node, network between nodes",
        },

        "decision_framework": """
            Same node: Prefer tensor parallelism (fast NVLink)
            Across nodes: Use pipeline parallelism (network bottleneck)
            Very large models: Hybrid TP + PP

            For 70B on 8 GPUs (single node):
            - TP=8: Lowest latency
            - PP=8: Highest throughput with large batches
            - TP=4 x PP=2: Balanced

            For 400B on 32 GPUs (4 nodes):
            - TP=8 per node, PP=4 across nodes
        """,
    }

Memory Implications

def memory_comparison():
    return {
        "tensor_parallel": {
            "model_weights": "Split evenly across GPUs",
            "kv_cache": "Split evenly (heads distributed)",
            "activations": "Partially replicated (need intermediate results)",
            "total_capacity": "~0.9x linear with GPU count",
        },

        "pipeline_parallel": {
            "model_weights": "Each GPU holds its stage's full weights",
            "kv_cache": "Each GPU holds cache for its layers",
            "activations": "Only current micro-batch activations",
            "total_capacity": "Linear with GPU count",
        },

        "practical_difference": """
            70B model (140GB in FP16):

            TP=4 (4x 80GB GPUs):
            - Weights: 35GB each
            - Activations: ~10GB each (replicated)
            - KV cache: ~25GB each
            - Total usable: 120GB effective (communication overhead)

            PP=4 (4x 80GB GPUs):
            - Weights: 35GB each (20 layers per GPU)
            - Activations: ~5GB each (current micro-batch only)
            - KV cache: ~35GB each
            - Total usable: 140GB effective

            PP gives more KV cache capacity but worse latency.
        """,
    }

Implementation Patterns

class ParallelismSelector:
    """
    Choose parallelism strategy based on requirements
    """

    def select_strategy(self, requirements: dict) -> dict:
        model_size = requirements["model_size_gb"]
        gpus_available = requirements["gpu_count"]
        priority = requirements["priority"]  # "latency" or "throughput"
        interconnect = requirements["interconnect"]  # "nvlink" or "pcie"

        # Single GPU if possible
        if model_size <= 75 and gpus_available >= 1:
            return {"strategy": "single_gpu", "reason": "Model fits on one GPU"}

        # Tensor parallel for latency on fast interconnect
        if priority == "latency" and interconnect == "nvlink":
            tp_degree = min(gpus_available, 8)
            return {
                "strategy": "tensor_parallel",
                "tp_degree": tp_degree,
                "reason": "Latency priority with NVLink",
            }

        # Pipeline parallel for throughput or slow interconnect
        if priority == "throughput" or interconnect == "pcie":
            pp_degree = gpus_available
            return {
                "strategy": "pipeline_parallel",
                "pp_degree": pp_degree,
                "reason": "Throughput priority or PCIe interconnect",
            }

        # Hybrid for multi-node
        if gpus_available > 8:
            tp_per_node = 8
            pp_across = gpus_available // 8
            return {
                "strategy": "hybrid",
                "tp_degree": tp_per_node,
                "pp_degree": pp_across,
                "reason": "Multi-node deployment",
            }

Common Mistakes

def parallelism_mistakes():
    return {
        "tp_over_pcie": {
            "mistake": "Tensor parallel with PCIe interconnect",
            "why_bad": "All-reduce over PCIe is 10-20x slower than NVLink",
            "result": "Communication dominates, scaling < 1.5x",
            "fix": "Use pipeline parallel instead, or get NVLink",
        },

        "pp_with_small_batches": {
            "mistake": "Pipeline parallel with batch size 1",
            "why_bad": "Pipeline never fills, most GPUs idle",
            "result": "Worse than single GPU",
            "fix": "Use micro-batching or switch to tensor parallel",
        },

        "ignoring_bubbles": {
            "mistake": "Expecting linear scaling from pipeline parallel",
            "why_bad": "Pipeline bubbles during ramp-up and ramp-down",
            "result": "Efficiency 60-80% of theoretical",
            "fix": "Account for bubble overhead in capacity planning",
        },

        "unbalanced_stages": {
            "mistake": "Pipeline stages with unequal compute",
            "why_bad": "Slowest stage becomes bottleneck",
            "result": "Other GPUs wait for slowest",
            "fix": "Split layers to balance compute time per stage",
        },
    }

Tensor parallelism makes individual requests faster. Pipeline parallelism makes the overall system process more requests. Choose tensor for latency-sensitive applications, pipeline for throughput-maximizing batch processing, and hybrid when you're spanning multiple nodes or need both.