Tensor vs Pipeline Parallelism: When Each Wins
Tensor parallelism cuts latency by splitting layers across GPUs. Pipeline parallelism increases throughput by splitting the model into stages. Choose based on your constraint.
3 posts tagged with "multi-gpu"
Tensor parallelism cuts latency by splitting layers across GPUs. Pipeline parallelism increases throughput by splitting the model into stages. Choose based on your constraint.
Four GPUs don't give you 4x throughput. Communication overhead, load imbalance, and synchronization eat into gains. Know the scaling curve before you buy.
Four GPUs don't give you 4x the KV cache memory. Communication overhead, activation memory, and synchronization eat into the gains. Plan accordingly.