AI Model DeepSeek-V3: Fast at Scale, Slow Locally - How Batch Inference Impacts Speed and Cost

Why DeepSeek is cheap at scale but expensive to run locally

Why is DeepSeek-V3 supposedly fast and cheap to serve at scale, but too slow and expensive to run locally? Why are some AI models slow to respond but fast once they get going? AI inference providers often talk about a fundamental tradeoff between throughput and latency: for any given model, you can either serve it at high-throughput high-latency, or low-throughput low-latency. In fact, some models are so naturally GPU-inefficient that in practice they must be served at high-latency to have any w...