"Cold-Start Latency Comparison: Fly Outperforms Replicate in Serverless GPU Hosting for Semantic Search Engine Development"

Replicate & Fly cold-start latency

Replicate has been my default serverless GPU choice in the past, and I’ve been trying to use it to set up some embedding models, like SPLADE and a Q&A-optimized bi-encoder. On the other hand, I’m a huge fan of Fly for hosting, and they’ve just announced GPUs.How does cold & warm latency compare between these providers?Problem ContextI’m building a semantic search engine, and I’m most interested in minimizing query-time latency, and trying out multiple different models.One constraint: I don’t wan...