Replicate & Fly cold-start latency
Replicate has been my default serverless GPU choice in the past, and I’ve been trying to use it to set up some embedding models, like SPLADE and a Q&A-optimized bi-encoder. On the other hand, I’m a huge fan of Fly for hosting, and they’ve just announced GPUs.How does cold & warm latency compare between these providers?Problem ContextI’m building a semantic search engine, and I’m most interested in minimizing query-time latency, and trying out multiple different models.One constraint: I don’t wan...
Read more at venki.dev