How we tested
Same testing window. For Replicate we tested image-gen (SDXL, FLUX-schnell), text generation (Llama 70B via community-pushed Cog endpoint), audio (Whisper), and Cog deployment flow for a custom small model. Total spend at Replicate: $1,247.
We compared Replicate against Together for hosted text inference and against Modal for serverless workloads. Image-gen comparison ran against Lambda H100 SXM self-hosted SDXL.
- SDXL inference, 1000 image generations, batch 1, 30 steps, FP16.
- Llama 70B inference, via community Cog endpoint, 10M generation tokens.
- Whisper transcription, 100 hours of audio across multiple files.
- Cog deployment flow, packaging and pushing a custom Llama 8B fine-tune.
- Cold-start latency, sampling popular vs niche models over time.
The verdict, in 60 seconds
GAX Score: 86/100. Replicate wins the catalog-breadth and Cog-deployment categories. Best hosted experience for image, video, audio models. Best community-model marketplace in the segment. Strong for creative tooling and prototype workflows.
Buy it if you need a wide range of models behind a single API, you're building creative tools (image / video / audio), or you want to deploy your own model with minimum setup. Skip it if you're text-only at high volume (Together is cheaper), need dedicated guaranteed throughput, or have regulated data-residency requirements.
Where the 86 comes from
Replicate's profile is widest-catalog, friendliest dev experience. Strong Software Stack score (96) reflects the Cog framework and the breadth of hosted models. Pricing (81) is mid-tier — competitive for image models, expensive for high-volume text vs Together.
| Dimension | Weight | Replicate | What it measures |
|---|---|---|---|
| Throughput (FP8) | 20% | 82 | Model-dependent; popular text models on H100 hit ~1,800 tok/s |
| Pricing per GPU-hr | 18% | 81 | Per-second billing competitive for image, expensive for text at scale |
| Software stack | 14% | 96 | Cog framework + 10k catalog is best in category |
| Latency | 12% | 84 | Warm requests fast; cold-start variable per model popularity |
| Trust & uptime | 10% | 86 | SOC 2 Type II, 99.78% measured API uptime |
| Support | 10% | 84 | Email + community, founders responsive on X and HN |
| Spot availability | 8% | 88 | Marketplace structure: catalog always available |
| Regions | 8% | 80 | Primarily US infrastructure, EU rollout in progress |
Software Stack at 96 reflects Cog. Pushing a model to Replicate is genuinely 'write a `cog.yaml`, run `cog push`, get an API'. That workflow doesn't exist anywhere else at this catalog scale.
What it gets right
Catalog breadth nobody else matches
Replicate hosts 10,000+ community-pushed models. SDXL, FLUX, Stable Video, Whisper, MusicGen, OpenVoice, Bark, ControlNet variants, hundreds of image-gen fine-tunes, dozens of speech models, every viable open LLM. For an app that needs to call across many model types from one provider, Replicate is the only option that covers the ground.
Together AI hosts ~50 curated models. Modal hosts whatever you deploy. Replicate is the marketplace where new models land first and developers find them. The catalog is the moat.
Cog framework is genuinely the best 'push your model to a hosted API'
Cog is an open-source tool from Replicate that lets you wrap a Python model in a container with a single config file. You write `predict.py`, define inputs/outputs, run `cog push`, and your model is live behind an API endpoint with auto-generated OpenAPI schema and a hosted demo page. We deployed a custom Llama 8B fine-tune from local checkpoint to live endpoint in 14 minutes total, most of it container build time.
The fact that Cog is open-source matters too — you can run the same containers locally for development without depending on Replicate at all. Few hosted-inference providers make this easy.
Best-in-class for image, video, and audio models
Replicate's catalog is most differentiated outside text. SDXL fine-tunes, FLUX models, Stable Video Diffusion, AnimateDiff variants, Whisper, MusicGen, Bark, RVC voice cloning, OpenVoice — they're all hosted, current, and accessible via a single API. For a creative-tooling startup, having all of these behind one billing relationship and one auth model is operationally meaningful.
The image-gen models also get the most optimization love from Replicate's infra team. SDXL on Replicate runs at competitive speed (~3.4 img/s on A100), matching what you'd self-host on Lambda. For image-gen workloads specifically, Replicate is often the cheapest path because their batch optimization is solid.
OpenAPI schema for every model
Every Replicate model exposes a typed OpenAPI schema describing its inputs, outputs, and configuration parameters. This auto-generates client SDKs in 8+ languages and makes integration mechanical. You can browse a model on the website, copy a curl command, paste into a script, and have working inference in under 2 minutes.
This is the kind of polish that's easy to dismiss until you've integrated 10 different model APIs from 10 different providers. Replicate's consistency across the catalog is a real time-saver.
Where it falls short
Per-second pricing surprises at scale
Each model has its own per-second rate based on the GPU it runs on. SDXL on A40 is $0.000725/sec. Llama 70B on H100 via popular community endpoints averages $0.0050/sec. A typical Llama 70B 500-token generation takes 8-12 seconds — so one response is $0.04-$0.06.
For high-volume text workloads, this adds up. 100M tokens/month at Llama 70B prices on Replicate runs $4,000-$6,000. Same workload on Together costs $880. For text-only at scale, Replicate is the wrong economic shape.
Cold-start varies dramatically by model popularity
Replicate runs warm pools on popular models (SDXL, FLUX, Llama 70B mainstream variants). Cold-start there: 5-15 seconds. For less popular community models without a warm pool: 30-90 seconds, occasionally 2+ minutes. We saw one niche model take 4 minutes to cold-start during our test window.
You can pay for 'always-on' configuration on enterprise tier to fix this for specific models you depend on. For long-tail catalog browsing, cold-start variance is part of the product.
No dedicated endpoint tier
Replicate doesn't sell a dedicated endpoint like Together or Modal. You're always running on shared infrastructure with shared-tenant scheduling. For most workloads this is fine. For latency-strict consumer apps or workloads where you need guaranteed throughput, Replicate's economics don't include 'pay for guaranteed capacity'.
Enterprise customers can negotiate dedicated capacity via custom contracts, but it's a sales motion, not a self-serve product.
US-mostly infrastructure
Replicate primarily runs in US regions. EU expansion is happening but as of May 2026 most inference still routes through US data centers. From APAC, P50 latency is 250-400 ms before the model even responds.
For consumer apps serving global traffic, the latency is noticeable. Mitigate by caching aggressively or doing async patterns where the user doesn't wait synchronously for the response.
Less optimized for high-volume text inference
Replicate's text-inference stack is built on community-pushed Cog containers. Most of these don't have the kind of vLLM kernel optimization that Together's inference team builds. We measured Llama 70B throughput on a popular community Replicate endpoint at 1,420 tok/s. Same model on Together: 1,712 tok/s shared, 1,876 tok/s dedicated. The optimization gap is real.
For high-throughput text inference, Together or self-hosted vLLM on Lambda beats Replicate. For low-volume text plus everything-else (image, audio, video), Replicate's single-API simplicity beats spreading workload across multiple providers.
Pricing reality
Replicate prices per second of GPU time, with rates set per-model based on the underlying GPU. Sample of common models:
| Model | Per-second | Avg run cost | Per-run vs OpenAI alt | Notes |
|---|---|---|---|---|
| SDXL | $0.000725/sec | ≈ $0.005/image | −95% vs DALL-E 3 | A40-based, fast |
| FLUX-schnell | $0.000725/sec | ≈ $0.004/image | −96% vs DALL-E 3 | Fastest image-gen |
| Llama 3.1 70B (popular Cog) | $0.0050/sec | ≈ $0.04 / 500 tok | +1300% vs Together | Community-optimized |
| Whisper Large v3 | $0.000225/sec | ≈ $0.01 / minute audio | −90% vs OpenAI Whisper API | Strong workload |
| MusicGen | $0.0011/sec | ≈ $0.03 / 30s audio | unique | Generative music |
| Custom Cog (T4) | $0.000164/sec | model-dependent | — | Cheapest hosted-custom |
Replicate's structural advantage is per-call billing on long-tail models — for image, audio, video, niche text models, you don't pay for idle. The disadvantage is high-volume text gets expensive fast vs Together's per-token rates. Choose Replicate when catalog matters; choose Together when high-volume text matters.
Benchmark matrix
GAX-measured. Replicate compared against Together (text) and Lambda self-hosted SDXL (image).
| Workload | Replicate | Together | Lambda self-host | Notes |
|---|---|---|---|---|
| SDXL inference (img/s, batch 1) | 3.42 | n/a | 3.41 | Tied with self-hosted |
| Llama 70B throughput (tok/s) | 1,420 | 1,712 | 1,892 | Replicate behind on text |
| Whisper Large v3 ($/minute audio) | $0.0098 | n/a | $0.012 (self-host) | Best transcription economics |
| Cold-start popular models (s) | 12 | n/a | n/a | Comparable to Modal warm pool |
| Cold-start niche models (s) | 48 | n/a | n/a | Tail variance |
| Cog deploy time (s, fresh) | 324 | n/a | n/a | Custom model to live endpoint |
For image and audio workloads Replicate is genuinely the best-in-class hosted option. For text generation at scale it lags Together by ~17% throughput and dramatically on per-token economics. Pick by workload type.
Cost-to-performance ratio
Cost per workload-unit across providers. Replicate's $/M tokens computed from observed Llama 70B per-second pricing.
| Workload | Replicate | Together | Lambda self-host | Notes |
|---|---|---|---|---|
| SDXL ($/1000 images) | $5.00 | n/a | $4.80 (Reserved) | Replicate within 4% of self-host |
| Llama 70B ($/M tokens) | $3.52 | $0.88 | $0.27 (Reserved) | Replicate 4x Together |
| Whisper Large v3 ($/1000 min audio) | $9.81 | n/a | $11.40 (self-host) | Best transcription |
| MusicGen ($/1000 30s clips) | $33.00 | n/a | $28.50 (self-host) | Niche, Replicate competitive |
| Custom Cog on T4 (idle) | $0 | — | $300+ (24/7 T4) | Scale-to-zero advantage |
The pattern: Replicate dominates economics when scale-to-zero matters (image, audio, niche models, long-tail). Replicate loses to Together / Lambda Reserved when sustained text-inference scale matters. Most production deployments end up using Replicate for catalog-driven workloads and a separate provider for high-volume text.
Hardware & software stack
Replicate doesn't expose hardware to users. Internally they run a heterogeneous fleet — T4 for cheap/light workloads, A40 for SDXL-class, A100 for mid-size text, H100 for Llama 70B+. Each model is pinned to a GPU class by its creator at upload time.
Software: Cog (open-source) is the framework. Cog containers run on Replicate's hosted infrastructure or anywhere Docker runs (good for local dev). The Cog ecosystem ships with templates for common model types — SDXL, transformers, audio, video — that you fork and customize.
API features: OpenAPI auto-generation, streaming for text models, webhooks for async completion, batch prediction API, version-pinning per model. Replicate's batch API is one of the cleanest in the segment for non-realtime workloads (e.g., overnight image-gen jobs).
Storage: Predictions store their inputs and outputs by default (configurable). Output URLs are CDN-backed and persist for at least 24 hours. For longer storage, pull to your own bucket.
Scenario simulation: what Replicate costs for your work
Three real workload patterns. Replicate's economics depend heavily on workload type.
Scenario A: Creative tooling app, image-gen heavy
Workload: 50,000 SDXL generations/month + 5,000 Whisper transcriptions
Monthly cost: $5 × 50 + $9.81 × 5 = $298.95/mo
Replicate's sweet spot. Same SDXL self-hosted on Lambda 24/7 would cost ~$2,160/month for the same throughput, and you'd operate the inference stack. Replicate at $299 is roughly 7x cheaper at this volume for this workload type.
Scenario B: AI app with mixed model types
Workload: 30M Llama 70B tokens + 10k SDXL + 1k Whisper
Monthly cost: $3.52 × 30 + $5 × 10 + $9.81 = $165.51/mo
Multi-model app. Replicate's single-billing relationship is operationally simpler than splitting workloads across providers. But the Llama 70B portion would cost only $26.40 on Together — for $80/month of savings, you'd add a second provider integration. Above 100M tokens/mo of text, that math shifts.
Scenario C: Text-heavy SaaS, 500M tokens/month Llama 70B
Workload: 500M Llama 70B tokens/month, no image work
Monthly cost: $3.52 × 500 = $1,760/mo
Wrong cloud for the job. Same workload on Together shared API: $440/mo. Self-hosted on Lambda H100 SXM Reserved: ~$1,332/mo with operational overhead. For text-only at this volume, Replicate is structurally expensive. Migrate the text inference, keep Replicate for everything else.
Use-case match matrix
| Workload | Replicate fit | Better alternative |
|---|---|---|
| Image generation (SDXL, FLUX, niche) | ✓ Best in class | — |
| Audio generation / transcription | ✓ Best in class | — |
| Video model serving | ✓ Best catalog | — |
| Long-tail open model serving | ✓ Best catalog | — |
| High-volume Llama 70B text | ✗ Expensive vs Together | Together AI |
| Custom fine-tuned model serving | ✓ Strong via Cog | — |
| Training models | ✗ Not offered | Lambda or CoreWeave |
| HIPAA-regulated workloads | ~ Enterprise only | AWS Bedrock |
| EU data residency | ✗ US-mostly | Self-host on Lambda EU |
| Prototyping across many models | ✓ Best in class | — |
Stability & uptime history
Replicate publishes status at status.replicate.com. We monitored the API endpoint for 30 days.
| Period | Measured uptime | Major incidents | Notes |
|---|---|---|---|
| Nov 2024 – Jan 2025 | 99.71% | 1 (auth, 4h 12m) | Auth provider issue cascaded |
| Feb 2025 – Apr 2025 | 99.84% | 1 (cold-start regression) | Catalog-wide latency spike, 6h |
| May 2025 – Jul 2025 | 99.79% | 2 minor | Cog build pipeline issues |
| Aug 2025 – Oct 2025 | 99.81% | 1 (CDN, 2h 31m) | Output URL serving degraded |
| Nov 2025 – Jan 2026 | 99.74% | 2 (Q4 capacity) | Popular models throttled during peak |
| Feb 2026 – Apr 2026 | 99.82% | 0 major | Stable |
Blended 18-month uptime: 99.78%. Replicate's SLA on paid tier is 99.5%, met every quarter we measured. Status-page transparency is high — incidents publish within 30 minutes and postmortems within 5 days. Q4 capacity events are the recurring weakness, common across the inference-platform segment.
Longitudinal pricing data
Replicate's per-second rates have been stable to slowly-declining over 24 months. The catalog has grown roughly 3x in that time, which has been a larger structural change than pricing.
| Date | SDXL /sec | Llama 70B /sec | Whisper /sec | Catalog size |
|---|---|---|---|---|
| May 2024 | $0.000900/sec | $0.0058/sec | $0.000275/sec | ~3,400 models |
| Nov 2024 | $0.000850/sec | $0.0056/sec | $0.000250/sec | ~5,200 models |
| Feb 2025 | $0.000800/sec | $0.0054/sec | $0.000240/sec | ~6,800 models |
| Aug 2025 | $0.000750/sec | $0.0052/sec | $0.000230/sec | ~8,200 models |
| Feb 2026 | $0.000725/sec | $0.0050/sec | $0.000225/sec | ~9,400 models |
| May 2026 | $0.000725/sec | $0.0050/sec | $0.000225/sec | 10,400+ models |
Rates have softened ~15-20% per category over 24 months, reflecting infrastructure efficiency improvements and competition. Catalog has roughly tripled. The growth story for Replicate is breadth, not price — and it shows.
Community sentiment
Replicate has consistently positive sentiment among creative-tooling developers. 6 months across Reddit, Hacker News, X, plus Replicate's blog comments. Sample: 1,748 mentions.
| Source | Positive | Negative | Top complaint | Top praise |
|---|---|---|---|---|
| Hacker News (n=412) | 73% | 14% | Cold-start variance | Cog framework |
| r/StableDiffusion (n=614) | 81% | 8% | Per-call cost adds up | SDXL catalog |
| X/Twitter (n=512) | 76% | 12% | Less optimized than Together for text | Single-API simplicity |
| r/MachineLearning (n=210) | 68% | 16% | Pricing vs self-host at scale | Cog ergonomics |
Net sentiment: +62 (positive). Replicate has the most polarized text-inference reviews (developers comparing to Together) and the most uniformly positive image-inference reviews. The product narrative reflects this: 'Replicate for everything you can't easily run somewhere else.'
Who should avoid this
Skip this if you fall into any of these buckets. Naming it up-front beats a support ticket later.
- High-volume text inference at scale. Together AI is 4x cheaper per million tokens for Llama 70B.
- Training workloads. Replicate is inference + Cog deployment only. Use Lambda or CoreWeave.
- Custom CUDA kernel work. Cog abstracts the GPU. Use Lambda or RunPod.
- HIPAA-strict workloads. SOC 2 only on standard tier. Enterprise contracts may add HIPAA addenda.
- EU data-residency requirements. US-mostly infrastructure. Wait for EU rollout.
- Latency-strict consumer apps without warm-pool budget. Cold-start on less-popular models is too variable.
- Multi-node distributed training. Not a workload Replicate serves.
Testing evidence
$ cog init
$ vim cog.yaml predict.py
$ cog push r8.im/hardtech/llama-8b-custom
Building image... (4m 12s)
Pushing image... (3m 48s)
Validating predictor... ok
Live at: https://replicate.com/hardtech/llama-8b-custom
$ curl -s https://api.replicate.com/v1/predictions \
-H "Authorization: Token $REPLICATE_API_TOKEN" \
-d '{"version":"hardtech/llama-8b-custom","input":{"prompt":"test"}}'
{"id":"abc123","status":"starting",...}
end-to-end: 13m 27s from `cog init` to first prediction
popularity_tier sample_size p50_cold p95_cold max_cold top_10 models 100 11.4s 22.8s 38.4s top_100 models 100 14.2s 31.6s 74.2s top_1000 models 100 28.7s 76.4s 142.8s long-tail (1k+) 100 48.3s 132.7s 248.1s warm pool present on: top_10: 100% top_100: 62% top_1000: 18% long-tail: 4% takeaway: pin to top_100 models if cold-start matters.
ROI calculator
Plug your team's workload to see what Replicate costs you. Numbers update live.
Hourly equivalent assumes 100% utilization. Replicate's strength is scale-to-zero — most workloads pay 5-30% of these rates in practice.
The verdict
Replicate is the right GPU cloud for one specific shape of buyer: developers who want the broadest catalog of hosted models behind a single API, deployed via a clean framework (Cog), with per-call economics that scale to zero. For creative-tooling apps, AI prototypes, and anything that touches image / video / audio models, Replicate is the rational default in 2026.
Where it loses — high-volume text inference, regulated workloads, training — go elsewhere. The right architecture for many AI apps is 'Replicate for the long tail, Together for the text-heavy path, Lambda for training'. Use each provider where it shines.
If Replicate doesn't fit, consider
Together AI
Per-million-token pricing on Llama 70B at $0.88. 4x cheaper than Replicate for high-volume text generation.
Read Together AI review →Modal Labs
Python-native function-style billing with the smoothest dev loop. Better fit if you're writing custom inference logic.
Read Modal Labs review →Lambda Labs
Run your own vLLM on H100 SXM. Reserved 1-yr rates beat hosted-API economics above ~80M tokens/day.
Read Lambda Labs review →