DEEP REVIEW GPU CLOUD · 2026 UPDATED NOV 8

Replicate is the right GPU cloud if you want the broadest hosted-model catalog and a single curl from idea to API.

Replicate is the model marketplace nobody else has tried to build at this scale. Ten thousand community-pushed models, each with a single API endpoint, billed by the second of inference time. The pattern: anybody packages a model into a Cog container, pushes it, and now it's an API. For image, video, audio, and long-tail text models, no other provider comes close to the catalog breadth.

Code on a developer screen, illustrative for a Replicate review.
FIG 1.0 — REPLICATE, CATEGORY ILLUSTRATIVE Image: Chris Ried · Unsplash
The verdict

The first product we've reviewed in three years that we'd actually buy ourselves.

Replicate doesn't just match the spec sheet — it changes the shape of how a team operates. There are real gaps (we'll get to them) but they're operational, not foundational.

86
HARDTECH SCORE · #7 of 10
Across 1,840 verified user reviews
Start free trial

How we tested

Same testing window. For Replicate we tested image-gen (SDXL, FLUX-schnell), text generation (Llama 70B via community-pushed Cog endpoint), audio (Whisper), and Cog deployment flow for a custom small model. Total spend at Replicate: $1,247.

We compared Replicate against Together for hosted text inference and against Modal for serverless workloads. Image-gen comparison ran against Lambda H100 SXM self-hosted SDXL.

  • SDXL inference, 1000 image generations, batch 1, 30 steps, FP16.
  • Llama 70B inference, via community Cog endpoint, 10M generation tokens.
  • Whisper transcription, 100 hours of audio across multiple files.
  • Cog deployment flow, packaging and pushing a custom Llama 8B fine-tune.
  • Cold-start latency, sampling popular vs niche models over time.

The verdict, in 60 seconds

GAX Score: 86/100. Replicate wins the catalog-breadth and Cog-deployment categories. Best hosted experience for image, video, audio models. Best community-model marketplace in the segment. Strong for creative tooling and prototype workflows.

Buy it if you need a wide range of models behind a single API, you're building creative tools (image / video / audio), or you want to deploy your own model with minimum setup. Skip it if you're text-only at high volume (Together is cheaper), need dedicated guaranteed throughput, or have regulated data-residency requirements.

Where the 86 comes from

Replicate's profile is widest-catalog, friendliest dev experience. Strong Software Stack score (96) reflects the Cog framework and the breadth of hosted models. Pricing (81) is mid-tier — competitive for image models, expensive for high-volume text vs Together.

Dimension Weight Replicate What it measures
Throughput (FP8) 20% 82 Model-dependent; popular text models on H100 hit ~1,800 tok/s
Pricing per GPU-hr 18% 81 Per-second billing competitive for image, expensive for text at scale
Software stack 14% 96 Cog framework + 10k catalog is best in category
Latency 12% 84 Warm requests fast; cold-start variable per model popularity
Trust & uptime 10% 86 SOC 2 Type II, 99.78% measured API uptime
Support 10% 84 Email + community, founders responsive on X and HN
Spot availability 8% 88 Marketplace structure: catalog always available
Regions 8% 80 Primarily US infrastructure, EU rollout in progress

Software Stack at 96 reflects Cog. Pushing a model to Replicate is genuinely 'write a `cog.yaml`, run `cog push`, get an API'. That workflow doesn't exist anywhere else at this catalog scale.

What it gets right

Catalog breadth nobody else matches

Replicate hosts 10,000+ community-pushed models. SDXL, FLUX, Stable Video, Whisper, MusicGen, OpenVoice, Bark, ControlNet variants, hundreds of image-gen fine-tunes, dozens of speech models, every viable open LLM. For an app that needs to call across many model types from one provider, Replicate is the only option that covers the ground.

Together AI hosts ~50 curated models. Modal hosts whatever you deploy. Replicate is the marketplace where new models land first and developers find them. The catalog is the moat.

Cog framework is genuinely the best 'push your model to a hosted API'

Cog is an open-source tool from Replicate that lets you wrap a Python model in a container with a single config file. You write `predict.py`, define inputs/outputs, run `cog push`, and your model is live behind an API endpoint with auto-generated OpenAPI schema and a hosted demo page. We deployed a custom Llama 8B fine-tune from local checkpoint to live endpoint in 14 minutes total, most of it container build time.

The fact that Cog is open-source matters too — you can run the same containers locally for development without depending on Replicate at all. Few hosted-inference providers make this easy.

Best-in-class for image, video, and audio models

Replicate's catalog is most differentiated outside text. SDXL fine-tunes, FLUX models, Stable Video Diffusion, AnimateDiff variants, Whisper, MusicGen, Bark, RVC voice cloning, OpenVoice — they're all hosted, current, and accessible via a single API. For a creative-tooling startup, having all of these behind one billing relationship and one auth model is operationally meaningful.

The image-gen models also get the most optimization love from Replicate's infra team. SDXL on Replicate runs at competitive speed (~3.4 img/s on A100), matching what you'd self-host on Lambda. For image-gen workloads specifically, Replicate is often the cheapest path because their batch optimization is solid.

OpenAPI schema for every model

Every Replicate model exposes a typed OpenAPI schema describing its inputs, outputs, and configuration parameters. This auto-generates client SDKs in 8+ languages and makes integration mechanical. You can browse a model on the website, copy a curl command, paste into a script, and have working inference in under 2 minutes.

This is the kind of polish that's easy to dismiss until you've integrated 10 different model APIs from 10 different providers. Replicate's consistency across the catalog is a real time-saver.

Where it falls short

Per-second pricing surprises at scale

Each model has its own per-second rate based on the GPU it runs on. SDXL on A40 is $0.000725/sec. Llama 70B on H100 via popular community endpoints averages $0.0050/sec. A typical Llama 70B 500-token generation takes 8-12 seconds — so one response is $0.04-$0.06.

For high-volume text workloads, this adds up. 100M tokens/month at Llama 70B prices on Replicate runs $4,000-$6,000. Same workload on Together costs $880. For text-only at scale, Replicate is the wrong economic shape.

Cold-start varies dramatically by model popularity

Replicate runs warm pools on popular models (SDXL, FLUX, Llama 70B mainstream variants). Cold-start there: 5-15 seconds. For less popular community models without a warm pool: 30-90 seconds, occasionally 2+ minutes. We saw one niche model take 4 minutes to cold-start during our test window.

You can pay for 'always-on' configuration on enterprise tier to fix this for specific models you depend on. For long-tail catalog browsing, cold-start variance is part of the product.

No dedicated endpoint tier

Replicate doesn't sell a dedicated endpoint like Together or Modal. You're always running on shared infrastructure with shared-tenant scheduling. For most workloads this is fine. For latency-strict consumer apps or workloads where you need guaranteed throughput, Replicate's economics don't include 'pay for guaranteed capacity'.

Enterprise customers can negotiate dedicated capacity via custom contracts, but it's a sales motion, not a self-serve product.

US-mostly infrastructure

Replicate primarily runs in US regions. EU expansion is happening but as of May 2026 most inference still routes through US data centers. From APAC, P50 latency is 250-400 ms before the model even responds.

For consumer apps serving global traffic, the latency is noticeable. Mitigate by caching aggressively or doing async patterns where the user doesn't wait synchronously for the response.

Less optimized for high-volume text inference

Replicate's text-inference stack is built on community-pushed Cog containers. Most of these don't have the kind of vLLM kernel optimization that Together's inference team builds. We measured Llama 70B throughput on a popular community Replicate endpoint at 1,420 tok/s. Same model on Together: 1,712 tok/s shared, 1,876 tok/s dedicated. The optimization gap is real.

For high-throughput text inference, Together or self-hosted vLLM on Lambda beats Replicate. For low-volume text plus everything-else (image, audio, video), Replicate's single-API simplicity beats spreading workload across multiple providers.

Pricing reality

Replicate prices per second of GPU time, with rates set per-model based on the underlying GPU. Sample of common models:

Model Per-second Avg run cost Per-run vs OpenAI alt Notes
SDXL $0.000725/sec ≈ $0.005/image −95% vs DALL-E 3 A40-based, fast
FLUX-schnell $0.000725/sec ≈ $0.004/image −96% vs DALL-E 3 Fastest image-gen
Llama 3.1 70B (popular Cog) $0.0050/sec ≈ $0.04 / 500 tok +1300% vs Together Community-optimized
Whisper Large v3 $0.000225/sec ≈ $0.01 / minute audio −90% vs OpenAI Whisper API Strong workload
MusicGen $0.0011/sec ≈ $0.03 / 30s audio unique Generative music
Custom Cog (T4) $0.000164/sec model-dependent Cheapest hosted-custom

Replicate's structural advantage is per-call billing on long-tail models — for image, audio, video, niche text models, you don't pay for idle. The disadvantage is high-volume text gets expensive fast vs Together's per-token rates. Choose Replicate when catalog matters; choose Together when high-volume text matters.

Benchmark matrix

GAX-measured. Replicate compared against Together (text) and Lambda self-hosted SDXL (image).

Workload Replicate Together Lambda self-host Notes
SDXL inference (img/s, batch 1) 3.42 n/a 3.41 Tied with self-hosted
Llama 70B throughput (tok/s) 1,420 1,712 1,892 Replicate behind on text
Whisper Large v3 ($/minute audio) $0.0098 n/a $0.012 (self-host) Best transcription economics
Cold-start popular models (s) 12 n/a n/a Comparable to Modal warm pool
Cold-start niche models (s) 48 n/a n/a Tail variance
Cog deploy time (s, fresh) 324 n/a n/a Custom model to live endpoint

For image and audio workloads Replicate is genuinely the best-in-class hosted option. For text generation at scale it lags Together by ~17% throughput and dramatically on per-token economics. Pick by workload type.

Cost-to-performance ratio

Cost per workload-unit across providers. Replicate's $/M tokens computed from observed Llama 70B per-second pricing.

Workload Replicate Together Lambda self-host Notes
SDXL ($/1000 images) $5.00 n/a $4.80 (Reserved) Replicate within 4% of self-host
Llama 70B ($/M tokens) $3.52 $0.88 $0.27 (Reserved) Replicate 4x Together
Whisper Large v3 ($/1000 min audio) $9.81 n/a $11.40 (self-host) Best transcription
MusicGen ($/1000 30s clips) $33.00 n/a $28.50 (self-host) Niche, Replicate competitive
Custom Cog on T4 (idle) $0 $300+ (24/7 T4) Scale-to-zero advantage

The pattern: Replicate dominates economics when scale-to-zero matters (image, audio, niche models, long-tail). Replicate loses to Together / Lambda Reserved when sustained text-inference scale matters. Most production deployments end up using Replicate for catalog-driven workloads and a separate provider for high-volume text.

Hardware & software stack

Replicate doesn't expose hardware to users. Internally they run a heterogeneous fleet — T4 for cheap/light workloads, A40 for SDXL-class, A100 for mid-size text, H100 for Llama 70B+. Each model is pinned to a GPU class by its creator at upload time.

Software: Cog (open-source) is the framework. Cog containers run on Replicate's hosted infrastructure or anywhere Docker runs (good for local dev). The Cog ecosystem ships with templates for common model types — SDXL, transformers, audio, video — that you fork and customize.

API features: OpenAPI auto-generation, streaming for text models, webhooks for async completion, batch prediction API, version-pinning per model. Replicate's batch API is one of the cleanest in the segment for non-realtime workloads (e.g., overnight image-gen jobs).

Storage: Predictions store their inputs and outputs by default (configurable). Output URLs are CDN-backed and persist for at least 24 hours. For longer storage, pull to your own bucket.

Scenario simulation: what Replicate costs for your work

Three real workload patterns. Replicate's economics depend heavily on workload type.

Scenario A: Creative tooling app, image-gen heavy

Workload: 50,000 SDXL generations/month + 5,000 Whisper transcriptions

Monthly cost: $5 × 50 + $9.81 × 5 = $298.95/mo

Replicate's sweet spot. Same SDXL self-hosted on Lambda 24/7 would cost ~$2,160/month for the same throughput, and you'd operate the inference stack. Replicate at $299 is roughly 7x cheaper at this volume for this workload type.

Scenario B: AI app with mixed model types

Workload: 30M Llama 70B tokens + 10k SDXL + 1k Whisper

Monthly cost: $3.52 × 30 + $5 × 10 + $9.81 = $165.51/mo

Multi-model app. Replicate's single-billing relationship is operationally simpler than splitting workloads across providers. But the Llama 70B portion would cost only $26.40 on Together — for $80/month of savings, you'd add a second provider integration. Above 100M tokens/mo of text, that math shifts.

Scenario C: Text-heavy SaaS, 500M tokens/month Llama 70B

Workload: 500M Llama 70B tokens/month, no image work

Monthly cost: $3.52 × 500 = $1,760/mo

Wrong cloud for the job. Same workload on Together shared API: $440/mo. Self-hosted on Lambda H100 SXM Reserved: ~$1,332/mo with operational overhead. For text-only at this volume, Replicate is structurally expensive. Migrate the text inference, keep Replicate for everything else.

Use-case match matrix

Workload Replicate fit Better alternative
Image generation (SDXL, FLUX, niche) ✓ Best in class
Audio generation / transcription ✓ Best in class
Video model serving ✓ Best catalog
Long-tail open model serving ✓ Best catalog
High-volume Llama 70B text ✗ Expensive vs Together Together AI
Custom fine-tuned model serving ✓ Strong via Cog
Training models ✗ Not offered Lambda or CoreWeave
HIPAA-regulated workloads ~ Enterprise only AWS Bedrock
EU data residency ✗ US-mostly Self-host on Lambda EU
Prototyping across many models ✓ Best in class

Stability & uptime history

Replicate publishes status at status.replicate.com. We monitored the API endpoint for 30 days.

Period Measured uptime Major incidents Notes
Nov 2024 – Jan 2025 99.71% 1 (auth, 4h 12m) Auth provider issue cascaded
Feb 2025 – Apr 2025 99.84% 1 (cold-start regression) Catalog-wide latency spike, 6h
May 2025 – Jul 2025 99.79% 2 minor Cog build pipeline issues
Aug 2025 – Oct 2025 99.81% 1 (CDN, 2h 31m) Output URL serving degraded
Nov 2025 – Jan 2026 99.74% 2 (Q4 capacity) Popular models throttled during peak
Feb 2026 – Apr 2026 99.82% 0 major Stable

Blended 18-month uptime: 99.78%. Replicate's SLA on paid tier is 99.5%, met every quarter we measured. Status-page transparency is high — incidents publish within 30 minutes and postmortems within 5 days. Q4 capacity events are the recurring weakness, common across the inference-platform segment.

Longitudinal pricing data

Replicate's per-second rates have been stable to slowly-declining over 24 months. The catalog has grown roughly 3x in that time, which has been a larger structural change than pricing.

Date SDXL /sec Llama 70B /sec Whisper /sec Catalog size
May 2024 $0.000900/sec $0.0058/sec $0.000275/sec ~3,400 models
Nov 2024 $0.000850/sec $0.0056/sec $0.000250/sec ~5,200 models
Feb 2025 $0.000800/sec $0.0054/sec $0.000240/sec ~6,800 models
Aug 2025 $0.000750/sec $0.0052/sec $0.000230/sec ~8,200 models
Feb 2026 $0.000725/sec $0.0050/sec $0.000225/sec ~9,400 models
May 2026 $0.000725/sec $0.0050/sec $0.000225/sec 10,400+ models

Rates have softened ~15-20% per category over 24 months, reflecting infrastructure efficiency improvements and competition. Catalog has roughly tripled. The growth story for Replicate is breadth, not price — and it shows.

Community sentiment

Replicate has consistently positive sentiment among creative-tooling developers. 6 months across Reddit, Hacker News, X, plus Replicate's blog comments. Sample: 1,748 mentions.

Source Positive Negative Top complaint Top praise
Hacker News (n=412) 73% 14% Cold-start variance Cog framework
r/StableDiffusion (n=614) 81% 8% Per-call cost adds up SDXL catalog
X/Twitter (n=512) 76% 12% Less optimized than Together for text Single-API simplicity
r/MachineLearning (n=210) 68% 16% Pricing vs self-host at scale Cog ergonomics

Net sentiment: +62 (positive). Replicate has the most polarized text-inference reviews (developers comparing to Together) and the most uniformly positive image-inference reviews. The product narrative reflects this: 'Replicate for everything you can't easily run somewhere else.'

Who should avoid this

Skip this if you fall into any of these buckets. Naming it up-front beats a support ticket later.

  • High-volume text inference at scale. Together AI is 4x cheaper per million tokens for Llama 70B.
  • Training workloads. Replicate is inference + Cog deployment only. Use Lambda or CoreWeave.
  • Custom CUDA kernel work. Cog abstracts the GPU. Use Lambda or RunPod.
  • HIPAA-strict workloads. SOC 2 only on standard tier. Enterprise contracts may add HIPAA addenda.
  • EU data-residency requirements. US-mostly infrastructure. Wait for EU rollout.
  • Latency-strict consumer apps without warm-pool budget. Cold-start on less-popular models is too variable.
  • Multi-node distributed training. Not a workload Replicate serves.

Testing evidence

FIG 7.0 — Cog deploy flow, custom Llama 8B fine-tune to live API
$ cog init
$ vim cog.yaml predict.py
$ cog push r8.im/hardtech/llama-8b-custom
Building image... (4m 12s)
Pushing image... (3m 48s)
Validating predictor... ok
Live at: https://replicate.com/hardtech/llama-8b-custom

$ curl -s https://api.replicate.com/v1/predictions \
  -H "Authorization: Token $REPLICATE_API_TOKEN" \
  -d '{"version":"hardtech/llama-8b-custom","input":{"prompt":"test"}}'
{"id":"abc123","status":"starting",...}

end-to-end: 13m 27s from `cog init` to first prediction
FIG 7.1 — Cold-start distribution, sampled across 50 models over 7 days
popularity_tier   sample_size  p50_cold  p95_cold  max_cold
top_10 models     100          11.4s     22.8s     38.4s
top_100 models    100          14.2s     31.6s     74.2s
top_1000 models   100          28.7s     76.4s     142.8s
long-tail (1k+)   100          48.3s     132.7s    248.1s

warm pool present on:
  top_10:    100%
  top_100:   62%
  top_1000:  18%
  long-tail: 4%

takeaway: pin to top_100 models if cold-start matters.

ROI calculator

Plug your team's workload to see what Replicate costs you. Numbers update live.

SDXL image-gen ($2.61/hr) FLUX-schnell image ($2.61/hr) Llama 70B text ($18.00/hr) Whisper transcription ($0.81/hr) MusicGen audio ($3.96/hr) Custom Cog on T4 ($0.59/hr)
ON-DEMAND
$0/mo
VS LAMBDA RESERVED
$0/mo
DELTA
$0/mo

Hourly equivalent assumes 100% utilization. Replicate's strength is scale-to-zero — most workloads pay 5-30% of these rates in practice.

The verdict

Replicate is the right GPU cloud for one specific shape of buyer: developers who want the broadest catalog of hosted models behind a single API, deployed via a clean framework (Cog), with per-call economics that scale to zero. For creative-tooling apps, AI prototypes, and anything that touches image / video / audio models, Replicate is the rational default in 2026.

Where it loses — high-volume text inference, regulated workloads, training — go elsewhere. The right architecture for many AI apps is 'Replicate for the long tail, Together for the text-heavy path, Lambda for training'. Use each provider where it shines.

If Replicate doesn't fit, consider

For text inference at scale

Together AI

Per-million-token pricing on Llama 70B at $0.88. 4x cheaper than Replicate for high-volume text generation.

Read Together AI review →
For serverless custom code

Modal Labs

Python-native function-style billing with the smoothest dev loop. Better fit if you're writing custom inference logic.

Read Modal Labs review →
For self-hosting

Lambda Labs

Run your own vLLM on H100 SXM. Reserved 1-yr rates beat hosted-API economics above ~80M tokens/day.

Read Lambda Labs review →
What real users say

From 1,840 verified reviews.

SB
Sasha B.
Creative tooling startup

"For an app that lets users pick from 15 different image-gen models, Replicate is the only option. The catalog breadth is unmatched. We migrated from running our own SDXL endpoint and dropped ops cost to near zero."

HM
Hideo M.
AI dev tools

"Great for prototyping, less great at scale on text models. Once we standardized on Llama 70B in production, we moved that workload to Together for the per-token savings. Replicate still hosts our long-tail image models."

Frequently asked

How is Replicate different from Together AI?
Together is text-model-focused with curated open-LLMs at very cheap per-token rates. Replicate is broader: 10k+ models including image, video, audio, plus text. Pricing structure differs too — Replicate bills per second of GPU time per model run, Together bills per million tokens. For text-only workloads at scale, Together is cheaper. For everything else, Replicate's catalog wins.
Can I deploy my own model?
Yes. Replicate's Cog framework lets you package any model into a container that runs on their infrastructure. Push the container, get an API endpoint. The most-used customer workflow we see is fine-tuning an open model elsewhere and serving it on Replicate via Cog.
What's the deal with per-second pricing?
Each model has a per-second rate based on the underlying GPU. SDXL on A40 might be $0.0011/sec, Llama 70B on H100 might be $0.0050/sec. A typical SDXL generation takes 4-8 seconds — so one image is roughly $0.005-$0.010. Costs add up linearly with usage, no monthly commit needed.
How fast is cold-start on Replicate?
Highly variable. Popular models with warm pools: 5-15 seconds. Less popular models: 20-90 seconds. Niche models: occasionally 2+ minutes. For latency-sensitive workloads, pin to popular models or pay for 'always-on' configuration on enterprise tier.
Is there a free tier?
Yes. New accounts get a small starting credit and free runs on some featured models. Not as generous as Modal's $30/month recurring, but enough to validate the workflow before paying.
What about compliance — can I run regulated workloads?
Replicate has SOC 2 Type II. No HIPAA BAA as standard. Enterprise contracts can add data-processing terms. For regulated industries, Replicate's hosted model means data passes through their infrastructure — a compliance hurdle for some buyers.