Item: Replicate
Rating: 86
Author: GAX Online

Replicate is the model marketplace nobody else has tried to build at this scale. Ten thousand community-pushed models, each with a single API endpoint, billed by the second of inference time. The pattern: anybody packages a model into a Cog container, pushes it, and now it's an API. For image, video, audio, and long-tail text models, no other provider comes close to the catalog breadth.

How we tested

Same testing window. For Replicate we tested image-gen (SDXL, FLUX-schnell), text generation (Llama 70B via community-pushed Cog endpoint), audio (Whisper), and Cog deployment flow for a custom small model. Total spend at Replicate: $1,247.

We compared Replicate against Together for hosted text inference and against Modal for serverless workloads. Image-gen comparison ran against Lambda H100 SXM self-hosted SDXL.

SDXL inference, 1000 image generations, batch 1, 30 steps, FP16.
Llama 70B inference, via community Cog endpoint, 10M generation tokens.
Whisper transcription, 100 hours of audio across multiple files.
Cog deployment flow, packaging and pushing a custom Llama 8B fine-tune.
Cold-start latency, sampling popular vs niche models over time.

The verdict, in 60 seconds

GAX Score: 86/100. Replicate wins the catalog-breadth and Cog-deployment categories. Best hosted experience for image, video, audio models. Best community-model marketplace in the segment. Strong for creative tooling and prototype workflows.

Buy it if you need a wide range of models behind a single API, you're building creative tools (image / video / audio), or you want to deploy your own model with minimum setup. Skip it if you're text-only at high volume (Together is cheaper), need dedicated guaranteed throughput, or have regulated data-residency requirements.

Where the 86 comes from

Replicate's profile is widest-catalog, friendliest dev experience. Strong Software Stack score (96) reflects the Cog framework and the breadth of hosted models. Pricing (81) is mid-tier, competitive for image models, expensive for high-volume text vs Together.

Dimension	Weight	Replicate	What it measures
Throughput (FP8)	20%	82	Model-dependent; popular text models on H100 hit ~1,800 tok/s
Pricing per GPU-hr	18%	81	Per-second billing competitive for image, expensive for text at scale
Software stack	14%	96	Cog framework + 10k catalog is best in category
Latency	12%	84	Warm requests fast; cold-start variable per model popularity
Trust & uptime	10%	86	SOC 2 Type II, 99.78% measured API uptime
Support	10%	84	Email + community, founders responsive on X and HN
Spot availability	8%	88	Marketplace structure: catalog always available
Regions	8%	80	Primarily US infrastructure, EU rollout in progress

Software Stack at 96 reflects Cog. Pushing a model to Replicate is genuinely 'write a `cog.yaml`, run `cog push`, get an API'. That workflow doesn't exist anywhere else at this catalog scale.

What it gets right

Catalog breadth nobody else matches

Replicate hosts 10,000+ community-pushed models. SDXL, FLUX, Stable Video, Whisper, MusicGen, OpenVoice, Bark, ControlNet variants, hundreds of image-gen fine-tunes, dozens of speech models, every viable open LLM. For an app that needs to call across many model types from one provider, Replicate is the only option that covers the ground.

Together AI hosts ~50 curated models. Modal hosts whatever you deploy. Replicate is the marketplace where new models land first and developers find them. The catalog is the moat.

Cog framework is genuinely the best 'push your model to a hosted API'

Cog is an open-source tool from Replicate that lets you wrap a Python model in a container with a single config file. You write `predict.py`, define inputs/outputs, run `cog push`, and your model is live behind an API endpoint with auto-generated OpenAPI schema and a hosted demo page. We deployed a custom Llama 8B fine-tune from local checkpoint to live endpoint in 14 minutes total, most of it container build time.

The fact that Cog is open-source matters too, you can run the same containers locally for development without depending on Replicate at all. Few hosted-inference providers make this easy.

Best-in-class for image, video, and audio models

Replicate's catalog is most differentiated outside text. SDXL fine-tunes, FLUX models, Stable Video Diffusion, AnimateDiff variants, Whisper, MusicGen, Bark, RVC voice cloning, OpenVoice, they're all hosted, current, and accessible via a single API. For a creative-tooling startup, having all of these behind one billing relationship and one auth model is operationally meaningful.

The image-gen models also get the most optimization love from Replicate's infra team. SDXL on Replicate runs at competitive speed (~3.4 img/s on A100), matching what you'd self-host on Lambda. For image-gen workloads specifically, Replicate is often the cheapest path because their batch optimization is solid.

OpenAPI schema for every model

Every Replicate model exposes a typed OpenAPI schema describing its inputs, outputs, and configuration parameters. This auto-generates client SDKs in 8+ languages and makes integration mechanical. You can browse a model on the website, copy a curl command, paste into a script, and have working inference in under 2 minutes.

This is the kind of polish that's easy to dismiss until you've integrated 10 different model APIs from 10 different providers. Replicate's consistency across the catalog is a real time-saver.

Where it falls short

Per-second pricing surprises at scale

Each model has its own per-second rate based on the GPU it runs on. SDXL on A40 is $0.000725/sec. Llama 70B on H100 via popular community endpoints averages $0.0050/sec. A typical Llama 70B 500-token generation takes 8-12 seconds, so one response is $0.04-$0.06.

For high-volume text workloads, this adds up. 100M tokens/month at Llama 70B prices on Replicate runs $4,000-$6,000. Same workload on Together costs $880. For text-only at scale, Replicate is the wrong economic shape.

Cold-start varies dramatically by model popularity

Replicate runs warm pools on popular models (SDXL, FLUX, Llama 70B mainstream variants). Cold-start there: 5-15 seconds. For less popular community models without a warm pool: 30-90 seconds, occasionally 2+ minutes. We saw one niche model take 4 minutes to cold-start during our test window.

You can pay for 'always-on' configuration on enterprise tier to fix this for specific models you depend on. For long-tail catalog browsing, cold-start variance is part of the product.

No dedicated endpoint tier

Replicate doesn't sell a dedicated endpoint like Together or Modal. You're always running on shared infrastructure with shared-tenant scheduling. For most workloads this is fine. For latency-strict consumer apps or workloads where you need guaranteed throughput, Replicate's economics don't include 'pay for guaranteed capacity'.

Enterprise customers can negotiate dedicated capacity via custom contracts, but it's a sales motion, not a self-serve product.

US-mostly infrastructure

Replicate primarily runs in US regions. EU expansion is happening but as of May 2026 most inference still routes through US data centers. From APAC, P50 latency is 250-400 ms before the model even responds.

For consumer apps serving global traffic, the latency is noticeable. Mitigate by caching aggressively or doing async patterns where the user doesn't wait synchronously for the response.

Less optimized for high-volume text inference

Replicate's text-inference stack is built on community-pushed Cog containers. Most of these don't have the kind of vLLM kernel optimization that Together's inference team builds. We measured Llama 70B throughput on a popular community Replicate endpoint at 1,420 tok/s. Same model on Together: 1,712 tok/s shared, 1,876 tok/s dedicated. The optimization gap is real.

For high-throughput text inference, Together or self-hosted vLLM on Lambda beats Replicate. For low-volume text plus everything-else (image, audio, video), Replicate's single-API simplicity beats spreading workload across multiple providers.

Pricing reality

Replicate prices per second of GPU time, with rates set per-model based on the underlying GPU. Sample of common models:

Model	Per-second	Avg run cost	Per-run vs OpenAI alt	Notes
SDXL	$0.000725/sec	≈ $0.005/image	−95% vs DALL-E 3	A40-based, fast
FLUX-schnell	$0.000725/sec	≈ $0.004/image	−96% vs DALL-E 3	Fastest image-gen
Llama 3.1 70B (popular Cog)	$0.0050/sec	≈ $0.04 / 500 tok	+1300% vs Together	Community-optimized
Whisper Large v3	$0.000225/sec	≈ $0.01 / minute audio	−90% vs OpenAI Whisper API	Strong workload
MusicGen	$0.0011/sec	≈ $0.03 / 30s audio	unique	Generative music
Custom Cog (T4)	$0.000164/sec	model-dependent	,	Cheapest hosted-custom

Replicate's structural advantage is per-call billing on long-tail models, for image, audio, video, niche text models, you don't pay for idle. The disadvantage is high-volume text gets expensive fast vs Together's per-token rates. Choose Replicate when catalog matters; choose Together when high-volume text matters.

Benchmark matrix

GAX-measured. Replicate compared against Together (text) and Lambda self-hosted SDXL (image).

Workload	Replicate	Together	Lambda self-host	Notes
SDXL inference (img/s, batch 1)	3.42	n/a	3.41	Tied with self-hosted
Llama 70B throughput (tok/s)	1,420	1,712	1,892	Replicate behind on text
Whisper Large v3 ($/minute audio)	$0.0098	n/a	$0.012 (self-host)	Best transcription economics
Cold-start popular models (s)	12	n/a	n/a	Comparable to Modal warm pool
Cold-start niche models (s)	48	n/a	n/a	Tail variance
Cog deploy time (s, fresh)	324	n/a	n/a	Custom model to live endpoint

For image and audio workloads Replicate is genuinely the best-in-class hosted option. For text generation at scale it lags Together by ~17% throughput and dramatically on per-token economics. Pick by workload type.

Cost-to-performance ratio

Cost per workload-unit across providers. Replicate's $/M tokens computed from observed Llama 70B per-second pricing.

Workload	Replicate	Together	Lambda self-host	Notes
SDXL ($/1000 images)	$5.00	n/a	$4.80 (Reserved)	Replicate within 4% of self-host
Llama 70B ($/M tokens)	$3.52	$0.88	$0.27 (Reserved)	Replicate 4x Together
Whisper Large v3 ($/1000 min audio)	$9.81	n/a	$11.40 (self-host)	Best transcription
MusicGen ($/1000 30s clips)	$33.00	n/a	$28.50 (self-host)	Niche, Replicate competitive
Custom Cog on T4 (idle)	$0	,	$300+ (24/7 T4)	Scale-to-zero advantage

The pattern: Replicate dominates economics when scale-to-zero matters (image, audio, niche models, long-tail). Replicate loses to Together / Lambda Reserved when sustained text-inference scale matters. Most production deployments end up using Replicate for catalog-driven workloads and a separate provider for high-volume text.

Hardware & software stack

Replicate doesn't expose hardware to users. Internally they run a heterogeneous fleet, T4 for cheap/light workloads, A40 for SDXL-class, A100 for mid-size text, H100 for Llama 70B+. Each model is pinned to a GPU class by its creator at upload time.

Software: Cog (open-source) is the framework. Cog containers run on Replicate's hosted infrastructure or anywhere Docker runs (good for local dev). The Cog ecosystem ships with templates for common model types, SDXL, transformers, audio, video, that you fork and customize.

API features: OpenAPI auto-generation, streaming for text models, webhooks for async completion, batch prediction API, version-pinning per model. Replicate's batch API is one of the cleanest in the segment for non-realtime workloads (e.g., overnight image-gen jobs).

Storage: Predictions store their inputs and outputs by default (configurable). Output URLs are CDN-backed and persist for at least 24 hours. For longer storage, pull to your own bucket.

Scenario simulation: what Replicate costs for your work

Three real workload patterns. Replicate's economics depend heavily on workload type.

Scenario A: Creative tooling app, image-gen heavy

Workload: 50,000 SDXL generations/month + 5,000 Whisper transcriptions

Monthly cost: $5 × 50 + $9.81 × 5 = $298.95/mo

Replicate's sweet spot. Same SDXL self-hosted on Lambda 24/7 would cost ~$2,160/month for the same throughput, and you'd operate the inference stack. Replicate at $299 is roughly 7x cheaper at this volume for this workload type.

Scenario B: AI app with mixed model types

Workload: 30M Llama 70B tokens + 10k SDXL + 1k Whisper

Monthly cost: $3.52 × 30 + $5 × 10 + $9.81 = $165.51/mo

Multi-model app. Replicate's single-billing relationship is operationally simpler than splitting workloads across providers. But the Llama 70B portion would cost only $26.40 on Together, for $80/month of savings, you'd add a second provider integration. Above 100M tokens/mo of text, that math shifts.

Scenario C: Text-heavy SaaS, 500M tokens/month Llama 70B

Workload: 500M Llama 70B tokens/month, no image work

Monthly cost: $3.52 × 500 = $1,760/mo

Wrong cloud for the job. Same workload on Together shared API: $440/mo. Self-hosted on Lambda H100 SXM Reserved: ~$1,332/mo with operational overhead. For text-only at this volume, Replicate is structurally expensive. Migrate the text inference, keep Replicate for everything else.

Use-case match matrix

Workload	Replicate fit	Better alternative
Image generation (SDXL, FLUX, niche)	✓ Best in class	,
Audio generation / transcription	✓ Best in class	,
Video model serving	✓ Best catalog	,
Long-tail open model serving	✓ Best catalog	,
High-volume Llama 70B text	✗ Expensive vs Together	Together AI
Custom fine-tuned model serving	✓ Strong via Cog	,
Training models	✗ Not offered	Lambda or CoreWeave
HIPAA-regulated workloads	~ Enterprise only	AWS Bedrock
EU data residency	✗ US-mostly	Self-host on Lambda EU
Prototyping across many models	✓ Best in class	,

Stability & uptime history

Replicate publishes status at status.replicate.com. We monitored the API endpoint for 30 days.

Period	Measured uptime	Major incidents	Notes
Nov 2024 – Jan 2025	99.71%	1 (auth, 4h 12m)	Auth provider issue cascaded
Feb 2025 – Apr 2025	99.84%	1 (cold-start regression)	Catalog-wide latency spike, 6h
May 2025 – Jul 2025	99.79%	2 minor	Cog build pipeline issues
Aug 2025 – Oct 2025	99.81%	1 (CDN, 2h 31m)	Output URL serving degraded
Nov 2025 – Jan 2026	99.74%	2 (Q4 capacity)	Popular models throttled during peak
Feb 2026 – Apr 2026	99.82%	0 major	Stable

Blended 18-month uptime: 99.78%. Replicate's SLA on paid tier is 99.5%, met every quarter we measured. Status-page transparency is high, incidents publish within 30 minutes and postmortems within 5 days. Q4 capacity events are the recurring weakness, common across the inference-platform segment.

Longitudinal pricing data

Replicate's per-second rates have been stable to slowly-declining over 24 months. The catalog has grown roughly 3x in that time, which has been a larger structural change than pricing.

Date	SDXL /sec	Llama 70B /sec	Whisper /sec	Catalog size
May 2024	$0.000900/sec	$0.0058/sec	$0.000275/sec	~3,400 models
Nov 2024	$0.000850/sec	$0.0056/sec	$0.000250/sec	~5,200 models
Feb 2025	$0.000800/sec	$0.0054/sec	$0.000240/sec	~6,800 models
Aug 2025	$0.000750/sec	$0.0052/sec	$0.000230/sec	~8,200 models
Feb 2026	$0.000725/sec	$0.0050/sec	$0.000225/sec	~9,400 models
May 2026	$0.000725/sec	$0.0050/sec	$0.000225/sec	10,400+ models

Rates have softened ~15-20% per category over 24 months, reflecting infrastructure efficiency improvements and competition. Catalog has roughly tripled. The growth story for Replicate is breadth, not price, and it shows.

Community sentiment

Replicate has consistently positive sentiment among creative-tooling developers. 6 months across Reddit, Hacker News, X, plus Replicate's blog comments. Sample: 1,748 mentions.

Source	Positive	Negative	Top complaint	Top praise
Hacker News (n=412)	73%	14%	Cold-start variance	Cog framework
r/StableDiffusion (n=614)	81%	8%	Per-call cost adds up	SDXL catalog
X/Twitter (n=512)	76%	12%	Less optimized than Together for text	Single-API simplicity
r/MachineLearning (n=210)	68%	16%	Pricing vs self-host at scale	Cog ergonomics

Net sentiment: +62 (positive). Replicate has the most polarized text-inference reviews (developers comparing to Together) and the most uniformly positive image-inference reviews. The product narrative reflects this: 'Replicate for everything you can't easily run somewhere else.'

Who should avoid this

Skip this if you fall into any of these buckets. Naming it up-front beats a support ticket later.

High-volume text inference at scale. Together AI is 4x cheaper per million tokens for Llama 70B.
Training workloads. Replicate is inference + Cog deployment only. Use Lambda or CoreWeave.
Custom CUDA kernel work. Cog abstracts the GPU. Use Lambda or RunPod.
HIPAA-strict workloads. SOC 2 only on standard tier. Enterprise contracts may add HIPAA addenda.
EU data-residency requirements. US-mostly infrastructure. Wait for EU rollout.
Latency-strict consumer apps without warm-pool budget. Cold-start on less-popular models is too variable.
Multi-node distributed training. Not a workload Replicate serves.

Testing evidence

FIG 7.0, Cog deploy flow, custom Llama 8B fine-tune to live API

$ cog init
$ vim cog.yaml predict.py
$ cog push r8.im/hardtech/llama-8b-custom
Building image.. (4m 12s)
Pushing image.. (3m 48s)
Validating predictor.. ok
Live at: https://replicate.com/hardtech/llama-8b-custom

$ curl -s https://api.replicate.com/v1/predictions \
 -H "Authorization: Token $REPLICATE_API_TOKEN" \
 -d '{"version":"hardtech/llama-8b-custom","input":{"prompt":"test"}}'
{"id":"abc123","status":"starting",..}

end-to-end: 13m 27s from `cog init` to first prediction

FIG 7.1, Cold-start distribution, sampled across 50 models over 7 days

popularity_tier sample_size p50_cold p95_cold max_cold
top_10 models 100 11.4s 22.8s 38.4s
top_100 models 100 14.2s 31.6s 74.2s
top_1000 models 100 28.7s 76.4s 142.8s
long-tail (1k+) 100 48.3s 132.7s 248.1s

warm pool present on:
 top_10: 100%
 top_100: 62%
 top_1000: 18%
 long-tail: 4%

takeaway: pin to top_100 models if cold-start matters.

ROI calculator

Plug your team's workload to see what Replicate costs you. Numbers update live.

Tier / GPU SDXL image-gen ($2.61/hr) FLUX-schnell image ($2.61/hr) Llama 70B text ($18.00/hr) Whisper transcription ($0.81/hr) MusicGen audio ($3.96/hr) Custom Cog on T4 ($0.59/hr)

GPU count

Hours per day

Days per month

ON-DEMAND

$0/mo

VS LAMBDA RESERVED

$0/mo

DELTA

$0/mo

Hourly equivalent assumes 100% utilization. Replicate's strength is scale-to-zero, most workloads pay 5-30% of these rates in practice.

The verdict

Replicate is the right GPU cloud for one specific shape of buyer: developers who want the broadest catalog of hosted models behind a single API, deployed via a clean framework (Cog), with per-call economics that scale to zero. For creative-tooling apps, AI prototypes, and anything that touches image / video / audio models, Replicate is the rational default in 2026.

Where it loses, high-volume text inference, regulated workloads, training, go elsewhere. The right architecture for many AI apps is 'Replicate for the long tail, Together for the text-heavy path, Lambda for training'. Use each provider where it shines.

If Replicate doesn't fit, consider

For text inference at scale

Together AI

Per-million-token pricing on Llama 70B at $0.88. 4x cheaper than Replicate for high-volume text generation.

Read Together AI review →

For serverless custom code

Modal Labs

Python-native function-style billing with the smoothest dev loop. Better fit if you're writing custom inference logic.

Read Modal Labs review →

For self-hosting

Lambda Labs

Run your own vLLM on H100 SXM. Reserved 1-yr rates beat hosted-API economics above ~80M tokens/day.

Read Lambda Labs review →

Replicate is the right GPU cloud if you want the broadest hosted-model catalog and a single curl from idea to API.

The first product we've reviewed in three years that we'd actually buy ourselves.

How we tested

The verdict, in 60 seconds

Where the 86 comes from

What it gets right

Catalog breadth nobody else matches

Cog framework is genuinely the best 'push your model to a hosted API'

Best-in-class for image, video, and audio models

OpenAPI schema for every model

Where it falls short

Per-second pricing surprises at scale

Cold-start varies dramatically by model popularity

No dedicated endpoint tier

US-mostly infrastructure

Less optimized for high-volume text inference

Pricing reality

Benchmark matrix

Cost-to-performance ratio

Hardware & software stack

Scenario simulation: what Replicate costs for your work

Scenario A: Creative tooling app, image-gen heavy

Scenario B: AI app with mixed model types

Scenario C: Text-heavy SaaS, 500M tokens/month Llama 70B

Use-case match matrix

Stability & uptime history

Longitudinal pricing data

Community sentiment

Who should avoid this

Testing evidence

ROI calculator

The verdict

If Replicate doesn't fit, consider

Together AI

Modal Labs

Lambda Labs

From 1,840 verified reviews.

Frequently asked

More rankings across GAX Online

How Replicate ranks in GPU Cloud