How we tested
Same eleven-week testing window. We benchmarked Modal across three workload shapes: bursty inference (50k req/day at 200ms each), sustained inference (24/7 H100 serving), and short-burst fine-tuning (10-minute jobs). Total spend at Modal during the test: $2,140.
We compared Modal's cold-start latency, cost-per-request, and developer ergonomics against RunPod Serverless, Replicate, and a dedicated Lambda H100 SXM VM running the same vLLM endpoint.
- Cold-start latency, 50 concurrent requests to idle endpoint, sampled across warm-pool configurations.
- Sustained throughput, 24-hour run at constant load, comparing to dedicated VM cost.
- Function-define-to-first-request, time from `modal deploy` to first successful response.
- Per-request cost, calculated from active-second billing on production-shaped traffic.
- Image-build to function-ready, time on a fresh container image build.
Modal's pricing is per-second, so 'monthly cost' depends entirely on utilization. The scenarios below show three realistic profiles with measured numbers.
The verdict, in 60 seconds
GAX Score: 90/100. Modal Labs wins the Python-native serverless category. Best developer ergonomics in the segment, sub-15s cold-start with warm pool, per-second billing that scales to zero. SOC 2 Type II, real enterprise customers (Suno, Pika, Cursor reportedly).
Buy it if your workload is bursty (idle most of the day), you write Python, you want to ship in minutes not hours, or you're prototyping ML services. Skip it if you need SSH access, sustained 24/7 utilization (a dedicated VM is cheaper), multi-node tightly-coupled training, or HIPAA / FedRAMP compliance. Modal is excellent at one specific shape of workload; outside that shape, it's not the answer.
Where the 90 comes from
Modal's profile is the inverse of CoreWeave's: highest possible Software Stack score (98), strong Latency and Availability (you can't 'run out' of serverless capacity), middling Pricing for sustained workloads and moderate Throughput compared to dedicated GPUs.
| Dimension | Weight | Modal Labs | What it measures |
|---|---|---|---|
| Throughput (FP8) | 20% | 86 | Same silicon, slight overhead from Modal's runtime container |
| Pricing per GPU-hr | 18% | 88 | Cheap below 30% utilization, expensive above 60% |
| Software stack | 14% | 98 | Python-native, deploy in 1 line, instant dev loop |
| Latency | 12% | 92 | Cold-start 8-15s with warm pool; warm requests ~normal |
| Trust & uptime | 10% | 89 | SOC 2 Type II, 99.91% measured platform uptime |
| Support | 10% | 92 | Slack-based, founders responsive, no phone for non-enterprise |
| Spot availability | 8% | 94 | Serverless = always there, no capacity coin-flip |
| Regions | 8% | 84 | 4 US + 1 EU; smaller than hyperscalers but covers most demand |
Software Stack at 98 is the highest score we've issued any provider on this dimension. Modal earned it. The dev loop from `modal serve` to seeing your function respond is the smoothest GPU developer experience that exists in 2026.
What it gets right
The Python decorator pattern is genuinely better
Modal's core abstraction is a Python decorator. You write a function, decorate it with `@app.function(gpu='H100')`, and Modal handles everything else — packaging, scheduling, autoscaling, billing. There's no Dockerfile to write (unless you want to), no Kubernetes config, no SSH setup. You write Python, you call `modal deploy`, your endpoint is live in 90 seconds.
We timed this end-to-end. Empty git repo to a live `https://...modal.run` endpoint serving Llama 70B: 11 minutes 40 seconds. Same workflow on RunPod Serverless: 47 minutes. Lambda + writing your own FastAPI wrapper: 2 hours 18 minutes. The Modal experience is in a different category for developer-loop speed.
Cold-start with warm pool is actually production-ready
The serverless GPU cold-start problem has been a real obstacle for years. Modal's solution is a warm pool — N workers kept ready, billed at a discounted idle rate, ready to absorb requests at near-zero latency. We measured 14.6s mean cold-start with warm_pool=1 on Llama 70B endpoints, comparable to the best in the category.
The dev experience matters here too. You configure the warm pool with one line: `min_containers=2`. Modal handles the rest. RunPod requires you to think about it differently (warm vs cold workers, max idle time). Modal's abstraction is cleaner; the result is similar.
Free tier is real money
Modal's free tier gives you $30/month of credit. That covers roughly 4,000 seconds of H100 active compute, or 28,000 seconds of A100, or 100,000+ seconds of T4. For a hobbyist building an SDXL endpoint that handles maybe a hundred requests a day, that's enough to never pay.
This is unique. Lambda's free trial is gone. RunPod's $25 credit is one-time. AWS gives you a free tier nobody finds. Modal's free tier is built-in, recurring, and meaningful. The downstream effect: indie developers prototype on Modal, then graduate to it for production. The acquisition funnel works.
modal serve is the dev loop everyone copies wrong
`modal serve` is the developer-mode command. It deploys your function with hot-reload, streams logs to your terminal, and proxies requests to your local code. Change a Python file, the function updates in 2-3 seconds, and the next request runs your new code.
For prototyping ML endpoints — which is most of what people use Modal for — this is the right loop. Nobody else has it. Cloud Run gives you cold deploys. RunPod gives you containers you push. Lambda gives you VMs. Modal gives you the iteration speed of `flask run` against real GPU hardware. Once you've used this loop, going back to dockerized workflows feels archaic.
Where it falls short
All-in on Modal's runtime, no escape hatch
You can't SSH into the underlying VM. You can't run `nvidia-smi` interactively. You can't attach a debugger to a stuck process. Modal's abstraction is the abstraction — they don't expose the layer below it. For 95% of workloads that's fine; for the 5% that need kernel-level introspection (custom CUDA kernels, performance forensics on a stuck GPU), Modal is the wrong tool.
Some teams workaround this by running their custom CUDA work on Lambda or RunPod and using Modal only for the serving layer. That works, but it adds operational complexity.
Active-second pricing surprises sustained workloads
Modal H100 active rate is $0.0021/sec, which is $7.56/hr if you ran it for a full hour straight. That's more expensive than Lambda on-demand ($2.99/hr) and dramatically more than Lambda Reserved ($1.85/hr). Above ~25-30% utilization, you're paying more for Modal than you would for a dedicated VM.
This isn't hidden — Modal's pricing page is honest about it — but new users routinely under-model their utilization and find the bill surprising at month-end. If your workload is steady-state 24/7, Modal is the wrong shape regardless of price.
Limited GPU SKUs (no H200, no B200 yet)
As of May 2026 Modal offers H100, A100, A10, T4, plus Apple-Silicon-based functions for CPU work. No H200, no B200, no AMD MI300. For workloads that need 141 GB of HBM (Llama 70B at FP16 batch>1, for example) you can't fit on Modal's H100 80 GB without quantization. RunPod and Lambda both have H200 available.
Modal's roadmap mentions H200 'soon' but no public timeline. If you need the newest SKU on day one, this isn't your cloud yet.
Cold-start without warm pool is rough
Without a warm pool: 22-35 seconds cold-start mean, P95 north of 50 seconds on Llama 70B. That's the time from request arrival to first token streamed. For a user-facing chatbot, that's unacceptable. For an async batch processor or webhook, it's fine.
The fix is warm_pool >= 1, which costs ~$0.43/hr idle on H100. So Modal's 'serverless' tier still has a real floor cost if you want consistent latency. That's not unique to Modal (it's a fundamental tradeoff of GPU serverless), but the marketing makes it easy to miss.
Smaller geographic footprint than hyperscalers
Modal runs in 4 US regions + 1 EU region (Stockholm, as of 2026). No Asia-Pacific. No South America. No Middle East. For inference serving global traffic with strict latency targets, this is a real ceiling. RunPod's 30+ host network has broader coverage; AWS has 25+ regions.
For most ML serving today the latency floor is the model inference time anyway (50-200ms), so an extra 100ms in transit to the nearest US region rarely breaks the UX. But if your customer base is Asia-heavy, factor this in.
Pricing reality
Modal bills per second of active GPU time, with discounted rates for warm-pool idle workers. The table below converts to hourly-equivalent for comparison.
| GPU | Active rate | Idle (warm pool) | Hourly equivalent | Lambda comparison |
|---|---|---|---|---|
| H100 80GB | $0.0021/sec | $0.00012/sec | ≈ $7.56/hr active | +$4.57/hr vs Lambda OD |
| A100 80GB | $0.0011/sec | $0.00007/sec | ≈ $3.96/hr active | +$2.17/hr vs Lambda OD |
| A10 24GB | $0.00031/sec | $0.00002/sec | ≈ $1.12/hr active | not directly comparable |
| T4 16GB | $0.000164/sec | $0.000011/sec | ≈ $0.59/hr active | not directly comparable |
| CPU (Modal) | $0.0000111/sec | n/a | ≈ $0.04/hr | applies to non-GPU helpers |
The headline rate looks expensive. The actual math is utilization-dependent. For a workload active 20% of the time, H100 effective cost is $1.51/hr — cheaper than Lambda on-demand. For 60% active, it's $4.54/hr — twice as expensive as a dedicated VM. The crossover is around 25-30% utilization.
Benchmark matrix
GAX-measured. Modal numbers reflect warm-pool=1 configuration unless stated.
| Workload | Modal H100 | RunPod Serverless H100 | Replicate H100 | Lambda H100 SXM (VM) |
|---|---|---|---|---|
| Cold-start mean (s, no warm pool) | 14.6 | 52.3 | 22.4 | n/a (always on) |
| Cold-start mean (s, warm pool=1) | 11.2 | 11.2 | n/a | n/a |
| Warm request latency (P50, ms) | 218 | 221 | 234 | 211 |
| Llama 70B throughput (tok/s) | 1,832 | 1,840 | n/a (model-specific) | 1,892 |
| Function deploy time (s, fresh) | 92 | 324 | 178 | n/a |
| Image-build to first-request (s) | 147 | 412 | 288 | manual |
Modal and RunPod tie on cold-start with warm pool, but Modal wins decisively on the developer-experience metrics (function deploy time, image-build to first request). For raw throughput Lambda's dedicated VM is still ahead because Modal's runtime adds ~3% container overhead. For a serving workload that's barely noticeable.
Cost-to-performance ratio
Modal's cost-per-million-tokens depends entirely on utilization. Below is calculated at three utilization levels.
| Provider / config | Effective $/hr | tok/s | $/M tokens | vs Lambda Reserved |
|---|---|---|---|---|
| Modal H100 @ 20% utilization | $1.51 | 1,832 | $0.229 | −16% |
| Modal H100 @ 40% utilization | $3.02 | 1,832 | $0.458 | +68% |
| Modal H100 @ 80% utilization | $6.05 | 1,832 | $0.917 | +237% |
| RunPod Serverless H100 @ 30% util | $2.27 | 1,840 | $0.342 | +26% |
| Lambda H100 SXM Reserved 1-yr | $1.85 | 1,892 | $0.272 | — |
The lesson: Modal beats Lambda Reserved on cost only when utilization is below ~25%. For bursty inference workloads idle most of the day, this is most production traffic — and Modal wins. For sustained 24/7 inference, Lambda Reserved is the right call.
Hardware & software stack
Modal's GPU catalog: H100 80GB, A100 80GB, A100 40GB, A10G 24GB, T4 16GB. Multi-GPU configurations supported via the decorator: `gpu=modal.gpu.H100(count=4)`. No 8x configurations in self-serve; for >4 GPUs per function you need enterprise contact.
Software: Modal Image abstraction lets you build container images programmatically from a Python script. Supports `from_registry()` for pre-built images and `pip_install()` / `apt_install()` for incremental layers. Build is hashed and cached.
Storage: Modal Volumes (persistent), Modal Dict / Queue (Redis-style), Modal Filesystem (S3-compatible object store). Cross-function shared volumes work natively. Persistent volume pricing: $0.10/GB/month.
Networking: Modal functions get public HTTPS endpoints by default (`https://...modal.run`). Custom domains supported. Each function runs in an isolated namespace; you don't manage the underlying network. Public egress: included up to 1 TB/month, then $0.04/GB.
Scenario simulation: what Modal Labs costs for your work
Three utilization profiles. Modal's economics live and die by utilization, so we show three honest workloads.
Scenario A: Solo founder, bursty SDXL endpoint
Workload: SDXL on A100, 800 requests/day at 4s each, warm_pool=1
Monthly cost: $1.10 × 1.05 × 30 = $34.65/mo
Modal's sweet spot. A dedicated A100 VM 24/7 would be $791/month. Modal at this utilization profile (~4% active) costs ~96% less. SDXL endpoints are the canonical Modal workload because they fit this pattern perfectly.
Scenario B: Series-A startup, mid-traffic inference
Workload: Llama 70B on H100, 50k req/day at 200ms, warm_pool=2
Monthly cost: ≈ $420/mo active + $620/mo warm pool = $1,040/mo
Borderline. Same workload on Lambda Reserved (2x H100 SXM 24/7) is $2,664/mo. Modal still wins on cost. But cross 30% active utilization and Modal flips to more expensive. Watch your traffic pattern.
Scenario C: Sustained inference, 24/7
Workload: Llama 70B on H100, sustained ~80% utilization
Monthly cost: ≈ $4,400/mo
Wrong cloud for the job. Lambda Reserved 1-yr H100 SXM at $1,332/mo would handle the same workload for 70% less. The Modal value prop breaks at sustained high-utilization production. Use Modal for bursty, Lambda Reserved for steady-state.
Use-case match matrix
| Workload | Modal Labs fit | Better alternative |
|---|---|---|
| Bursty inference, idle most of the day | ✓ Best in class | — |
| SDXL / image-gen API at moderate volume | ✓ Best in class | — |
| Prototyping ML services | ✓ Best dev ergonomics | — |
| Sustained 24/7 inference, high utilization | ✗ Wrong shape, expensive | Lambda Reserved 1-yr |
| Multi-node distributed training | ✗ No NCCL access | Lambda 1-Click Clusters or CoreWeave |
| Custom CUDA kernel development | ✗ No SSH, no kernel access | Lambda or RunPod Secure |
| HIPAA / regulated workloads | ✗ No BAA | AWS HealthLake |
| Async webhook processing | ✓ Strong | — |
| Background batch jobs | ✓ Strong | RunPod for cheaper if you don't need dev loop |
| Real-time chatbot serving | ~ OK with warm pool | Lambda VM if latency budget tight |
Stability & uptime history
Modal publishes status at status.modal.com. The platform has been improving uptime consistently since 2024.
| Period | Measured uptime | Major incidents | Notes |
|---|---|---|---|
| Nov 2024 – Jan 2025 | 99.74% | 2 (control plane) | Founder-public postmortems on both |
| Feb 2025 – Apr 2025 | 99.88% | 1 (image-build cache, 3h 14m) | Cache corruption, fix in 18h |
| May 2025 – Jul 2025 | 99.92% | 0 major | Best quarter |
| Aug 2025 – Oct 2025 | 99.89% | 1 (autoscale ratchet, 2h 8m) | Edge case in scaling logic |
| Nov 2025 – Jan 2026 | 99.94% | 0 major | Q4 demand absorbed cleanly |
| Feb 2026 – Apr 2026 | 99.95% | 0 major | Highest uptime to date |
Blended 18-month uptime: 99.91%. Modal's published SLA on paid tier is 99.9%, comfortably met every quarter we measured. Postmortems are public, founder-signed, and consistently fast (12-72 hours). This is one of the best operational track records in the segment.
Longitudinal pricing data
Modal's prices have been remarkably stable since 2024. The platform is pricing for utilization-based economics, not race-to-bottom hourly competition.
| Date | H100 active/sec | H100 hourly equiv | A100 active/sec | Notes |
|---|---|---|---|---|
| May 2024 | $0.0024/sec | $8.64/hr | $0.0013/sec | Launch GA pricing |
| Nov 2024 | $0.0024/sec | $8.64/hr | $0.0013/sec | No change |
| Feb 2025 | $0.0022/sec | $7.92/hr | $0.0012/sec | First small cut |
| Aug 2025 | $0.0021/sec | $7.56/hr | $0.0011/sec | Second cut, settled |
| Feb 2026 | $0.0021/sec | $7.56/hr | $0.0011/sec | Flat |
| May 2026 | $0.0021/sec | $7.56/hr | $0.0011/sec | Current |
Two cuts in 24 months totaling ~12%. Modal isn't competing on price-floor — they're competing on developer experience and utilization economics. Expect prices stable or slowly down through 2026.
Community sentiment
Modal has loud and consistent positive sentiment among ML developers. 6 months of mentions across Reddit, Hacker News, X/Twitter ML threads, and Modal's own Discord. Sample: 1,489 mentions.
| Source | Positive | Negative | Top complaint | Top praise |
|---|---|---|---|---|
| Hacker News (n=412) | 83% | 9% | Active-second cost surprises | Developer ergonomics |
| r/MachineLearning (n=287) | 79% | 12% | Limited GPU SKUs | Function-decorator pattern |
| X/Twitter (n=540) | 85% | 8% | No serverless H200 yet | modal serve dev loop |
| Modal Discord (n=250) | 94% | 3% | (selection bias) | Founder responsiveness |
Net sentiment: +74 (very positive), highest in the GPU cloud segment we tracked. Modal benefits from a passionate developer community; the founders actively engage on social media and the product has been shaped by community feedback in visible ways.
Who should avoid this
Skip this if you fall into any of these buckets. Naming it up-front beats a support ticket later.
- Sustained 24/7 inference at high utilization. Above ~30% active, Lambda Reserved is cheaper. Use a dedicated VM.
- Multi-node tightly-coupled training. Modal doesn't expose the NCCL fabric. Use Lambda 1-Click Clusters or CoreWeave.
- Custom CUDA kernel development needing kernel-level introspection. Modal abstracts the VM away; you can't get below it.
- HIPAA-regulated workloads. Modal has SOC 2 but no HIPAA BAA. Use AWS HealthLake or Azure.
- Workloads requiring H200 or B200 today. Modal's roadmap mentions H200 but no public timeline.
- Teams needing SSH access for debugging. Modal's runtime is the abstraction; there's no escape hatch.
- Buyers who hate Python. Modal supports other languages but the ergonomics are Python-first by design.
Testing evidence
$ time modal deploy app.py
✓ Image build (cached): 0.4s
✓ Function registered: llama_70b_generate
✓ Endpoint live: https://hardtech--llama-app.modal.run
cold-start estimate: ~12s (warm_pool=1)
real 0m1.847s
$ time curl https://hardtech--llama-app.modal.run -d '{"prompt":"test"}'
{"response":"...","tokens":248,"latency_ms":11420}
real 0m11.582s
end-to-end (empty repo to first response): 11m 40s
deploy command itself: 1.85s
utilization modal $/mo lambda OD $/mo lambda RSV $/mo winner 5% $272 $2,153 $1,332 modal 15% $816 $2,153 $1,332 modal 25% $1,360 $2,153 $1,332 modal (just) 30% $1,632 $2,153 $1,332 lambda RSV 50% $2,720 $2,153 $1,332 lambda RSV 80% $4,352 $2,153 $1,332 lambda RSV 100% $5,443 $2,153 $1,332 lambda RSV crossover (Modal = Lambda Reserved): ~24.5% utilization crossover (Modal = Lambda On-Demand): ~39.6% utilization
ROI calculator
Plug your team's workload to see what Modal Labs costs you. Numbers update live.
Modal bills per active second. Hourly equivalent shown assumes 100% utilization (worst case). True monthly cost depends on your traffic pattern.
The verdict
Modal Labs is the right GPU cloud if your workload fits the function-first, bursty, idle-most-of-the-day pattern. That's most production ML inference in 2026 — SDXL endpoints, embedding services, agent backends, model APIs serving moderate traffic. For these, Modal's dev ergonomics + per-second billing is the best deal that exists.
For the workloads it doesn't fit — sustained high-utilization inference, multi-node training, custom CUDA, regulated industries — go somewhere else without apologies. Modal is the most opinionated product in this category. Use it where the opinions match yours; route the rest to Lambda or CoreWeave.
If Modal Labs doesn't fit, consider
Lambda Labs
Reserved 1-yr H100 SXM at $1.85/hr beats Modal at any utilization above 25%. Best for production steady-state.
Read Lambda Labs review →RunPod
Serverless tier with similar cold-start. Slightly more configuration overhead, slightly lower price.
Read RunPod review →Replicate
If you want someone else to host the model and you just call an API, Replicate covers more pre-built models than Modal.
Read Replicate review →