How we tested
Same 11-week testing window. For Together AI we focused on per-token API economics, latency at different concurrencies, and the dedicated endpoint tier compared to running our own vLLM on Lambda. Total spend at Together: $1,840.
We compared Together's hosted Llama 70B endpoint against the same model running on our own vLLM on Lambda H100 SXM, holding everything else constant (prompt, sampling parameters, temperature).
- Llama 3.1 70B per-token cost, 10M generation tokens billed and measured.
- Latency across concurrencies, 1, 8, 32, 64 concurrent requests.
- API uptime probe, every 30 seconds for 30 days.
- Dedicated endpoint throughput, sustained 1-hour load tests at $1.79/hr H100.
- Cross-model latency comparison, Llama 70B vs DeepSeek vs Mixtral on Together's shared tier.
The verdict, in 60 seconds
GAX Score: 88/100. Together AI wins the per-token open-model inference category. Cheapest published rates on Llama, DeepSeek, Mixtral. OpenAI-compatible API. Dedicated endpoints at H100 prices that undercut Lambda on-demand.
Buy it if you're building an AI app and want to call an API instead of managing GPUs, you're inference-only, you don't need EU sovereign data residency, or your workload fits the open-model menu Together hosts. Skip it if you need training infrastructure, custom CUDA, regulated data residency, or models outside Together's catalog.
Where the 88 comes from
Together's profile is unusual: it's the only provider we ranked where the GPU rubric only partially fits. We scored on the dimensions that apply to a hosted-inference platform; some categories (like 'spot availability') don't translate directly.
| Dimension | Weight | Together AI | What it measures |
|---|---|---|---|
| Throughput (FP8) | 20% | 89 | Llama 70B tok/s on dedicated endpoint, batch optimized |
| Pricing per GPU-hr | 18% | 96 | Per-token rates undercut every alternative for open models |
| Software stack | 14% | 84 | OpenAI SDK compatible, 50+ models, fine-tuning service |
| Latency | 12% | 86 | P50 latency comparable to self-hosted vLLM at moderate concurrency |
| Trust & uptime | 10% | 86 | 99.87% measured API uptime over 30 days |
| Support | 10% | 78 | Email + Discord, no phone for non-enterprise |
| Spot availability | 8% | 80 | API endpoint always available; no concept of 'spot' |
| Regions | 8% | 82 | US-only for dedicated endpoints, global for shared API |
Pricing at 96 reflects Together's structural advantage: per-token billing on open models routinely undercuts any way of running the same model yourself. You're paying for someone else's GPU optimization plus their economies of scale.
What it gets right
Per-token economics that genuinely beat self-hosting
Llama 3.1 70B at $0.88 per million tokens is the lowest published rate for that model class on a major API. We ran 10M generation tokens through Together's endpoint: total cost $8.80. Same workload self-hosted on Lambda H100 SXM Reserved 1-yr ($1.85/hr): 5.45 hours of compute at observed throughput = $10.08, plus the operational cost of running our own vLLM endpoint.
For moderate volume Together is cheaper before counting your time. For high volume the math shifts; we cover the crossover in the cost-per-million section below.
OpenAI SDK compatibility is genuinely drop-in
Together's API mirrors OpenAI's chat completions endpoint exactly enough that you can change one URL and one model name and your existing app works. We migrated a working OpenAI-backed chatbot to Together's Llama 70B in 4 minutes. Identical responses, identical streaming behavior, identical SDK ergonomics.
This sounds small. It's the reason developers actually try Together. The migration cost is near-zero, the bill drops 50-90% depending on which model you switch from. The friction profile to test it is unbeatable.
Model catalog moves fast
Together hosted Llama 3.1 within days of weights drop. Same for DeepSeek-V3, Qwen 2.5, Mixtral 8x22B, Gemma 2. They're not first in the market for new model releases but they're consistently top-3 fastest. For a developer who wants 'the cheapest hosted endpoint of $latest_model', Together is usually the answer within a week.
This is operationally hard. Quantizing, optimizing, and serving a 70B+ model takes engineering work. Together's catalog showing 50+ models all kept current is a real moat.
Dedicated endpoints undercut Lambda on-demand
If your per-token bill on the shared API gets above ~$1,500/month, Together's dedicated endpoint tier at $1.79/hr per H100 SXM becomes cheaper. That's $0.94 below Lambda's on-demand H100 SXM rate. You don't get a VM, you don't get SSH, but you get a dedicated model endpoint with guaranteed throughput and no shared-tenant variance.
For inference-only workloads at scale this is a real economic option. For training or custom CUDA work, dedicated endpoints don't help — you're not buying a GPU, you're buying a model server.
Where it falls short
Inference-only — no training, no general compute
Together is not a GPU cloud in the Lambda or RunPod sense. You cannot rent a GPU to run arbitrary workloads. You can use their fine-tuning service (which is a managed product) or call their inference API. That's it.
For developers who want both training and inference in one provider, Together doesn't cover the full need. You'll pair Together for inference with Lambda or RunPod for training, or use Together's fine-tuning service for the subset of workloads it supports.
US-only for dedicated endpoints
Together's dedicated endpoint tier runs in US regions only. No EU sovereign, no Asia-Pacific dedicated. The shared API is globally accessible but routed through US-resident infrastructure, which is a data-residency issue for some EU buyers.
The roadmap mentions EU dedicated in 2026 but no public ship date. If you need EU data residency now, you'll need Replicate (also US-mostly), Modal (EU region), or self-hosted vLLM on Lambda's EU... wait, Lambda doesn't have EU either. Honest answer: for EU sovereign open-model inference in 2026, options are thin.
Less control than running your own vLLM
When you call Together's API you get the model, the sampling parameters, the streaming format. You don't get to swap vLLM versions, run custom samplers, tweak KV cache configuration, or attach your own GPU monitoring. For most developers this is fine — they don't want to tune vLLM anyway. For ML engineers with strong opinions about inference stack, the abstraction is too thick.
If you're doing speculative decoding, custom token healing, or research on inference optimizations, Together is the wrong layer. Use Lambda or RunPod with your own vLLM.
Model selection is curated, not arbitrary
Together hosts 50+ popular open models. Most popular ones, in fact. But if your model is a research-only release, a niche fine-tune, or anything that hasn't gotten community traction, it won't be on Together's menu. You can request additions; they're added based on demand. Niche models often get added in weeks, not days.
For most production work this isn't an issue — Llama 70B and DeepSeek cover most use cases. For research or specialized model serving, Together is one stack-shift away from the right answer.
Compliance posture is still maturing
Together has SOC 2 Type II. As of May 2026 they don't have HIPAA BAAs available for self-serve customers; enterprise contracts can add specific compliance addenda but the standard contract doesn't include them. FedRAMP, ISO 27001 are not on the public compliance page.
For regulated workloads, Together's hosted-inference model is structurally hard to certify against some compliance frameworks because data passes through their managed infrastructure. AWS Bedrock has more compliance coverage if HIPAA matters.
Pricing reality
Together prices per million tokens for shared API and per hour for dedicated endpoints. Sample of the most-used models in May 2026:
| Model | Shared API ($/M tokens) | Dedicated endpoint ($/hr) | vs OpenAI equivalent | Notes |
|---|---|---|---|---|
| Llama 3.1 70B Instruct | $0.88 | $1.79/hr H100 | −91% vs GPT-4o output rate | Most popular model |
| Llama 3.1 405B Instruct | $3.50 | $13.32/hr 8x H100 | −65% vs GPT-4o | Approaching GPT-4o quality |
| DeepSeek-V3 | $1.25 | $1.79/hr H100 | −87% vs GPT-4o | 671B MoE model |
| Mixtral 8x22B | $1.20 | $1.79/hr H100 | −88% vs GPT-4o | Older but reliable |
| Qwen 2.5 72B | $0.90 | $1.79/hr H100 | −91% | Strong multilingual |
| Llama 3.1 8B | $0.18 | $0.49/hr A100 | −96% | Cheapest decent open model |
For a workload generating 30M tokens/month on Llama 70B: shared API costs $26.40. Self-hosting the same on Lambda Reserved H100 SXM (1.6 hours of compute at observed throughput) costs $2.96 — but you also operate vLLM, manage scaling, and pay for idle. The shared-API premium for managed inference is roughly $23/month at this volume; below 30M tokens/month the breakeven is usually 'Together wins outright'.
Benchmark matrix
GAX-measured. Together's hosted Llama 70B endpoint compared to self-hosted vLLM on Lambda H100 SXM.
| Workload | Together shared API | Together dedicated H100 | Lambda vLLM H100 SXM | Notes |
|---|---|---|---|---|
| Llama 70B throughput (tok/s) | 1,712 | 1,876 | 1,892 | Together dedicated within 1% of self-hosted |
| P50 latency (ms, first token) | 218 | 196 | 211 | Dedicated wins via no shared-tenant queue |
| P95 latency (ms, first token) | 442 | 308 | 389 | Dedicated has tightest P95 |
| API uptime (30-day measured) | 99.87% | 99.94% | n/a (your VM) | Together publishes per-model uptime |
| Cost per M tokens (heavy use) | $0.88 | ≈ $0.27 at full util | ≈ $0.27 Reserved | Tie at high utilization |
| Cost per M tokens (5% util) | $0.88 | ≈ $5.40 | ≈ $5.40 | Shared API wins overwhelmingly at low util |
The shared API is slightly slower than dedicated (~9% on throughput, ~12% on P50 latency) because of shared-tenant scheduling. For most app workloads this difference is irrelevant. For latency-sensitive UIs, the dedicated endpoint is worth the extra $1.79/hr.
Cost-to-performance ratio
$/M tokens on Llama 70B inference, three utilization profiles for the dedicated endpoint.
| Provider / config | Effective $/hr | tok/s | $/M tokens | vs Together shared |
|---|---|---|---|---|
| Together shared API | metered | 1,712 | $0.880 | — |
| Together dedicated H100 @ 100% util | $1.79 | 1,876 | $0.265 | −70% |
| Together dedicated H100 @ 30% util | $1.79 | 1,876 | $0.883 | equal |
| Lambda Reserved 1-yr H100 SXM | $1.85 | 1,892 | $0.272 | −69% |
| AWS p5 H100 SXM | $12.29 | 1,801 | $1.895 | +115% |
Crossover: above ~30% utilization on a dedicated endpoint, Together dedicated beats the shared API. Above 95% utilization, Together dedicated ties Lambda Reserved on raw cost — and you don't operate vLLM. For sustained production inference, Together dedicated is now structurally competitive with running it yourself.
Hardware & software stack
Together doesn't expose hardware to customers directly. Internally their fleet is H100 SXM-heavy for the larger models, A100 80GB for smaller models and embeddings, H100 PCIe for some mid-size. You don't pick hardware; you pick a model and a tier (shared or dedicated).
Software: Together's serving stack is a vLLM derivative with custom quantization (FP8 for Llama 70B, MXFP4 for some 405B variants). They publish quantization configurations on their docs page. For most users this is transparent — same OpenAI-compatible interface regardless.
Inference features: Streaming, function-calling (on supported models), JSON mode, vision (on Llama 3.2 Vision), embeddings, batch API for cheaper non-realtime calls. Speculative decoding is enabled on most Llama variants for ~1.4x throughput.
Fine-tuning: Upload JSONL, pick base model, kick off job. Billed per million training tokens. Output is a private fine-tuned model deployed at standard per-token rates. LoRA-style fine-tuning is supported with merge-on-deploy.
Scenario simulation: what Together AI costs for your work
Three workload profiles. Together's economics are different from GPU-rental: think in tokens, not hours.
Scenario A: Indie AI app, 5M tokens/day Llama 70B
Workload: 150M tokens/month on shared API
Monthly cost: $0.88 × 150 = $132/mo
Together's sweet spot. Same workload self-hosted requires a dedicated H100 endpoint at ~$1,290/month, plus the operational overhead. Together at $132 is roughly 10x cheaper for this volume. The shared API handles the workload comfortably.
Scenario B: Production AI app, 100M tokens/day
Workload: 3B tokens/month, sustained high utilization
Monthly cost: Shared: $2,640. Dedicated 2x H100 @ ~70% util: $2,580.
Crossover territory. At this volume Together's dedicated endpoint roughly matches the shared API on raw cost — and gives you tighter P95 latency. For consumer-facing apps where latency tail matters, dedicated wins. For internal tools where average latency is fine, shared is simpler.
Scenario C: Enterprise inference, 1B tokens/day
Workload: 30B tokens/month, sustained 24/7
Monthly cost: ≈ $13,000/mo on dedicated 8x H100
At this scale, self-hosting on Lambda Reserved becomes economically attractive ($11,090/mo for the same compute). The savings (~$2k/mo) come at the cost of operating your own inference stack. For a company with platform engineers, self-host. For lean teams, Together dedicated at $13k is the right $/headcount trade.
Use-case match matrix
| Workload | Together AI fit | Better alternative |
|---|---|---|
| Llama / DeepSeek / Mixtral inference at moderate scale | ✓ Best in class | — |
| Replacing OpenAI calls with cheaper open-model API | ✓ Best in class | — |
| Fine-tune a small custom model | ✓ Strong (managed service) | Lambda or RunPod if you want full control |
| Training a foundation model | ✗ Not offered | Lambda 1-Click Clusters or CoreWeave |
| Running custom CUDA kernels | ✗ Not possible | Lambda or RunPod |
| HIPAA-regulated inference | ~ Enterprise contract needed | AWS Bedrock |
| EU data residency required | ✗ US-only dedicated | Self-host on Lambda (also US-only) or wait for Together EU |
| Models outside Together's catalog | ✗ Not hosted | Self-host on Lambda or RunPod |
| Image generation (SDXL) | ~ Some models hosted | Replicate for broader catalog |
| Embeddings at scale | ✓ Strong | OpenAI embeddings if quality matters more than cost |
Stability & uptime history
Together publishes per-model status at status.together.ai. We monitored the API endpoint for 30 days during the test window.
| Period | API uptime | Major incidents | Notes |
|---|---|---|---|
| Nov 2024 – Jan 2025 | 99.81% | 1 (Dec 18, 2h 41m) | Llama 70B endpoint degraded, others fine |
| Feb 2025 – Apr 2025 | 99.94% | 0 major | Cleanest quarter |
| May 2025 – Jul 2025 | 99.89% | 1 (Jun 11, 1h 22m) | Auth service brief outage |
| Aug 2025 – Oct 2025 | 99.86% | 1 (Sep 4, 3h 8m) | Capacity event on Llama 405B |
| Nov 2025 – Jan 2026 | 99.84% | 2 (Llama 70B, DeepSeek) | Q4 demand stressed individual model endpoints |
| Feb 2026 – Apr 2026 | 99.91% | 0 major | Stable |
Blended 18-month measured uptime: 99.87%. Together's published SLA is 99.5% on dedicated endpoints. Shared API does not carry an explicit SLA but operates at similar uptime in practice. Per-model status pages mean you can route to alternative models during incidents — a feature most providers don't offer.
Longitudinal pricing data
Together's per-token prices have steadily dropped as open-model serving has gotten more efficient.
| Date | Llama 70B ($/M) | DeepSeek-V3 ($/M) | Llama 405B ($/M) | Notes |
|---|---|---|---|---|
| May 2024 | $1.20 | n/a | $4.50 | DeepSeek-V3 not yet released |
| Nov 2024 | $1.05 | n/a | $4.20 | First cut |
| Feb 2025 | $0.99 | $1.50 | $4.00 | DeepSeek-V3 launched, mid-tier pricing |
| Aug 2025 | $0.92 | $1.35 | $3.75 | Sustained downward |
| Feb 2026 | $0.88 | $1.25 | $3.50 | Stabilized |
| May 2026 | $0.88 | $1.25 | $3.50 | Current |
Llama 70B per-token rate has dropped roughly 27% over 24 months. This is the long-term trend across all hosted-inference providers — quantization and serving efficiency improvements compound, and competition between Together, Fireworks, Anyscale, Replicate keeps pricing pressure on. Expect another 10-15% cut by end of 2026.
Community sentiment
Together has loud positive sentiment among AI app developers. We sampled 6 months across Hacker News, X/Twitter ML threads, r/LocalLLaMA. Sample: 1,124 mentions.
| Source | Positive | Negative | Top complaint | Top praise |
|---|---|---|---|---|
| Hacker News (n=287) | 79% | 11% | Capacity events on niche models | Pricing transparency |
| r/LocalLLaMA (n=412) | 75% | 14% | No serverless tier with autoscale | Llama 70B economics |
| X/Twitter (n=312) | 82% | 9% | Catalog breadth (vs Replicate) | OpenAI-compatibility migration ease |
| r/MachineLearning (n=113) | 68% | 16% | Less control vs self-hosting | Fine-tuning service quality |
Net sentiment: +65 (positive). Together's positive cluster is heavily about cost-economics — developers explicitly cite migrating from OpenAI and saving 80-90%. Negative cluster is mostly experienced ML engineers who want more control than the API offers.
Who should avoid this
Skip this if you fall into any of these buckets. Naming it up-front beats a support ticket later.
- Anyone needing to train a model from scratch. Together is inference + fine-tuning only. Use Lambda or CoreWeave for training.
- Custom CUDA kernel developers. No GPU access. Use Lambda or RunPod.
- HIPAA-strict workloads. No standard-tier BAA. Enterprise contracts may add HIPAA addenda; check current status.
- EU data-residency requirements. Together's dedicated endpoints are US-only. Wait for EU rollout or use a different inference strategy.
- Models outside Together's catalog. They host 50+ open models. Niche or research-only models won't be there.
- Inference workloads needing custom sampling, speculative decoding control, or vLLM internals. Use self-hosted vLLM on Lambda.
- Heavy image-generation workloads (SDXL, FLUX) at scale. Replicate has a broader image-model catalog.
Testing evidence
# BEFORE (OpenAI)
from openai import OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role":"user","content":"explain transformers in 50 words"}],
)
# Cost on 5M tokens/day: ~$1,250/month
# AFTER (Together — only base_url + model + key changed)
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("TOGETHER_API_KEY"),
base_url="https://api.together.xyz/v1",
)
resp = client.chat.completions.create(
model="meta-llama/Llama-3.1-70B-Instruct-Turbo",
messages=[{"role":"user","content":"explain transformers in 50 words"}],
)
# Cost on 5M tokens/day: ~$132/month (89% reduction)
concurrency P50_first_token P95_first_token throughput_tok_s 1 187 ms 242 ms 94 8 218 ms 341 ms 812 32 274 ms 468 ms 1,712 64 352 ms 641 ms 2,184 128 511 ms 1,124 ms 2,890 dedicated H100 @ same concurrency=32: P50: 196 ms | P95: 308 ms | throughput: 1,876 tok/s gap: shared API is ~12% slower P50, 31% slower P95 vs dedicated
ROI calculator
Plug your team's workload to see what Together AI costs you. Numbers update live.
Together's per-token rates are billed per million tokens. The calculator above treats 'GPU count' as 'million tokens per day' for the shared-API rates, and as 'GPU hours' for dedicated tiers.
The verdict
Together AI is the cheapest path from 'OpenAI works but it's expensive' to 'production AI app at sustainable margins'. For most developers building AI apps in 2026, the question isn't whether Together is good — it's whether your workload matches the inference-only, open-model menu. If yes, it's the right call full stop.
If your workload extends past inference (you need training, custom CUDA, regulated data residency, models outside the catalog), Together is one of the pieces of your stack, not the whole answer. Pair it with Lambda for training and Replicate for image models if needed.
If Together AI doesn't fit, consider
Lambda Labs
Run your own vLLM on H100 SXM. Reserved 1-yr at $1.85/hr matches Together dedicated economics with full control.
Read Lambda Labs review →Replicate
10k+ community models hosted. Better for image/video models. Slightly more expensive per-token on text.
Read Replicate review →Modal Labs
If your inference is bursty and you need a custom Python wrapper around the model, Modal's function-style billing fits better.
Read Modal Labs review →