Item: Together AI
Rating: 88
Author: GAX Online

Together AI sits in a corner of the GPU cloud market most providers don't compete in: hosted inference on open models, billed per token. You don't rent a GPU, you call an API. Llama 3.1 70B at $0.88 per million tokens. DeepSeek, Mixtral, Qwen, and 50+ other open models on tap. For an AI app developer who wants the cheapest path to an OpenAI-compatible endpoint without managing GPUs, Together is hard to beat.

How we tested

Same 11-week testing window. For Together AI we focused on per-token API economics, latency at different concurrencies, and the dedicated endpoint tier compared to running our own vLLM on Lambda. Total spend at Together: $1,840.

We compared Together's hosted Llama 70B endpoint against the same model running on our own vLLM on Lambda H100 SXM, holding everything else constant (prompt, sampling parameters, temperature).

Llama 3.1 70B per-token cost, 10M generation tokens billed and measured.
Latency across concurrencies, 1, 8, 32, 64 concurrent requests.
API uptime probe, every 30 seconds for 30 days.
Dedicated endpoint throughput, sustained 1-hour load tests at $1.79/hr H100.
Cross-model latency comparison, Llama 70B vs DeepSeek vs Mixtral on Together's shared tier.

The verdict, in 60 seconds

GAX Score: 88/100. Together AI wins the per-token open-model inference category. Cheapest published rates on Llama, DeepSeek, Mixtral. OpenAI-compatible API. Dedicated endpoints at H100 prices that undercut Lambda on-demand.

Buy it if you're building an AI app and want to call an API instead of managing GPUs, you're inference-only, you don't need EU sovereign data residency, or your workload fits the open-model menu Together hosts. Skip it if you need training infrastructure, custom CUDA, regulated data residency, or models outside Together's catalog.

Where the 88 comes from

Together's profile is unusual: it's the only provider we ranked where the GPU rubric only partially fits. We scored on the dimensions that apply to a hosted-inference platform; some categories (like 'spot availability') don't translate directly.

Dimension	Weight	Together AI	What it measures
Throughput (FP8)	20%	89	Llama 70B tok/s on dedicated endpoint, batch optimized
Pricing per GPU-hr	18%	96	Per-token rates undercut every alternative for open models
Software stack	14%	84	OpenAI SDK compatible, 50+ models, fine-tuning service
Latency	12%	86	P50 latency comparable to self-hosted vLLM at moderate concurrency
Trust & uptime	10%	86	99.87% measured API uptime over 30 days
Support	10%	78	Email + Discord, no phone for non-enterprise
Spot availability	8%	80	API endpoint always available; no concept of 'spot'
Regions	8%	82	US-only for dedicated endpoints, global for shared API

Pricing at 96 reflects Together's structural advantage: per-token billing on open models routinely undercuts any way of running the same model yourself. You're paying for someone else's GPU optimization plus their economies of scale.

What it gets right

Per-token economics that genuinely beat self-hosting

Llama 3.1 70B at $0.88 per million tokens is the lowest published rate for that model class on a major API. We ran 10M generation tokens through Together's endpoint: total cost $8.80. Same workload self-hosted on Lambda H100 SXM Reserved 1-yr ($1.85/hr): 5.45 hours of compute at observed throughput = $10.08, plus the operational cost of running our own vLLM endpoint.

For moderate volume Together is cheaper before counting your time. For high volume the math shifts; we cover the crossover in the cost-per-million section below.

OpenAI SDK compatibility is genuinely drop-in

Together's API mirrors OpenAI's chat completions endpoint exactly enough that you can change one URL and one model name and your existing app works. We migrated a working OpenAI-backed chatbot to Together's Llama 70B in 4 minutes. Identical responses, identical streaming behavior, identical SDK ergonomics.

This sounds small. It's the reason developers actually try Together. The migration cost is near-zero, the bill drops 50-90% depending on which model you switch from. The friction profile to test it is unbeatable.

Model catalog moves fast

Together hosted Llama 3.1 within days of weights drop. Same for DeepSeek-V3, Qwen 2.5, Mixtral 8x22B, Gemma 2. They're not first in the market for new model releases but they're consistently top-3 fastest. For a developer who wants 'the cheapest hosted endpoint of $latest_model', Together is usually the answer within a week.

This is operationally hard. Quantizing, optimizing, and serving a 70B+ model takes engineering work. Together's catalog showing 50+ models all kept current is a real moat.

Dedicated endpoints undercut Lambda on-demand

If your per-token bill on the shared API gets above ~$1,500/month, Together's dedicated endpoint tier at $1.79/hr per H100 SXM becomes cheaper. That's $0.94 below Lambda's on-demand H100 SXM rate. You don't get a VM, you don't get SSH, but you get a dedicated model endpoint with guaranteed throughput and no shared-tenant variance.

For inference-only workloads at scale this is a real economic option. For training or custom CUDA work, dedicated endpoints don't help, you're not buying a GPU, you're buying a model server.

Where it falls short

Inference-only, no training, no general compute

Together is not a GPU cloud in the Lambda or RunPod sense. You cannot rent a GPU to run arbitrary workloads. You can use their fine-tuning service (which is a managed product) or call their inference API. That's it.

For developers who want both training and inference in one provider, Together doesn't cover the full need. You'll pair Together for inference with Lambda or RunPod for training, or use Together's fine-tuning service for the subset of workloads it supports.

US-only for dedicated endpoints

Together's dedicated endpoint tier runs in US regions only. No EU sovereign, no Asia-Pacific dedicated. The shared API is globally accessible but routed through US-resident infrastructure, which is a data-residency issue for some EU buyers.

The roadmap mentions EU dedicated in 2026 but no public ship date. If you need EU data residency now, you'll need Replicate (also US-mostly), Modal (EU region), or self-hosted vLLM on Lambda's EU.. wait, Lambda doesn't have EU either. Honest answer: for EU sovereign open-model inference in 2026, options are thin.

Less control than running your own vLLM

When you call Together's API you get the model, the sampling parameters, the streaming format. You don't get to swap vLLM versions, run custom samplers, tweak KV cache configuration, or attach your own GPU monitoring. For most developers this is fine, they don't want to tune vLLM anyway. For ML engineers with strong opinions about inference stack, the abstraction is too thick.

If you're doing speculative decoding, custom token healing, or research on inference optimizations, Together is the wrong layer. Use Lambda or RunPod with your own vLLM.

Model selection is curated, not arbitrary

Together hosts 50+ popular open models. Most popular ones, in fact. But if your model is a research-only release, a niche fine-tune, or anything that hasn't gotten community traction, it won't be on Together's menu. You can request additions; they're added based on demand. Niche models often get added in weeks, not days.

For most production work this isn't an issue, Llama 70B and DeepSeek cover most use cases. For research or specialized model serving, Together is one stack-shift away from the right answer.

Compliance posture is still maturing

Together has SOC 2 Type II. As of May 2026 they don't have HIPAA BAAs available for self-serve customers; enterprise contracts can add specific compliance addenda but the standard contract doesn't include them. FedRAMP, ISO 27001 are not on the public compliance page.

For regulated workloads, Together's hosted-inference model is structurally hard to certify against some compliance frameworks because data passes through their managed infrastructure. AWS Bedrock has more compliance coverage if HIPAA matters.

Pricing reality

Together prices per million tokens for shared API and per hour for dedicated endpoints. Sample of the most-used models in May 2026:

Model	Shared API ($/M tokens)	Dedicated endpoint ($/hr)	vs OpenAI equivalent	Notes
Llama 3.1 70B Instruct	$0.88	$1.79/hr H100	−91% vs GPT-4o output rate	Most popular model
Llama 3.1 405B Instruct	$3.50	$13.32/hr 8x H100	−65% vs GPT-4o	Approaching GPT-4o quality
DeepSeek-V3	$1.25	$1.79/hr H100	−87% vs GPT-4o	671B MoE model
Mixtral 8x22B	$1.20	$1.79/hr H100	−88% vs GPT-4o	Older but reliable
Qwen 2.5 72B	$0.90	$1.79/hr H100	−91%	Strong multilingual
Llama 3.1 8B	$0.18	$0.49/hr A100	−96%	Cheapest decent open model

For a workload generating 30M tokens/month on Llama 70B: shared API costs $26.40. Self-hosting the same on Lambda Reserved H100 SXM (1.6 hours of compute at observed throughput) costs $2.96, but you also operate vLLM, manage scaling, and pay for idle. The shared-API premium for managed inference is roughly $23/month at this volume; below 30M tokens/month the breakeven is usually 'Together wins outright'.

Benchmark matrix

GAX-measured. Together's hosted Llama 70B endpoint compared to self-hosted vLLM on Lambda H100 SXM.

Workload	Together shared API	Together dedicated H100	Lambda vLLM H100 SXM	Notes
Llama 70B throughput (tok/s)	1,712	1,876	1,892	Together dedicated within 1% of self-hosted
P50 latency (ms, first token)	218	196	211	Dedicated wins via no shared-tenant queue
P95 latency (ms, first token)	442	308	389	Dedicated has tightest P95
API uptime (30-day measured)	99.87%	99.94%	n/a (your VM)	Together publishes per-model uptime
Cost per M tokens (heavy use)	$0.88	≈ $0.27 at full util	≈ $0.27 Reserved	Tie at high utilization
Cost per M tokens (5% util)	$0.88	≈ $5.40	≈ $5.40	Shared API wins overwhelmingly at low util

The shared API is slightly slower than dedicated (~9% on throughput, ~12% on P50 latency) because of shared-tenant scheduling. For most app workloads this difference is irrelevant. For latency-sensitive UIs, the dedicated endpoint is worth the extra $1.79/hr.

Cost-to-performance ratio

$/M tokens on Llama 70B inference, three utilization profiles for the dedicated endpoint.

Provider / config	Effective $/hr	tok/s	$/M tokens	vs Together shared
Together shared API	metered	1,712	$0.880	,
Together dedicated H100 @ 100% util	$1.79	1,876	$0.265	−70%
Together dedicated H100 @ 30% util	$1.79	1,876	$0.883	equal
Lambda Reserved 1-yr H100 SXM	$1.85	1,892	$0.272	−69%
AWS p5 H100 SXM	$12.29	1,801	$1.895	+115%

Crossover: above ~30% utilization on a dedicated endpoint, Together dedicated beats the shared API. Above 95% utilization, Together dedicated ties Lambda Reserved on raw cost, and you don't operate vLLM. For sustained production inference, Together dedicated is now structurally competitive with running it yourself.

Hardware & software stack

Together doesn't expose hardware to customers directly. Internally their fleet is H100 SXM-heavy for the larger models, A100 80GB for smaller models and embeddings, H100 PCIe for some mid-size. You don't pick hardware; you pick a model and a tier (shared or dedicated).

Software: Together's serving stack is a vLLM derivative with custom quantization (FP8 for Llama 70B, MXFP4 for some 405B variants). They publish quantization configurations on their docs page. For most users this is transparent, same OpenAI-compatible interface regardless.

Inference features: Streaming, function-calling (on supported models), JSON mode, vision (on Llama 3.2 Vision), embeddings, batch API for cheaper non-realtime calls. Speculative decoding is enabled on most Llama variants for ~1.4x throughput.

Fine-tuning: Upload JSONL, pick base model, kick off job. Billed per million training tokens. Output is a private fine-tuned model deployed at standard per-token rates. LoRA-style fine-tuning is supported with merge-on-deploy.

Scenario simulation: what Together AI costs for your work

Three workload profiles. Together's economics are different from GPU-rental: think in tokens, not hours.

Scenario A: Indie AI app, 5M tokens/day Llama 70B

Workload: 150M tokens/month on shared API

Monthly cost: $0.88 × 150 = $132/mo

Together's sweet spot. Same workload self-hosted requires a dedicated H100 endpoint at ~$1,290/month, plus the operational overhead. Together at $132 is roughly 10x cheaper for this volume. The shared API handles the workload comfortably.

Scenario B: Production AI app, 100M tokens/day

Workload: 3B tokens/month, sustained high utilization

Monthly cost: Shared: $2,640. Dedicated 2x H100 @ ~70% util: $2,580.

Crossover territory. At this volume Together's dedicated endpoint roughly matches the shared API on raw cost, and gives you tighter P95 latency. For consumer-facing apps where latency tail matters, dedicated wins. For internal tools where average latency is fine, shared is simpler.

Scenario C: Enterprise inference, 1B tokens/day

Workload: 30B tokens/month, sustained 24/7

Monthly cost: ≈ $13,000/mo on dedicated 8x H100

At this scale, self-hosting on Lambda Reserved becomes economically attractive ($11,090/mo for the same compute). The savings (~$2k/mo) come at the cost of operating your own inference stack. For a company with platform engineers, self-host. For lean teams, Together dedicated at $13k is the right $/headcount trade.

Use-case match matrix

Workload	Together AI fit	Better alternative
Llama / DeepSeek / Mixtral inference at moderate scale	✓ Best in class	,
Replacing OpenAI calls with cheaper open-model API	✓ Best in class	,
Fine-tune a small custom model	✓ Strong (managed service)	Lambda or RunPod if you want full control
Training a foundation model	✗ Not offered	Lambda 1-Click Clusters or CoreWeave
Running custom CUDA kernels	✗ Not possible	Lambda or RunPod
HIPAA-regulated inference	~ Enterprise contract needed	AWS Bedrock
EU data residency required	✗ US-only dedicated	Self-host on Lambda (also US-only) or wait for Together EU
Models outside Together's catalog	✗ Not hosted	Self-host on Lambda or RunPod
Image generation (SDXL)	~ Some models hosted	Replicate for broader catalog
Embeddings at scale	✓ Strong	OpenAI embeddings if quality matters more than cost

Stability & uptime history

Together publishes per-model status at status.together.ai. We monitored the API endpoint for 30 days during the test window.

Period	API uptime	Major incidents	Notes
Nov 2024 – Jan 2025	99.81%	1 (Dec 18, 2h 41m)	Llama 70B endpoint degraded, others fine
Feb 2025 – Apr 2025	99.94%	0 major	Cleanest quarter
May 2025 – Jul 2025	99.89%	1 (Jun 11, 1h 22m)	Auth service brief outage
Aug 2025 – Oct 2025	99.86%	1 (Sep 4, 3h 8m)	Capacity event on Llama 405B
Nov 2025 – Jan 2026	99.84%	2 (Llama 70B, DeepSeek)	Q4 demand stressed individual model endpoints
Feb 2026 – Apr 2026	99.91%	0 major	Stable

Blended 18-month measured uptime: 99.87%. Together's published SLA is 99.5% on dedicated endpoints. Shared API does not carry an explicit SLA but operates at similar uptime in practice. Per-model status pages mean you can route to alternative models during incidents, a feature most providers don't offer.

Longitudinal pricing data

Together's per-token prices have steadily dropped as open-model serving has gotten more efficient.

Date	Llama 70B ($/M)	DeepSeek-V3 ($/M)	Llama 405B ($/M)	Notes
May 2024	$1.20	n/a	$4.50	DeepSeek-V3 not yet released
Nov 2024	$1.05	n/a	$4.20	First cut
Feb 2025	$0.99	$1.50	$4.00	DeepSeek-V3 launched, mid-tier pricing
Aug 2025	$0.92	$1.35	$3.75	Sustained downward
Feb 2026	$0.88	$1.25	$3.50	Stabilized
May 2026	$0.88	$1.25	$3.50	Current

Llama 70B per-token rate has dropped roughly 27% over 24 months. This is the long-term trend across all hosted-inference providers, quantization and serving efficiency improvements compound, and competition between Together, Fireworks, Anyscale, Replicate keeps pricing pressure on. Expect another 10-15% cut by end of 2026.

Community sentiment

Together has loud positive sentiment among AI app developers. We sampled 6 months across Hacker News, X/Twitter ML threads, r/LocalLLaMA. Sample: 1,124 mentions.

Source	Positive	Negative	Top complaint	Top praise
Hacker News (n=287)	79%	11%	Capacity events on niche models	Pricing transparency
r/LocalLLaMA (n=412)	75%	14%	No serverless tier with autoscale	Llama 70B economics
X/Twitter (n=312)	82%	9%	Catalog breadth (vs Replicate)	OpenAI-compatibility migration ease
r/MachineLearning (n=113)	68%	16%	Less control vs self-hosting	Fine-tuning service quality

Net sentiment: +65 (positive). Together's positive cluster is heavily about cost-economics, developers explicitly cite migrating from OpenAI and saving 80-90%. Negative cluster is mostly experienced ML engineers who want more control than the API offers.

Who should avoid this

Skip this if you fall into any of these buckets. Naming it up-front beats a support ticket later.

Anyone needing to train a model from scratch. Together is inference + fine-tuning only. Use Lambda or CoreWeave for training.
Custom CUDA kernel developers. No GPU access. Use Lambda or RunPod.
HIPAA-strict workloads. No standard-tier BAA. Enterprise contracts may add HIPAA addenda; check current status.
EU data-residency requirements. Together's dedicated endpoints are US-only. Wait for EU rollout or use a different inference strategy.
Models outside Together's catalog. They host 50+ open models. Niche or research-only models won't be there.
Inference workloads needing custom sampling, speculative decoding control, or vLLM internals. Use self-hosted vLLM on Lambda.
Heavy image-generation workloads (SDXL, FLUX) at scale. Replicate has a broader image-model catalog.

Testing evidence

FIG 6.0, OpenAI → Together migration, single code change

# BEFORE (OpenAI)
from openai import OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
resp = client.chat.completions.create(
 model="gpt-4o",
 messages=[{"role":"user","content":"explain transformers in 50 words"}],
)
# Cost on 5M tokens/day: ~$1,250/month

# AFTER (Together, only base_url + model + key changed)
from openai import OpenAI
client = OpenAI(
 api_key=os.getenv("TOGETHER_API_KEY"),
 base_url="https://api.together.xyz/v1",
)
resp = client.chat.completions.create(
 model="meta-llama/Llama-3.1-70B-Instruct-Turbo",
 messages=[{"role":"user","content":"explain transformers in 50 words"}],
)
# Cost on 5M tokens/day: ~$132/month (89% reduction)

FIG 6.1, Latency profile across concurrencies, Llama 70B shared API

concurrency P50_first_token P95_first_token throughput_tok_s
1 187 ms 242 ms 94
8 218 ms 341 ms 812
32 274 ms 468 ms 1,712
64 352 ms 641 ms 2,184
128 511 ms 1,124 ms 2,890

dedicated H100 @ same concurrency=32:
P50: 196 ms | P95: 308 ms | throughput: 1,876 tok/s
gap: shared API is ~12% slower P50, 31% slower P95 vs dedicated

ROI calculator

Plug your team's workload to see what Together AI costs you. Numbers update live.

Tier / GPU Llama 70B shared (per-token) ($0.88/hr) Llama 405B shared (per-token) ($3.50/hr) DeepSeek-V3 shared (per-token) ($1.25/hr) Dedicated H100 endpoint ($1.79/hr) Dedicated A100 endpoint ($0.99/hr)

GPU count

Hours per day

Days per month

ON-DEMAND

$0/mo

VS LAMBDA RESERVED

$0/mo

DELTA

$0/mo

Together's per-token rates are billed per million tokens. The calculator above treats 'GPU count' as 'million tokens per day' for the shared-API rates, and as 'GPU hours' for dedicated tiers.

The verdict

Together AI is the cheapest path from 'OpenAI works but it's expensive' to 'production AI app at sustainable margins'. For most developers building AI apps in 2026, the question isn't whether Together is good, it's whether your workload matches the inference-only, open-model menu. If yes, it's the right call full stop.

If your workload extends past inference (you need training, custom CUDA, regulated data residency, models outside the catalog), Together is one of the pieces of your stack, not the whole answer. Pair it with Lambda for training and Replicate for image models if needed.

If Together AI doesn't fit, consider

For self-hosted inference control

Lambda Labs

Run your own vLLM on H100 SXM. Reserved 1-yr at $1.85/hr matches Together dedicated economics with full control.

Read Lambda Labs review →

For broader model catalog (incl. image)

Replicate

10k+ community models hosted. Better for image/video models. Slightly more expensive per-token on text.

Read Replicate review →

For serverless burst inference

Modal Labs

If your inference is bursty and you need a custom Python wrapper around the model, Modal's function-style billing fits better.

Read Modal Labs review →

Together AI is the right GPU cloud if you'd rather call a per-token API than provision an H100.

The first product we've reviewed in three years that we'd actually buy ourselves.

How we tested

The verdict, in 60 seconds

Where the 88 comes from

What it gets right

Per-token economics that genuinely beat self-hosting

OpenAI SDK compatibility is genuinely drop-in

Model catalog moves fast

Dedicated endpoints undercut Lambda on-demand

Where it falls short

Inference-only, no training, no general compute

US-only for dedicated endpoints

Less control than running your own vLLM

Model selection is curated, not arbitrary

Compliance posture is still maturing

Pricing reality

Benchmark matrix

Cost-to-performance ratio

Hardware & software stack

Scenario simulation: what Together AI costs for your work

Scenario A: Indie AI app, 5M tokens/day Llama 70B

Scenario B: Production AI app, 100M tokens/day

Scenario C: Enterprise inference, 1B tokens/day

Use-case match matrix

Stability & uptime history

Longitudinal pricing data

Community sentiment

Who should avoid this

Testing evidence

ROI calculator

The verdict

If Together AI doesn't fit, consider

Lambda Labs

Replicate

Modal Labs

From 893 verified reviews.

Frequently asked

More rankings across GAX Online

How Together AI ranks in GPU Cloud