The real total cost of running your own LLM in 2026
GPU sticker prices are only one line on the invoice. A breakdown of every cost center for self-hosted inference, with current 2026 numbers, so you can decide whether the math works for your use case.
We’ve written before about when self-hosting an LLM beats the cloud APIs. This post goes deeper into one specific question: what does it actually cost?
Cloud API pricing is easy to model. You pay per token, and the rate is on the OpenAI or Anthropic dashboard. Self-hosted is harder because the cost is spread across hardware, electricity, rack space (or cloud GPU rental), engineering time, and the always-underestimated category of “things that go wrong.”
Here’s the honest accounting in 2026.
The four cost centers
A self-hosted inference deployment has four cost categories. People who quote you “$2/hour for an A100” are quoting one of them.
- Hardware capex (or GPU rental opex if you skip ownership)
- Electricity and cooling
- Operational overhead. Hosting, networking, monitoring, on-call
- Engineering time. The part that’s never on the invoice
Skip any of these and the comparison vs. cloud APIs is dishonest.
1. Hardware capex (or rental)
The two paths most clients consider: buy GPUs and rack them, or rent from a cloud GPU provider.
Buying. Street prices verified May 2026 against ServerSupply, Thundercompute, and Compute Exchange listings:
| Card | VRAM | Approx. street price (USD, May 2026) | Best for |
|---|---|---|---|
| NVIDIA RTX 6000 Ada | 48 GB | $7,800–$9,400 | Workstation inference up to ~70B quantized |
| NVIDIA L40S | 48 GB | $7,500–$9,000 | Server inference, batch + single-user |
| NVIDIA H100 80GB SXM | 80 GB | $35,000–$40,000 (PCIe variant: $25-30k) | High-throughput, larger models |
| NVIDIA H200 141GB NVL | 141 GB | $30,000–$35,000 single card | Long-context, large models |
| AMD MI300X | 192 GB | $17,000–$20,000 | Cost-per-VRAM-GB winner if your stack supports ROCm |
A two-card workstation for a small team (e.g., 2× RTX 6000 Ada) lands around $16,000-19,000 in cards alone. Add a real server chassis, EPYC or Xeon CPU, fast NVMe, dual PSU, and you’re at $25k-30k for a workstation-class node, $60k+ for a proper rack server.
Renting. Hourly GPU rental from RunPod / Lambda Labs / Vast.ai / DataCrunch / TensorDock, verified May 2026 (rates move weekly; re-check before quoting a deployment):
- A100 80GB: roughly $1.40–$1.80/hr community / spot, $2.00-2.80/hr on-demand
- H100 80GB: roughly $1.90–$2.50/hr community, $3.00-4.00/hr on-demand
- H200 141GB: roughly $2.30-4.00/hr depending on provider and tier
- L40S: roughly $0.79-1.50/hr
At $2.20/hr for an H100 community-tier running 24/7, that’s roughly $1,600/month, about $19k/year. Five months of rental and you’re at the price of a single retail H100 SXM. Renting makes sense when you’re prototyping or have lumpy demand. Buying makes sense when utilization is high and predictable.
2. Electricity and cooling
A typical inference card draws 300-700W under load. A two-card workstation pulling 800W average over 24 hours = 19.2 kWh/day = 580 kWh/month.
At current Toronto Hydro time-of-use rates (roughly $0.10-0.20/kWh depending on the period, blending to about $0.13-0.15/kWh after the Ontario Electricity Rebate), that’s $75-110/month in electricity alone, before cooling.
Cooling adds roughly 25-50% on top of compute power depending on whether you’re using air or liquid cooling. A small server room with a dedicated AC unit will run another $30-50/month on top.
Realistic monthly electricity for a small on-prem inference setup: $130-180.
3. Operational overhead
Things that are easy to forget when you’re modeling:
- Hosting / bandwidth. Even on-prem, you need a static IP, redundant ISP, or a Cloudflare Tunnel. Budget $50-150/month.
- Monitoring. Grafana/Prometheus on a small VPS, or a managed service like Datadog/Better Stack. $20-200/month.
- Backup of model weights and configs. Small but real, around $5-15/month.
- Spare parts. Fans, drives, the occasional PSU. Budget 5-10% of hardware cost per year as a maintenance reserve.
Realistic monthly opex for a small on-prem deployment: $100-300.
For rented GPUs, almost all of this is bundled into the hourly rate (except monitoring), so adjust.
4. Engineering time, the line nobody puts on the invoice
The biggest hidden cost. Self-hosting an LLM in 2026 is much easier than it was in 2024, but it is not free of engineering work.
What we’ve seen across deployments:
- Initial setup: 20-60 hours for a single-model, single-tenant deployment. Includes choosing a serving framework (vLLM, SGLang, TensorRT-LLM, llama.cpp server), tuning quantization, wiring up an API gateway, setting up TLS, basic logging.
- Ongoing maintenance: 2-8 hours/month, ramping up around model swaps and framework upgrades.
- Incidents: budget 1-2 unscheduled incidents per quarter requiring 4-8 hours each. Things that fail: driver updates, OOM under unexpected load, batch sizes, KV-cache configuration, certificate renewals.
At a Toronto senior-developer billed consultant rate of $120-200/hour (or in-house cost of roughly $60-90/hour fully loaded), that’s $250-1,500/month in engineering time, depending on how stable your deployment is and how many models you’re juggling.
This is the line that flips the cost-comparison vs. cloud APIs more often than any other.
Putting it together: realistic monthly TCO
Two illustrative scenarios:
Scenario A. Small clinic running a 70B-class model on-prem (single workstation)
- Capex amortization (3-year): $26k / 36 months = $725/month
- Electricity: $150/month
- Operational: $150/month
- Engineering: $400/month (4 hours of light maintenance at $100/hr blended)
- Total: ~$1,400/month
For comparison, the same workload on Anthropic’s Claude Sonnet 4.6 API at moderate volume (5M tokens/month, mixed input/output) would be roughly $50-200/month at current 2026 pricing, significantly cheaper than self-hosting on pure cost. Self-hosting only wins here for non-cost reasons: typically PHIPA / data-residency requirements (which is exactly the situation we recommend it for).
Scenario B. Mid-size SaaS doing 200M tokens/month, predictable load
- 1× rented H100 community-tier ($2.20/hr × 720 hours): $1,600/month
- Operational + monitoring: $100/month
- Engineering: $600/month (heavier load = more tuning)
- Total: ~$2,300/month
Same volume on cloud APIs at current 2026 rates: roughly $1,500-5,000/month depending on model and input/output mix (Sonnet 4.6: $600-3,000; Opus 4.7: $1,000-5,000; GPT-5.4 mid-tier: similar to Sonnet). Self-hosting now ties or wins above ~250M tokens/month for output-heavy workloads, or above ~150M tokens/month if you’re using Opus-class models. Cloud APIs got cheaper through 2025-26; the break-even moved up.
When the math says “self-host”
Three patterns where self-hosting genuinely wins on TCO:
- Token volume above ~250M/month with predictable, sustained load (or above ~150M if you’d otherwise be on Opus-class pricing).
- Data sensitivity that prohibits cloud APIs entirely (PHIPA, PIPEDA-regulated client data, internal IP).
- Latency floors below ~200ms p99 that the cloud APIs can’t reliably hit because of geography or queue time.
If none of these apply, a cloud API is the right answer. The math we run on most discovery calls confirms it.
When the math says “cloud API”
Three patterns where self-hosting is a mistake:
- Your monthly spend on cloud APIs is currently under $300. The engineering time alone exceeds the saving.
- Demand is bursty (10x peaks). You’d over-provision hardware to handle peaks that happen 5% of the time.
- You don’t have a dedicated person who can be on-call for the deployment. Self-hosted means you’re operating infrastructure.
What we don’t recommend
- Buying GPUs as an investment. The depreciation curve on consumer / prosumer cards has been brutal across 2024-2026. Treat GPUs as 2-3 year depreciable assets, not stores of value.
- Building your own serving stack. vLLM, SGLang, TensorRT-LLM, and llama.cpp server are all production-grade in 2026 (HuggingFace put TGI into maintenance mode; SGLang has emerged as the other major option alongside vLLM). Use them. Don’t rewrite.
- Mixing model versions on the same GPU. Memory fragmentation will bite you under load. One model per card, or use a proper batching framework.
If you’re trying to run the math for your specific use case, send us your numbers. Token volume, latency targets, data sensitivity. We’ll tell you honestly which side the comparison lands on.