LLM Resource Usage & Cost Calculator
Analyse memory requirements, throughput and cloud spend for open-source language models in seconds.
Curated presets
Start from modern open-weight models with known serving notes.
- Parameters
- 7.000 B Exact / sourced value
- Hidden size
- 4096
- Layers
- 32
Quick estimator
Size weights and KV cache instantly from core workload inputs.
KV cache memory scales linearly with concurrent sequences and context length.
Architecture assumptions
Using LLaMA-style scaling heuristics derived from parameter count. Enable manual mode to match a specific checkpoint.
- Hidden size
- 4096
- Layers
- 32
- Attention heads
- 32
- Feed-forward size
- 16384
Hidden size and depth follow public LLaMA scaling heuristics. Switch to manual mode to match custom architectures.
Memory & hardware
Assumes 1 concurrent user at 4096 tokens.
Weight memory and KV cache are exact arithmetic from your inputs. Activations, optimizer state, total VRAM, fit checks, and GPU recommendations are heuristic estimates because they depend on runtime behavior and overhead assumptions.
Fits with 11.71 GB headroom.
Closest matching GPUs
- NVIDIA GeForce RTX 30903.71 GB spare
- NVIDIA GeForce RTX 40903.71 GB spare
- NVIDIA GeForce RTX 509011.71 GB spare
- NVIDIA A100 PCIe 40GB19.71 GB spare
- NVIDIA RTX 6000 Ada27.71 GB spare
Performance
All performance outputs below are heuristic estimates. They use a simplified compute model and default to total parameters when no active-parameter metadata is available.
GPU FP32 throughput: 104.9 TFLOPs
Adjust the efficiency multiplier to reflect framework and kernel optimisations.
Estimated FLOPs / sequence: 27.46 TFLOPs
Tokens per second: 2,247.86
Milliseconds per token: 0.44
Cloud cost projection
Cloud cost is exact arithmetic from hourly rate × runtime. It does not include hidden provider fees outside the selected rate.
Effective hourly rate: $1.01
Projected cost: $1.01