LLM Resource Usage & Cost Calculator

Analyse memory requirements, throughput and cloud spend for open-source language models in seconds.

Hugging Face model ID

Curated presets

Start from modern open-weight models with known serving notes.

6 presets

Parameters: 7.000 B
Hidden size: 4096
Layers: 32

Quick estimator

Size weights and KV cache instantly from core workload inputs.

Weights/KV: exact arithmeticActivations/overhead: heuristic

Parameter count (billions)Context length (tokens)Weight precisionKV cache precisionOverhead factor

Concurrent users

KV cache memory scales linearly with concurrent sequences and context length.

Architecture assumptions

Using LLaMA-style scaling heuristics derived from parameter count. Enable manual mode to match a specific checkpoint.

Heuristic architecture

Hidden size: 4096
Layers: 32
Attention heads: 32
Feed-forward size: 16384

Hidden size and depth follow public LLaMA scaling heuristics. Switch to manual mode to match custom architectures.

Memory & hardware

Assumes 1 concurrent user at 4096 tokens.

Weight memory and KV cache are exact arithmetic from your inputs. Activations, optimizer state, total VRAM, fit checks, and GPU recommendations are heuristic estimates because they depend on runtime behavior and overhead assumptions.

Model weights13.04 GB

Activations2.61 GB

KV cache16-bit2.00 GB

Total before overhead17.65 GB

Framework overhead (1.15×)2.65 GB

Total VRAM needed20.29 GB

Compare against GPU

Fits with 11.71 GB headroom.

Closest matching GPUs

NVIDIA GeForce RTX 30903.71 GB spare
NVIDIA GeForce RTX 40903.71 GB spare
NVIDIA GeForce RTX 509011.71 GB spare
NVIDIA A100 PCIe 40GB19.71 GB spare
NVIDIA RTX 6000 Ada27.71 GB spare

Performance

All performance outputs below are heuristic estimates. They use a simplified compute model and default to total parameters when no active-parameter metadata is available.

Kernel efficiency

GPU FP32 throughput: 104.9 TFLOPs

Adjust the efficiency multiplier to reflect framework and kernel optimisations.

Estimated FLOPs / sequence: 27.46 TFLOPs

Tokens per second: 2,247.86

Milliseconds per token: 0.44

Cloud cost projection

Cloud cost is exact arithmetic from hourly rate × runtime. It does not include hidden provider fees outside the selected rate.

Cloud instanceRuntime (hours)Custom hourly rate (optional)

Effective hourly rate: $1.01

Projected cost: $1.01