Disclosure: We may earn affiliate commissions from retailer links, but paid placements do not control rankings. See our disclosure policy.
GPU Ranking for AI Workloads — May 2026
Full GPU tier ranking — May 2026
Tiers reflect AI workload fit, not gaming performance. A card can be excellent for gaming and mediocre for local AI if VRAM is the constraint.
| Tier | GPU | VRAM | Architecture | Approx. Price (May 2026) | AI Workload Fit | Key Limitation |
|---|---|---|---|---|---|---|
| Flagship | RTX 5090 | 32 GB GDDR7 | Blackwell (GB202) | $1,999+ MSRP; street higher | Best available for 70B+ quantized LLMs, Flux, SDXL 3.x, ComfyUI video. Meaningfully faster memory bandwidth than 4090. | Stock constrained; street prices well above MSRP. Power draw 575 W TDP — large PSU required. |
| Flagship | RTX 5080 | 16 GB GDDR7 | Blackwell (GB203) | $999+ MSRP; street higher | Strong for SDXL, ComfyUI, local LLMs up to 34B (quantized). GDDR7 bandwidth noticeably faster than 4080 in memory-bound tasks. | 16 GB is still 16 GB — no advantage over a 4090 for models that exceed 16 GB. Also stock-constrained. |
| High-end | RTX 4090 | 24 GB GDDR6X | Ada Lovelace (AD102) | $1,400–$1,700 (clearance improving) | Best value top-tier card while RTX 50 pricing is unsettled. Handles 70B q4 LLMs, SDXL high-res, ComfyUI workflows. 24 GB headroom remains decisive. | 450 W TDP; large cooler. Availability better than 5090 but clearance stock varies. |
| High-end | RTX 4080 Super | 16 GB GDDR6X | Ada Lovelace (AD103) | $800–$950 (clearance) | Best balance of price and real AI throughput in 2026. Handles 13B–34B models (quantized), SDXL, ComfyUI, standard Flux pipelines comfortably. | 16 GB ceiling — 70B models require heavy quantization with partial CPU offload. |
| High-end | RTX 4080 (non-Super) | 16 GB GDDR6X | Ada Lovelace (AD103) | $700–$850 (used/clearance) | Nearly identical to 4080 Super for AI workloads. Worth considering used if the price gap is significant. | Same 16 GB ceiling. Lower shader count than Super has minimal AI impact. |
| Mid-range | RTX 4070 Ti Super | 16 GB GDDR6X | Ada Lovelace (AD103) | $550–$700 | The 16 GB mid-range sweet spot. Handles 13B–34B quantized models, SDXL, ComfyUI. Prefer this over the base 4070 Ti if local AI is the primary use. | Pricing can drift toward 4080 territory — always check before buying. Not for 70B models. |
| Mid-range | RTX 4070 Super | 12 GB GDDR6X | Ada Lovelace (AD104) | $450–$550 | Solid for SD 1.5, SDXL (with VRAM management), 7B–13B LLMs. The 12 GB ceiling is real but manageable with current tooling. | 12 GB will feel tight with Flux and newer high-VRAM pipelines. Not recommended as a long-term LLM card. |
| Mid-range | RTX 4070 (base) | 12 GB GDDR6X | Ada Lovelace (AD104) | $400–$480 | Adequate for 7B LLMs and SD 1.5. Weaker value than 4070 Super at typical price differences. | Skip in favor of 4070 Super if the gap is under $70. Same 12 GB ceiling. |
| Value | RTX 4060 Ti 16 GB | 16 GB GDDR6 | Ada Lovelace (AD106) | $380–$440 | Surprising VRAM-per-dollar. 16 GB fits 13B quantized models and SDXL. Lower bandwidth limits throughput but fits more than any 12 GB card. | 128-bit bus width is a meaningful bandwidth bottleneck. Inference is noticeably slower than cards with wider buses. |
| Value | RTX 4060 Ti 8 GB | 8 GB GDDR6 | Ada Lovelace (AD106) | $300–$350 | SD 1.5 runs well. 7B models need heavy quantization (Q4 minimum). Fine for learning. | 8 GB is the hard floor — you will hit it quickly. Avoid this variant if SDXL or LLMs beyond 7B are planned. |
| Entry | RTX 4060 (base) | 8 GB GDDR6 | Ada Lovelace (AD107) | $260–$310 | Lowest entry point. SD 1.5 functional. 7B models Q4 possible but constrained. Good for learning and experimentation before a real upgrade. | 8 GB ceiling; narrower bus than Ti. Will feel limiting within 6–12 months for most users. |
| Entry | AMD RX 7900 XTX | 24 GB GDDR6 | RDNA 3 (Navi 31) | $650–$800 | 24 GB at a lower price than RTX 4090. Legitimate for VRAM-heavy workloads with ROCm (Linux). Ollama, llama.cpp, and some SD pipelines work well. | ROCm ecosystem is maturing but narrower than CUDA. Windows support is inconsistent for AI workloads. Not recommended for CUDA-only workflows. |
| Entry | AMD RX 7900 XT | 20 GB GDDR6 | RDNA 3 (Navi 31) | $500–$650 | 20 GB is unique in the consumer price range. Good for VRAM-hungry workflows on Linux/ROCm. Better VRAM-per-dollar than most NVIDIA options at this price. | Same ROCm/CUDA caveats as 7900 XTX. Check your framework's ROCm support before buying. |
Note on RTX 50-series (Blackwell): RTX 5070 Ti (16 GB) and RTX 5070 (12 GB) exist but have had erratic availability and pricing since launch. They are not listed above because their price/availability ratio does not yet make them clear buys over the 40-series cards they'd displace. Check current pricing — if the 5070 Ti is near 4080 Super pricing, it becomes competitive.
VRAM requirements by workload
These are practical loading requirements, not theoretical minima. Below these floors, models either fail to load or require CPU offloading that kills performance.
| Model / Workload | Min VRAM (FP16) | Min VRAM (Q4 / INT4) | Comfortable VRAM | Notes |
|---|---|---|---|---|
| 7B LLM (Llama 3, Mistral, Phi-3) | 14 GB | ~5–6 GB | 12 GB | Q4 on 8 GB is possible but output quality degrades. 12 GB gives comfortable Q8 operation. |
| 13B LLM (Llama 2 13B, Mistral 12B) | ~26 GB | ~9–10 GB | 16 GB | Q4 on 12 GB works but is near the limit. 16 GB gives Q8 headroom. |
| 34B LLM (CodeLlama 34B, Yi 34B) | ~68 GB | ~20–22 GB | 24 GB+ | Needs 24 GB minimum at Q4. 70B requires multi-GPU or heavy CPU offload. |
| 70B LLM (Llama 3 70B) | ~140 GB | ~40–45 GB | 48 GB+ (multi-GPU) | Q4 barely fits RTX 5090 (32 GB) with overhead trimming. Most practical path is dual-GPU or CPU+GPU offload. |
| Stable Diffusion 1.5 (512×512) | 4 GB | — | 6–8 GB | Runs on almost any modern GPU. 8 GB recommended for batching and higher-res. |
| SDXL (1024×1024) | 8 GB | — | 12–16 GB | 8 GB works with attention slicing enabled (slower). 12 GB recommended for comfort. |
| Stable Diffusion 3.x / Flux.1 | 16 GB | ~12 GB (quantized UNet) | 16–24 GB | Flux.1 Dev at FP8 needs ~16 GB. Full FP16 needs 24 GB+. These models separate 12 GB cards from 16 GB+ cards clearly. |
| ComfyUI video workflows (SVD, Wan) | 16 GB | — | 24 GB+ | Video diffusion models are VRAM-hungry. 16 GB is the minimum; 24 GB is genuinely more usable. |
| LoRA training (7B, QLoRA) | — | 12–16 GB | 24 GB | QLoRA with gradient checkpointing fits 16 GB for 7B. 13B QLoRA typically needs 24 GB for comfortable batch sizes. |
| Full fine-tuning (7B) | ~80 GB | — | Multi-GPU or cloud | Not realistic on consumer hardware. QLoRA is the consumer-viable alternative. |
Workload performance: Stable Diffusion, local LLMs, fine-tuning
Stable Diffusion / image generation
| GPU | VRAM | SD 1.5 (512px) it/s | SDXL (1024px) it/s | Flux.1 Dev (1024px) | ComfyUI batch |
|---|---|---|---|---|---|
| RTX 5090 | 32 GB | ~55–70 it/s | ~22–28 it/s | Full FP16, fast | Large batches |
| RTX 5080 | 16 GB | ~45–58 it/s | ~18–24 it/s | FP8/FP16, good | Medium-large batches |
| RTX 4090 | 24 GB | ~42–55 it/s | ~16–22 it/s | Full FP16 | Large batches |
| RTX 4080 Super | 16 GB | ~30–42 it/s | ~12–17 it/s | FP8, good | Medium batches |
| RTX 4070 Ti Super | 16 GB | ~26–36 it/s | ~10–14 it/s | FP8, adequate | Medium batches |
| RTX 4070 Super | 12 GB | ~22–30 it/s | ~8–12 it/s | FP8 w/ quantization | Small batches |
| RTX 4060 Ti 16 GB | 16 GB | ~16–22 it/s | ~6–9 it/s | FP8, slow | Small batches |
| RTX 4060 (8 GB) | 8 GB | ~12–16 it/s | ~4–6 it/s (attn slicing) | Not practical | Very limited |
it/s = iterations per second using DDIM sampler, 20 steps, no LoRA. Real-world numbers vary with scheduler, LoRA count, and VAE. These are representative ranges, not guaranteed benchmarks.
Local LLM inference — tokens per second
Numbers are approximate, measured with llama.cpp or Ollama using quantized models (Q4_K_M unless noted). First-token latency is not included — it varies with prompt length.
| GPU | VRAM | 7B (Q4_K_M) tok/s | 13B (Q4_K_M) tok/s | 34B (Q4_K_M) tok/s | 70B (Q4_K_M) tok/s | Context (128K viable?) |
|---|---|---|---|---|---|---|
| RTX 5090 | 32 GB | ~110–140 | ~65–85 | ~28–38 | ~12–18 (tight) | Yes, 32 GB handles large contexts |
| RTX 5080 | 16 GB | ~90–115 | ~52–68 | ~18–26 (quantized) | No — offload needed | Moderate (16 GB limits context cache) |
| RTX 4090 | 24 GB | ~80–105 | ~48–62 | ~22–32 | ~10–15 (tight) | Yes, 24 GB handles long contexts well |
| RTX 4080 Super | 16 GB | ~62–82 | ~36–50 | ~14–20 (quantized) | No | Moderate (KV cache limits at 16 GB) |
| RTX 4070 Ti Super | 16 GB | ~55–72 | ~32–45 | ~12–18 (quantized) | No | Moderate |
| RTX 4070 Super | 12 GB | ~45–60 | ~20–30 (near limit) | No — exceeds VRAM | No | Short contexts only (≤16K) |
| RTX 4060 Ti 16 GB | 16 GB | ~32–44 | ~18–26 | ~8–12 (slow, quantized) | No | Fits, but bandwidth-limited |
| RTX 4060 (8 GB) | 8 GB | ~22–32 (Q4 only) | No — exceeds VRAM | No | No | Very short only |
KV cache (context window) is a second VRAM consumer alongside model weights. A 7B model at Q4 uses ~5 GB, leaving 7 GB for KV cache on a 12 GB card — which limits effective context to roughly 32K–64K tokens depending on configuration. Longer context requires more VRAM, not just a faster GPU.
Fine-tuning: what's realistic on consumer hardware
Full fine-tuning of even a 7B model in FP16 requires ~80 GB VRAM. That is not happening on consumer hardware. The practical path is QLoRA (Quantized LoRA) using libraries like Unsloth, PEFT, or Axolotl.
| GPU | VRAM | 7B QLoRA | 13B QLoRA | 34B QLoRA | Practical batch size |
|---|---|---|---|---|---|
| RTX 5090 | 32 GB | Yes — comfortable | Yes — comfortable | Yes — tight | 4–8 (7B); 2–4 (13B) |
| RTX 4090 | 24 GB | Yes — comfortable | Yes — comfortable | Possible, small batch | 4–8 (7B); 2–4 (13B) |
| RTX 4080 Super / 4080 | 16 GB | Yes — comfortable | Yes — tight, small batch | Not practical | 2–4 (7B); 1 (13B) |
| RTX 4070 Ti Super | 16 GB | Yes | Yes — very tight | No | 2–4 (7B) |
| RTX 4060 Ti 16 GB | 16 GB | Yes — slow | Barely (batch size 1) | No | 1–2 (7B) |
| RTX 4070 Super / 4070 | 12 GB | Yes — small batch | No — OOM risk | No | 1–2 (7B) |
| RTX 4060 8 GB | 8 GB | Very marginal — OOM likely without significant workarounds | No | No | Not recommended |
Gradient checkpointing reduces VRAM use at a training speed penalty. Flash Attention 2 helps. Even with these, 8 GB is not a fine-tuning card in any practical sense.
Tier-by-tier notes and real tradeoffs
RTX 5090 / 5080 — Blackwell
The RTX 5090 is the first consumer GPU to give real headroom for 70B quantized models, ComfyUI video diffusion, and Flux at full FP16. The GDDR7 bandwidth improvement is meaningful for inference — not marginal. The 5080 with 16 GB is faster than the 4080 Super at equivalent VRAM but does not give you more of it. The catch: as of May 2026, both cards are selling above MSRP due to constrained supply. If you can find them at or near MSRP, they represent good long-term buys. At 30–40% above MSRP, the RTX 4090 remains better value.
- Power draw: RTX 5090 is 575 W TDP. You need a 1000 W+ PSU and good case airflow. RTX 5080 is more reasonable at ~360 W.
- Laptop variants: RTX 5090 and 5080 laptop cards exist but run at substantially lower power limits. Do not assume desktop-equivalent performance from laptop branding.
RTX 4090 — still the top-value pick
24 GB GDDR6X remains decisive for long-context LLMs, 34B–70B models, and any workflow where 16 GB regularly runs out. As RTX 40-series clears, pricing has improved. The 4090 is not efficient — 450 W is a lot — but for buyers who want maximum capability without paying RTX 5090 premiums, it remains the call.
RTX 4080 Super — the serious buyer's balance point
16 GB is the minimum VRAM floor worth targeting in 2026 for Stable Diffusion 3.x and 13B+ LLMs. The 4080 Super delivers strong throughput at a significantly lower price than a 4090 or 5090. Clearance pricing is favorable. This is the card most serious buyers should be considering in Q2 2026 unless 70B models or very large context windows are required.
RTX 4070 Ti Super — the 16 GB value card
Meaningfully better for AI than the base 4070 Ti (12 GB) due to the VRAM jump. Handles 13B models, SDXL, and standard ComfyUI workflows. Recommended over the base 4070 Ti whenever the price delta is under $100.
RTX 4060 Ti 16 GB — the bandwidth tradeoff
This card frequently confuses people because 16 GB sounds impressive for the price. It genuinely is useful for VRAM-dependent workloads — you can load 13B models where a 12 GB card cannot. But the 128-bit memory bus is a real bottleneck: inference throughput is noticeably slower than cards with a 192-bit or 256-bit bus. Think of it as a VRAM-first pick when budget is the constraint, not as a performance pick.
12 GB cards (RTX 4070 / 4070 Super)
Capable for SD 1.5, SDXL with management, and 7B–13B LLMs. The 12 GB ceiling becomes a real issue with Flux.1, Stable Diffusion 3, video diffusion, and long-context LLMs. If you're planning a 2–3 year horizon, budget for 16 GB minimum. These cards are best for users focused primarily on SD 1.5/SDXL and 7B models where they will likely spend most of their time.
8 GB cards (RTX 4060, 4060 Ti 8 GB)
Fine for getting started with SD 1.5 and 7B LLMs at Q4 quantization. Will feel restrictive for anything else. Do not buy an 8 GB card if 16 GB is within reach of your budget — the upgrade cost later will exceed the savings now.
Laptop vs desktop GPU variants
| GPU Name | Desktop TDP | Laptop TDP range | VRAM (same?) | AI performance gap | Notes |
|---|---|---|---|---|---|
| RTX 5090 / 5080 | 575 W / 360 W | 150–175 W | Yes (same VRAM spec) | Large — laptop is 40–55% slower | Laptop 5090 is closer to a desktop 5080 in sustained workloads. Chassis cooling is the key variable. |
| RTX 4090 | 450 W | 80–150 W | 16 GB (laptop version is 16 GB, not 24 GB) | Very large — VRAM AND performance both differ | Laptop RTX 4090 has 16 GB, not 24 GB. This is a significant distinction for model fit. |
| RTX 4080 | 320 W | 80–150 W | 12 GB (laptop) vs 16 GB (desktop) | Large — both VRAM and throughput | Laptop 4080 has 12 GB vs desktop 16 GB. Check the spec sheet carefully. |
| RTX 4070 / 4070 Ti Super | 200–285 W | 80–125 W | 8 GB (laptop 4070) / 12 GB (laptop 4070 Ti) | Moderate — laptop is 25–40% slower | Laptop 4070 has 8 GB; desktop has 12 GB. Another spec to confirm before buying. |
| RTX 4060 | 115 W | 70–115 W | 8 GB (same) | Small — laptop and desktop are closest in this tier | This is one of the least penalized tiers in laptop form. |
Practical guidance: if you want a laptop for serious local LLM work, the RTX 4090 laptop's 16 GB VRAM is the honest ceiling. For sustained AI workloads, chassis cooling and power delivery matter more than the GPU nameplate. Check the laptop's maximum GPU TDP in reviews, not just the GPU model. See our best AI laptops guide for specific system recommendations.
Recommendations by budget and use case
Under $350 — Learning / experimentation
RTX 4060 (8 GB)
Run SD 1.5, experiment with 7B models at Q4. Good starting point. You will outgrow it, but it is a real entry into local AI at a low price.
Under $450 — Budget, prioritize VRAM
RTX 4060 Ti 16 GB
The bandwidth tradeoff is real, but 16 GB fits more than any 12 GB card. Recommended over the 8 GB 4060 Ti if you plan to run 13B models or SDXL regularly.
$450–$600 — Mid-range, image gen focus
RTX 4070 Super (12 GB)
Strong throughput for SD 1.5 and SDXL. Good 7B LLM performance. Ceiling will be felt with Flux and newer pipelines. Right choice if SD is the primary use.
$550–$700 — Mid-range, LLM focus
RTX 4070 Ti Super (16 GB)
16 GB matters for 13B models, long context, and Flux. Prefer this over the 4070 Super if local LLMs are a primary use case, not just image generation.
$800–$950 — Serious, best balance
RTX 4080 Super
The clearest buy in Q2 2026 for buyers who want real AI capability without flagship pricing. Handles 13B–34B models, SDXL, Flux, ComfyUI, and LoRA training. Strong clearance pricing.
$1,400–$1,700 — Top-end, best value flagship
RTX 4090 (24 GB)
Buy when 70B models, very long context (128K+), or ComfyUI video are planned. Best top-tier value while RTX 5090 supply and pricing remain unstable. Clearance pricing makes this increasingly attractive.
$1,999+ — If you need the absolute ceiling
RTX 5090 (32 GB)
Only when 70B models at useful speed, ComfyUI video at scale, or Flux at full FP16 quality are requirements you will actually use. At MSRP it makes sense. At 30%+ above MSRP, wait or choose the 4090.
$650–$800 — Linux / ROCm alternative
AMD RX 7900 XTX (24 GB)
24 GB at a lower price than RTX 4090. Legitimate on Linux with ROCm for llama.cpp, Ollama, and some SD pipelines. Avoid if you need CUDA-specific tools or are on Windows.
Frequently asked questions
Is VRAM really more important than GPU tier for AI?
For most local AI workloads: yes. A model that doesn't fit in VRAM either fails to load or requires CPU offloading that reduces throughput by 5–20x. A slower GPU that fits the model comfortably in VRAM will almost always outperform a faster GPU that can't. That said, once VRAM is sufficient, GPU tier determines how fast inference runs — so both matter, with VRAM being the first gate to clear.
Should I buy RTX 40-series or wait for RTX 50-series?
If the RTX 50-series card you want is available at or near MSRP, it's a reasonable buy — particularly the 5090 for its 32 GB or the 5080 for GDDR7 bandwidth. If it's 25%+ above MSRP (common as of May 2026), the RTX 4090 and 4080 Super offer better value right now. RTX 40-series clearance pricing is improving. Don't pay a significant premium for availability.
Can I run a 70B model on a consumer GPU?
Barely, with caveats. The RTX 5090 (32 GB) is the first consumer card where 70B Q4 models fit with minimal offloading. The RTX 4090 (24 GB) can run 70B Q4 but requires careful configuration and still needs some CPU offloading for larger context windows. Below 24 GB, 70B models are not practical for real use — expect very slow speeds with heavy CPU involvement.
What's the minimum GPU for Stable Diffusion 3 / Flux.1?
Flux.1 Dev at FP8 needs approximately 12–14 GB of VRAM. At FP16 it needs 24 GB+. In practice, 16 GB is the comfortable minimum for Flux.1 pipelines — cards like the RTX 4080 Super or 4070 Ti Super. At 12 GB, you can run quantized versions with reduced quality.
Can I fine-tune models on a consumer GPU?
Yes, using QLoRA. With libraries like Unsloth or Axolotl, QLoRA fine-tuning of 7B models is practical on 16 GB cards (RTX 4080 Super, 4070 Ti Super) and comfortable on 24 GB (RTX 4090). Full fine-tuning requires enterprise hardware. For 13B models via QLoRA, 24 GB is recommended — 16 GB works but with batch size 1.
Are AMD GPUs good for local AI?
On Linux with ROCm, yes — the RX 7900 XTX (24 GB) and RX 7900 XT (20 GB) are legitimate alternatives with strong VRAM per dollar. Tools like llama.cpp, Ollama, and some Stable Diffusion pipelines work well. On Windows, ROCm support is uneven and many CUDA-only tools won't work. If you're on Linux and familiar with the ecosystem, AMD is a real option. If you depend on CUDA-specific tools or are on Windows, stick with NVIDIA.
Does a laptop RTX GPU perform the same as a desktop RTX GPU?
No. Laptop GPUs run at lower TDP (typically 80–150 W vs 300–575 W for desktops), which significantly reduces sustained inference throughput. Additionally, laptop RTX 4090 has 16 GB VRAM, not 24 GB — a critical difference for model fit. Laptop RTX 4080 has 12 GB, not 16 GB. Always check the VRAM spec of the specific laptop, not just the GPU model name.
About these rankings: GPU pricing and availability shift regularly. Tier positions and price ranges reflect conditions in May 2026. Performance figures are representative ranges from community benchmarks (llama.cpp, Ollama, A1111, ComfyUI) across multiple configurations — they are not single-run lab results. Always verify current pricing before purchasing. See our methodology page for more detail.