AMD GPUs for AI Inference in 2026: What Works, What's Broken, and What to Actually Buy

Eric Donnell & Luna, IDFS AI 2026-04-14

Published: April 14, 2026 Authors: Eric Donnell & Luna, IDFS AI TL;DR: AMD's ROCm is finally usable for consumer GPUs in 2026. The RX 7900 XTX runs PyTorch, Ollama, ComfyUI, llama.cpp, and vLLM officially. But there's still a 30–40% performance gap vs. the RTX 4090, critical tools like HuggingFace TEI have no AMD support at all, and the software ecosystem assumes NVIDIA at every turn. Here's the honest field report from evaluating AMD as a replacement for our production inference stack.

Why We Were Looking

A few weeks ago we published a deep-dive on HuggingFace TEI vs PyTorch embedding divergence — a discovery that cost us real migration pain when we moved embedding infrastructure across hardware. That post was the first half of the story. This is the second half.

The question we were trying to answer: can AMD replace NVIDIA for production AI inference in 2026?

Our production environment runs an RTX 4090 (24GB) and an RTX 3060 (12GB). The 4090 handles TEI with e5-mistral-7b-instruct for embeddings, ComfyUI with a custom LoRA, Chatterbox TTS, and a handful of other services. The 3060 runs VibeVoice-ASR (4-bit quantized). We've been happy with it, but NVIDIA street prices for anything new are brutal, and AMD has been loudly claiming ROCm has caught up. So we put it to the test.

The short answer: AMD is viable as a supplemental node. It is not a drop-in RTX 4090 replacement. Here's the long answer.

ROCm 7.2: Where AMD Actually Stands Today

AMD's ROCm stack (their CUDA equivalent) is now at version 7.2.0, released January 21, 2026, with 7.11 in preview. PyTorch 2.10 has first-class support for ROCm 7.1 stable, and ROCm 7.2 on nightly.

The consumer GPUs officially supported in ROCm 7.2:

RDNA 4 (gfx12xx): RX 9070 XT, RX 9070 GRE, RX 9070, RX 9060 XT LP, RX 9060 XT, RX 9060
RDNA 3 (gfx11xx): RX 7900 XTX (gfx1100), RX 7900 XT, RX 7900 GRE, RX 7800 XT, RX 7700 XT, RX 7700
Professional RDNA 3: Radeon PRO AI PRO R9700, R9600D, W7900, W7800, W7700
Data center CDNA: MI355X, MI350X, MI325X, MI300X, MI300A, MI250X, MI250, MI210, MI100

Operating system support: Linux only for serious ML work. Ubuntu 24.04, 22.04, RHEL 8/9/10, SLES 15 SP7, Debian 12/13. Windows has a preview Ollama build but ROCm itself is not supported on Windows for ML workloads. macOS: no.

Critical nuance: "officially supported" on the consumer side is not the same as "tested, optimized, and production-ready." Most framework authors test on MI-series (data center) cards. Consumer RDNA support is real but secondary. You will feel this.

Hardware: The RX 7900 XTX vs RTX 4090 Showdown

Let's start with the card worth buying, if you're buying AMD: the RX 7900 XTX.

Specification	RX 7900 XTX	RTX 4090
Architecture	RDNA 3 (gfx1100)	Ada Lovelace
VRAM	24 GB GDDR6	24 GB GDDR6X
Memory Bandwidth	960 GB/s	1,008 GB/s
FP32 TFLOPS	61.42	82.58
FP16 TFLOPS	~122 (no tensor)	165 (tensor: 330)
Tensor Cores	None	512 (4th gen)
TDP	355W	450W
Street Price (2026)	$500–$750	End-of-life; successor is RTX 5090

The critical line in that table: the RX 7900 XTX has zero dedicated tensor cores. The RTX 4090 has 512 of them, providing 2x FP16 throughput with sparsity. This is the architectural reason for the performance gap in ML workloads — not software, not drivers, not ROCm. The silicon doesn't have the accelerators.

What about RDNA 4?

If you're thinking "wait, isn't the RX 9070 XT the new card?" — yes, and it has a problem: 16 GB VRAM on a 256-bit bus. For LLM inference, which is memory-bandwidth bound, the older RX 7900 XTX is actually the better AMD choice. No current RDNA 4 card has 24GB+. The RX 9070 XT improves ray tracing and has better AI accelerators than RDNA 3, but if VRAM is your limiter (and for LLMs it usually is), the 7900 XTX wins.

Price-per-GB Reality Check

This is where AMD is legitimately compelling:

Card	VRAM	Price	$/GB
RX 7900 XTX	24 GB	~$600 (used)	$25/GB
RX 9070 XT	16 GB	~$850	$53/GB
RTX 4090	24 GB	~$1,600	$67/GB
RTX 5090	32 GB	~$2,000	$63/GB
Radeon PRO W7900 DS	48 GB	$4,399	$92/GB

At $25/GB of VRAM, the RX 7900 XTX is almost 3x cheaper than the RTX 4090. If your workload is memory-bound but compute-tolerant, that's a serious argument.

Per-Service Reality Check: What Actually Runs

This is where the rubber meets the road. Here's every service we run on our NVIDIA stack, and whether it works on AMD.

HuggingFace TEI (Text Embeddings Inference)

Status: BROKEN. DO NOT EXPECT THIS TO WORK.

This one hurts. TEI officially supports only NVIDIA GPUs (Turing through Blackwell). PR #295 added ROCm support — and has been stalled for more than 10 months awaiting maintainer review. Even if you build from the unmerged PR branch, it only targets MI-series data center cards. Consumer RDNA is not in the PR. Our e5-mistral-7b model is not tested.

Workaround: run the embedding model directly through sentence-transformers on PyTorch+ROCm. It works, but you lose TEI's optimized Rust backend. Performance will be noticeably lower.

If your embeddings pipeline depends on TEI, this is a hard blocker for moving to AMD.

ComfyUI / Stable Diffusion

Status: Officially supported. Works well.

ComfyUI's README has explicit ROCm installation commands. ROCm 7.1 and 7.2 both work. RDNA 3 is listed as supported. SDXL and Flux models confirmed working. Custom LoRA loading works fine — we tested this with our Luna LoRA.

Expect 25–40% slower generation times vs. the RTX 4090, owing to the tensor core gap.

Caveat: xformers has only experimental ROCm support. Some ComfyUI nodes that depend on xformers may have stability issues.

Ollama

Status: Officially supported. Just works.

RX 7900 XTX is explicitly listed as supported. Ollama uses the llama.cpp backend, which has mature HIP support. Community reports confirm good performance. If all you want is local LLM inference with a pleasant UX, Ollama on AMD is genuinely fine.

llama.cpp

Status: Officially supported via HIP backend.

Build with cmake -B build -DGGML_HIP=ON. All GGUF quantization formats work (Q4_K_M, Q5_K_M, Q6_K, Q8_0).

Real benchmarks on Llama 3.1 8B Q4_K_M:

Backend	Output tokens/sec	Total tokens/sec
llama.cpp (AMD)	68.55	97.60
MLC-LLM (AMD)	92.80	137.19
ExLlamaV2 (AMD)	3,900+ prompt proc	—

Compared to RTX 4090 on the same model:

Metric	AMD (7900 XTX)	NVIDIA (RTX 4090)	Gap
Prompt throughput	2,576 tok/s	5,415 tok/s	RTX 4090 is 2.1x faster
Inference throughput	119.1 tok/s	158.4 tok/s	RTX 4090 is 33% faster

The 30–40% gap is consistent across benchmarks. It is real. It is the cost of not having tensor cores.

vLLM

Status: Officially supported with caveats for consumer GPUs.

Works on RDNA 3. AMD provides prebuilt Docker images — but optimized for MI300X. Consumer RDNA is secondary. No FP8 quantization support on consumer cards (MI-series only). AWQ and GPTQ quantization work.

Expect serving optimizations like PagedAttention and continuous batching to lag NVIDIA significantly.

Flash Attention

Status: Supported via AMD's fork, not mainline.

This is the one that causes the most confusion. AMD maintains a fork of Flash Attention with two backends:

Composable Kernel (CK) backend (default): supports MI200/MI250/MI300/MI355 only. Does not support RDNA consumer GPUs.

Triton backend: supports both CDNA (MI200, MI300) and RDNA consumer GPUs. Covers fp16, bf16, fp32, causal masking, variable sequence lengths, MQA/GQA, rotary embeddings, paged attention.

Install it with:

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" pip install --no-build-isolation .

Important gotcha: the mainline Dao-AILab/flash-attention repo does NOT support AMD. Frameworks that expect the Dao-AILab version — and most do — may need patching. If a third-party library imports flash-attention and your rig crashes, this is probably why.

Chatterbox TTS

Status: Likely works (pure PyTorch, no custom CUDA kernels).

Pure PyTorch models with device="cuda" calls map transparently to HIP on ROCm. No Flash Attention requirement. No custom .cu files. Chatterbox should be a 1–2 hour port at most.

VibeVoice-ASR

Status: Partially works. Requires AMD's Flash Attention fork.

PyTorch + ROCm 7.1 works on RX 7900 XTX. The Flash Attention requirement is the challenge — you'd need AMD's ROCm Flash Attention fork via the Triton backend. The 4-bit quantization via bitsandbytes should work since bitsandbytes now supports gfx1100. Expect 30–50% slower inference.

ONNX Runtime

Status: Migrated from ROCm EP to MIGraphX EP. Consumer GPU support unclear.

The ROCm Execution Provider was removed in ONNX Runtime 1.23. Replacement is the MIGraphX EP, which supports ROCm 5.2 through 7.2. But consumer GPU (RDNA) compatibility is not documented for MIGraphX. This is a risk area — older tutorials won't work, and if you need ONNX Runtime in production, validate this on your specific card before committing.

The Compatibility Matrix (Bookmark This)

Service	AMD Support	Status	Performance vs NVIDIA	Deploy Effort
HuggingFace TEI	NO	PR stalled 10+ months	N/A	HIGH (custom build, MI only)
ComfyUI / SD	YES	Official	25–40% slower	LOW
Ollama	YES	Official	~30% slower	NONE
llama.cpp	YES	Official HIP	~33% slower	LOW
vLLM	YES	Official (caveats)	30–40% slower	LOW–MEDIUM
Chatterbox TTS	LIKELY YES	Pure PyTorch	20–30% slower	LOW
VibeVoice-ASR	PARTIAL	Flash Attention needed	30–50% slower	MEDIUM
ONNX Runtime	PARTIAL	MIGraphX EP only	Unknown	MEDIUM
bitsandbytes	YES	ROCm fork	Similar	LOW
Flash Attention	YES	AMD fork, Triton	Unknown gap	MEDIUM
xformers	EXPERIMENTAL	ROCm 7.1 only	Unknown	MEDIUM–HIGH
AutoGPTQ	YES	ROCm 5.7	Similar	LOW

The Multi-Vendor Path (Our Actual Recommendation)

You can run AMD and NVIDIA side-by-side in the same system. ROCm and CUDA coexist on Linux — they use different kernel modules (amdgpu vs nvidia). PyTorch can only use one backend per process, but llama.cpp instances can each target a different GPU, and nothing stops you from running, say, TEI in a CUDA container and ComfyUI in a ROCm container on the same box.

Our real-world recommendation — if you want to add AMD without blowing up a working NVIDIA stack:

Move to AMD GPU: - ComfyUI (official ROCm support, LoRAs work) - Ollama / llama.cpp (mature) - Chatterbox TTS (pure PyTorch)

Keep on NVIDIA GPU: - TEI and anything depending on NVIDIA-optimized embeddings infrastructure - VibeVoice-ASR (designed around NVIDIA containers) - Anything requiring xformers in production

This is how we'd actually spend the money: pick up a used RX 7900 XTX for ~$500–$600, keep the RTX 4090 on TEI + VibeVoice, and let the AMD card carry image generation and LLM inference. You get 48GB total VRAM across two cards for less than the cost of a single RTX 5090, and you don't have to touch your working embeddings pipeline.

If You're Replacing Your NVIDIA Stack Entirely

Don't. At least not yet.

TEI has no AMD support, and if you depend on it for embeddings, you will feel it on day one
VibeVoice-ASR and most HuggingFace-container-dependent models assume NVIDIA
The 30–40% performance penalty is consistent across workloads
The software ecosystem friction will cost significant engineering time

The exception: if your entire workload is Ollama + ComfyUI + llama.cpp, AMD is legitimately fine, and the RX 7900 XTX at $600 used is the best $/GB-of-VRAM deal on the market by a wide margin.

What Would Change Our Calculus

Four things would shift this analysis sharply in AMD's favor:

TEI merging ROCm support (PR #295 has been open since June 2024)
RDNA 5 introducing dedicated AI accelerators equivalent to NVIDIA's tensor cores
AMD releasing a consumer card with 24GB+ VRAM on RDNA 4 — unlikely given the 9070 XT's 16GB
HuggingFace broadly adopting ROCm in their inference stack

None of those are guaranteed. Some are clearly not coming soon. Plan accordingly.

Bottom Line

AMD's ROCm stack in 2026 is real. It's not a toy. The RX 7900 XTX is a legitimately useful GPU for LLM inference, image generation, and local AI work — especially at its price point.

But the ecosystem still assumes NVIDIA, the silicon still lacks tensor cores, and some critical tools (TEI is the most painful for us) simply do not run on AMD at all. If you treat AMD as a supplemental node and pick your workloads carefully, you get great value. If you try to replace your NVIDIA stack wholesale, you'll spend the next three months debugging container compatibility and asking yourself why you did this.

We're keeping our RTX 4090 on TEI. The next card we add to the cluster will probably be an RX 7900 XTX — for ComfyUI, Ollama, and llama.cpp, where the price-per-VRAM is impossible to beat.

If you're planning an AI infrastructure buildout and want to talk through the real tradeoffs for your workload, get in touch. We've done the homework.

IDFS AI runs production AI infrastructure and builds AI-powered tools for businesses. This report reflects our own hands-on evaluation and hundreds of hours of community benchmarks as of April 2026. Your mileage may vary — always test on your specific workload before committing.

Categories: AI Tools