Six Days With the Intel Arc Pro B70: An Honest AI Inference Build Diary
Why We Bought Them
In early April 2026 we needed more VRAM and the NVIDIA market was, charitably, unkind. A pair of professional-tier CUDA cards with 32 GB each was running close to the price of a small car. Intel had been shipping the Arc Pro B70 since late 2025: 32 GB of GDDR6 per card, Xe2 "Battlemage" architecture, PCIe 5.0 x16, ECC support, and a price tag that bought us two cards and 64 GB of total VRAM for less than the single equivalent NVIDIA card we had been pricing.[^1]
The math made sense. The software situation, less so. What follows is the unedited build diary — six days, every dead end, and the runbook we wished someone else had written before us.
This post does not cover how we orchestrate inference once the cards are up. The point here is the floor: how hard is it in 2026 to get an Intel Arc Pro B-series running a real embedding model in a container, and what should you expect to hit on the way?
Day 1 — The Kernel Wall
We installed Ubuntu 24.04, plugged the cards in, and the system booted to a nomodeset console. lspci showed the GPUs by their PCI device ID — 0xE223 — and that was the entire signal of life we got. No /dev/dri nodes. No GPU process visible to intel_gpu_top. The stock Ubuntu kernel did not have a working xe driver for Battlemage.
The Arc Pro B-series uses Intel's new xe kernel driver rather than the older i915. Intel's big driver maturation work for Battlemage landed in Linux 6.17, including SR-IOV groundwork and the device-ID coverage that recognizes E223 silicon as something the xe module wants to claim.[^2]
Ubuntu 24.04's general-availability kernel is older than that. The fix is the HWE (hardware enablement) kernel stream:
sudo apt install --install-recommends linux-generic-hwe-24.04
sudo reboot
After the reboot, uname -r reported a 6.17-line kernel, /dev/dri/card0 and /dev/dri/card1 appeared, and dmesg showed xe claiming both devices.
It also showed both devices failing to fully initialize because their GuC firmware was missing.
Day 1, partially won.
Day 2 — The Firmware Quest
GuC ("Graphics microController") is the on-die scheduler for modern Intel discrete GPUs. The kernel xe driver expects to load a firmware blob into the GPU at probe time. Without it, the device technically claims, but workloads do not run.
The canonical firmware tree is on kernel.org: git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git, with Battlemage-era blobs (bmg_guc_*.bin, bmg_huc_*.bin) in the xe/ subdirectory.[^3] Distribution linux-firmware packages are slower-moving than upstream and on Ubuntu 24.04 ours did not yet ship the BMG blobs we needed. The fix is a small one once you know where to look:
git clone https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git
sudo cp linux-firmware/xe/bmg_guc_*.bin /lib/firmware/xe/
sudo cp linux-firmware/xe/bmg_huc_*.bin /lib/firmware/xe/
sudo update-initramfs -u
sudo reboot
After this, dmesg | grep -i xe showed clean GuC + HuC loads on both cards. clinfo listed two OpenCL devices. We had functional GPUs.
We did not yet have a way to run an embedding model on them. That was Day 3.
Day 3 — The TEI XPU Disappointment
HuggingFace ships an XPU build of Text Embeddings Inference (TEI), their high-throughput embedding server, as a Docker image: ghcr.io/huggingface/text-embeddings-inference:xpu-ipex-latest.[^4] In principle it is plug-and-play: pull the image, point it at a model, get an OpenAI-compatible /embed endpoint backed by Intel hardware.
In practice on a B-series Arc Pro it does not work. The image successfully downloads the model (we used intfloat/e5-mistral-7b-instruct, ~14 GB pulled cleanly), then fails at startup with a generic message:
Could not start Python backend: Python backend failed to start.
Verbose logs (and a couple of hours of strace) traced the root cause: the IPEX (Intel Extension for PyTorch) build inside the container predates the B-series silicon. It does not recognize device ID 0xE223. In TEI's own current docs the explicitly-supported XPU targets are Intel Gaudi 2 and Gaudi 3, with no mention of Arc Pro B-series.[^4] The omission is the documentation: the image is not built for these cards.
This is a straightforward gap and likely to close in a future image build. As of our work in early April 2026, we needed a different path.
Day 4 — intel/vllm Saves The Day
Intel themselves publish AI containers for the Arc Pro B-series. The one we needed was the vLLM XPU image, which Intel explicitly validates on Arc Pro B-series Graphics: intel/vllm:0.10.2-xpu, with PyTorch 2.10.0+xpu as its base.[^5]
docker run --rm -it \
--device /dev/dri \
-v /dev/shm:/dev/shm \
intel/vllm:0.10.2-xpu \
python -c "import torch; print(torch.xpu.is_available(), torch.xpu.device_count())"
True 2 came back. Both cards visible to PyTorch as XPU devices. We loaded e5-mistral-7b-instruct via the standard transformers path against device='xpu:0', fed it a batch, and got a 4096-dimensional vector out.
First successful B70 embedding. Day 4, ~6:44 AM EDT.
A note that bit us hard before we figured it out: pip install intel-extension-for-pytorch outside the container produces a different ABI than the one inside the container. Mixing them — for example, exporting a venv into the container — gives you a cryptic oneDNN ABI mismatch at the first matmul. The IPEX release notes are clear that the SYCL runtime version, the _GLIBCXX_USE_CXX11_ABI setting, and the icpx compiler version all have to align with whatever was used to build the wheel.[^6] In practice: stay inside the container, or build IPEX from source against your exact stack. Don't mix.
Day 5 — The Raw Compute Reality Check
With one embedding running we wanted to know what the cards were actually delivering. We ran a simple synthetic benchmark — a 4096×4096 FP16 matrix multiply, warmed up, timed over many iterations, and a memory-bandwidth probe via large allocations and reads.
What we measured on our specific stack (intel/vllm:0.10.2-xpu, PyTorch 2.10.0+xpu, kernel 6.17 HWE, no SR-IOV, default frequencies):
| Metric | B70 (measured) | RTX 3090 (measured, same script) | Ratio |
|---|---|---|---|
| FP16 matmul, 4096×4096 | 128.3 TFLOPS | ~71 TFLOPS | 1.81× |
| Effective memory bandwidth | 443 GB/s | ~936 GB/s | 0.47× |
Two things to flag honestly. First, those are measured-on-our-stack numbers, not theoretical peaks. Intel's published specifications for the B70 give a higher theoretical memory bandwidth than what we hit in practice; what we measured is what we'd expect a real workload to see, not what a synthetic peak benchmark on a perfectly tuned driver would report.[^1] Second, NVIDIA's RTX 3090 whitepaper publishes ~71 TFLOPS as the FP16 with-sparsity number on Ampere; our 3090 measurement happens to land near that figure but the apples-to-apples comparison would be FP16 dense.[^7] In other words: do not screenshot the table above and call it a B70-vs-3090 verdict. It is a snapshot of one workload on one stack at one moment.
The headline finding for our use case — embedding inference, which is heavy on small-batch matmul against medium-sized tensors — was clear though: the B70 was throughput-positive on our compute and bandwidth-negative on memory. A workload dominated by streaming tensors back and forth would prefer the 3090. A workload dominated by repeated arithmetic on tensors that fit in VRAM would prefer the B70. Embedding inference at our batch sizes was much closer to the second case.
Two B70s gave us 64 GB of VRAM, more compute headroom for our specific workload, and ran cool. Worth the six days.
Day 6 — Wrapping It in a TEI-Compatible API
Our existing internal services were already speaking the TEI HTTP shape: POST /embed with a list of strings, get back a list of vectors. Rewriting every consumer to talk to a different server was not the plan. So we wrote a thin FastAPI wrapper around IPEX-LLM that loads the model on XPU and exposes the same endpoint.[^8]
The wrapper is small — ~120 lines — and the three things that mattered:
- Load the model once at process start, not per request. The first forward pass on XPU compiles kernels and takes seconds; you do not want to pay that on a user request.
- Pin the model to a specific XPU. With two cards we explicitly bind via
device='xpu:0'(orxpu:1), and run two server processes, one per card, behind a tiny round-robin proxy. Multi-XPU inside one process is a separate journey we deliberately did not take. - Use the model's native batch path. For E5-Mistral that means
model.encode(...)withbatch_sizetuned to your VRAM headroom — for us, 32 was a comfortable point on a 7-billion-parameter model with a 32 GB card.
Once that was running, our existing services could not tell the difference between this and the old TEI server. Same JSON in, same vectors out (modulo a finding we'll cover in a separate post about why "same vectors" turned out to be a much harder claim than we thought).
The Verdict, Honestly
Would we buy them again? Yes, for VRAM-heavy workloads where compute-per-dollar matters more than peak memory bandwidth. The 64 GB across two cards has paid for itself in models we can actually fit.
Would we recommend them as a drop-in replacement? No. The HuggingFace ecosystem assumes CUDA. Intel's containers work, but you will be reading more dmesg output and Intel release notes than your peers on the green team. If your team's tolerance for kernel-and-firmware troubleshooting is low, the time-to-first-inference penalty is real.
One caveat we want to be honest about: IPEX-LLM (the project that gives you the friendliest LLM-on-XPU experience) was archived read-only on January 28, 2026. Continued development is moving to intel-extension-for-pytorch and torch-xpu-ops directly. The path we used works today; expect the recommended path to keep shifting for the next few quarters.[^9]
If you are on the fence: budget six days, follow the runbook above, and you will know by the end of the week whether the platform fits your team. We did, and the cards have been quietly serving inference for us ever since.
Quick Reference Runbook
# 1. Kernel
sudo apt install --install-recommends linux-generic-hwe-24.04 && sudo reboot
# 2. GuC + HuC firmware (until distros catch up)
git clone https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git
sudo cp linux-firmware/xe/bmg_guc_*.bin /lib/firmware/xe/
sudo cp linux-firmware/xe/bmg_huc_*.bin /lib/firmware/xe/
sudo update-initramfs -u && sudo reboot
# 3. Verify
dmesg | grep -i 'xe.*GuC' | head
clinfo -l
# 4. Container
docker pull intel/vllm:0.10.2-xpu
docker run --rm -it --device /dev/dri -v /dev/shm:/dev/shm intel/vllm:0.10.2-xpu \
python -c "import torch; print(torch.xpu.is_available(), torch.xpu.device_count())"
# 5. Load your embedding model on xpu:0 inside the container.
If you make it to the end of step 5 and torch.xpu.is_available() returns True, the rest is the same engineering you already do. The cards are real, the silicon works, the software is climbing the curve. Catch them mid-climb and the value is genuine.
Footnotes
[^1]: Intel Arc Pro B70 Graphics — Product Specifications. https://www.intel.com/content/www/us/en/products/sku/245797/intel-arc-pro-b70-graphics/specifications.html
[^2]: "Intel Readies Big Graphics Driver Changes For Linux 6.17," Phoronix, 2025. https://www.phoronix.com/news/Intel-Xe-Driver-Linux-6.17-Big
[^3]: linux-firmware on kernel.org, xe/ subdirectory. https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/xe
[^4]: Text Embeddings Inference — Intel container docs (XPU targets). https://github.com/huggingface/text-embeddings-inference/blob/main/docs/source/en/intel_container.md
[^5]: Intel AI Containers — vLLM 0.10.2 XPU (validated on Arc Pro B-Series). https://github.com/intel/ai-containers/blob/main/vllm/0.10.2-xpu.md
[^6]: Intel Extension for PyTorch — XPU release / known-issues notes. https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/releases.html
[^7]: NVIDIA Ampere GA102 GPU Architecture Whitepaper (v2.1). https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.1.pdf
[^8]: IPEX-LLM project (archived January 2026). https://github.com/intel/ipex-llm
[^9]: Phoronix — "Intel Arc Pro B70 Benchmarks With LLM / AI, OpenCL, OpenGL & Vulkan." https://www.phoronix.com/review/intel-arc-pro-b70-linux