AMD Instinct Shrinks the CUDA Moat: Serving GLM-5.2 at 2,626 tok/s on MI355X Infrastructure
SAN FRANCISCO — In a technical demonstration challenging Nvidia’s enterprise compute monopoly, AI engineering firm Wafer has successfully served Z-ai’s frontier GLM-5.2 model on AMD Instinct MI355X hardware, achieving an aggregate node throughput of 2,626 tokens per second (tok/s) and a single-stream speed of 213 tok/s.
The benchmark proves that while Nvidia’s newly released Blackwell B200 infrastructure still holds a raw performance lead (reaching 3,192 tok/s/node on identical workloads), AMD’s hardware delivers an aggressive 2x lower cost-per-token ratio, establishing a new baseline for performance-per-dollar efficiency.
The Economics of the Inference Supply Squeeze
The market demand for inference is drastically outpacing silicon supply. With high-velocity frontier model drops—such as Claude Fable, GLM-5.2, and MiniMax M3—the demand for token processing has caused Nvidia Blackwell allocations to skyrocket in price.
By contrast, the AMD Instinct MI355X is roughly 2.75x cheaper per GPU than an Nvidia B300 while maintaining comparable raw hardware specifications. Historically, Nvidia's primary moat has been software: its Day-0 developer support allows providers to spin up new models instantly, whereas running frontier networks on AMD's ROCm stack frequently requires extensive engineering overrides to fix broken initialization images.
Wafer's optimization run signals that automated kernel tuning and developer framework compliance are rapidly closing this software gap.
Technical Deep Dive: Lossless 4-Bit Quantization & Sglang Fixes
To achieve these speeds without degrading model intelligence, Wafer passed the base bf16 weights of GLM-5.2 through AMD Quark, quantizing the network down to a ultra-lean MXFP4 format. Benchmark audits on rigorous testing suites proved the 4-bit compression was functionally lossless:
| Evaluation Metric | FP8 Baseline | MXFP4 Quantization | Delta ($\Delta$) |
|---|---|---|---|
| GSM8K (5-shot, greedy) | 0.965 | 0.955 | -0.010 |
| GPQA-Diamond (Graduate Reasoning) | 0.9217 | 0.9026 | -0.019 |
| tau2 macro | 0.819 | 0.834 | +0.015 |
While evaluating execution engines, Wafer selected sglang over vLLM (which lacked a working MXFP4 Mixture-of-Experts path) and ATOM (which suffered from severe context degradation). However, maximizing the MI355X silicon required fixing two critical framework bugs:
1. Speculative Decoding Initialization Patch
Sglang's base ROCm image initially crashed on a shape mismatch during initialization. Because the Multi-Token Prediction (MTP) head utilizes a unique module prefix (model.decoder.*) compared to Quark's standard decoder prefix (model.layers.78.mlp.shared_experts.*), sglang failed its quantization lookup. It attempted to load full-width bf16 shared expert weights into a half-width 4-bit slot. Engineers resolved the crash by manually duplicating the layer 78 entries under the explicit sglang naming conventions, unblocking multi-token prediction and yielding a 3x gain in single-stream throughput.
2. The Fused Kernel ROCm Guard
Deep speculative configurations (specifically the 5/1/6 depth layout) were blocked because the fused multi-step metadata kernel was hardcoded with a #include <cuda_runtime.h> directive lacking an AMD equivalent. Inserting a simple #ifdef USE_ROCM compiler guard fully unlocked the execution path.
Combined with targeted flag tuning (--kv-cache-dtype fp8_e4m3 and --enable-aiter-allreduce-fusion), these adjustments pushed single-stream performance to 213 tok/s.
Optimizing for Massive Prefill Workloads
For large-scale enterprise automation, single-stream optimizations are insufficient. Under a common production profile—20K input tokens, 1K output tokens, and a 60% KV cache hit rate—the infrastructure bottleneck shifts entirely to the prefill stage.
Initially, running at Tensor Parallel 8 (TP8) restricted the MI355X to 1,461 tok/s/node. Shifting the architecture to a hybrid TP4 × DP2 topology bumped performance to 1,944 tok/s/node at 2.0 requests per second (RPS).
The ultimate optimization milestone came when engineers discovered that sglang’s container was silently routing GLM-5.2’s FP4 Mixture-of-Experts (MoE) layer through a slow, generic FlyDSL fallback heuristic, as the underlying aiter library only shipped pre-tuned configurations for fp8 paths. By manually tuning the MoE kernel selection grid directly to GLM’s specific dimensions (model_dim 6144, moe_inter 2048, E=256, topk=8), Wafer successfully saturated the node at 2,626 tok/s at 2.4 RPS.
Single-Node Saturation Metrics (AMD MI355X)
| Sustained RPS | Aggregate Throughput (tok/s/node) | TTFT p50 / p95 | Request Success Rate |
|---|---|---|---|
| 0.5 | 449 | 0.59s / 0.60s | 100% |
| 1.0 | 974 | 0.60s / 0.81s | 100% |
| 1.5 | 1,913 | 0.62s / 1.03s | 100% |
| 2.0 | 1,944 | 0.62s / 1.05s | 100% |
| 2.25 | 2,089 | 0.63s / 1.23s | 100% |
| 2.4 (Saturation) | 2,626 | 0.81s / 2.22s | 100% |
Structural Takeaways for Enterprise AI Architects
The execution of this optimization run proves that achieving elite performance on AMD Instinct silicon no longer requires writing custom, low-level Triton or CUDA kernels from scratch. By leveraging standard libraries and executing high-level framework debugging, teams can bypass the Nvidia premium entirely. As software orchestration layers mature, the legacy CUDA moat is shifting from a technical barrier to a basic framework configuration challenge.
Source & References
- Primary Source: Performance per dollar is getting faster and cheaper: Serving GLM5.2 on AMD MI355X — Wafer AI
Re-architecting your production cluster to balance performance caps against skyrocketing GPU rental fees? Head over to the ChooseAIModel Directory to compare running costs, token throughput metrics, and native framework support across hundreds of open and closed model variations. To map how migrating your workloads from Nvidia cloud endpoints to alternative hardware capacities like TensorWave affects your baseline operating margins, use our free Cost Simulator to instantly build your optimization roadmap.