Hardware

AI Hardware Benchmarking — Performance & Price Analysis

AI Hardware Benchmarking & Performance Analysis

Comprehensive benchmarking of AI accelerator systems for language model inference. We measure how performance scales with concurrent load on NVIDIA 8×H100, 8×H200, and 8×B200 systems using DeepSeek R1, Llama 4 Maverick, Llama 3.3 70B, and GPT-OSS 120B. Benchmarks are conducted periodically (at least quarterly). Methodology and benchmark specs appear below.
Report
Model Deployment Report
Coming soon
Peak System Output Throughput
Llama 3.3 70B • Tokens/sec Higher is better
Peak Output Speed per Query
Llama 3.3 70B • Tokens/sec/query Higher is better
Rental Price (On-Demand)
USD per GPU Hour Lower is better

Price per GPU Hour

Section guide: What this shows

Purpose: Compares on-demand hourly rental costs across major cloud providers for key AI accelerator chips. This establishes the baseline "unit cost" of compute.

  • Compare providers for the same chip to find arbitrage opportunities.
  • Consider committed use discounts (1-year) for long-term deployments.

Common mistake: Ignoring data egress fees or spot instance availability which can affect total cost of ownership.

Provider H100 H200 B200 MI300X TPU v6e
GPU Variations
Regional Pricing Basis
Provider Pricing Basis
Pricing Update Schedule

Performance Benchmarks

gpt-oss-120B (high)
Llama 4 Maverick
DeepSeek R1 0528
Llama 3.3 70B
At Reference Speed
At Peak Throughput
Section guide: What this shows

Purpose: Measures real-world inference performance. Throughput indicates total system capacity (concurrent users), while speed per query indicates the latency for a single user.

  • Use Throughput for batch processing or high-traffic serving.
  • Use Speed per Query for interactive chat applications.
System Output Throughput
Tokens/sec • Higher is better
Peak Output Speed per Query
Tokens/sec/query • Higher is better
About Speed

Speed per query measures the generation rate for a single stream. Crucial for user experience in chatbots. >50 t/s is generally faster than human reading speed.

System Throughput vs Output Speed
Pareto Frontier • Top Right is better
Understanding the Tradeoff

Systems often trade single-user speed for total system throughput. The ideal hardware sits in the top-right corner, offering both high capacity and fast individual responses.

Throughput & Speed vs. Concurrency
Scaling Performance
Cost per Million Tokens
Input + Output • USD • Lower is better
Cost Calculation

Derived from hourly rental price divided by the system's token throughput capacity at reference usage.

Cost = (Price/Hour) / (Tokens/Hour)
End-to-End Latency vs. Concurrency
Seconds • Lower is better
End-to-End Latency

Total time to receive a full response. As concurrency rises (x-axis), requests queue up, increasing wait times.

System & Benchmark Specifications

Section guide: Data Manifest

Purpose: Detailed configuration logs for every benchmark run to ensure reproducibility and transparency.

Model Name System Provider Precision TP/PP/DP Framework Date

Subscribe