Hardware

AI Hardware Benchmarking — Performance & Price Analysis

AI Hardware Benchmarking & Performance Analysis

Comprehensive benchmarking of AI accelerator systems for language model inference. We measure how performance scales with concurrent load on NVIDIA 8×H100, 8×H200, and 8×B200 systems using DeepSeek R1, Llama 4 Maverick, Llama 3.3 70B, and GPT-OSS 120B. Benchmarks are conducted periodically (at least quarterly). Methodology and benchmark specs appear below.

Methodology LLM model comparison

Report

Model Deployment Report

Coming soon

Peak System Output Throughput

Llama 3.3 70B • Tokens/sec Higher is better

Peak Output Speed per Query

Llama 3.3 70B • Tokens/sec/query Higher is better

Rental Price (On-Demand)

USD per GPU Hour Lower is better

Overview

Price per GPU Hour

Benchmarks

Specs

Price per GPU Hour

Section guide: What this shows

Purpose: Compares on-demand hourly rental costs across major cloud providers for key AI accelerator chips. This establishes the baseline "unit cost" of compute.

Compare providers for the same chip to find arbitrage opportunities.
Consider committed use discounts (1-year) for long-term deployments.

Common mistake: Ignoring data egress fees or spot instance availability which can affect total cost of ownership.

Provider	H100	H200	B200	MI300X	TPU v6e

GPU Variations

Regional Pricing Basis

Provider Pricing Basis

Pricing Update Schedule

Performance Benchmarks

gpt-oss-120B (high)

Llama 4 Maverick

DeepSeek R1 0528

Llama 3.3 70B

At Reference Speed

At Peak Throughput

Section guide: What this shows

Purpose: Measures real-world inference performance. Throughput indicates total system capacity (concurrent users), while speed per query indicates the latency for a single user.

Use Throughput for batch processing or high-traffic serving.
Use Speed per Query for interactive chat applications.

System Output Throughput

Tokens/sec • Higher is better

Peak Output Speed per Query

Tokens/sec/query • Higher is better

About Speed

Speed per query measures the generation rate for a single stream. Crucial for user experience in chatbots. >50 t/s is generally faster than human reading speed.

System Throughput vs Output Speed

Pareto Frontier • Top Right is better

Understanding the Tradeoff

Systems often trade single-user speed for total system throughput. The ideal hardware sits in the top-right corner, offering both high capacity and fast individual responses.

Throughput & Speed vs. Concurrency

Scaling Performance

Cost per Million Tokens

Input + Output • USD • Lower is better

Cost Calculation

Derived from hourly rental price divided by the system's token throughput capacity at reference usage.

Cost = (Price/Hour) / (Tokens/Hour)

End-to-End Latency vs. Concurrency

Seconds • Lower is better

End-to-End Latency

Total time to receive a full response. As concurrency rises (x-axis), requests queue up, increasing wait times.

System & Benchmark Specifications

Section guide: Data Manifest

Purpose: Detailed configuration logs for every benchmark run to ensure reproducibility and transparency.

Model Name	System	Provider	Precision	TP/PP/DP	Framework	Date

AI Hardware Benchmarking & Performance Analysis

Price per GPU Hour

Performance Benchmarks

System & Benchmark Specifications

Subscribe