Granite 4.0 H Small (non-reasoning)

A fast, concise, and capable open-weight model from IBM.

Granite 4.0 H Small (non-reasoning)

An open-weight model from IBM that balances above-average intelligence with exceptional speed and remarkable conciseness, making it a strong choice for high-throughput text generation tasks.

IBMOpen License128k ContextText GenerationFastConcise

Granite 4.0 H Small is a member of IBM's latest generation of open-weight models, designed to offer a compelling balance of performance, efficiency, and cost. Released under the permissive Apache 2.0 license, it represents a significant contribution to the open-source AI ecosystem, providing developers with a powerful tool that can be freely used, modified, and deployed. As its name suggests, this "Small" variant is optimized for efficiency, yet it punches above its weight class in key performance areas, particularly speed and conciseness.

The model's standout characteristic is its blistering output speed. Benchmarks show it generating text at over 340 tokens per second, placing it among the fastest models in its category. This level of throughput makes it an excellent candidate for real-time applications such as interactive chatbots, live content summarization, and high-volume data processing pipelines where rapid response is paramount. This speed is coupled with a remarkable tendency for conciseness; in our tests, it used less than half the number of tokens as the average model to provide answers, a trait that directly translates into lower operational costs, especially on platforms that charge more for output tokens.

Despite its focus on speed and efficiency, Granite 4.0 H Small does not significantly compromise on intelligence. It scores above average on the Artificial Analysis Intelligence Index compared to similarly sized non-reasoning models. This indicates a strong capability for tasks like summarization, classification, and question-answering within a given context. With a generous 128,000-token context window, the model can ingest and analyze vast amounts of information—equivalent to a 300-page book—in a single pass. This combination of a large context window, solid intelligence, and extreme speed makes it a versatile workhorse for a wide range of enterprise and developer use cases.

Scoreboard

Intelligence

23 (#22 / 55)

Scores above the class average of 20, indicating solid performance on our intelligence benchmarks for a non-reasoning model.

Output speed

340.2 tokens/s

Exceptionally fast, ranking #2 in its class. Ideal for applications requiring high throughput.

Input price

$0.06 / 1M tokens

More affordable than the class average of $0.10 for inputs, making it economical for context-heavy tasks.

Output price

$0.25 / 1M tokens

Slightly more expensive than the class average of $0.20 for outputs, a cost offset by its high conciseness.

Verbosity signal

5.2M tokens

Highly concise, using less than half the tokens of the average model (13M) on our tests.

Provider latency

8.82 seconds

Time to first token is high, suggesting a significant cold start penalty on the benchmarked serverless provider.

Technical specifications

Spec	Details
Model Owner	IBM
License	Open (Apache 2.0)
Context Window	128,000 tokens
Model Family	Granite 4.0
Model Size	Small
Input Modalities	Text
Output Modalities	Text
Architecture	Decoder-only Transformer
Release Date	May 2024
Fine-Tuning	Supported (as an open model)

What stands out beyond the scoreboard

Where this model wins

Blazing Speed: Its output speed of over 340 tokens/second is elite, making it perfect for applications requiring rapid text generation like chatbots or live content creation.
Extreme Conciseness: The model is remarkably non-verbose, delivering answers with significantly fewer tokens than its peers. This directly translates to lower costs on output-heavy tasks.
Large Context Window: A 128k context window allows it to process and analyze very large documents, from legal contracts to entire codebases, without losing track of details.
Solid Intelligence: Despite its speed and 'Small' designation, it scores above average on intelligence benchmarks, proving it doesn't sacrifice capability for efficiency.
Open and Accessible: Released under an open license by IBM, it allows for greater flexibility, custom fine-tuning, and self-hosting, avoiding vendor lock-in.

Where costs sneak up

High Latency / Cold Starts: The benchmarked latency of over 8 seconds to first token suggests significant cold start times. This can be detrimental for user-facing, on-demand applications where immediate response is critical.
Relatively High Output Token Price: At $0.25 per million output tokens, it's more expensive than the average model in its class. Its natural conciseness helps offset this, but for verbose tasks, costs can escalate.
Not a Reasoning Specialist: While intelligent for its class, it's categorized as a non-reasoning model. It may struggle with complex, multi-step logical problems compared to larger, reasoning-focused models.
Input Processing Costs: While the input price is competitive, feeding its large 128k context window with dense information for every query can become a significant cost driver if not managed carefully.
Provider-Dependent Performance: Performance metrics like speed and latency are highly dependent on the API provider's infrastructure. The observed high latency might be specific to the benchmarked setup and not inherent to the model itself.

Provider pick

Granite 4.0 H Small was benchmarked on a single provider, Replicate. This gives us a clear snapshot of its performance on that specific platform, which is known for hosting a wide variety of open-weight models. The choice of provider significantly impacts real-world performance, especially for metrics like latency and throughput.

Priority	Pick	Why	Tradeoff to accept
Max Throughput	Replicate	Delivers an exceptional median output speed of over 340 tokens/second, making it a top choice for high-volume generation.	High time-to-first-token (latency) suggests cold start issues, making it less suitable for interactive, single-user sessions.
Cost-Effectiveness	Replicate	Offers a competitive input price. The model's inherent conciseness helps manage the higher output price, leading to good overall value.	The output price is still above average, so intentionally verbose use cases could become more expensive than alternatives.
Ease of Access	Replicate	Provides a simple, standardized API for a vast catalog of open models, including Granite 4.0 H Small, simplifying integration.	Performance is tied to a shared resource pool, which can lead to variability in latency and queue times during peak demand.
Lowest Latency	Self-Hosted / Dedicated	The benchmarked 8.82-second latency is too high for real-time interactive use. A dedicated instance would be required to eliminate cold starts.	Requires significant infrastructure management, setup costs, and technical expertise compared to a managed API.

Provider performance and pricing are subject to change. The metrics shown are based on benchmarks conducted by Artificial Analysis at a specific point in time. Your own results may vary based on workload, region, and provider capacity.

Real workloads cost table

Theoretical prices per million tokens are useful, but seeing costs for real-world tasks provides a more tangible understanding of a model's financial impact. Below are estimated costs for running Granite 4.0 H Small on Replicate for several common scenarios, using its pricing of $0.06/1M input and $0.25/1M output tokens.

Scenario	Input	Output	What it represents	Estimated cost
Article Summarization	10,000 token article	500 token summary	Content summarization for research or newsletters.	$0.00073
Chatbot Response	1,500 token history	150 token reply	A single turn in an automated customer service interaction.	$0.00013
Code Generation	2,000 token context	800 token function	A typical co-pilot style code generation task.	$0.00032
RAG Document Query	100,000 token document	300 token answer	Querying a large document using Retrieval-Augmented Generation.	$0.00608
Bulk Data Classification	500 token product review	10 token category	A single item in a large-scale data processing pipeline.	$0.00003

The model's conciseness and low input cost make it highly economical for tasks involving large inputs and small outputs, like RAG and classification. Even for balanced tasks like code generation, the costs remain very low. The higher output price is effectively mitigated by the model's tendency to produce short, relevant responses.

How to control cost (a practical playbook)

While Granite 4.0 H Small is reasonably priced, its cost structure—with cheap inputs and more expensive outputs—creates opportunities for optimization. Proactive strategies can help you maximize its value and minimize spend, especially when scaling up usage.

Leverage Its Conciseness

This model's greatest cost-saving feature is its tendency to be brief. You can lean into this strength to manage costs associated with its higher output price.

Prompt for Brevity: Explicitly ask for concise answers in your prompts (e.g., "Summarize in one sentence," or "Answer with only 'Yes' or 'No'").
Structure Output: Use few-shot prompting or instructions to guide the model into a structured, minimal format like JSON, which reduces extraneous conversational text.

Optimize Context Window Usage

A 128k context window is powerful but can be expensive if filled unnecessarily. Since input tokens are cheap, the primary goal is to avoid redundant processing and ensure the context is effective.

Pre-process Inputs: Before sending a large document, use a cheaper model or an algorithm to extract the most relevant sections.
Manage Chat History: For chatbots, implement a sliding window or summarization strategy for the conversation history rather than feeding the entire transcript every time.

Mitigate High Latency

The observed 8.8-second latency is likely a "cold start" problem on serverless infrastructure. This can be a deal-breaker for interactive applications but can be managed.

Use Provisioned Concurrency: Some providers offer a paid tier where an instance of the model is kept "warm" for you, eliminating cold starts at the cost of an hourly fee.
Batch Requests: For non-interactive workloads, group many requests into a single batch. The cold start penalty is paid only once for the entire batch, significantly improving overall throughput.
Implement a Keep-Alive: For low-traffic applications, a simple script that pings the model endpoint every few minutes can prevent the instance from shutting down.

Consider Self-Hosting for Scale

As an open-weight model, Granite 4.0 H Small can be hosted on your own infrastructure. This shifts the cost model from pay-per-token to a fixed cost for hardware and maintenance.

When it makes sense: If you have a very high, consistent volume of requests, the cost of running your own server can be lower than the cumulative cost of API calls.
The Tradeoff: This approach requires significant technical expertise in machine learning operations (MLOps) to manage deployment, scaling, and hardware maintenance.

FAQ

What is Granite 4.0 H Small?

Granite 4.0 H Small is an open-weight large language model developed and released by IBM. It is part of the Granite 4.0 family and is designed to be efficient, fast, and highly concise while maintaining above-average intelligence for its size class. Its open license (Apache 2.0) allows for broad use, including commercial applications.

What is this model good for?

This model excels at tasks where speed, a large context window, and conciseness are important. Key use cases include:

RAG (Retrieval-Augmented Generation): Querying large documents provided in its 128k context window.
Summarization: Quickly creating brief summaries of long articles or reports.
Data Classification and Extraction: Processing and categorizing large volumes of text data efficiently.
Real-time Chatbots: Powering conversational agents where rapid, concise responses are valued.

What are its main limitations?

The primary limitations are its high latency on serverless platforms (cold starts), its status as a non-reasoning model (making it less suitable for complex logic), and an output token price that is higher than the class average. Performance can also vary depending on the hosting provider.

How does it compare to models like Llama 3 8B?

Granite 4.0 H Small and Llama 3 8B are in a similar size class of open models. Granite's key differentiators are its significantly higher output speed and greater conciseness. Llama 3 8B is widely regarded as a very strong all-around model and may have an edge in general knowledge and reasoning capabilities, while Granite is more specialized for high-throughput, efficient text generation.

Why is the latency so high in the benchmark?

The 8.82-second latency to first token is characteristic of a "cold start" on a serverless GPU platform like Replicate. When a model hasn't been used for a few minutes, the server unloads it from memory. The first request to the idle model must wait for the system to load the multi-gigabyte model files onto a GPU, which causes a significant delay. Subsequent requests are fast until the model becomes idle again.

What does the "H" in the name stand for?

The 'H' is a series designator within IBM's Granite model family. While IBM has not publicly detailed its specific meaning, it serves to differentiate this series from others, such as the Granite Code models. It is best understood as part of the model's official product name.

Granite 4.0 H Small (non-reasoning)