gpt-oss-20B (low) (low)

A high-speed, highly intelligent open-source model with a massive context window.

gpt-oss-20B (low) (low)

An open-source 20B parameter model from OpenAI that delivers exceptional speed and top-tier intelligence at a competitive price point, making it a strong all-rounder.

Open Source131k ContextText GenerationHigh SpeedStrong Intelligence20B Parameters

The gpt-oss-20B (low) model emerges as a formidable contender in the open-source landscape, offering a compelling blend of performance, intelligence, and cost-effectiveness. With approximately 20 billion parameters, it occupies a sweet spot, providing sophisticated reasoning and generation capabilities without the overhead of much larger models. Its performance metrics reveal a model that is not just a jack-of-all-trades but a master of several, particularly in speed and raw intelligence, where it ranks among the top performers in its class.

Scoring an impressive 44 on the Artificial Analysis Intelligence Index, gpt-oss-20B (low) significantly outperforms the average score of 26 for comparable models, placing it in the top echelon (#8 out of 84). This indicates a strong aptitude for complex tasks like reasoning, instruction following, and creative generation. Interestingly, it achieves this high score with relative conciseness, generating 15 million tokens during the evaluation compared to the class average of 23 million. This suggests an efficient model that can deliver high-quality responses without unnecessary verbosity—a crucial factor for managing output costs and improving user experience.

Perhaps its most eye-catching feature is its speed. When served by optimized inference providers like Together.ai and Groq, it achieves output speeds exceeding 900 tokens per second, a rate that rivals or even surpasses many smaller, specialized models. This makes it exceptionally well-suited for real-time applications such as interactive chatbots, live coding assistants, and rapid content creation. This speed, combined with a massive 131,000-token context window, unlocks new possibilities for processing and analyzing long documents, maintaining coherent, extended conversations, and performing complex, context-aware tasks that were previously impractical.

From a financial perspective, gpt-oss-20B (low) is positioned as a high-value option. While not the absolute cheapest on the market, its pricing is moderate and highly competitive given its performance profile. With input costs around $0.07 per million tokens and output at $0.20, it provides access to top-tier capabilities at a fraction of the cost of leading proprietary models. This balance of power, speed, and price makes gpt-oss-20B (low) a strategic choice for developers and businesses looking to build advanced AI features without committing to the high costs and closed ecosystems of flagship commercial offerings.

Scoreboard

Intelligence

44 (#8 / 84)

Scores 44 on the Artificial Analysis Intelligence Index, placing it well above the average of 26 for comparable models.
Output speed

245.9 tokens/s

Notably fast performance, ranking #4 out of 84 models benchmarked. Top providers achieve speeds over 900 t/s.
Input price

$0.07 / 1M tokens

Moderately priced for input, ranking #32 out of 84 models and offering excellent value.
Output price

$0.20 / 1M tokens

Also moderately priced for output, ranking #35 out of 84. The 3x cost differential vs. input is a key factor to manage.
Verbosity signal

15M tokens

Fairly concise, generating significantly fewer tokens than the class average of 23M during intelligence tests.
Provider latency

0.15 seconds

Best-in-class latency is achievable, with top providers like Groq and Google Vertex reaching as low as 0.15s time-to-first-token.

Technical specifications

Spec Details
Model Name gpt-oss-20B (low)
Owner OpenAI
License Open
Parameters ~20 Billion
Context Window 131,000 tokens
Input Modalities Text
Output Modalities Text
Knowledge Cutoff May 2025
Intelligence Index Score 44
Intelligence Index Rank #8 / 84
Default Output Speed 245.9 tokens/s
Default Input Price $0.07 / 1M tokens
Default Output Price $0.20 / 1M tokens

What stands out beyond the scoreboard

Where this model wins
  • Elite Speed: With top providers like Together.ai and Groq, it delivers truly exceptional output speeds (900+ t/s), making it ideal for real-time, interactive applications.
  • High Intelligence: Ranks in the top 10% of its class on the Intelligence Index, demonstrating strong reasoning and instruction-following capabilities for a model of its size.
  • Massive Context Window: The 131k token context window allows for deep analysis of long documents, extended conversations, and complex multi-shot prompting.
  • Cost-Performance Ratio: Offers a fantastic balance of high-end performance characteristics at a moderate price point, providing excellent value compared to both proprietary and other open-source models.
  • Provider Diversity: Supported by a wide range of API providers, from hyperscalers like Google and Amazon to specialized, speed-focused startups, giving users ample choice.
Where costs sneak up
  • Output Price Multiplier: Output tokens are nearly three times more expensive than input tokens. Generative tasks that produce long responses can quickly become more costly than expected.
  • Context Window Trap: While powerful, consistently filling the 131k context window will lead to high per-request costs, especially for input-heavy tasks like RAG.
  • Provider Price Variance: The cost to run this model can vary significantly between providers. Choosing a provider based on speed alone may result in a much higher bill.
  • Real-Time Application Costs: The model's high speed encourages its use in high-throughput, real-time scenarios. At scale, the volume of calls can lead to substantial cumulative costs if not monitored.
  • Inefficient Prompting: Failing to engineer prompts for conciseness can inflate costs due to the higher price of output tokens.

Provider pick

Choosing the right API provider for gpt-oss-20B (low) is critical, as performance and cost can vary dramatically. Your ideal choice depends entirely on whether your primary goal is minimizing cost, maximizing throughput, achieving the lowest possible latency, or finding a balanced, all-around option. We've benchmarked the leading providers to help you make an informed decision.

Priority Pick Why Tradeoff to accept
Lowest Blended Price Novita At a blended price of just $0.07 per million tokens, Novita is the undisputed cost leader for running this model at scale. Its output speed (246 t/s) and latency are solid but fall short of the top-tier speed specialists.
Maximum Speed Together.ai Delivers the highest output throughput at a blistering 975 tokens/second, making it the top choice for bulk processing and high-volume generation. Slightly more expensive than the absolute cheapest options, and its latency isn't the lowest available.
Lowest Latency Groq Tied for the lowest time-to-first-token (TTFT) at an incredible 0.15 seconds. This is the pick for the most responsive, real-time user experiences. While extremely fast in output (933 t/s), its pricing is not as competitive as budget-focused providers.
Balanced Performance Lightning AI Offers a fantastic middle ground: very low price ($0.09/M), strong speed (312 t/s), and low latency (0.41s). It's a great default choice for many use cases. It is not the absolute number one in any single category, but excels as a versatile all-rounder.
Enterprise Choice Google Vertex AI Provides the reliability, security, and support of a major cloud platform. It matches Groq for the lowest latency (0.15s), making it a premium, high-performance option. It is one of the more expensive providers, reflecting the cost of the enterprise-grade ecosystem and support.

Note: Performance and pricing data are subject to change. Benchmarks reflect point-in-time analysis. The 'Blended Price' is a weighted average and may not reflect your exact costs.

Real workloads cost table

To understand the real-world cost implications of using gpt-oss-20B (low), let's model a few common scenarios. These estimates are based on the average pricing of $0.07 per 1M input tokens and $0.20 per 1M output tokens. Note how the cost balance shifts depending on whether the task is input-heavy or output-heavy.

Scenario Input Output What it represents Estimated cost
RAG Chatbot Query 15,000 tokens 500 tokens A single user query where extensive documentation is injected as context. ~$0.00115
Long Document Summary 100,000 tokens 2,000 tokens Summarizing a large PDF report or legal document into key takeaways. ~$0.00740
Code Generation Task 1,000 tokens 3,000 tokens Generating a Python script or a complex SQL query from a detailed prompt. ~$0.00067
Content Creation 500 tokens 8,000 tokens Writing a draft for a blog post or marketing email based on a short outline. ~$0.00164
Data Extraction (JSON) 20,000 tokens 1,000 tokens Parsing an unstructured text document to extract structured data. ~$0.00160

The key takeaway is the significant impact of the input-to-output price ratio. Tasks that generate a lot of text, like content creation and code generation, see their costs driven primarily by the $0.20/M output price. Conversely, input-heavy tasks like RAG are more sensitive to the $0.07/M input price. Optimizing for either input length or output verbosity is the most direct path to cost management.

How to control cost (a practical playbook)

Effectively managing the cost of gpt-oss-20B (low) involves a multi-faceted strategy. While the model offers great value, its powerful features like the large context window and high speed can lead to unexpected expenses if not handled carefully. Here are several key strategies to keep your operational costs in check.

Optimize Your Provider Choice

Your choice of API provider is the single biggest lever on your cost and performance. Don't default to one provider for all tasks.

  • For background tasks: Use a cost-leader like Novita where latency and top speed are not critical. This is ideal for batch processing, report generation, or asynchronous workflows.
  • For user-facing features: Use a low-latency provider like Groq or Google Vertex for chatbots and interactive tools where responsiveness is paramount.
  • For general purpose use: A balanced provider like Lightning AI often provides the best blend of cost and performance for a wide range of applications.
Actively Manage the Context Window

The 131k context window is a powerful tool but also a significant cost driver. Sending 100k tokens on every call is rarely necessary and always expensive.

  • Use RAG intelligently: Instead of passing entire documents, use an efficient retrieval step to inject only the most relevant chunks of text into the prompt.
  • Summarize conversations: For long-running chats, create a summary of the conversation history to pass along instead of the full transcript. This keeps the context relevant while drastically cutting token count.
  • Prune your prompts: Regularly review system prompts and few-shot examples to ensure they are as concise as possible while still being effective.
Control Output Verbosity

With output tokens costing nearly three times as much as input tokens, controlling the model's verbosity is crucial for managing expenses, especially in generative tasks.

  • Use prompt engineering: Explicitly instruct the model to be concise. Phrases like "Be brief," "Answer in one paragraph," or "Use bullet points" can be very effective.
  • Set max_tokens: Use the `max_tokens` parameter in your API call to set a hard limit on the output length, preventing the model from generating excessively long and costly responses.
  • Refine with a cheaper model: For some workflows, you can use gpt-oss-20B (low) for the core reasoning and then use a smaller, cheaper model to refine or expand the text, shifting the token generation burden to a less expensive resource.
Implement Caching and Batching

Reduce redundant API calls and improve throughput with smart architectural choices.

  • Cache responses: For common or identical queries, store the result in a cache (like Redis). Serving a cached response is dramatically faster and cheaper than making a new API call. This is highly effective for informational websites or FAQ bots.
  • Batch requests: When you have multiple, non-interactive requests to process, batch them together into a single API call if the provider's API supports it. This can reduce network overhead and may unlock better pricing.

FAQ

What is gpt-oss-20B (low)?

gpt-oss-20B (low) is an open-source large language model from OpenAI with approximately 20 billion parameters. It is designed to provide a strong balance of high intelligence, extremely fast inference speed, and moderate cost. It features a large 131,000-token context window and is proficient in a wide range of text-based tasks, from generation and summarization to complex reasoning.

What does the "(low)" in the name signify?

While not officially defined by the owner, the "(low)" designation typically suggests that this is a variant of a base model optimized for performance and efficiency. This is often achieved through techniques like:

  • Quantization: Reducing the precision of the model's weights (e.g., from 16-bit floating point to 8-bit or 4-bit integers). This dramatically speeds up computation and reduces memory usage, often with only a minor impact on accuracy.
  • Distillation: Training a smaller model (the "student") to mimic the behavior of a larger, more capable model (the "teacher").

In this case, the "(low)" likely refers to a lower-precision or quantized version that enables the remarkable speed and cost-efficiency observed in benchmarks.

How does it compare to other open-source models?

gpt-oss-20B (low) positions itself very competitively. Compared to other models in the 15-30B parameter range, it stands out for its combination of top-tier intelligence (ranking #8/84) and elite speed on optimized hardware. While some models might be slightly cheaper, they often don't match its intelligence score or throughput. Conversely, models that are more intelligent are typically much larger, slower, and more expensive to run. Its large context window is also a significant advantage over many other models in its class.

What are the best use cases for this model?

Given its profile, gpt-oss-20B (low) excels in a variety of applications:

  • Interactive Chatbots: Its low latency and high throughput make it perfect for creating responsive, engaging conversational AI.
  • Long-Document Q&A: The 131k context window is ideal for feeding large documents (e.g., legal contracts, research papers, financial reports) and asking detailed questions.
  • Real-Time Coding Assistants: The model's speed allows it to provide instant code suggestions and completions within an IDE.
  • High-Volume Content Generation: Its fast output speed makes it suitable for generating marketing copy, product descriptions, or social media posts at scale.
Who is the fastest provider for gpt-oss-20B (low)?

Based on our benchmarks, Together.ai offers the highest output speed (throughput) at 975 tokens per second. For the lowest latency (time to first token), Groq and Google Vertex are tied for the lead at just 0.15 seconds. Your choice depends on whether you need to generate a lot of text quickly (throughput) or get the first word back as fast as possible (latency).

Who is the cheapest provider for gpt-oss-20B (low)?

The most cost-effective provider is Novita, with a blended price of $0.07 per million tokens. They also offer the cheapest input token price at $0.04/M. This makes them an excellent choice for cost-sensitive applications, especially background tasks where maximum speed is not a requirement.


Subscribe