EXAONE 4.0 32B (Non-reasoning)

A highly intelligent open model with a premium price tag.

EXAONE 4.0 32B (Non-reasoning)

From LG AI Research, this 32-billion parameter model delivers top-tier intelligence and a massive context window, but its high cost and moderate speed require careful consideration.

32B Parameters131k ContextOpen LicenseText GenerationHigh Intelligence

EXAONE 4.0 32B (Non-reasoning) is a formidable entry into the open-weight model landscape, developed by the dedicated team at LG AI Research. As a 32-billion parameter model, it occupies a competitive middle ground in terms of size, but its performance on intelligence benchmarks sets it apart. With a score of 30 on the Artificial Analysis Intelligence Index, it significantly outperforms the average score of 20 for comparable models, placing it in the upper echelon for tasks that require deep knowledge and nuanced text generation.

This model is not a generalist designed for complex, multi-step reasoning. The "Non-reasoning" designation clarifies its focus: it excels at knowledge-intensive tasks like advanced question-answering, detailed summarization, and high-quality content creation. Its massive 131k token context window is a key enabler for these use cases, allowing it to process and reference entire research papers, legal documents, or extensive codebases in a single prompt. This makes it a powerful tool for applications built on Retrieval-Augmented Generation (RAG), where providing large amounts of context is crucial for generating accurate and relevant responses.

However, this premium intelligence comes at a literal premium. EXAONE 4.0 32B is one of the more expensive open-weight models available via API. Its input price of $0.60 per million tokens is six times the class average, and its output price of $1.00 is five times the average. This pricing strategy positions it as a high-end option for developers who prioritize top-tier generative quality and are willing to invest accordingly. The total cost to run our standard intelligence evaluation, $68.69, serves as a stark reminder of how quickly expenses can accumulate with intensive use.

Performance metrics beyond intelligence and cost present a mixed picture. Its output speed of 88 tokens per second is respectable for a model of its size but falls slightly below the class average of 93 tokens per second. This makes it fast enough for many applications but potentially less suitable for highly interactive, real-time experiences where every millisecond counts. Furthermore, the model exhibits a tendency towards verbosity, generating more tokens on average than its peers. While this can sometimes lead to more detailed answers, it also directly increases output costs. Consequently, prospective users must weigh EXAONE's impressive intelligence and context capabilities against its significant cost and moderate performance profile.

Scoreboard

Intelligence

30 (8 / 55)

Scores well above the class average of 20, placing it in the top 15% for intelligence among comparable models.
Output speed

88.2 tokens/s

Slightly slower than the class average of 93 tokens/s.
Input price

$0.60 / 1M tokens

Significantly more expensive than the class average of $0.10.
Output price

$1.00 / 1M tokens

Five times more expensive than the class average of $0.20.
Verbosity signal

15M tokens

More verbose than the class average of 13M tokens during intelligence testing, which can increase output costs.
Provider latency

0.38 seconds

A good time-to-first-token, ensuring a responsive feel for interactive use cases.

Technical specifications

Spec Details
Owner LG AI Research
License Open License (Non-commercial)
Model Family EXAONE
Parameters 32 Billion
Context Window 131,072 tokens
Input Modality Text
Output Modality Text
Architecture Decoder-only Transformer
Specialization Non-reasoning, Knowledge-intensive tasks
Benchmarked Provider FriendliAI

What stands out beyond the scoreboard

Where this model wins
  • Top-Tier Intelligence: With an intelligence score of 30, it excels at tasks requiring nuanced understanding, making it a strong choice for high-quality content generation and complex Q&A.
  • Massive Context Window: The 131k token context length is a significant advantage for processing and analyzing long documents, enabling sophisticated RAG and summarization applications that other models cannot handle.
  • Open and Transparent: As an open-weight model from a major research institution, it offers a powerful alternative to closed, proprietary systems, allowing for greater scrutiny and potential for fine-tuning.
  • Strong Corporate Backing: Developed and maintained by LG AI Research, the model benefits from a foundation of dedicated corporate R&D, suggesting reliability and future development.
Where costs sneak up
  • Premium Token Pricing: Both input ($0.60/M) and output ($1.00/M) prices are substantially higher than the average for open-weight models, making it a costly choice for high-volume applications.
  • Cost of Large Context: While the 131k context window is a powerful feature, filling it with data can be extremely expensive. A single prompt using the full context would cost nearly $8 in input tokens alone.
  • Higher Verbosity: The model's tendency to produce longer outputs than average directly translates to higher costs, as you pay more per generated response.
  • Moderate Speed: Its output speed is slightly below average. For applications requiring high throughput or real-time generation, this can lead to longer processing times and a less snappy user experience.
  • High Evaluation Cost: The benchmark evaluation cost of over $68 highlights how quickly expenses can escalate during development, testing, and production use.

Provider pick

EXAONE 4.0 32B is currently available as a serverless API endpoint through FriendliAI, which specializes in optimizing and serving large language models. As the sole benchmarked provider, FriendliAI is the go-to choice for accessing this model's capabilities without managing the underlying infrastructure.

Priority Pick Why Tradeoff to accept
Best Overall FriendliAI As the exclusive benchmarked provider, FriendliAI offers a fully managed, optimized API for EXAONE 4.0 32B. No other providers are available to compare for price or performance.
Fastest Throughput FriendliAI Delivers a solid 88 tokens per second, a strong performance for a model of this size and complexity. The speed is slightly below the class average, and may not be sufficient for the most demanding real-time applications.
Lowest Latency FriendliAI Achieves a quick 0.38-second time-to-first-token, ensuring a responsive start for interactive sessions. This initial responsiveness is paired with a high per-token cost that accrues as generation continues.
Lowest Price FriendliAI By default, FriendliAI sets the market price as the only available API provider. The price is significantly higher than nearly all other open-weight models in its class.

Provider analysis is based on benchmark data collected by Artificial Analysis. The market for LLM APIs is dynamic, and provider availability and pricing may change.

Real workloads cost table

To understand the real-world financial impact of EXAONE 4.0 32B's pricing, let's estimate the cost for several common tasks. These scenarios illustrate how the high input and output token prices, combined with its verbosity, affect the bottom line at scale.

Scenario Input Output What it represents Estimated cost
Summarize a long report 15,000 tokens (approx. 30 pages) 1,000 tokens Academic or business document analysis. ~$0.01 per summary
Draft a marketing blog post 500 tokens (detailed prompt) 2,000 tokens Content creation and copywriting. ~$0.0023 per post
Power a chatbot session 2,500 tokens (10 user turns) 2,000 tokens Interactive customer support or Q&A. ~$0.0035 per session
Analyze a financial document 65,000 tokens (half of context window) 2,500 tokens High-context RAG for financial insights. ~$0.0415 per analysis

While the cost per individual task appears low, the model's premium pricing becomes a critical factor at scale. Applications that rely heavily on large context inputs, such as document analysis, will see costs escalate quickly due to the expensive $0.60/M input token price. A service handling thousands of these requests per day would face substantial operational expenses.

How to control cost (a practical playbook)

Given its high cost, effectively managing expenses is crucial when deploying EXAONE 4.0 32B. A proactive approach to cost optimization can help you leverage its powerful intelligence without breaking your budget. The following strategies are essential for any team building with this model.

Control Verbosity with Prompting

Since the model tends to be verbose and output tokens are expensive, explicitly instructing the model to be concise can yield significant savings. Add phrases to your prompts like:

  • "Provide a brief, one-paragraph summary."
  • "Answer in three sentences or less."
  • "Be direct and concise."

This reduces the number of output tokens you pay for and can also improve response speed.

Use the Large Context Window Strategically

The 131k context window is powerful but expensive to fill. Instead of re-sending conversation history with every turn, maintain the state in a single, long-running context. For document analysis, batch multiple questions about the same document into a single API call rather than making separate calls for each question.

However, always be mindful of the input cost. Only provide the context that is absolutely necessary for the task at hand to avoid unnecessary expense.

Implement Aggressive Caching

Many applications receive repetitive queries. Implementing a caching layer (e.g., using Redis) to store the results of common prompts can dramatically reduce API calls. For any incoming request, first check if an identical or semantically similar request exists in your cache. Serving a cached response avoids the cost and latency of a new API call entirely.

Post-Process and Truncate Outputs

In some cases, you may need a detailed internal representation but only a short final answer for the user. You can let the model generate a longer, more detailed response (e.g., a chain-of-thought process) and then programmatically extract or summarize the final answer in your application logic. This gives you the benefit of the model's full capabilities while controlling what the end-user sees, though you still pay for the full generation.

FAQ

What is EXAONE 4.0 32B?

EXAONE 4.0 32B is a 32-billion parameter large language model developed by LG AI Research. It is an open-weight model optimized for knowledge-intensive, non-reasoning tasks. It is distinguished by its high performance on intelligence benchmarks and its very large 131,072-token context window.

Who should use this model?

This model is ideal for developers and organizations that require top-tier performance for tasks like long-document summarization, advanced RAG-based Q&A, and high-quality content generation. Users must be prepared to handle its premium pricing, making it best suited for high-value applications where quality and deep context are more important than minimizing cost.

What does "(Non-reasoning)" mean?

The "Non-reasoning" tag indicates that the model is not primarily designed for complex, multi-step logical deduction or mathematical problem-solving. Instead, its strengths lie in retrieving, processing, and generating text based on the vast knowledge encoded in its parameters and the context provided in the prompt.

How does its 131k context window help?

A 131k token context window allows the model to consider a massive amount of information—equivalent to a small book—in a single request. This is a game-changer for applications like:

  • Analyzing lengthy legal contracts or financial reports.
  • Creating chatbots that remember an entire long conversation.
  • Performing Retrieval-Augmented Generation (RAG) with extensive source material for highly accurate answers.
Why is this model so much more expensive than other open models?

The high price is likely a combination of several factors. First, the resources required to train a high-quality 32B parameter model with a large context window are immense. Second, its high intelligence score positions it as a premium product. Finally, the provider, FriendliAI, sets a price that reflects the value of this performance and the cost of serving such a large model efficiently via their optimized inference engine.

Is this model a good choice for real-time chat applications?

It's a mixed bag. The time-to-first-token (latency) of 0.38 seconds is good, providing a responsive initial feel. However, its output speed of 88 tokens/s is slightly below average. For short responses, it will feel fast enough. For longer, more detailed generations, users might notice a slight delay compared to faster models. The high cost per token is also a major consideration for a high-volume chat service.


Subscribe