Devstral Small (non-reasoning)

A fast, concise, and intelligent open-weight model for developers.

Devstral Small (non-reasoning)

An exceptionally fast and highly intelligent open-weight model that balances strong performance and a massive context window with a moderate, output-weighted price.

256k ContextOpen LicenseCode GenerationHigh IntelligenceVery FastConcise Output

Devstral Small, released by Mistral in July 2025, establishes itself as a formidable player in the open-weight model landscape. As its name suggests, it is engineered with developers in mind, delivering a potent combination of high-speed performance, strong intelligence, and remarkable conciseness. It carves out a niche for itself by offering capabilities that challenge larger, more expensive models, particularly in tasks that benefit from rapid responses and factual accuracy. With a massive 256k token context window, it is well-equipped to handle complex inputs like entire codebases or lengthy technical documents, making it a versatile tool for a wide range of professional applications.

On the Artificial Analysis Intelligence Index, Devstral Small achieves a score of 27, placing it at an impressive #14 out of 55 benchmarked models in its class. This score is significantly higher than the class average of 20, demonstrating a level of comprehension and problem-solving ability that is well above its weight class. This intelligence is complemented by its top-tier speed; at 234.3 tokens per second, it ranks #3 overall, making it one of the fastest models available. This combination is critical for user-facing applications where latency can make or break the user experience. Furthermore, the model is notably concise, generating just 5.4 million tokens during the intelligence evaluation, compared to a class average of 13 million. This brevity is not only a stylistic feature but also a crucial cost-control mechanism.

The pricing model for Devstral Small presents a nuanced picture. Its input price of $0.10 per million tokens is squarely average for its category. However, its output price of $0.30 per million tokens is somewhat expensive, exceeding the class average of $0.20. This pricing structure makes Devstral Small particularly cost-effective for tasks that are input-heavy and output-light, such as Retrieval-Augmented Generation (RAG), classification, or summarization. Conversely, for workloads that require extensive generated text, such as verbose chatbots or long-form content creation, the higher output cost can become a significant factor. The total cost to run the model through the entire Intelligence Index benchmark was $9.62, providing a tangible measure of its operational cost under sustained, complex usage.

Given its profile, Devstral Small excels in specific domains. Its primary strength lies in code-related tasks—generation, debugging, explanation, and autocompletion—where its speed, large context, and intelligence provide a seamless developer experience. It is also an excellent choice for building sophisticated RAG systems, where it can quickly process large amounts of retrieved context to generate a concise, accurate answer. Its speed and conciseness make it a prime candidate for powering real-time chatbots and virtual assistants that need to deliver immediate, relevant responses without unnecessary verbosity.

Scoreboard

Intelligence

27 (#14 / 55)

Scores well above the class average of 20, placing it among the top-tier intelligent models in this category.
Output speed

234.3 tokens/s

Exceptionally fast, ranking #3 out of 55 models benchmarked for output speed.
Input price

$0.10 / 1M tokens

Average pricing for its class, ranking #27 out of 55.
Output price

$0.30 / 1M tokens

Slightly more expensive than average, ranking #35 out of 55.
Verbosity signal

5.4M tokens

Highly concise, generating far fewer tokens than the class average of 13M.
Provider latency

0.26s TTFT

Excellent time-to-first-token, with the fastest provider starting responses in just over a quarter of a second.

Technical specifications

Spec Details
Model Owner Mistral
License Open
Model Release July 2025
Context Window 256,000 tokens
Input Modalities Text
Output Modalities Text
Architecture Transformer-based
Parameters Not Disclosed
Training Data Not publicly disclosed
Fine-tuning Support Yes, via standard methods and provider APIs
API Providers Mistral, Deepinfra

What stands out beyond the scoreboard

Where this model wins
  • Exceptional Speed: With a top-tier output speed of over 234 tokens/second on its fastest provider, it's ideal for real-time applications like chatbots and interactive coding assistants.
  • High Intelligence for its Size: Scoring significantly above average on the Intelligence Index, it provides reliable and accurate outputs for complex tasks without the overhead of larger models.
  • Massive Context Window: The 256k token context window allows it to process and analyze very large documents, extensive codebases, or long conversation histories in a single pass.
  • Output Conciseness: Its tendency to be concise saves on output token costs and provides users with direct, to-the-point answers, which is valuable for summarization and data extraction.
  • Strong for Code-Related Tasks: The combination of speed, a large context window, and high intelligence makes it a powerful tool for code generation, debugging, and explanation.
Where costs sneak up
  • Above-Average Output Costs: The $0.30 per 1M output token price is higher than many competitors, making generation-heavy tasks like long-form content creation or verbose chat sessions more expensive.
  • The Large Context Trap: While the 256k context is a feature, filling it with input for every API call can become extremely expensive. A full context window could cost over $25 in input tokens alone.
  • Not the Cheapest Overall: While not a premium-priced model, it's not a budget option either. Users seeking the absolute lowest cost may find other open-weight models more suitable.
  • Performance vs. Cost Trade-off: The fastest provider (Mistral) is also the more expensive one. Users must choose between top-tier performance for interactive applications and lower operational costs for background tasks.
  • Verbosity Creep: While naturally concise, if prompts encourage verbose outputs, the higher output token price will quickly inflate costs. Careful prompt engineering is required to maintain cost-efficiency.

Provider pick

Choosing a provider for Devstral Small involves a direct trade-off between performance and cost. The model's creator, Mistral, offers the highest throughput, making it the clear choice for applications where speed is paramount. In contrast, Deepinfra provides a more budget-friendly option with lower latency (Time to First Token), making it ideal for cost-sensitive projects that can tolerate a slower generation rate.

Priority Pick Why Tradeoff to accept
Max Speed Mistral At 234 t/s, it's over 5x faster than the alternative, essential for real-time user experiences. Higher blended price ($0.15 vs $0.12) and slightly higher latency.
Lowest Latency Deepinfra With a TTFT of 0.26s, it begins generating a response faster than any other provider. Significantly slower output speed (45 t/s) once generation begins.
Lowest Price Deepinfra Offers the lowest blended price ($0.12/M tokens) and the cheapest rates for both input and output tokens. Much lower throughput, making it less suitable for high-velocity generation.
Balanced Profile Deepinfra Provides the best combination of low latency and low cost, making it a strong default choice for most non-interactive workloads. The primary trade-off is the substantially lower tokens/second output rate.

Provider performance and pricing are based on benchmarks conducted in July 2025 and can change. Blended price as calculated by the source benchmark assumes a 3:1 input-to-output token ratio, which favors input-heavy workloads.

Real workloads cost table

To understand how Devstral Small's pricing translates to real-world use, let's estimate the cost for several common tasks. These scenarios highlight how input and output token counts affect the final cost. Calculations use the lowest-cost provider, Deepinfra, with prices of $0.07 per 1M input tokens and $0.28 per 1M output tokens.

Scenario Input Output What it represents Estimated cost
RAG Chatbot Query 8,000 tokens (query + context) 400 tokens Answering a user question using retrieved documents. $0.00067
Code Generation 1,500 tokens (detailed prompt) 3,000 tokens Generating a Python script from specifications. $0.00095
Document Summarization 15,000 tokens (long article) 500 tokens Creating a bulleted summary of a long text. $0.00119
Email Classification 500 tokens (email body) 50 tokens Categorizing an inbound email and extracting key info. $0.00005
Multi-Turn Conversation 10,000 tokens (chat history) 2,000 tokens A 10-turn conversation where context is maintained. $0.00126

Devstral Small is highly cost-effective for tasks with large inputs and concise outputs, such as RAG and summarization, where individual operations cost a fraction of a cent. Costs remain low even for generative tasks like coding, but can accumulate more quickly in verbose, chat-heavy applications due to the higher output price.

How to control cost (a practical playbook)

Given Devstral Small's pricing structure—average on input, expensive on output—managing costs requires a strategic approach. Optimizing usage is key to leveraging its high performance without incurring unexpected expenses. The following playbook offers techniques to minimize spend while maximizing the model's value.

Engineer Prompts for Conciseness

The most direct way to control costs is to minimize expensive output tokens. Design your prompts to explicitly ask for brevity.

  • Request structured output like JSON or XML, which is less verbose than natural language.
  • Instruct the model to respond in bullet points or a numbered list instead of full paragraphs.
  • Set explicit length constraints, such as "summarize in three sentences" or "provide a one-paragraph explanation."
Optimize Context Window Usage

The 256k context window is powerful but expensive to fill. Avoid sending the entire conversation history or full documents with every call. Instead, employ more sophisticated context management.

  • For chatbots, use a summarization model to condense the history before sending it to Devstral Small.
  • For RAG, ensure your retrieval step is highly effective, sending only the most relevant chunks of text as context.
  • Implement a sliding window technique for long conversations, keeping only the most recent N tokens.
Match the Provider to the Workload

Don't default to a single provider for all tasks. A hybrid approach can yield significant savings.

  • Use the fastest provider (Mistral) for user-facing, real-time applications where low latency and high throughput are critical.
  • Use the cheapest provider (Deepinfra) for asynchronous, background tasks like document processing, summarization, or report generation where users are not waiting for an immediate response.
Implement Aggressive Caching

Many applications receive repetitive queries. Caching responses to common inputs can eliminate a large number of API calls entirely.

  • Identify and cache responses for frequently asked questions in a customer support bot.
  • Store the results of common code generation or explanation prompts.
  • Use a semantic cache that can match new queries to existing, similar cached results to further reduce API calls.

FAQ

What is Devstral Small best at?

Devstral Small excels at tasks requiring a combination of speed, intelligence, and the ability to process long context. Its prime use cases include code generation and assistance, Retrieval-Augmented Generation (RAG) systems, real-time chatbots, and complex document summarization or analysis.

How does it compare to other Mistral models?

Devstral Small is positioned as a highly efficient and fast model within Mistral's portfolio. It likely sits below larger, more powerful (and expensive) models like Mistral Large in terms of raw reasoning capability, but offers superior speed and cost-effectiveness for a wide range of common tasks, making it a workhorse model.

Is the 256k context window always useful?

While the 256k context window is a powerful feature, it is not always practical or cost-effective to use it to its full capacity. It is most valuable for specific tasks that genuinely require a vast amount of information in a single pass, such as analyzing an entire codebase or a comprehensive legal document. For most other tasks, it's more efficient to provide a smaller, more targeted context.

Why is it called 'Devstral'?

The name is a strong indicator of its intended audience: developers. Its performance characteristics—high speed for interactive use, strong intelligence for logical tasks, and a large context window for handling code—are all tailored to development-centric workflows, from coding and debugging to technical documentation.

Is Devstral Small a good choice for creative writing?

While it is capable of generating text, its natural tendency towards conciseness and its relatively high output token price may make it less ideal for long-form creative writing compared to other models. Models that are more naturally verbose or have lower output costs might be a more economical and suitable choice for generating stories, articles, or other creative content.

What does the 'Open License' mean for developers?

An open license, such as Apache 2.0 which is common for Mistral models, means the model's weights are publicly available. This allows developers to download, modify, and deploy the model on their own infrastructure. It offers maximum flexibility and control, enabling self-hosting for privacy, security, or cost reasons, and allows for deep customization and fine-tuning.


Subscribe