An exceptionally fast and highly intelligent open-weight model that balances strong performance and a massive context window with a moderate, output-weighted price.
Devstral Small, released by Mistral in July 2025, establishes itself as a formidable player in the open-weight model landscape. As its name suggests, it is engineered with developers in mind, delivering a potent combination of high-speed performance, strong intelligence, and remarkable conciseness. It carves out a niche for itself by offering capabilities that challenge larger, more expensive models, particularly in tasks that benefit from rapid responses and factual accuracy. With a massive 256k token context window, it is well-equipped to handle complex inputs like entire codebases or lengthy technical documents, making it a versatile tool for a wide range of professional applications.
On the Artificial Analysis Intelligence Index, Devstral Small achieves a score of 27, placing it at an impressive #14 out of 55 benchmarked models in its class. This score is significantly higher than the class average of 20, demonstrating a level of comprehension and problem-solving ability that is well above its weight class. This intelligence is complemented by its top-tier speed; at 234.3 tokens per second, it ranks #3 overall, making it one of the fastest models available. This combination is critical for user-facing applications where latency can make or break the user experience. Furthermore, the model is notably concise, generating just 5.4 million tokens during the intelligence evaluation, compared to a class average of 13 million. This brevity is not only a stylistic feature but also a crucial cost-control mechanism.
The pricing model for Devstral Small presents a nuanced picture. Its input price of $0.10 per million tokens is squarely average for its category. However, its output price of $0.30 per million tokens is somewhat expensive, exceeding the class average of $0.20. This pricing structure makes Devstral Small particularly cost-effective for tasks that are input-heavy and output-light, such as Retrieval-Augmented Generation (RAG), classification, or summarization. Conversely, for workloads that require extensive generated text, such as verbose chatbots or long-form content creation, the higher output cost can become a significant factor. The total cost to run the model through the entire Intelligence Index benchmark was $9.62, providing a tangible measure of its operational cost under sustained, complex usage.
Given its profile, Devstral Small excels in specific domains. Its primary strength lies in code-related tasks—generation, debugging, explanation, and autocompletion—where its speed, large context, and intelligence provide a seamless developer experience. It is also an excellent choice for building sophisticated RAG systems, where it can quickly process large amounts of retrieved context to generate a concise, accurate answer. Its speed and conciseness make it a prime candidate for powering real-time chatbots and virtual assistants that need to deliver immediate, relevant responses without unnecessary verbosity.
27 (#14 / 55)
234.3 tokens/s
$0.10 / 1M tokens
$0.30 / 1M tokens
5.4M tokens
0.26s TTFT
| Spec | Details |
|---|---|
| Model Owner | Mistral |
| License | Open |
| Model Release | July 2025 |
| Context Window | 256,000 tokens |
| Input Modalities | Text |
| Output Modalities | Text |
| Architecture | Transformer-based |
| Parameters | Not Disclosed |
| Training Data | Not publicly disclosed |
| Fine-tuning Support | Yes, via standard methods and provider APIs |
| API Providers | Mistral, Deepinfra |
Choosing a provider for Devstral Small involves a direct trade-off between performance and cost. The model's creator, Mistral, offers the highest throughput, making it the clear choice for applications where speed is paramount. In contrast, Deepinfra provides a more budget-friendly option with lower latency (Time to First Token), making it ideal for cost-sensitive projects that can tolerate a slower generation rate.
| Priority | Pick | Why | Tradeoff to accept |
|---|---|---|---|
| Max Speed | Mistral | At 234 t/s, it's over 5x faster than the alternative, essential for real-time user experiences. | Higher blended price ($0.15 vs $0.12) and slightly higher latency. |
| Lowest Latency | Deepinfra | With a TTFT of 0.26s, it begins generating a response faster than any other provider. | Significantly slower output speed (45 t/s) once generation begins. |
| Lowest Price | Deepinfra | Offers the lowest blended price ($0.12/M tokens) and the cheapest rates for both input and output tokens. | Much lower throughput, making it less suitable for high-velocity generation. |
| Balanced Profile | Deepinfra | Provides the best combination of low latency and low cost, making it a strong default choice for most non-interactive workloads. | The primary trade-off is the substantially lower tokens/second output rate. |
Provider performance and pricing are based on benchmarks conducted in July 2025 and can change. Blended price as calculated by the source benchmark assumes a 3:1 input-to-output token ratio, which favors input-heavy workloads.
To understand how Devstral Small's pricing translates to real-world use, let's estimate the cost for several common tasks. These scenarios highlight how input and output token counts affect the final cost. Calculations use the lowest-cost provider, Deepinfra, with prices of $0.07 per 1M input tokens and $0.28 per 1M output tokens.
| Scenario | Input | Output | What it represents | Estimated cost |
|---|---|---|---|---|
| RAG Chatbot Query | 8,000 tokens (query + context) | 400 tokens | Answering a user question using retrieved documents. | $0.00067 |
| Code Generation | 1,500 tokens (detailed prompt) | 3,000 tokens | Generating a Python script from specifications. | $0.00095 |
| Document Summarization | 15,000 tokens (long article) | 500 tokens | Creating a bulleted summary of a long text. | $0.00119 |
| Email Classification | 500 tokens (email body) | 50 tokens | Categorizing an inbound email and extracting key info. | $0.00005 |
| Multi-Turn Conversation | 10,000 tokens (chat history) | 2,000 tokens | A 10-turn conversation where context is maintained. | $0.00126 |
Devstral Small is highly cost-effective for tasks with large inputs and concise outputs, such as RAG and summarization, where individual operations cost a fraction of a cent. Costs remain low even for generative tasks like coding, but can accumulate more quickly in verbose, chat-heavy applications due to the higher output price.
Given Devstral Small's pricing structure—average on input, expensive on output—managing costs requires a strategic approach. Optimizing usage is key to leveraging its high performance without incurring unexpected expenses. The following playbook offers techniques to minimize spend while maximizing the model's value.
The most direct way to control costs is to minimize expensive output tokens. Design your prompts to explicitly ask for brevity.
The 256k context window is powerful but expensive to fill. Avoid sending the entire conversation history or full documents with every call. Instead, employ more sophisticated context management.
Don't default to a single provider for all tasks. A hybrid approach can yield significant savings.
Many applications receive repetitive queries. Caching responses to common inputs can eliminate a large number of API calls entirely.
Devstral Small excels at tasks requiring a combination of speed, intelligence, and the ability to process long context. Its prime use cases include code generation and assistance, Retrieval-Augmented Generation (RAG) systems, real-time chatbots, and complex document summarization or analysis.
Devstral Small is positioned as a highly efficient and fast model within Mistral's portfolio. It likely sits below larger, more powerful (and expensive) models like Mistral Large in terms of raw reasoning capability, but offers superior speed and cost-effectiveness for a wide range of common tasks, making it a workhorse model.
While the 256k context window is a powerful feature, it is not always practical or cost-effective to use it to its full capacity. It is most valuable for specific tasks that genuinely require a vast amount of information in a single pass, such as analyzing an entire codebase or a comprehensive legal document. For most other tasks, it's more efficient to provide a smaller, more targeted context.
The name is a strong indicator of its intended audience: developers. Its performance characteristics—high speed for interactive use, strong intelligence for logical tasks, and a large context window for handling code—are all tailored to development-centric workflows, from coding and debugging to technical documentation.
While it is capable of generating text, its natural tendency towards conciseness and its relatively high output token price may make it less ideal for long-form creative writing compared to other models. Models that are more naturally verbose or have lower output costs might be a more economical and suitable choice for generating stories, articles, or other creative content.
An open license, such as Apache 2.0 which is common for Mistral models, means the model's weights are publicly available. This allows developers to download, modify, and deploy the model on their own infrastructure. It offers maximum flexibility and control, enabling self-hosting for privacy, security, or cost reasons, and allows for deep customization and fine-tuning.