An open-license model from Mistral offering exceptional speed and a 256k context window, balanced by average intelligence and moderately high output costs.
Devstral Small (May '25) is a recent addition to Mistral's growing family of open-weight models, specifically tailored for developer-centric tasks. As its name implies, it's designed with a strong focus on code generation, completion, and analysis, while also being a capable text generator. Positioned as a 'small' model, it prioritizes speed and efficiency, making it a compelling choice for applications that require real-time feedback and high throughput, such as interactive coding assistants, chatbots, and content moderation systems.
The model's performance profile is one of stark trade-offs. Its most prominent feature is its exceptional inference speed. When served on Mistral's native API, it achieves nearly 200 tokens per second, placing it among the fastest models on the market. This speed, combined with a very generous 256,000-token context window, creates powerful possibilities for processing and generating content based on large codebases or extensive documentation. However, this performance comes at the cost of raw intelligence. With a score of 20 on the Artificial Analysis Intelligence Index, it sits in the below-average tier for reasoning and complex instruction following. This makes it less suitable for tasks that demand deep analytical capabilities or nuanced understanding, positioning it as a specialized tool rather than a general-purpose reasoning engine.
The pricing structure for Devstral Small reflects its specialized nature. Input tokens are moderately priced at around $0.10 per million tokens, which is average for its class. This makes it cost-effective for input-heavy tasks like Retrieval-Augmented Generation (RAG) or document analysis, where the model processes large amounts of text. Conversely, the cost for output tokens is somewhat expensive, reaching up to $0.30 per million tokens on some platforms. This pricing model penalizes verbosity; applications that generate lengthy responses, such as detailed explanations or long-form content creation, will see costs accumulate more quickly. Developers must carefully consider the input/output ratio of their intended workload to accurately forecast expenses.
Currently, Devstral Small is accessible through a limited number of API providers, primarily Mistral's own platform and Deepinfra. This creates a fascinating dynamic for developers, as the choice of provider has a dramatic impact on both performance and cost. Mistral's native offering provides unparalleled throughput, making it the go-to for high-volume streaming. Deepinfra, on the other hand, presents a much more economical option—less than half the price—with the added benefit of lower time-to-first-token (latency). This makes Deepinfra a strong contender for cost-sensitive projects or applications where initial responsiveness is more critical than sustained generation speed. Choosing the right provider is therefore not just a matter of preference but a crucial strategic decision that directly impacts application performance and operational budget.
20 (29 / 55)
194.7 tokens/s
$0.10 / 1M tokens
$0.30 / 1M tokens
N/A
0.29s TTFT
| Spec | Details |
|---|---|
| Model Name | Devstral Small (May '25) |
| Owner | Mistral |
| License | Open License |
| Context Window | 256,000 tokens |
| Model Type | Text & Code Generation |
| Architecture | Transformer-based |
| Input Modality | Text |
| Output Modality | Text |
| Intelligence Index | 20 (Below Average) |
| Rank (Intelligence) | #29 out of 55 |
| Rank (Speed) | #7 out of 55 |
| Cheapest Provider (Blended) | Deepinfra ($0.07 / 1M tokens) |
| Fastest Provider (Throughput) | Mistral (196 tokens/s) |
| Lowest Latency Provider (TTFT) | Deepinfra (0.29s) |
The best API provider for Devstral Small depends entirely on your primary goal: minimizing cost, maximizing speed, or achieving the lowest latency. The performance and price differences between the available providers are significant, making this a critical decision.
| Priority | Pick | Why | Tradeoff to accept |
|---|---|---|---|
| Lowest Cost | Deepinfra | At a blended price of $0.07/M tokens, it's less than half the price of Mistral's API. Both input ($0.06) and output ($0.12) are significantly cheaper. | Much lower throughput (41 t/s vs. 196 t/s). Not suitable for high-volume, real-time generation. |
| Highest Speed | Mistral | Delivers an exceptional 196 tokens/second, nearly 5x faster than Deepinfra. The best choice for real-time, streaming applications. | Higher cost, especially for output tokens ($0.30/M vs. Deepinfra's $0.12/M). |
| Lowest Latency | Deepinfra | Achieves the lowest time-to-first-token (0.29s), making it feel the most responsive for interactive, single-turn tasks. | Slower overall generation speed after the first token is delivered. |
| Balanced Performance | Mistral | Offers a good combination of very low latency (0.35s) and market-leading speed. A solid default choice if cost is not the primary constraint. | The most expensive option, particularly for output-heavy tasks. |
Provider benchmarks are based on an aggregation of recent performance data. Actual performance may vary. Prices are as of May 2025 and are subject to change.
To understand the real-world cost of using Devstral Small, let's examine a few common scenarios. These estimates use the pricing from the most cost-effective provider, Deepinfra ($0.06/M input, $0.12/M output), to illustrate the best-case cost for each task. Note that using a faster but more expensive provider like Mistral would significantly increase these costs.
| Scenario | Input | Output | What it represents | Estimated cost |
|---|---|---|---|---|
| Code Snippet Generation | 500 input tokens | 1,500 output tokens | A typical copilot-style request to write a function based on a comment. | $0.00021 |
| RAG-based Q&A | 8,000 input tokens | 500 output tokens | Answering a question using a chunk of a provided document as context. | $0.00054 |
| Live Chat Session | 4,000 input tokens | 6,000 output tokens | A moderately long, interactive conversation with multiple turns. | $0.00096 |
| Large Document Summarization | 100,000 input tokens | 2,000 output tokens | Using the large context window to create a summary of a technical paper. | $0.00624 |
The takeaway is clear: Devstral Small is highly economical for input-heavy tasks like RAG and analysis, where the output is concise. However, costs can rise in conversational or generative scenarios with high output token counts. Utilizing the large context window for a single task, like summarization, remains affordable, but frequent use with large inputs will add up.
Managing costs for Devstral Small involves a strategic approach to provider selection, prompt engineering, and application architecture. Given the significant price difference between input and output tokens, controlling verbosity is key. Here are several strategies to keep your operational expenses in check.
Your choice of provider is the single biggest factor in your total cost. The trade-off is stark: speed vs. price.
Output tokens are 2x more expensive on Deepinfra and 3x more expensive on Mistral than input tokens. Every extra word the model generates has an outsized impact on your bill.
The 256k context window is a powerful feature but also a potential cost trap if used inefficiently. Sending 100k tokens on every call gets expensive quickly.
Devstral Small is a fast, open-license language model from Mistral. It is optimized for speed and developer-focused tasks like code generation, but also handles general text generation. The 'May '25' designation indicates the version and release timeframe of this particular model build.
It is significantly faster than most competitors, especially on its native Mistral API. Its key differentiator is the combination of high speed and a very large 256,000-token context window. However, its raw intelligence and reasoning ability are below average compared to other leading models in its class.
No. It is explicitly a 'non-reasoning' model, meaning it is not designed for multi-step logic, complex problem-solving, or deep analytical tasks. For those use cases, larger and more capable models like GPT-4, Claude 3, or Mistral's own Large model are more appropriate.
The primary differences are speed and cost. Mistral's API is extremely fast (196 tokens/s) but more expensive, especially for output. Deepinfra's API is much cheaper (less than half the price) and has slightly lower latency (faster first token), but its overall generation speed is much slower (41 tokens/s).
It means the model can read, process, and reference up to 256,000 tokens (roughly 190,000 words) in a single prompt. This is extremely useful for tasks like summarizing entire books, answering questions about large codebases, or maintaining context over very long conversations without losing track of details.
Yes. Because Devstral Small is released under an open license, you have the freedom to download the model weights. This allows you to fine-tune it on your own proprietary data to create a specialized version for your specific needs, or to self-host it on your own infrastructure for maximum privacy and control.