Devstral Small (May) (non-reasoning)

A blazing-fast developer model with a massive context window.

Devstral Small (May) (non-reasoning)

An open-license model from Mistral offering exceptional speed and a 256k context window, balanced by average intelligence and moderately high output costs.

MistralOpen License256k ContextCode GenerationText GenerationHigh Throughput

Devstral Small (May '25) is a recent addition to Mistral's growing family of open-weight models, specifically tailored for developer-centric tasks. As its name implies, it's designed with a strong focus on code generation, completion, and analysis, while also being a capable text generator. Positioned as a 'small' model, it prioritizes speed and efficiency, making it a compelling choice for applications that require real-time feedback and high throughput, such as interactive coding assistants, chatbots, and content moderation systems.

The model's performance profile is one of stark trade-offs. Its most prominent feature is its exceptional inference speed. When served on Mistral's native API, it achieves nearly 200 tokens per second, placing it among the fastest models on the market. This speed, combined with a very generous 256,000-token context window, creates powerful possibilities for processing and generating content based on large codebases or extensive documentation. However, this performance comes at the cost of raw intelligence. With a score of 20 on the Artificial Analysis Intelligence Index, it sits in the below-average tier for reasoning and complex instruction following. This makes it less suitable for tasks that demand deep analytical capabilities or nuanced understanding, positioning it as a specialized tool rather than a general-purpose reasoning engine.

The pricing structure for Devstral Small reflects its specialized nature. Input tokens are moderately priced at around $0.10 per million tokens, which is average for its class. This makes it cost-effective for input-heavy tasks like Retrieval-Augmented Generation (RAG) or document analysis, where the model processes large amounts of text. Conversely, the cost for output tokens is somewhat expensive, reaching up to $0.30 per million tokens on some platforms. This pricing model penalizes verbosity; applications that generate lengthy responses, such as detailed explanations or long-form content creation, will see costs accumulate more quickly. Developers must carefully consider the input/output ratio of their intended workload to accurately forecast expenses.

Currently, Devstral Small is accessible through a limited number of API providers, primarily Mistral's own platform and Deepinfra. This creates a fascinating dynamic for developers, as the choice of provider has a dramatic impact on both performance and cost. Mistral's native offering provides unparalleled throughput, making it the go-to for high-volume streaming. Deepinfra, on the other hand, presents a much more economical option—less than half the price—with the added benefit of lower time-to-first-token (latency). This makes Deepinfra a strong contender for cost-sensitive projects or applications where initial responsiveness is more critical than sustained generation speed. Choosing the right provider is therefore not just a matter of preference but a crucial strategic decision that directly impacts application performance and operational budget.

Scoreboard

Intelligence

20 (29 / 55)

Scores 20 on the Artificial Analysis Intelligence Index. This places it below average among 55 comparable open-weight models, indicating it is not designed for complex reasoning tasks.

Output speed

194.7 tokens/s

Exceptionally fast, ranking #7 out of 55 models. This speed is primarily achieved on Mistral's native API.

Input price

$0.10 / 1M tokens

Moderately priced for input, ranking #27 out of 55. Competitive for tasks involving large context ingestion.

Output price

$0.30 / 1M tokens

Somewhat expensive for output, ranking #35 out of 55. Can become costly for verbose generation tasks.

Verbosity signal

N/A

Verbosity data is not yet available for this model.

Provider latency

0.29s TTFT

Excellent time-to-first-token. Deepinfra offers the lowest latency, making it feel very responsive for single-turn interactions.

Technical specifications

Spec	Details
Model Name	Devstral Small (May '25)
Owner	Mistral
License	Open License
Context Window	256,000 tokens
Model Type	Text & Code Generation
Architecture	Transformer-based
Input Modality	Text
Output Modality	Text
Intelligence Index	20 (Below Average)
Rank (Intelligence)	#29 out of 55
Rank (Speed)	#7 out of 55
Cheapest Provider (Blended)	Deepinfra ($0.07 / 1M tokens)
Fastest Provider (Throughput)	Mistral (196 tokens/s)
Lowest Latency Provider (TTFT)	Deepinfra (0.29s)

What stands out beyond the scoreboard

Where this model wins

Blazing Inference Speed: With speeds approaching 200 tokens/second on its native platform, it's ideal for real-time, high-throughput applications like interactive chatbots or live code completion.
Massive Context Window: A 256k context window allows it to process and reference vast amounts of information, making it suitable for complex document analysis, RAG, or maintaining long conversations.
Developer-Focused Capabilities: As a 'Devstral' model, it's optimized for code generation, explanation, and debugging tasks, serving as a powerful copilot for software engineers.
Low-Latency Starts: For applications where initial responsiveness is critical, providers like Deepinfra offer sub-300ms time-to-first-token, ensuring a snappy user experience for single-turn queries.
Open and Accessible: Being an open-license model allows for greater flexibility, self-hosting possibilities, and fine-tuning for specific domains without vendor lock-in.

Where costs sneak up

Average Intelligence: Its score of 20 on the Intelligence Index is below average. For tasks requiring deep reasoning, complex instruction following, or nuanced creative writing, more powerful models are necessary.
High Output Token Cost: The output token price is up to 3x the input price. This can make verbose applications, like detailed explanations or content generation, unexpectedly expensive.
Performance Fragmentation: Key metrics like speed, latency, and price vary dramatically between API providers. Choosing the wrong provider for your use case can lead to either poor performance or a much higher bill.
Not a Reasoning Powerhouse: This is a 'Small' non-reasoning model. It's not designed for multi-step logical problems or complex chain-of-thought tasks, where larger, instruction-tuned models excel.
Large Context Cost Trap: While the 256k context window is powerful, filling it with input tokens can become costly, especially for frequent, large-scale processing jobs. The cost per prompt can escalate quickly if not managed.

Provider pick

The best API provider for Devstral Small depends entirely on your primary goal: minimizing cost, maximizing speed, or achieving the lowest latency. The performance and price differences between the available providers are significant, making this a critical decision.

Priority	Pick	Why	Tradeoff to accept
Lowest Cost	Deepinfra	At a blended price of $0.07/M tokens, it's less than half the price of Mistral's API. Both input ($0.06) and output ($0.12) are significantly cheaper.	Much lower throughput (41 t/s vs. 196 t/s). Not suitable for high-volume, real-time generation.
Highest Speed	Mistral	Delivers an exceptional 196 tokens/second, nearly 5x faster than Deepinfra. The best choice for real-time, streaming applications.	Higher cost, especially for output tokens ($0.30/M vs. Deepinfra's $0.12/M).
Lowest Latency	Deepinfra	Achieves the lowest time-to-first-token (0.29s), making it feel the most responsive for interactive, single-turn tasks.	Slower overall generation speed after the first token is delivered.
Balanced Performance	Mistral	Offers a good combination of very low latency (0.35s) and market-leading speed. A solid default choice if cost is not the primary constraint.	The most expensive option, particularly for output-heavy tasks.

Provider benchmarks are based on an aggregation of recent performance data. Actual performance may vary. Prices are as of May 2025 and are subject to change.

Real workloads cost table

To understand the real-world cost of using Devstral Small, let's examine a few common scenarios. These estimates use the pricing from the most cost-effective provider, Deepinfra ($0.06/M input, $0.12/M output), to illustrate the best-case cost for each task. Note that using a faster but more expensive provider like Mistral would significantly increase these costs.

Scenario	Input	Output	What it represents	Estimated cost
Code Snippet Generation	500 input tokens	1,500 output tokens	A typical copilot-style request to write a function based on a comment.	$0.00021
RAG-based Q&A	8,000 input tokens	500 output tokens	Answering a question using a chunk of a provided document as context.	$0.00054
Live Chat Session	4,000 input tokens	6,000 output tokens	A moderately long, interactive conversation with multiple turns.	$0.00096
Large Document Summarization	100,000 input tokens	2,000 output tokens	Using the large context window to create a summary of a technical paper.	$0.00624

The takeaway is clear: Devstral Small is highly economical for input-heavy tasks like RAG and analysis, where the output is concise. However, costs can rise in conversational or generative scenarios with high output token counts. Utilizing the large context window for a single task, like summarization, remains affordable, but frequent use with large inputs will add up.

How to control cost (a practical playbook)

Managing costs for Devstral Small involves a strategic approach to provider selection, prompt engineering, and application architecture. Given the significant price difference between input and output tokens, controlling verbosity is key. Here are several strategies to keep your operational expenses in check.

Choose the Right Provider for the Job

Your choice of provider is the single biggest factor in your total cost. The trade-off is stark: speed vs. price.

For Cost-Sensitive & Batch Jobs: Use Deepinfra. It's less than half the price of Mistral's API. This is the ideal choice for asynchronous tasks, data analysis, summarization, or any application where budget is the top priority.
For Real-Time & UX-Critical Apps: Use Mistral. The 5x speed advantage is crucial for chatbots, live coding assistants, and other applications where users are waiting for a streaming response. Absorb the higher cost as a necessity for a premium user experience.

Aggressively Manage Output Verbosity

Output tokens are 2x more expensive on Deepinfra and 3x more expensive on Mistral than input tokens. Every extra word the model generates has an outsized impact on your bill.

Prompt for Brevity: Instruct the model directly in your prompt to be concise. Use phrases like "Answer in one sentence," "Provide only the code," or "Be brief."
Use `max_tokens`: Always set the `max_tokens` parameter in your API call. This acts as a hard ceiling, preventing the model from generating excessively long (and expensive) responses, especially if it gets stuck in a loop.
Refine Few-Shot Examples: If using few-shot prompting, ensure your examples show the desired level of conciseness.

Optimize Large Context Window Usage

The 256k context window is a powerful feature but also a potential cost trap if used inefficiently. Sending 100k tokens on every call gets expensive quickly.

Summarize Conversation History: For chatbots, instead of sending the entire chat history, create a running summary of the conversation and inject that into the prompt along with the last few user turns.
Be Selective with RAG: When using Retrieval-Augmented Generation, refine your retrieval step to pull in only the most relevant chunks of text, rather than stuffing the context window with marginally useful information.
Cache Full-Context Results: If you frequently analyze the same large document, cache the results of the analysis to avoid reprocessing the entire document for similar queries.

FAQ

What is Devstral Small (May '25)?

Devstral Small is a fast, open-license language model from Mistral. It is optimized for speed and developer-focused tasks like code generation, but also handles general text generation. The 'May '25' designation indicates the version and release timeframe of this particular model build.

How does it compare to other small models?

It is significantly faster than most competitors, especially on its native Mistral API. Its key differentiator is the combination of high speed and a very large 256,000-token context window. However, its raw intelligence and reasoning ability are below average compared to other leading models in its class.

Is Devstral Small good for complex reasoning?

No. It is explicitly a 'non-reasoning' model, meaning it is not designed for multi-step logic, complex problem-solving, or deep analytical tasks. For those use cases, larger and more capable models like GPT-4, Claude 3, or Mistral's own Large model are more appropriate.

What's the difference between the Mistral and Deepinfra APIs?

The primary differences are speed and cost. Mistral's API is extremely fast (196 tokens/s) but more expensive, especially for output. Deepinfra's API is much cheaper (less than half the price) and has slightly lower latency (faster first token), but its overall generation speed is much slower (41 tokens/s).

What does the 256k context window mean for developers?

It means the model can read, process, and reference up to 256,000 tokens (roughly 190,000 words) in a single prompt. This is extremely useful for tasks like summarizing entire books, answering questions about large codebases, or maintaining context over very long conversations without losing track of details.

Can I fine-tune or self-host this model?

Yes. Because Devstral Small is released under an open license, you have the freedom to download the model weights. This allows you to fine-tune it on your own proprietary data to create a specialized version for your specific needs, or to self-host it on your own infrastructure for maximum privacy and control.

Devstral Small (May) (non-reasoning)