GPT-3.5 Turbo (0613) (Function Calling)

The classic workhorse model that introduced reliable function calling.

GPT-3.5 Turbo (0613) (Function Calling)

A snapshot model from OpenAI that balanced speed, cost, and the groundbreaking ability to connect language generation with external tools and APIs.

Text GenerationFunction Calling4K ContextOpenAILegacy ModelCost-Effective

Released in June 2023, GPT-3.5 Turbo (0613) represents a significant milestone in the evolution of large language models. While not the largest or most powerful model of its time, it was the first version of the widely popular GPT-3.5 Turbo series to be fine-tuned specifically for reliable function calling. This capability transformed the model from a pure text generator into a reasoning engine that could interact with external systems, read data, and trigger actions in a structured, predictable manner. It effectively became the brain for a new generation of AI-powered applications that could go beyond conversation to perform concrete tasks.

The '0613' snapshot quickly became a developer favorite, establishing itself as the go-to workhorse for a vast array of applications. Its appeal was rooted in a compelling trifecta of attributes: it was fast enough for real-time user interfaces, cheap enough for high-volume workloads, and capable enough for tasks ranging from customer support and content creation to simple code generation. Before this model, integrating LLMs with other software often required fragile parsing of natural language output. The `0613` version, with its ability to generate structured JSON arguments for predefined functions, provided a robust and scalable solution, paving the way for more complex and reliable AI agents.

However, the pace of AI development is relentless. The `0613` model, while foundational, has since been superseded by more advanced versions like `1106` and the `gpt-3.5-turbo-0125` update, which offer larger context windows, improved instruction following, and even lower pricing. Furthermore, OpenAI has scheduled this model version for deprecation. As of June 13, 2024, API calls to `gpt-3.5-turbo-0613` will be automatically rerouted to a newer model. Despite its legacy status, analyzing this model provides critical context for understanding the trajectory of AI development and offers a performance baseline against which newer, more capable models are measured.

Scoreboard

Intelligence

Good (Not Ranked / 93)

While not benchmarked in this specific index, GPT-3.5 Turbo (0613) is widely regarded as highly capable for a broad range of standard tasks, though it lacks the deep reasoning and nuance of GPT-4 class models.
Output speed

Fast qualitative

Known for high token throughput, making it ideal for streaming responses in chatbots and other interactive applications.
Input price

$1.50 per 1M tokens

This historical pricing established a new standard for cost-effective AI, enabling high-volume applications. Note: Newer versions are even cheaper.
Output price

$2.00 per 1M tokens

The slightly higher output cost is a standard pricing model, reflecting the greater computational load of generation versus ingestion.
Verbosity signal

Moderate qualitative

Generally provides concise and relevant answers but can be guided toward more or less verbose outputs through careful system prompting.
Provider latency

Low qualitative

Time-to-first-token is typically low, contributing to a responsive user experience in real-time scenarios.

Technical specifications

Spec Details
Model Name GPT-3.5 Turbo (snapshot 0613)
Owner / Developer OpenAI
Release Date June 13, 2023
Model Type Generative Pre-trained Transformer
Key Feature First version with stable Function Calling
Context Window 4,096 tokens
Input Modality Text
Output Modality Text (including structured JSON for functions)
License Proprietary
Fine-Tuning Supported, but the feature is also being deprecated for this model version.
Deprecation Date June 13, 2024 (scheduled)
API Endpoint gpt-3.5-turbo-0613
Successor Models gpt-3.5-turbo-1106, gpt-3.5-turbo-0125

What stands out beyond the scoreboard

Where this model wins
  • Cost-Performance Ratio: For its time, it offered an unmatched balance of capability, speed, and low cost, making sophisticated AI accessible for mainstream applications and startups.
  • Reliable Function Calling: It was the first model to make tool use a dependable feature. Its ability to generate structured JSON for specified functions was a game-changer for building AI agents.
  • Speed and Throughput: Excellent performance for interactive use cases. Its low latency and high output tokens-per-second rate are suitable for chatbots, content completion, and other real-time tasks.
  • Versatility: As a general-purpose model, it performs well across a wide spectrum of non-specialized tasks, including summarization, classification, translation, and creative writing.
  • Maturity and Stability: As a snapshot model, its behavior was fixed, providing developers with a predictable and stable target for prompt engineering and application logic, free from the variations of continuously updated models.
Where costs sneak up
  • Limited Context Window: The 4K token context window is small by modern standards. Processing long documents or maintaining long conversational histories requires complex chunking and state management, adding engineering overhead.
  • Impending Deprecation: The model has a fixed end-of-life date. Any new development on this model is technical debt, as applications will need to be tested and migrated to a newer version.
  • Weaker Complex Reasoning: It struggles with tasks requiring multiple steps of reasoning, deep logical inference, or strict adherence to complex constraints, often failing where GPT-4 models would succeed.
  • Factuality and Hallucination: Like all models of its generation, it can confidently invent facts, sources, and details. It requires fact-checking and guardrails for any application where accuracy is critical.
  • Function Calling Nuances: While reliable, crafting effective function-calling prompts can be an iterative process. The model can occasionally fail to call a function when it should, or hallucinate arguments.
  • No Native JSON Mode: Unlike newer models, it lacks a guaranteed JSON output mode outside of function calling, requiring extra prompting and parsing to ensure valid JSON is returned for other tasks.

Provider pick

Accessing GPT-3.5 Turbo (0613) is primarily done through an API provider. While OpenAI is the direct source, other platforms offer this model as part of a larger suite of services, often with value-added features like unified APIs, enterprise-grade security, or performance monitoring. As the model is now a legacy product, provider availability may be limited.

Priority Pick Why Tradeoff to accept
Lowest Cost & Direct Access OpenAI As the creator of the model, OpenAI offers direct, unadulterated access at the base cost. This is the simplest and most common way to use the model. Lacks built-in features like advanced caching, failover to other models, or a unified API structure for multi-provider setups.
Enterprise Compliance Microsoft Azure OpenAI Service Provides the power of OpenAI models within Azure's secure and compliant cloud environment, including VNet integration and regional data residency. Setup is more complex than the direct OpenAI API. Pricing and model availability can lag behind OpenAI's public releases.
Unified API & Resilience Proxy Providers (e.g., Portkey, LiteLLM) These services provide a single API endpoint to interact with multiple models from different providers, simplifying code and enabling automatic failover or load balancing. Introduces an additional point of failure and potential latency. Often comes with its own subscription cost on top of model usage fees.
Performance & Caching Custom Proxy / Monitoring (e.g., Helicone) Building a custom proxy or using a service like Helicone allows for implementing semantic caching, which can dramatically reduce costs and latency for repetitive queries. Requires significant engineering effort to build and maintain a custom solution, or a subscription fee for a managed monitoring/caching service.

Note: Provider offerings are subject to change. Given the deprecation status of `gpt-3.5-turbo-0613`, many providers may no longer feature it prominently or may automatically route requests to newer equivalents.

Real workloads cost table

Understanding the per-token price is one thing; seeing the cost of real-world tasks is another. The true cost-effectiveness of GPT-3.5 Turbo (0613) becomes clear when applied to common application workloads. The following examples are based on its historical pricing of $1.50 per 1M input tokens and $2.00 per 1M output tokens.

Scenario Input Output What it represents Estimated cost
Customer Support Bot (1 turn) ~350 input tokens ~100 output tokens A single user question with conversation history and a bot's response. ~$0.00073
Email Summarization ~1,500 input tokens ~200 output tokens Processing a moderately long email thread to produce a concise summary. ~$0.00265
API Call via Function ~400 input tokens ~80 output tokens A prompt including function definitions, user query, and the model's JSON argument output. ~$0.00076
Blog Post Idea Generation (5 ideas) ~100 input tokens ~250 output tokens A short prompt asking for five distinct ideas based on a topic. ~$0.00065
Language Translation (Paragraph) ~200 input tokens ~200 output tokens Translating a paragraph of text from one language to another. ~$0.00070
Sentiment Analysis (Batch of 20) ~1,000 input tokens ~100 output tokens Classifying 20 short customer reviews as positive, negative, or neutral. ~$0.00170

The takeaway is clear: individual operations are exceptionally cheap, often costing fractions of a cent. The model's affordability allows for its integration into high-volume, user-facing applications without incurring prohibitive costs. The total cost is a simple function of volume, making budget forecasting straightforward for established workflows.

How to control cost (a practical playbook)

While GPT-3.5 Turbo (0613) is inexpensive, costs can accumulate rapidly in high-volume applications. A strategic approach to token management is essential for maintaining a healthy budget. Implementing a few key practices can yield significant savings and improve application efficiency.

Optimize Your Prompts

The number of input tokens is a primary cost driver. Keeping prompts concise without sacrificing clarity is key.

  • Refine System Prompts: Iteratively shorten your system message. Remove redundant words or sentences that don't significantly improve output quality.
  • Minimize Context: In conversational applications, don't send the entire chat history with every turn. Use summarization techniques or a sliding window to keep the context relevant and brief.
  • Use Few-Shot Examples Wisely: While few-shot prompting can improve accuracy, the examples add to your input token count. Use them only when necessary and keep them as short as possible.
Leverage Caching Strategies

Many applications receive identical or semantically similar requests repeatedly. Responding from a cache instead of calling the API can lead to massive cost and latency reductions.

  • Exact-Match Caching: For identical prompts, store the result and serve it directly from a simple key-value store (like Redis). This is highly effective for stateless, repetitive tasks.
  • Semantic Caching: For similar but not identical prompts, use embedding models to determine if a new prompt is close enough to a previously answered one. If so, you can serve the cached response, avoiding an API call.
Batch Your Requests

If you have many non-urgent tasks, batching them into a single API call can be more efficient than making numerous individual calls. While the OpenAI API processes requests individually, batching on your end simplifies your application logic and reduces network overhead.

  • Asynchronous Processing: Use a job queue system to collect tasks (e.g., summarizing articles, generating product descriptions) and have a background worker process them in batches.
  • Consolidate Data: Instead of asking the model to analyze one piece of data at a time, structure your prompt to request analysis on a list of items at once.
Monitor and Log Everything

You can't optimize what you don't measure. Implement robust logging to track token usage for every API call.

  • Log Prompt and Completion Tokens: Record the input and output token counts returned by the API for every request.
  • Associate Costs with Features: Tag API calls with the application feature that triggered them. This helps identify which parts of your product are responsible for the most significant costs.
  • Set Up Alerts: Create budget alerts in your provider's dashboard or your own monitoring system to get notified when costs exceed a certain threshold.

FAQ

What is GPT-3.5 Turbo (0613)?

GPT-3.5 Turbo (0613) is a specific, versioned "snapshot" of the GPT-3.5 Turbo model family, released by OpenAI on June 13, 2023. Its defining feature was being the first version heavily optimized for reliable function calling, allowing it to interact with external tools and APIs by generating structured JSON output.

How is it different from other GPT-3.5 models?

Compared to the base `gpt-3.5-turbo` model (which is continuously updated), the `0613` snapshot was static, offering predictable behavior. Its main advantage over its predecessors was reliable function calling. Newer models like `1106` and `0125` have surpassed it by offering larger context windows (16K vs 4K), better instruction-following, a dedicated JSON mode, and lower prices.

Is this model still relevant to use?

For new projects, no. Its official deprecation date is June 13, 2024. Developers should use newer, more capable, and cheaper versions like `gpt-3.5-turbo-0125`. Its relevance today is primarily for maintaining legacy systems that were built specifically around its behavior and for serving as a historical benchmark for model progress.

What happens after its deprecation date?

OpenAI has stated that after the deprecation date, API calls made to the `gpt-3.5-turbo-0613` endpoint will be automatically rerouted to the current standard `gpt-3.5-turbo` model. While this prevents applications from breaking entirely, it can lead to unexpected behavior or changes in output quality, so proactive migration is strongly recommended.

What is "function calling"?

Function calling is a feature that allows a developer to define a set of tools or functions within their code. The LLM can then choose to "call" one of these functions in response to a prompt. The model doesn't execute the code itself; instead, it generates a JSON object containing the name of the function to call and the arguments to use, which the developer's application can then parse and execute.

What are the main limitations of this model?

The primary limitations are its small 4K token context window, its impending deprecation, and its comparatively weaker performance on complex reasoning tasks when measured against state-of-the-art models like GPT-4. It is also prone to hallucination and requires careful prompt engineering for best results.


Subscribe