Anthropic has released a suite of updates to the Claude API designed to significantly increase throughput and reduce token usage with Claude 3.7 Sonnet. The new features—including cache-aware rate limits, simplified prompt caching, and token-efficient tool use—allow developers to process more requests within existing rate limits while reducing costs with minimal code changes.
Cache-Aware Rate Limits Transform Throughput
The most impactful change involves how Anthropic calculates rate limits for applications using prompt caching. Previously introduced to allow developers to store and reuse frequently accessed context between API calls, prompt caching can reduce costs by up to 90% and latency by up to 85% for long prompts.
The new cache-aware rate limits mean that prompt cache read tokens no longer count against the Input Tokens Per Minute (ITPM) limit for Claude 3.7 Sonnet on the Anthropic API. This fundamental change allows developers to optimize prompt caching usage to dramatically increase throughput within existing ITPM rate limits. Output Tokens Per Minute (OTPM) rate limits remain unchanged.
This makes Claude 3.7 Sonnet particularly powerful for applications requiring extensive context with high throughput. Document analysis platforms maintaining large knowledge bases in context, coding assistants referencing extensive codebases, and customer support systems leveraging detailed product documentation all benefit substantially from this change.
Simplified Cache Management
Anthropic has also streamlined how prompt caching works. When developers set a cache breakpoint, Claude now automatically reads from the longest previously cached prefix. This eliminates the need to manually track and specify which cached segments to use.
The automatic identification of the most relevant cached content reduces developer workload while freeing up more tokens for actual content. This feature is available on both the Anthropic API and Google Cloud's Vertex AI, making it accessible across major cloud platforms.
Token-Efficient Tool Use
Claude has long supported interacting with external client-side tools and functions, allowing developers to equip the model with custom tools for tasks like extracting structured data or automating tasks via APIs. The new token-efficient tool use capability reduces output token consumption by up to 70%, with early users seeing an average reduction of 14%.
Implementing this feature requires minimal effort. Developers simply add the beta header token-efficient-tools-2025-02-19 to tool use requests with Claude 3.7 Sonnet. SDK users need to ensure they're using the beta SDK with anthropic.beta.messages. The feature is currently available in beta on the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI.
Text Editor Tool for Collaborative Workflows
Anthropic has introduced a new text_editor tool designed for applications where users collaborate with Claude on documents. The tool enables Claude to make targeted edits to specific portions of text within source code, documents, or research reports. This targeted editing reduces token consumption and latency while increasing accuracy.
Developers can implement this tool by providing it in their API requests and handling the tool use responses. The text_editor tool is available on the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI, with comprehensive documentation to help developers get started.
Real-World Impact: Cognition Case Study
Early adopters are already reporting substantial benefits from these updates. Cognition, an applied AI lab and creator of Devin—a collaborative AI teammate for engineering teams—has leveraged the improvements extensively.
"Prompt caching allows us to provide more context about the codebase to get higher quality results while reducing cost and latency," said Scott Wu, Co-founder and CEO at Cognition. "With cache-aware ITPM limits, we are further optimizing our prompt caching usage to increase our throughput and get more out of our existing rate limits."
This real-world validation demonstrates that the token-saving updates deliver tangible value for production applications at scale.
Implementation Guidance
All features are available immediately to Anthropic API customers and can be implemented with minimal code changes. Developers can take advantage of cache-aware rate limits by using prompt caching with Claude 3.7 Sonnet. For token-efficient tool use, adding the specified beta header to requests enables immediate token savings. The text_editor tool can be integrated into applications for more efficient document editing workflows.
The updates reflect Anthropic's commitment to making Claude more accessible and cost-effective for developers building production applications. By reducing token consumption and increasing throughput within existing rate limits, these changes lower the barrier to deploying sophisticated AI-powered features while improving performance for existing implementations.
Broader Implications for AI Development
These token-saving updates represent an important evolution in how AI APIs balance performance, cost, and accessibility. As language models become increasingly capable and context windows expand, efficient token management becomes critical for production deployments. By automatically handling cache management and optimizing tool use, Anthropic reduces the operational complexity of working with large language models.
For developers building applications that require extensive context—whether analyzing large documents, working with substantial codebases, or maintaining detailed system knowledge—these updates materially improve the economics and performance of AI integration. The combination of reduced costs, lower latency, and increased throughput within existing limits makes sophisticated AI features more accessible to a broader range of applications and organizations.
Source: Token-saving updates on the Anthropic API - Anthropic News