
cache_control
property), BlackBox will make a best-effort to continue routing to the same provider to make use of the warm cache. In the event that the provider with your cached prompt is not available, BlackBox will try the next-best provider.
Inspecting cache usage
To see how much caching saved on each generation, you can:- Use
usage: {include: true}
in your request to get the cache tokens at the end of the response.
cache_discount
field in the response body will tell you how much the response saved on cache usage. Some providers, like Anthropic, will have a negative discount on cache writes, but a positive discount (which reduces total cost) on cache reads.
OpenAI
Caching price changes:- Cache writes: no cost
- Cache reads: (depending on the model) charged at 0.25x or 0.50x the price of the original input pricing
Grok
Caching price changes:- Cache writes: no cost
- Cache reads: charged at x the price of the original input pricing
Moonshot AI
Caching price changes:- Cache writes: no cost
- Cache reads: charged at x the price of the original input pricing
Groq
Caching price changes:- Cache writes: no cost
- Cache reads: charged at x the price of the original input pricing
Anthropic Claude
Caching price changes:- Cache writes: charged at x the price of the original input pricing
- Cache reads: charged at x the price of the original input pricing
cache_control
breakpoints. There is a limit of four breakpoints, and the cache will expire within five minutes. Therefore, it is recommended to reserve the cache breakpoints for large bodies of text, such as character cards, CSV data, RAG data, book chapters, etc.
Click here to read more about Anthropic prompt caching and its limitation.
The cache_control
breakpoint can only be inserted into the text part of a multipart message.
DeepSeek
Caching price changes:- Cache writes: charged at the same price as the original input pricing
- Cache reads: charged at x the price of the original input pricing
Google Gemini
Implicit Caching
Gemini 2.5 Pro and 2.5 Flash models support implicit caching, providing automatic caching similar to OpenAI. No manual setup required. Pricing Changes:- No cache write or storage costs.
- Cached tokens are charged at x the original input token cost.
To maximize implicit cache hits, keep the initial portion of your message arrays consistent between requests.
Explicit Caching
For other Gemini models, insertcache_control
breakpoints explicitly. No limit on breakpoints, but only the last is used for caching.
Cache writes: input token cost + 5 minutes storage. Cache reads: x input cost.
TTL: 5 minutes. Minimum 4096 tokens typically.