Interesting idea. I suppose one could also have response settings (e.g. max response tokens) to ensure the model doesn't waffle on and run up costs. In a best-case scenario "ping" would be one or two input tokens and a "pong" response would be one or two output tokens, so the cost of the operation would be the preserved context size times the cache read cost (one could avoid doing a cache write since I believe the cache read would reset the platforms cache timer).
It would be interesting to graph the cost/savings of this approach based on context length, percent cached, etc.
The UI for this is a bit tricky, I could mark conversations as "active" and then do the ping/pong dance on only active conversations and up to some determined max cached (e.g. 1 hour).
Agree. Also because of the way AI writes, it takes SO LONG to read through it (they're trained on blogspam where the page tells you the author's life story as well as the bloody history of bread before telling you how to bake it)
That's why in this case I usually ask to another AI to make me a short summary with the main points. I wish the human behind the looong article idea chooses to publish a short summary directly instead.
reply