Today, I write the code. It’s trivial and takes a lot less time than writing the spec, and since I’m using conventional tooling for WCET and stack sizing it’s nice to get those right up front. The LLMs sometimes tweak the code slightly for provability, but this is usually either direct operator replacement (shift with multiplication, and with modulus, etc) or factoring out a block to a function to tie a contract onto it, both of which I trust my compiler to undo (simple arithmetic operations and inlining, respectively) with zero to minimal impact on the generated binary.
I have been testing formal verification methods with multiple products. It will be great to also understand more about what’s tried and how it was done. For example attempting to verify the spec is what I have been trying to implement.
In an interesting coincidence I ended up watching Person of Interest S4 E5 while reading the announcement. The series showed some code supposedly belonging to to an AI.
Fable 5 said the first screen shot is from “ IDA Pro’s Hex-Rays decompiler” and a windows driver. The second screenshot triggered the safety guard rails and pushed me into Haiku.
It’s impossible to write a spec that’s not ambiguous , complete and correct in natural languages. Thus prompts will always generate unreliable software.
IMHO even if we are using auditing tools I believe we must use deterministic tools for critical analysis like this. Such rule and pattern based systems may not scale beyond certain point but they can be accurate.
At work we are now in the process of migrating away from Figma. We had spend years perfecting our Figma based design workflow. Currently we are moving all the designs into the code itself using Storybook. The gap currently is reviews and feedback which is addressed by Chromatic now.
I tried building a deliberately vague project around managing MCP servers [0]. The purpose was to find what LLMs and agents can do. While the project didn’t reach anywhere I was amazed by how it’s possible to navigate even with no clear direction. The ability of the “glorified auto-complete” system to pull off something this sort was an eye opener for me.
False positives from the deterministic audits a very difficult problem to address. Comparing and deduplicating across different methods or LLM audits seems to the only way.
I think these audit tools can look beyond just security and can look for compliance audits as well. The ability to audit real targets in staging environments makes it easy to identify issues.
reply