Agentic Design & Building Production Tooling

AI is magic because everything is a hack to make it look smarter than it is. For every message, the system rebuilds the entire conversation history, system prompt, uploaded files, tool results, and your new message into one massive prompt. Every token competes for attention. Stuff at the top and bottom gets noticed; stuff in the middle gets lost for large prompts.

Google ran 180 experiments across GPT, Gemini, and Claude. On sequential reasoning, every multi-agent variant made things worse by 39–70%. It was found that if a single agent can complete a sequential task correctly 45% of the time, adding more agents hurts performance. GPT-3.5 in a reflection loop scored 95.1% on HumanEval versus GPT-4 zero-shot at 67%. Weaker models with the right workflows crush stronger models without one.

Route easy queries to cheap models and hard ones to expensive ones. Clearly define a pass/fail case for your agent and evaluate it often. Every interaction with an LLM is a cold start — that constraint is the foundation of healthy architecture.

https://arxiv.org/abs/2512.08296v2

Tooling

MITRE ATLAS catalogs the techniques with which AI systems get attacked. The problem is that knowing the taxonomy doesn’t get you closer to testing anything, and the tooling landscape is a minefield of overpromises and leapfrogging projects. We made a strong attempt to pick industry-leading tools in this rapidly changing field, and we’re not blind to the fact that we’ll need to adapt as the landscape adapts.

We built 51 agentic skills that map directly to ATLAS techniques. The mapping covers LLMs, computer vision, recommendation systems, autonomous agents, and ML supply chain. Instead of picking through five different ecosystems to figure out which tool applies to which attack, the decision is already made. These skills are pre-audited both manually and via the auditing skill we built ourselves.

The skills were generated by an LLM. We edited every one of them by hand before anything shipped, then audited them manually, fixing the obvious problems as we went. Only after that did we write the automated auditor.

The auditor found more issues than we did — not because it was smarter, but because it was more thorough than any person is going to be across 51 skills. It caught unpinned commits, missing venv enforcement, and subtle environment pollution that we had either missed or considered low priority. The volume of nitpicky findings was higher than our manual pass, but nothing it flagged was wrong. That gave us confidence that it could reproduce our own judgment reliably.

Leave a Comment Cancel Reply