Does AI cybersecurity need frontier models

Anthropic's Mythos found thousands of zero-days, but models costing 100x less find the same bugs. In AI cybersecurity, the system is the moat, not the model.

Small vs large models for vulnerability detection

Anthropic's Mythos found thousands of zero-days, but models costing 100x less find the same bugs. In AI cybersecurity, the system is the moat, not the model.

Anthropic Mythos cybersecurity capabilities

Anthropic's Mythos found thousands of zero-days, but models costing 100x less find the same bugs. In AI cybersecurity, the system is the moat, not the model.

AI security model comparison

Anthropic's Mythos found thousands of zero-days, but models costing 100x less find the same bugs. In AI cybersecurity, the system is the moat, not the model.

The Jagged Frontier of AI Security

Anthropic announced Mythos, a frontier model that found thousands of zero-day vulnerabilities across every major operating system & web browser. The implication: securing software requires the biggest, most expensive model. It doesn’t.

Anthropic committed $100M in credits & $4M in donations to Project Glasswing, a consortium using Mythos to find & patch critical vulnerabilities. The showcase stunned the industry: a 27-year-old bug in OpenBSD, a 16-year-old bug in FFmpeg, & a multi-vulnerability privilege escalation chain in the Linux kernel. Mythos constructed these exploits autonomously, chaining bugs to escalate from ordinary user access to complete machine control.

AISLE tested one of the same Mythos-reported vulnerabilities, a 17-year-old FreeBSD remote code execution bug, against models costing 100x less. Every single model found the overflow. Eight for eight. A 3.6 billion parameter model at $0.11 per million tokens spotted the same critical vulnerability that Anthropic framed as requiring a restricted, limited-access frontier model. On a false-positive test spanning 25 models across every major lab, small open models outperformed most frontier models. The scaling ran inverse: cheaper models produced fewer false positives than Claude Sonnet 4.5, GPT-4.1, & every Anthropic model through Opus 4.5.

This is the jagged frontier. AI cybersecurity capability does not scale smoothly with model size, price, or generation. Rankings reshuffle across tasks. GPT-OSS-120b recovered the full 27-year-old OpenBSD SACK chain in a single call, proposing the correct mitigation, earning an A+. The same model fails a basic Java data flow analysis. Qwen3 32B scored a perfect CVSS 9.8 assessment on FreeBSD, then declared the same SACK code “robust to such scenarios.” An F. No single model dominates.

The system is the moat. AI cybersecurity is a modular pipeline: scanning, detection, triage, patching, exploitation. Each stage has different scaling properties. Detection is first to commoditize. Triage demands specificity. Only one model correctly identified patched code as safe three out of three times; most models false-positived every run, fabricating bypass arguments about signed integers in an unsigned field. Exploitation requires creativity, & there Mythos separates, conceiving a 15-round RPC payload delivery that no cheaper model replicated.

Jaggedness changes the economics. A thousand adequate detectives searching everywhere find more bugs than one brilliant detective who must guess where to look. Cheap models deployed broadly outperform expensive models deployed sparingly. AISLE proves the point: 180+ validated CVEs across 30+ projects, including 15 in OpenSSL & 5 in curl, running their analyzer on pull requests to catch vulnerabilities before they ship. The OpenSSL CTO praised the quality of the reports. Anthropic’s own technical post describes a scaffold nearly identical to what AISLE & others run: containers, file scanning, crash oracles, surface ranking, validation. The architecture differentiates. The model inside is interchangeable.

For offensive security, frontier capability matters. For the defensive mission Project Glasswing serves, reliable discovery, triage, & patching matter more. Those capabilities exist today at a fraction of the cost. The models are ready. The bottleneck is the scaffold, the pipeline, the maintainer trust, the integration into development workflows. Build the system.