VaultFuzionVaultFuzionBY KAPARDYN
Applied ML04 Jun 2026 · 10 min read

Why we stayed single-model on the LLM layer — and what we built instead

One good model in a 20-detector consensus beats two models carrying 60% of the verdict.

— VaultFuzion ML Team

A widely-shared belief in email-security ML is that two independent language models, voting in consensus, catch attacks that either model misses alone. It is an intuitive argument — two opinions beat one, disagreement signals uncertainty, redundancy buys resilience.

We tested that belief on our own pipeline during an April 2026 evaluation. The numbers said yes, in a small way. The operational cost said no, in a much larger way. This article walks through the trade-off and what we shipped instead.

The pitch for dual-LLM

Run a primary model (on-premise Ollama with Llama-family weights) and a secondary model (cloud API, different training corpus and architecture) in parallel on every scanned message. When they agree the message is malicious, the signal contributes strongly to the verdict. When they disagree, dampen both and let the other detectors carry the weight.

In our internal evaluation, the dual configuration produced a measurable reduction in false negatives. It also produced a small increase in false-positive rate on ambiguous traffic. The net effect on our hardest bucket of BEC-style attempts was positive but modest.

The cost that killed it

Three factors made dual-LLM a bad trade for us.

First, infrastructure cost. Running two LLMs in parallel roughly doubles GPU time per scanned message. Across the scan volumes we project for MSPs at scale, the second model's compute cost outweighs the marginal detection gain.

Second, latency. The secondary-model call added material latency to our p95 scan budget. Pre-delivery hold is only viable when detection stays under a hard second-budget — a regression here means holding too many messages and generating customer complaints.

Third, and most importantly, the LLM is one voice in a 20-detector consensus stack. Each other detector ships improvements every release: BEC graph, thread-hijack, ATO bridge, cross-feature rules. The compounding effect of those shipped improvements already outpaces the marginal dual-LLM gain inside a release cycle or two.

The meta-argument

In a consensus architecture, the right question is never "how good can we make this one detector?". It is "what adds the most signal to the ensemble per unit of cost?". Dual-LLM adds accuracy; the extra detectors we can build in the same engineering time add more.

What we shipped instead

One LLM, running on our South African infrastructure for privacy and latency. Ollama-backed. The output is one vote in the consensus engine — never decisive alone, never ignored either.

With the engineering bandwidth saved, we shipped the capabilities that move the consensus ceiling more than a second LLM would: the BEC behavioural graph (5 patterns), thread-hijacking detector, ATO identity bridge, display-name spoof detector, domain-age detector, ARC auth-chain validator, and cross-feature detection rules that combine signals across detectors.

The composite detection lift from shipping those detectors is larger than what we measured from the dual-LLM experiment, at a fraction of the infrastructure cost, and with a latency profile that improved rather than regressed.

The fine-tuning question — briefly

A parallel question is whether to fine-tune a phishing-specific model. We did not. Not because the accuracy gain isn't there — it probably is, on paper — but because a fine-tuned model trained on 2025 phishing decays measurably against 2026 attack variants. Retraining cadence becomes a first-order cost, and the staff time is the same staff time that ships the other detectors.

A fine-tuned phishing LLM detector is on our future evaluation list. When we do ship one, it enters the consensus stack as another voice, never decisive — and it will run alongside the default model in shadow mode for a full observation window before we trust it to vote.

Dual-LLM is not wrong. For a company without a strong consensus stack, it buys real resilience. For us, it was a premium feature whose cost exceeded the marginal gain over a 20-detector ensemble we already operate.

A different company, with different constraints, would reasonably choose differently. We chose the boring answer because it is the one that survives contact with the customer's infrastructure bill.

See what's shipping

Each article is paired with a release. For what's currently live, release notes. For what's in the pipeline, coming next.