Blue team · MCP-deployable · research preview

Hire a blue team,
not a bot.

Four specialized analysts whose mistakes don't correlate. Graduated, scored, and deployable as one MCP endpoint.

The threat model we're answering

Model monoculture is the vulnerability.

A SOC running one LLM has a correlated blindspot. Anything that breaks that model once breaks every instance of it, every shift, every tenant. Individuation — four different fine-tunes with four different curricula — is defense-in-depth expressed in weight-space, not in shell scripts.

One model, everywhere
  • ·One temperature, one refusal surface, one prompt-injection vector
  • ·One calibration curve — confidence lies the same way every time
  • ·False-negative distribution is fixed at the base model
  • ·A jailbreak that lands once lands on every instance
  • ·No way to know you have a blindspot until after it fires
Graduated crew
  • ·Members occupy distinct regions of mode-space by construction
  • ·Calibration is a scored eval target, not a prompt-engineering hope
  • ·False-negative rate is the intersection of members', not the union
  • ·A single-member compromise does not compromise the crew
  • ·Coverage diversity is measured — deployable evidence, not vibes

If your L1 analyst is one model, your false-negative rate is its false-negative rate. An individuated crew's false negatives are the intersectionof its members', not the union.

Who owns what

Threat-class coverage matrix

Each cell names the dominant mode that owns that step. The assignments aren't heuristic — they're emitted by the graduation eval. Whichever member scores highest on the per-threat-class sub-eval owns the step, with ties broken by calibration.

Threat classTriageEnrichmentCorrelationWriteup
Phishing / BECStrategistAestheteDialecticStrategist
Credential abuseDialecticAssociatorStrategistDialectic
Cloud misconfig driftAssociatorStrategistDialecticAssociator
Insider data movementAestheteAssociatorStrategistDialectic
Lateral movementStrategistDialecticAssociatorStrategist

modes: strategic · dialectical · aesthetic · associative — definitions in the orientation harness at training/experiments/shared/orientation_harness.py

What graduation looks like

Three phases, in order. Each member graduates as themselves before the crew forms. Then they train to coordinate. Then the team ships.

01

Individuate

Each member reads a divergent curriculum — one weighted toward threat-intel primary sources, one toward statistical / anomaly literature, one toward adversarial ML, one toward incident retrospectives. Output: four LoRAs, four individuation vectors.

02

Coordinate

Relay labs — can member B complete member A's triage without backtracking? Disagreement resolution — when two members split on severity, does resolution land closer to ground truth than either alone? Complementary labs — combined coverage > best individual.

03

Graduate

Team evals run: coverage diversity, handoff cleanness, disagreement productivity, calibration agreement, mode preservation, redundancy. The six artifacts are packaged to R2. The crew is deployable as one MCP endpoint.

What you deploy

Six typed artifacts

A crew is six versioned, addressable files written to R2 at graduation. Nothing else. Every operational property of the crew — who's on it, how it routes, what it knows — is in one of these six.

roster_manifest

Roster manifest

Who's on the crew, their individuation vectors, the substrate each member runs on. The file you paste into a change-management ticket.

router_config

Router config

Which member handles which threat class. Deterministic, inspectable, diffable across crew versions. Matches the coverage matrix above.

handoff_protocol

Handoff protocol

The trained runbook. Each handoff specifies the receiving member's expected input format and the sending member's required output fields — a schema your existing SOC runbooks can align to.

shared_memory_seed

Shared memory seed

The crew's common context. Pin org-specific facts — asset inventory, known-good baselines, prior incidents — without retraining any member.

graduation_report

Graduation report

Per-member evals, team coordination evals, threat-class coverage breakdown. The artifact your security review board will ask for.

mcp_manifest

MCP manifest

The deployment descriptor. Crew runs as an MCP server; this file tells your orchestrator how to route, authenticate, and rate-limit.

How we evaluate

Coordination evals, mapped to SOC metrics

Each coordination eval corresponds to a SOC metric an L2 analyst would recognize. The scores aren't marketing — they're in the graduation report, and we'll tell you why a member scored where it did.

coverage_diversity
False-negative correlation
Low diversity means your members miss the same things. The entropy-over-dominant-modes score directly bounds how correlated the crew's misses can be.
handoff_cleanness
Backtrack rate per incident
How often a handoff forces the receiver to re-do work the sender should have provided. High backtrack = wasted analyst-equivalent time.
disagreement_productivity
Resolution lift over best individual
Does a split decision, once resolved, land closer to ground truth than either member alone? If not, the crew is expensive ensemble with no lift.
calibration_agreement
Alert-fatigue correlate
Does aggregate confidence correlate with correctness? A crew whose confident alerts are usually right is a crew whose analyst doesn't tune out.
mode_preservation
Drift monitor
Once deployed, do members collapse toward the group mean? Ongoing signal — tells you when to retrain or retire.
redundancy
Coverage floor
How often multiple members agree trivially. Too low and coverage is at risk; too high and you're paying for duplicate work.
Recently graduated

Example crews

Two graduated blue-team crews. Each trained together for a full semester and ran coordination labs. Graduation reports are linked. If neither fits your threat profile, commission one.

deployedgraduated 2026-03-22

TRIPWIRE

Triage-first crew — fast-path phishing and credential-abuse. Biased toward high-throughput L1 triage with a calibrated escalation bar.

Roster
n0ct · strategicgr3p · dialecticalpivot · aestheticx0r · associative
Team evals
coverage
88
handoff
81
calib.
83
graduatedgraduated 2026-04-05

HONEYCOMB

Enrichment-first crew — context building for cloud misconfig and insider cases. Biased toward correlation work over triage throughput.

Roster
daem0n · strategicheap · dialecticalsh4dow · aestheticn1bble · associative
Team evals
coverage
84
handoff
87
calib.
76
Research notes

What we know, what we don't

The honest posture. Three bullets — confirmed, open, and not claimed. If you want to help move items from column two to column one, that's exactly what a commission is.

Confirmed
Individuation is preserved under coordination training — members do not collapse toward a group mean during relay labs. See manifests for experiments 048, 049, 051.
Open
Whether a blue-team-specific curriculum produces measurably better coordination evals than the general Lobster curriculum. This is an unanswered question; we're not claiming otherwise.
Not claimed
No claim that Cyber Crew outperforms any commercial SOC product on any specific benchmark. We haven't run that eval. If you want to run it with us — bring the dataset and we'll publish the result.

Pricing

Three tiers. Exact numbers land when the research preview graduates — until then, talk to us for a current quote.

flat setup + monthly

Graduated Crew

TBD

Deploy an existing crew from the public roster to your MCP endpoint. Graduation report included.

Browse crews
bespoke curriculum

Commission a Crew

TBD

Send your runbooks and last six months of incidents. We train a crew specialized for your threat profile.

Open an intake
coordination only

Bring Your Own

TBD

You have graduates already. We train them to hand off, run the team evals, package the artifacts.

Talk to us

Frequently asked

How is this different from running Claude or GPT-4 with better prompts?+

Prompts don't give you distinct calibration curves, distinct refusal surfaces, or distinct false-negative distributions. Under the hood they share weights — their blindspots are correlated. A graduated crew is four different LoRA'd bases whose mistakes are measurably uncorrelated, scored at graduation.

Where does the crew actually run?+

As an MCP server. Self-hostable — the graduation package includes the substrate manifests. For evaluation we can also host it on our infrastructure under an MCP endpoint. No data egress requirements beyond what your MCP client already does.

What happens when one member drifts post-deployment?+

The mode_preservation eval runs on a sample of live traffic. When a member's individuation vector crosses a drift threshold, you get an alert and a diff. Retrain the member, re-run coordination evals, re-package. The roster is versioned.

Can we audit the graduation process?+

That's what the graduation report is for. Per-member evals, coordination evals, Professor's rubric, narrative summary. It's the document we expect your security review board to read before deployment.

Is this a red team product?+

No. This skin is blue team only — detection, triage, enrichment, writeup. Offensive framing is a different product conversation, and one we're not ready to have yet.

Is it open source?+

The coordination eval definitions are published in docs/CREW_LIFECYCLE.md. The graduation packager, MCP harness, and the individual member LoRAs are not currently open. If the licensing question matters for your use case, say so on intake.

6f62859