On this page
The best way to find gaps in an automated broadcast system is simple: run a real race, watch the data that comes out, and ask whether an AI could have told the story a human lived through. This week we did exactly that — and the results reshaped the next phase of development on both the publisher and cloud sides.
This post documents the methodology we now use to validate the Director pipeline against real races: export, analyse, reconstruct, compare, file. It is also the design document behind Race Control issue #344.
The Test Loop
Our validation workflow follows a deliberate cycle:
┌─────────────────────────────────────────────────────────────────┐
│ NARRATIVE VALIDATION LOOP │
│ │
│ 1. RACE Run a real race session with the full stack live │
│ │ │
│ ▼ │
│ 2. EXPORT Pull all Cosmos containers for that session │
│ │ │
│ ▼ │
│ 3. ANALYSE Reconstruct the race narrative from raw events │
│ │ │
│ ▼ │
│ 4. COMPARE The driver tells us what actually happened │
│ │ │
│ ▼ │
│ 5. GAP MAP Where does the event stream fail the narrative? │
│ │ │
│ ▼ │
│ 6. FILE Issues on publisher (Director) and cloud │
│ │ │
│ └──────────────────────────────────────────────────────▶ │
│ REPEAT │
└─────────────────────────────────────────────────────────────────┘
The loop is deliberately cheap. The export script costs a few Cosmos RU and runs in seconds. The analysis is Python over local JSON. The compare step is a conversation. The value is that every real race teaches us something the integration tests could not.
Step 1: Race — Weekend 911 GT3
The session in question was a live iRacing GT3 race: 54 cars, 69 minutes of green-flag racing, run on 16 May 2026. The Director's Publisher Rig was running the iRacing extension in Publisher Mode, posting structured RaceEvent[] to POST /api/telemetry/events at 5 Hz.
The stack was live end-to-end:
- Publisher rig
ff152b80-…posting events to Race Control - Cosmos
raceEventscontainer receiving them under session03efc9ef-c2e0-48a8-ac81-4dbd8f0667fd - The AI Executor selecting sequences from 17 Planner-generated templates
- The Director App executing sequences against OBS
Everything appeared to be working. The dashboard showed events flowing. Sequences were being served. The broadcast ran.
That is the most dangerous state for a system like this: working and not telling the story look identical from the outside.
Step 2: Export
Immediately after the session we ran the export script against the session ID:
npx tsx api/scripts/export-session.ts 03efc9ef-c2e0-48a8-ac81-4dbd8f0667fdThis pulls every Cosmos container scoped to the session into a local directory — one JSON file per container — and writes a manifest.json with counts and RU costs.
| Container | Documents | Size |
|---|---|---|
raceEvents | 40,993 | 46.8 MB |
sessionSequences (served to Director) | 128 | 1.9 MB |
sequenceTemplates (Planner output) | 17 | 95.9 KB |
storySnapshots | 36 | 65.0 KB |
sessionCheckins | 1 | 29.9 KB |
publisherCheckins | 2 | 1.2 KB |
40,993 events across 69 minutes and 54 cars. That sounds rich. The analysis would tell a different story.
Step 3: Analyse
Volume is not signal
The first thing we do with any export is break the event stream down by type. The result for this session:
| Event Type | Count | % |
|---|---|---|
LAPPED_TRAFFIC_AHEAD | 17,939 | 43.8% |
BEING_LAPPED | 17,938 | 43.8% |
LAP_COMPLETED | 3,112 | 7.6% |
STOPPED_ON_TRACK | 918 | 2.2% |
| Everything else | 1,086 | 2.6% |
PIT_ENTRY / PIT_EXIT | 0 | 0% |
The top two types together account for 87.6% of all events. 17,938 lapping interactions for 54 cars in 69 minutes works out to 289 per car per hour — roughly one every 12 seconds. Physically, a car can only be lapped a handful of times per race. These detectors are firing every 5 Hz tick while the condition is true, not on the transition into it.
This matters for the AI Director because the Executor's "recent events" query pulls the last N events from raceEvents. If N=50 and 44 of those are LAPPED_TRAFFIC_AHEAD, the model's context window is almost entirely noise.
The lap times tell the real story
With the flood events set aside, LAP_COMPLETED events carry remarkably good data. Every event includes:
{
"lapNumber": 14,
"lapTime": 78.60,
"position": 12,
"classPosition": 4,
"gapToLeaderSec": 58.3
}Position, class position, and gap to leader on every lap, for every car. Over 58 laps per car. That is a full race trajectory — and we can diff consecutive laps to reconstruct position changes, detect fuel-save laps, and infer pit windows.
The 3-car pack: reconstructed from data alone
Starting from scratch with no external knowledge of what happened, we identified a three-car pack from the LAP_COMPLETED position data:
- Car #8 (Paul Crofts): P12 at lap 2, steady through lap 17
- Car #26 (Simon Darch): P11 throughout laps 0–17
- Car #19 (Sergey Beams): P13 at lap 2, P12–P13 through lap 17
P11 → P12 → P13. Three cars, same gap for 15 laps of a 26-lap race. The BATTLE_ENGAGED events for all three fire at +0:06 after the green flag, confirming they were in range from the standing start.
At +6:55 (lap 5), a OVERTAKE event fires for Sergey. Car #8 drops from P12 to P13 on the same lap. This is the position swap, unambiguous — even though the event's payload is empty (no competitor car number, no gap value).
The pit stop window
At lap 17 (+22.7 min) Simon is P10. At lap 19 (+27.1 min) Simon is P16. Lap 18 is missing entirely from his LAP_COMPLETED record. That gap is the pit stop. He dropped 6 places and left a blank in the sequence — the earliest edge we have of the undercut attempt is an absence of data.
Paul's own in-lap is visible as an anomaly: lap 21 = 85.1 s (2% above his session median of 78.8 s — the fuel-save lap), lap 22 = 104.4 s (the in-lap itself, 32% above median). Sergey's lap 23 = 104.6 s — he pitted exactly one lap after Paul. Post-stop: Sergey P10, Paul P11. The undercut was held.
Step 4: Compare
After the analysis we asked the driver of car #8 what actually happened:
Qualified P11 average, P8 in class. Made a couple of early places from some crashes that put me in the top 6. Raced Simon (#26) with Sergey (#19) keeping tight — gap less than 1 second, one more car right behind Sergey. We had a little pack of cars in the first phase of the race.
Simon was falling behind — his gap to the car in front was getting to 1.4 seconds, out of draft. I started to push to close on him and raced 3–4 laps within 0.5 seconds. Tight, dramatic racing in a pack.
I made a small mistake allowing Sergey through. It was like that to the pit stop, where Simon attempted an early undercut 5 laps out. I waited until 1 lap of fuel remaining and pitted to attempt the counter-undercut with a better stop. I nailed the stop but Sergey held the place in dramatic fashion coming out of the pits. There were multiple lapped cars causing drama. The race was amazing.
We mapped every sentence against the events.
Step 5: Gap Map
Here is what was and was not evident in the event stream:
| Narrative beat | Evidence | Verdict |
|---|---|---|
| "Tight 3-car pack at the start" | BATTLE_ENGAGED for all three at +0:06; positions confirm P11/P12/P13 | ✅ |
| "Less than 1 second gap for 3–4 laps within 0.5s" | Battle events confirm range; gap values not present | ⚠️ |
| "Small mistake — let Sergey through" | OVERTAKE event at +6:55; position drop matches | ✅ |
| "Simon falling behind, gap opening to 1.4s" | BATTLE_BROKEN for Simon at +16.1 and +24.1 min | ✅ |
| "Simon's early undercut ~5 laps before me" | Missing lap 18 for Simon; drop from P10 to P16 | ⚠️ inferred from absence |
| "Waited for 1 lap of fuel, nailed the stop" | Lap 21 = 85.1 s slow; lap 22 = 104.4 s | ⚠️ inferred from lap-time anomaly |
| "Sergey held the place at pit exit" | Sergey lap 23 = 104.6 s; post-stop Sergey P10, Paul P11 | ⚠️ inferred from coincidence |
| "Drama in lapped-car traffic at pit exit" | LAPPED_TRAFFIC_AHEAD floods uniformly every minute | ❌ lost in noise |
| "Early crashes promoted me" | Pre-green STOPPED_ON_TRACK flood; no post-green incidents | ❌ invisible |
Seven out of nine beats are technically recoverable — but most of them require a human to connect dots across multiple events with domain knowledge. The Executor, running every 10–30 seconds with a truncated event window, would not connect them.
Three issues dominate the gaps:
-
Pit events are completely absent.
PIT_ENTRY/PIT_EXIT— 0 events in 69 minutes across 54 cars. We inferred the pit from a 32% lap-time anomaly. The AI has to do the same, which is fragile: a safety car, an off-track excursion, or a mechanical issue produces a similar signature. -
Battle and overtake payloads are empty. The
OVERTAKEevent for Sergey looks like this in Cosmos:{ "type": "OVERTAKE", "car": { "carNumber": "19", "driverName": "Sergey Beams" }, "payload": {} }No passed car number. No gap. The event confirms something happened to Sergey but not what or to whom. We inferred the victim (Paul) from a position change on the same lap — pure coincidence it was unambiguous with 54 cars.
-
State-based detectors flood the stream. The
LAPPED_TRAFFIC_AHEADcounter for car #8 at the minute of his pit entry: 24 events in 60 seconds. Thirty seconds after he exits: 24 more. These were all firing the entire race. We cannot tell whether there was an extraordinary lapped-car encounter at pit exit or just the usual background noise, because both look identical.
These publisher-side issues are documented in detail in Director issue #196.
Step 6: What We're Building on the Cloud Side
The publisher fixes will close most of the data gaps, but there is a parallel roadmap of improvements that Race Control can make right now, before the publisher changes land. These are the core of Race Control issue #344.
Persist raceContext snapshots
Today the Director sends raceContext.drivers[] — authoritative positions, gaps, pit status — on every /sequences/next poll. We use it for the current LLM call and throw it away. Persisting a slimmed snapshot per poll into a new raceContextSnapshots container unlocks everything below. It is the foundational change.
Derive what the publisher isn't sending
With consecutive snapshots, Race Control can emit its own events with source: "cloud-derived":
PIT_ENTRY/PIT_EXIT— fromonPitRoadstate transitions in snapshotsPIT_UNDERCUT_ATTEMPT/PIT_UNDERCUT_HELD/PIT_UNDERCUT_LOST— from cross-referencing who pitted and in what orderGAP_CLOSING/GAP_OPENING— fromgapAheadSecdeltas over ≥3 lapsPOSITION_CHANGE— with the competitor car number from the snapshot diffSTINT_PACE_DROP— from rolling median lap time (the fuel-save signal)
The undercut narrative — Simon's early stop, Paul's counter, Sergey holding — would have produced seven derived events that frame the entire pit window story. All from position and gap diffs, no publisher changes required.
Focus drivers and rivals
The focusedCarNumber field on raceContext is already used in the Executor prompt, but it is ephemeral. We will persist a focusDrivers list per session and auto-detect rivals — any car within 2.0 s of a focus driver for ≥5 consecutive snapshots. Rivals get promoted in the Executor's priority stack. Every event involving a rival of the focus driver is more important than a random field event.
In the reference race, cars #19 and #26 would have been flagged as Paul's rivals from lap 2 onward. The OVERTAKE event for Sergey — today invisible in a sea of unattributed events — would have been instantly elevated to a high-priority RIVAL_POSITION_CHANGE arc.
Narrative arcs instead of raw events
The biggest shift is changing what the Executor sees. Today it gets a flat list of recent raw events. Going forward, a deterministic arc-detector will maintain a set of named arcs per session:
| Arc | Active during the reference race |
|---|---|
closing_for_battle | Laps 1–5, Paul→Simon; laps 13–17, Paul→Simon (again) |
defending_under_pressure | Laps 1–5, Paul vs. Sergey |
pit_window_open | +16 min onward |
undercut_in_progress | Simon's stop to Paul's stop (+22–29 min) |
undercut_held | Post-stop resolution |
lapped_traffic_drama | Paul's out-lap |
Each arc carries a human-readable context.summary — e.g. "Crofts (#8) closing on Darch (#26), gap 1.2s, -0.4s/lap, lap 14 of 26" — that becomes the Executor's narrative spine. The LLM is no longer asked to invent the story; it is asked to film the one the arc engine has already identified.
An event enricher at the boundary
Before any of the above runs, an event enricher at the POST /api/telemetry/events handler will:
- Drop pre-green
STOPPED_ON_TRACKnoise (phase-tagging) - Deduplicate the
LAPPED_TRAFFIC_AHEADflood (collapse identical state events within a 30s window) - Backfill empty
OVERTAKEandBATTLEpayloads from the most recent context snapshot
This alone would have reduced today's 40,993-event stream to roughly 5,000 events with meaningfully higher signal density.
The Metric We Were Missing
After a broadcast ends today we have no way to answer: did we tell a good story? We know sequences were served — 128 in today's session — but we do not know whether they were focused on the right cars, whether they captured the battles as they happened, or whether the pit window produced compelling coverage.
We will build a narrative coverage metric computed at session end:
- For each focus driver: what % of laps had an active arc?
- What % of served sequences referenced the focus driver?
- Which arc types fired, and which resolved positively vs. negatively for the focus car?
Surface in the Director UI as a post-mortem panel. The target for the next reference race is to look at that panel and recognise the race story in the numbers.
The Feedback Loop in Practice
The point of this methodology is not that any single race teaches us the perfect set of fixes. It is that every race is a cheap, high-fidelity test of the full system that a synthetic test harness cannot replicate. Real iRacing incidents are weird. Real pit strategies diverge. Real traffic is non-deterministic.
The loop we ran this week took about 90 minutes from session end to filed issues:
- Ran
export-session.ts— 4 minutes, 1,613 Cosmos RU - Ran analysis scripts — 3 minutes
- Driver described what happened — 5 minutes conversation
- Gap map — 30 minutes of structured comparison
- Filed Director #196 and Race Control #344 — 45 minutes
The next time we run this loop, we will have the replay tooling (replay-session.ts) to run derived event detection against the same export offline — so we can validate the new detectors against a race where we already know the ground truth before shipping them live.
Test. Export. Analyse. Compare. File. Repeat.
That is the engineering loop that will close the gap between technically running and actually broadcasting.