Citation chains: the trust UI for AI-drafted postmortems

2026-05-27 · ~5 minute read

The trust problem with AI-drafted postmortems

If an AI drafts a paragraph of your postmortem that says “the deploy at 09:32 triggered the cascading worker OOM,” how do you know it’s true?

That’s not a rhetorical question. AI text — even from frontier models — fabricates names, timestamps, and causal chains with sub-paragraph regularity. A postmortem is a load-bearing document: it’s how the team agrees on what happened, who did what, and what changed because of it. If the AI invents a participant or shifts a timestamp by ten minutes, the document isn’t a postmortem any more — it’s fiction with the right shape.

The standard cloud-AI postmortem-drafting interface doesn’t help you tell. You see drafted text. It reads plausibly. You don’t know which lines in the Slack export, the PagerDuty dump, or the log file the AI is making each claim against. You either trust the whole draft or you don’t.

This is the problem citation chains solve.

What is a citation chain?

A citation chain is a verifiable path from any sentence in the drafted postmortem to the original source line that justifies it. Two clicks. No interpretation.

In IncidentScribe:

Each drafted section (Summary, Timeline, Root Cause, Contributing Factors, Action Items) carries metadata about which timeline events it cites.
Each timeline event carries metadata about which slice of the source (Slack message, log line, PagerDuty payload) it was extracted from.
Clicking any drafted claim opens the Citations inspector, which lists the cited events. Clicking any event opens the source slice with the actor’s name highlighted.

There’s no AI interpretation between the click and the source. The citation links are stored as stable IDs, not as paraphrases. The same drafted claim that links to event 14, slice 7 today will still link to event 14, slice 7 in six months when the new on-call engineer pulls up the postmortem trying to figure out what actually happened.

The two-click invariant

The invariant the product holds is: any sentence the drafter wrote must be two clicks from its source.

That’s a product constraint that drives every architectural decision behind it. It means:

The drafter can’t paraphrase events into the timeline. The timeline is the structured truth; the drafter writes prose against it.
The timeline can’t include events whose actor or timestamp didn’t appear verbatim in the source. Anything fabricated by the extraction pass gets dropped before reconciliation.
The Citations inspector can’t show “approximately this is where the claim came from.” Either there’s a specific cited event, or the section gets flagged as ungrounded.

When the chain breaks — when a section is drafted from too-sparse cited events, or no cited events at all — the section gets a visible warning chip. You see it in the draft UI before you sign off. You don’t have to find it; the broken chain finds you.

Why the drafter never sees raw text

A subtle but load-bearing decision: the drafter — the model that writes the prose — never sees the raw incident text. It only sees the validated, reconciled timeline.

This matters because the drafter is the highest-risk component for fabrication. Models that read raw text and write prose tend to confabulate, especially when the text is long and the prose is short. If the drafter never reads raw text, it can’t fabricate from it. It can only write prose against the structured timeline — which has already been validated against the source.

The pipeline looks like this:

Chunk — natural-boundary slicing of the raw input.
Extract — schema-constrained event extraction from each slice. Every event has a typed actor, typed timestamp, typed summary.
Validate — every event traced back to a specific slice. Events whose actor or timestamp can’t be located in the source get dropped.
Reconcile — per-slice timelines merged into a deduplicated chronological master.
Draft — model writes prose against the master timeline, with each claim citing the events that justify it.

Steps 1, 3, 4 are pure code. Steps 2 and 5 are the only places a model touches text. Step 2’s output is schema-constrained (it can only emit timeline-event records with the typed fields). Step 5’s input is the structured timeline (no raw text).

This is what makes the citation chain reliable. It’s not a UI feature bolted on after the model wrote prose. It’s the architecture: the drafter literally cannot make a claim that isn’t grounded in a cited event, because cited events are the only thing it sees.

What this looks like in the UI

Click any sentence in the drafted Summary. The Citations inspector mounts on the right. It lists the timeline events that justify that section — typically two or three. Click one. The source pane opens to the slice that event was extracted from, with the actor’s name highlighted in the original log line. The full Slack thread (or PagerDuty payload, or raw log block) is visible around it.

For the Timeline section specifically: the timeline is rendered deterministically from the reconciled events. The model doesn’t write it as prose; the renderer formats it from the structured data. That’s a hard guarantee: the rendered Timeline is byte-for-byte derived from the cited events. The drafter only sees Timeline output once it’s been rendered, and only uses it as input to the other four sections.

When a section is drafted against sparse evidence — when there aren’t enough cited events to support the prose — a “low confidence” chip appears next to the section header. The reader can click through to see exactly which (few) events the section was drafted from. Sparse-evidence sections still appear in the draft; they just appear visibly flagged so the reader knows to fact-check more carefully.

Why cloud incident-management vendors didn’t ship citation chains

Cloud incident-management vendors that shipped AI postmortem drafting in 2024–2026 didn’t ship citation chains. They shipped LLM summaries.

The reason isn’t a design oversight. It’s that citation chains require the pipeline above — the chunk/extract/validate/reconcile/draft architecture with the drafter quarantined from raw text. Retrofitting that into a one-shot “summarise this Slack channel” call is more invasive than it looks. You’d have to:

Rebuild the extraction pass as schema-constrained.
Add the validation step.
Wire the citation IDs through the storage layer.
Rebuild the draft UI to expose them.
Convince the drafter to actually use them (most LLMs will happily generate “as documented in [3]” without the [3] mapping to anything real, unless you constrain the input to make that impossible).

It’s two or three quarters of platform work, ship-blocking, on top of an already-launched product. The market reward is opaque — the buyers who’d value it most are also the ones least likely to use the cloud product in the first place. So it didn’t happen.

When you’re building from scratch for the buyer slice that’s locked out of cloud AI anyway, the math is different. There’s no launched product to retrofit. The starting point is “every drafted claim must be two clicks from its source,” and the architecture follows from that constraint.

What you actually trust about the draft

A postmortem drafted by IncidentScribe is a structured assertion: given these source artefacts, these timeline events were extracted; given those events, these claims hold. You can verify the chain at every link. If the on-call engineer who lived the incident reads the draft and says “wait, that wasn’t what happened,” they can click through to the source line that justified the claim and either confirm the model misread it, or update their own memory. Either way, the conversation is grounded.

That’s the trust UI. It’s not “trust the AI.” It’s “verify the chain.”

Open in Mac App Store