Executive summary:
A single fiber cut can trigger a cascading failure pattern that looks like multiple unrelated incidents: interface down, LOS alarms, routing adjacency flaps, BGP resets, congestion, latency, packet loss, and customer tickets far from the physical cut.
This scenario compares two AIOps approaches:
- LLM-based AIOps: strong at summarizing and clustering text-based alarms and tickets.
- GNN-based AIOps: built to reason over network topology and dependency graphs, where failures propagate along edges.
The operational outcome is a large MTTR gap: LLM 60 to 180 minutes vs GNN 5 to 15 minutes.
The incident pattern: why fiber cuts create “domino” outages
A fiber cut is a physical-layer event, but the impact appears across layers and across geography:
- Transport interfaces drop (LOS).
- Routing adjacencies flap.
- BGP sessions reset.
- Traffic reroutes onto alternate paths.
- Congestion appears in places far from the cut.
- Customer-facing services degrade.
- Alert storms follow (latency, packet loss, application monitors).
The NOC challenge is not detecting that “something is wrong.” The challenge is identifying the initiating event quickly, proving causality, and choosing the next best action.
Side-by-side timeline: LLM behavior vs GNN behavior
T+0s: Physical cut occurs
- Network reality: optical signal drops on one span.
- LLM behavior: no action until logs/alarms arrive.
- GNN behavior: detects physical-layer edge state change on the dependency graph.
- Outcome: incident begins.
T+10s: Interface down and LOS alarms fire
- Network reality: transport gear reports interface down.
- LLM behavior: ingests alarms/log lines; starts summarizing “interface down.”
- GNN behavior: anchors the first causal event to a specific edge (fiber/link object) in the graph.
- Outcome: candidate root cause is established early.
T+30s: Routing adjacencies flap; BGP sessions reset
- Network reality: multiple routers show adjacency flaps and resets.
- LLM behavior: sees many BGP resets; may infer “BGP instability” as the primary issue.
- GNN behavior: traces propagation: fiber edge -> interface -> routing adjacency -> impacted prefixes/paths.
- Outcome: separates cause from downstream symptoms.
T+1m: Traffic shifts; congestion appears
- Network reality: reroutes create congestion on alternate paths.
- LLM behavior: reads congestion alerts; may suggest capacity or DDoS as competing hypotheses.
- GNN behavior: computes path re-optimization impact; identifies which reroutes are consequences of the cut.
- Outcome: avoids misclassification as a “capacity event.”
T+2m: Customer-facing services degrade far from the cut
- Network reality: metros distant from the cut see degradation due to shared backbone dependency.
- LLM behavior: summarizes tickets and alarms; produces top 3 to 5 plausible causes.
- GNN behavior: maps service impact to dependency paths; explains why distant metros fail.
- Outcome: deterministic blast radius narrative.
T+3m: Alert storm across domains
- Network reality: interface, BGP, latency, packet loss, application monitors.
- LLM behavior: clusters alerts by text similarity; may split into multiple “incidents.”
- GNN behavior: collapses storm into one incident keyed to the earliest causal edge.
- Outcome: reduced noise and a single root cause thread.
T+5m: NOC triage begins; multiple teams engaged
- LLM behavior: provides a readable summary; still requires humans to pick the culprit.
- GNN behavior: produces a ranked causal chain with evidence: earliest event plus propagation proof.
- Outcome: faster ownership assignment.
T+7m: Engineers consider mitigations
- Network reality: reroute policy changes, manual traffic engineering.
- LLM behavior: suggests generic mitigations; cannot validate which action restores causality.
- GNN behavior: simulates/validates candidate mitigations against the dependency graph (what will improve, what will not).
- Outcome: prevents ineffective changes.
T+10m: Dispatch decision needed
- LLM behavior: cannot confidently justify dispatch vs wait; relies on human judgment.
- GNN behavior: recommends dispatch/repair on the cut span; provides impacted customers list.
- Outcome: decisive action with justification.
T+15m: Temporary reroute; partial recovery
- LLM behavior: summarizes “improving” signals; may attribute improvement incorrectly.
- GNN behavior: confirms recovery aligns with predicted dependency restoration.
- Outcome: validated progress tracking.
T+30m: Fiber repaired; routing converges
- LLM behavior: generates post-incident narrative from logs.
- GNN behavior: confirms causal closure: root edge restored; downstream symptoms resolve in correct order.
- Outcome: clean RCA and closure.
T+45m: Postmortem
- LLM behavior: produces plausible postmortem text; may miss topology-specific lessons.
- GNN behavior: outputs explainable causal graph, blast radius, and hardening recommendations (redundancy gaps).
- Outcome: stronger evidence for customer and internal review.
Why this gap exists: networks are graphs, not documents
LLMs are optimized for language. They can summarize what alarms say. But a network incident is a propagation problem across dependencies.
If you want deterministic root cause, you need:
- A dependency graph (links, adjacencies, services)
- Time-ordered event anchoring
- Propagation tracing
- Blast radius computation
- Validation of mitigations against the graph
That is what a GNN is designed to do.
What to ask when evaluating “LLM AIOps” for networking
- Can it anchor the earliest causal event to a specific link/interface object?
- Can it prove propagation across topology, not just correlate timestamps?
- Can it compute blast radius and explain distant impact via shared dependencies?
- Can it validate mitigations (what will improve, what will not) before changes are made?
NetAI perspective
NetAI GraphIQ is built around GNN-powered, topology-aware reasoning so network teams can move from narrative triage to deterministic RCA.