Executive summary:
This scenario shows a common failure mode in operations: a physical-layer degradation that does not hard-fail, but triggers reroutes and localized overload. The surface symptoms look like congestion, so teams chase capacity upgrades or generic traffic engineering. A topology-aware GNN approach anchors the investigation on the physical evidence, traces the causal chain across the network graph, and drives a targeted fix.
MTTR comparison: LLM-assisted 2 to 6 hours vs GNN-based 15 to 30 minutes.
The setup: capacity vs fault ambiguity
In a metro ISP network, an optical link begins to degrade. CRC/FEC corrections rise, but the link stays up. Traffic begins to reroute around the impaired edge. Downstream paths see queue growth, latency, packet loss, and congestion alarms.
This is exactly the kind of incident that produces confident but wrong narratives:
- “We are out of capacity.”
- “We need to upgrade links.”
- “This is a traffic spike or DDoS.”
The problem is not that these are impossible. The problem is that they are not causally grounded.
Timeline walkthrough (what actually happens)
T+0s: Early physical degradation
- Network reality: Optics begin degrading; CRC/FEC corrections rise; link stays up.
- LLM-assisted behavior: Ingests logs if present; often treated as low severity.
- GNN-based behavior: Detects rising physical error metrics on a high-centrality edge.
- Outcome: Early risk flag.
T+2m: Symptoms look like congestion
- Network reality: Effective throughput drops; queues build on alternate paths.
- LLM-assisted behavior: Sees congestion alarms; labels as “capacity issue.”
- GNN-based behavior: Correlates congestion onset downstream of the impaired edge; flags “fault masquerading as overload.”
- Outcome: Correct classification.
T+5m: Customer experience degrades
- Network reality: Latency and loss increase; customer experience degrades.
- LLM-assisted behavior: Suggests add capacity, shape traffic, or investigate DDoS.
- GNN-based behavior: Traces dependency: impaired edge -> reroute -> congestion points -> impacted services.
- Outcome: Causal chain established.
T+8m: Operators consider traffic engineering
- Network reality: Team considers TE changes to reduce impact.
- LLM-assisted behavior: Recommends generic TE steps; cannot validate which resolves root cause.
- GNN-based behavior: Validates which TE change reduces impact while the physical fault persists.
- Outcome: Effective mitigation.
T+12m: The incident thrashes
- Network reality: Physical metrics worsen; intermittent micro-outages.
- LLM-assisted behavior: May pivot hypotheses repeatedly as symptoms change.
- GNN-based behavior: Maintains a stable root cause hypothesis anchored to physical edge evidence.
- Outcome: Avoids thrash.
T+15m: The decision point (spend money or fix the fault)
- Network reality: Decision: replace optics vs upgrade capacity.
- LLM-assisted behavior: May recommend upgrade based on congestion narrative.
- GNN-based behavior: Recommends targeted optics replacement; shows why capacity spend is unnecessary.
- Outcome: Avoids wrong spend.
T+25m: Deterministic closure
- Network reality: Optics replaced; physical errors drop.
- LLM-assisted behavior: Observes “improvement” but may not tie causality cleanly.
- GNN-based behavior: Confirms predicted recovery: physical errors drop first, then congestion clears, then KPIs normalize.
- Outcome: Deterministic closure.
T+35m to T+1w: Prevention and recurrence control
- Network reality: Network returns to normal routing; finance asks what happened and what to invest in; recurrence risk management.
- LLM-assisted behavior: Generates incident summary; narrative explanation may be vague; manual monitoring.
- GNN-based behavior: Produces RCA with evidence and a preventive watchlist for similar edges; quantifies avoided SLA cost; continuous detection of physical degradation signatures before congestion events.
- Outcome: Actionable business case and proactive operations.
Why LLM-only approaches struggle here
This incident is not a missing information problem. It is a causality and dependency problem.
If an assistant is reasoning primarily from logs and symptom text, it will gravitate toward the most common pattern: congestion equals capacity. It can propose plausible actions, but it cannot reliably:
- Anchor the hypothesis to the correct physical edge.
- Trace reroutes and downstream congestion as a graph effect.
- Validate mitigation choices against the causal chain.
- Prove closure with the expected order of recovery signals.
Why topology-first GNN RCA wins
A GNN-based AIOps system reasons over the dependency graph and the telemetry attached to nodes and edges. That enables it to:
- Detect physical degradation early on high-centrality links.
- Connect downstream congestion to upstream impairment.
- Keep the root cause stable even as symptoms move.
- Recommend the lowest-cost fix (replace optics) instead of the highest-cost narrative (upgrade capacity).
What to take away
If your operations tooling cannot separate fault from capacity, you will:
- Burn hours on hypothesis churn.
- Apply mitigations that reduce pain but do not fix the cause.
- Justify the wrong capital spend.
Topology-aware, deterministic RCA changes the outcome: faster MTTR, cleaner closure, and better investment decisions.