Scenario 7: TTL Exceeded Is Late — A Topology-First Way to Catch Routing Loops Faster

Executive summary: A routing policy change creates a partial loop affecting only certain prefixes. Symptoms are ambiguous — latency shifts, asymmetric paths, intermittent loss — and by the time TTL exceeded logs appear, the incident has already burned hours and customer trust. This scenario contrasts an LLM-assisted workflow that summarizes symptoms and suggests likely causes, with a topology-aware GNN approach that anchors the change to the dependency graph, computes cyclic dependencies, and recommends a minimal, validated fix.

MTTR comparison: LLM-assisted 3 to 10 hours vs. GNN-based 30 to 60 minutes.

The setup: when a small policy change breaks only some routes

Network context: Enterprise + ISP hybrid network.

A routing policy or metric change is applied on one node/VRF. Nothing goes down immediately. Instead:

Certain prefixes start taking a longer path.
A subset of applications see latency increases.
In the worst case, a loop forms for specific routes and TTL exceeded logs appear.

The operational trap: symptoms are partial and inconsistent. Engineers run traceroutes and see different results depending on source and destination. Congestion appears on links that are not normally hot. It can look like capacity, peering, or application issues.

Timeline walkthrough

T+0s: Change applied

Network reality: A routing policy/metric change is applied on one node/VRF.

LLM-assisted: If it sees the change ticket, it summarizes intent. Otherwise waits for symptoms.

GNN-based: Registers the change on a node/edge in the dependency graph and marks affected adjacencies and prefix sets.

Outcome: Change anchored to topology.

T+30s: Path deviation begins

Network reality: Certain prefixes begin taking a longer path. Latency increases for a subset of applications.

LLM-assisted: Summarizes latency increase and searches for similar incidents.

GNN-based: Detects path deviation for impacted flows and ties it to the changed policy node.

Outcome: Early causal link established.

T+1m: Loop forms for a subset of routes

Network reality: A loop forms for a subset of routes. TTL exceeded logs appear.

LLM-assisted: Recognizes TTL exceeded and suggests routing loop is likely.

GNN-based: Computes cyclic dependency patterns across the routing adjacency graph for the affected prefix/VRF.

Outcome: Loop confirmed deterministically.

T+2m: Congestion and loss appear downstream

Network reality: Congestion appears on links not normally hot. Packet loss begins.

LLM-assisted: May misclassify as a capacity issue due to congestion alarms.

GNN-based: Shows congestion is downstream of the loop and suboptimal path. Identifies the loop edges.

Outcome: Correct classification.

T+3m: Traceroute confusion

Network reality: Operators run traceroutes. Results vary by source and destination.

LLM-assisted: Interprets traceroute text. Can be misled by asymmetry and sampling.

GNN-based: Uses graph context to reconcile asymmetric paths and isolate the consistent cycle.

Outcome: Avoids false leads.

T+5m: Fix debate

Network reality: NOC debates rollback vs. targeted tweak. Risk of impacting other traffic.

LLM-assisted: Recommends caution. Cannot validate which knob fixes the loop.

GNN-based: Recommends the minimal policy change and predicts which prefixes and paths will normalize.

Outcome: Targeted fix plan.

T+7m: Fix applied

Network reality: Fix applied (policy correction, metric adjustment, route-map change).

LLM-assisted: Notes change applied. Waits for symptoms to clear.

GNN-based: Monitors specific loop indicators: cycle break, adjacency stability, path length normalization.

Outcome: Fast validation.

T+9m: Closure

Network reality: TTL exceeded logs stop. Latency returns to baseline for affected applications.

LLM-assisted: Summarizes resolved. May not prove why.

GNN-based: Confirms closure: path normalization occurs in predicted order across the dependency graph.

Outcome: High-confidence RCA.

T+15m to T+1d: Postmortem and prevention

Network reality: Postmortem asks why the policy created a loop only for some prefixes. Pre-change checks needed to prevent recurrence.

LLM-assisted: Generates plausible explanation. May miss topology nuance. Suggests adding validation.

GNN-based: Provides explainable dependency proof: which nodes, edges, and route policies interacted to form the cycle. Recommends graph-based pre-change simulation for routing policies on high-centrality nodes.

Outcome: Better engineering learning and preventive control.

Why LLM-only approaches struggle here

LLMs can recognize TTL exceeded and suggest routing loop. But they lack a defensible way to:

Tie the incident to the specific policy change and affected prefix set.
Compute the actual cycle in the routing adjacency graph.
Separate downstream congestion from the upstream loop.
Recommend the minimal safe fix with predicted impact.

Why topology-first GNN RCA wins

A GNN-based system reasons over the dependency graph and routing adjacencies to:

Anchor changes to topology and prefix sets.
Detect path deviation early, before TTL exceeded appears.
Confirm loops deterministically via cyclic dependency computation.
Validate closure via expected recovery ordering.
Prevent recurrence with pre-change simulation on high-centrality nodes.

What to take away

Routing loops are a policy interaction problem. If your RCA is not topology- and policy-aware, you will:

Waste time on traceroute noise.
Roll back broadly under uncertainty.
Miss the specific conditions that triggered the loop.

Deterministic, graph-based RCA reduces MTTR and improves change safety.