Engineering Note
Diagnosing BGP session flaps systematically
A field checklist for BGP session flaps: reading reset reasons, hold-timer analysis, MTU blackholes, dampening, and TCP resets — Cisco and FRR commands.
TL;DR
A BGP session flap is a session cycling between Established and down — and the router already recorded why. The reset reason sorts every flap into one of four failure classes: keepalive loss, transport or MTU problems, deliberate resets, or interface instability. This is my working checklist: the exact Cisco IOS and FRR commands to pull the reset reason, prove or rule out a path-MTU blackhole, and check dampening before it starts penalizing you.
Start with the reset reason, not the config
The router already knows why the session dropped; it wrote the reason down. Read that before changing anything:
! Cisco IOS
show ip bgp neighbors 203.0.113.1 | include Last reset|flapped
show logging | include BGP-5-ADJCHANGE|BGP-3-NOTIFICATION
# FRR
vtysh -c 'show bgp neighbors 203.0.113.1 json' | jq -r '.["203.0.113.1"] | .lastResetDueTo, .connectionsEstablished, .connectionsDropped'
journalctl -u frr --since -2h | grep -iE 'bgp.*(down|reset|notif)'
The reset reason sorts you into a failure class immediately:
| Reset reason | Failure class | First check |
|---|---|---|
| Hold timer expired (locally) | You stopped hearing keepalives | Congestion, CoPP, unidirectional link |
| Peer sent NOTIFICATION / “peer closed session” | The other side reset | Their logs, their maintenance window |
| TCP connection reset | Transport layer | MD5 mismatch, firewall timeout, GTSM |
| Interface flap / BFD down | Physical or BFD | Errors, optics, BFD timers |
This is the same discipline as any structured network troubleshooting: let the evidence pick the branch.
Hold timer expiry: prove the keepalive path
Keepalives are small and cheap. If they’re getting lost, something is dropping control-plane traffic, and the config is not where you’ll find it. Check interface errors and control-plane policing on both ends:
show interfaces gi0/0/1 | include error|drops|CRC
show policy-map control-plane | include drop
If timers were “tuned” by a predecessor, verify both sides agree on what was negotiated — the session uses the lower hold time offered:
show ip bgp neighbors 203.0.113.1 | include hold time
The 90-second death loop: MTU
If the session establishes, sits in Established briefly, then dies on hold timer — over and over — suspect a path-MTU blackhole. Small packets (OPEN, keepalive) pass; the full-size segments of the initial UPDATE burst don’t. Prove it with DF-set pings at full MTU:
! Cisco
ping 203.0.113.1 size 1500 df-bit
# Linux/FRR host — 1472 payload + 28 header = 1500
ping -M do -s 1472 203.0.113.1
If 1500 fails and smaller succeeds, walk the sizes down to find the real path MTU. Check what MSS the session actually negotiated:
show ip bgp neighbors 203.0.113.1 | include segment
Fixes, in order of preference: repair the path MTU, clamp MSS on the session,
or (FRR) confirm bgp default PMTU discovery behavior on the TCP socket. On
links I run over tunnels or VPN
paths, I clamp MSS proactively —
tunnel overhead is where these blackholes breed.
TCP resets and dampening
Immediate TCP resets (session never reaches OpenSent) are transport problems:
MD5 password mismatch (%TCP-6-BADAUTH on IOS), ttl-security hops / GTSM
mismatch, or a stateful firewall between peers timing out the long-lived TCP
179 session — the flap interval matching the firewall’s idle timeout is the
tell.
Dampening deserves one check even if you didn’t configure it, because an upstream may have:
show ip bgp dampening flap-statistics
show ip bgp dampening dampened-paths
vtysh -c 'show ip bgp dampening dampened-paths'
A prefix that looks “down” from the outside while your session is healthy is often just suppressed upstream.
What I write down
For every eBGP session: negotiated timers, MSS, whether BFD is enabled and its timers, the peer’s NOC contact, and the flap history at handover. Half the diagnosis time on flap tickets is reconstructing what “normal” looked like — notes on BGP design decisions beat archaeology every time.
Frequently asked questions
- What does 'hold timer expired' actually mean?
- Your router went three keepalive intervals (default 180 seconds) without hearing from the peer, then tore the session down. The fault is on the receive path: congestion dropping keepalives, control-plane policing on either end, a unidirectional link, or the peer's control plane stalling. It is almost never a BGP configuration problem.
- Why does my BGP session establish and then die about 90–180 seconds later?
- Classic path-MTU blackhole. The OPEN and keepalive packets are small and pass fine, but the initial full-table UPDATE burst uses full-size TCP segments that get silently dropped somewhere with a smaller MTU and broken PMTUD. The session stalls until the hold timer expires, resets, and repeats. Test with DF-bit pings at full MTU.
- Should I enable BGP dampening?
- Rarely on eBGP edges today, and never on iBGP. Modern convergence is fast enough that dampening's suppression often causes longer outages than the flaps it hides. If you inherit it, check flap statistics before assuming a prefix is down — it may just be suppressed.