Frequent IPv4 outages due to ARP requests not being answered

Internet access discussion, including Fusion, IP Broadband, and Gigabit Fiber!
19 posts Page 2 of 2
by conradpino » Sat Jan 18, 2025 11:58 pm
@bg212 recovers by unplugging data cable from ONT then reconnecting; next time, verify if you have same behavior.
The time after that, leave data ONT cable alone and power cycle the ONT; may localize to ONT or OLT (upstream).
Are you convinced your packet logs are substantially the same as @bg212 packet logs?

I believe Sonic can't fix this directly, it's smells like an OLT / ONT bug and Adtran is the vendor. Sonic is limited by what Adtran makes available to customers. My guess is Adtran TA5000 OLT is earliest point where Sonic can monitor data stream. You've already said you and Sonic don't see the same data stream at their respective sides and that is the crux of the matter.

Sonic is pretty forthcoming about what they are doing:
viewtopic.php?t=3389
viewtopic.php?t=16737
viewtopic.php?t=17231
I conclude all components between OLT and ONT are passive optical paths.
Ask about MAC Table entries in use on your circuit (a long shot).
by klui » Sun Jan 19, 2025 1:56 am
I don't have any issue with leases and my client starts renewing every ~7 minutes. My lease is 15 minutes.

EDIT: I looked at the wrong entries because I have another interface that retrieves a lease from my PV controller. My client for my WAN renews every 3 hours; lease is 6 hours. Sorry for the confusion.

What if you perform the equivalent of a shut/no shut on the WAN interface in VyOS (equivalent to unplugging/replugging cable)?

Maybe related to
https://supportportal.juniper.net/s/art ... uage=en_US
by gadams » Mon Jan 20, 2025 9:31 pm
Ah, yes, that Juniper bug does sound suspiciously similar! I don't have a Juniper login, though, so I can't read more about PR1646010. I wonder if Sonic is running an affected Junos version. The linked forum posts don't mention whether Sonic's network is using an EVPN architecture. (That should be a purely internal implementation detail, so it's no wonder they don't talk about that sort of thing.)

I wonder what the specifics of that bug are. If I could read about the bug, I could develop hypotheses, such as whether it would be less likely to appear if my router's ARP interval were longer than 30 seconds. Would increasing it to, say 5 minutes help? I'll try that after the next outage, even though it's a total shot in the dark.

I definitely plan to try various tests including power-cycling the ONT during the next outage. Disconnecting and re-connecting the ONT-to-my-router ethernet didn't seem to have any effect, and my router didn't (to my surprise) choose to re-run the DHCP negotiation when I tried it, which would have kick-started ARP responses again.

Reading more of the Juniper docs, it seems that in EVPN configurations, the Juniper router snoops ARP and NDP packets from the customer router. That would explain why it never sends me any ARP or NDP requests. But the bug seems only to affect ARP; NDP (for IPv6) continues to work.

Thanks again for the pointers! I'll definitely report back when I have more information from my side.
by klui » Tue Jan 21, 2025 1:13 am
The PR says this:
PROBLEM

When "no-arp-suppression" is configured, L2ALM will not respond to ARP request for IRB IP and peer router cannot send traffic to IRB.

RELEASE NOTES

On all Junos platforms, when "no-arp-suppression" is configured, the Layer 2 Address Learning Manager (L2ALM) will not respond to Address Resolution Protocol (ARP) request for Integrated Routing and Bridging (IRB) IP, and traffic to IRB will be lost.

PRODUCT

ACX Series, PTX Series, QFX Series, MX Series

WORKAROUND

Delete "no-arp-suppression".

TRIGGERS

This issue might be seen if the following conditions are met:
* On all Junos and Junos Evolved platforms supporting EVPN-MPLS (From 19.1R1, on EX and QFX, no-arp-supression is deprecated and this issue is not applicable)

* When no-arp-suppression is configured
I've also been looking at IPv6 and persistent prefix delegations. Sonic's DHCPv6 server is a Juniper device and there is a configuration that allows for persistent PDs. Maybe that is something Sonic could consider for folks who reboot their router/manually renew their lease. Sonic support gave me the impression they are not ready to answer IPv6 questions though but I hope they will forward to their network ops for consideration. As I don't use IPv6 if I were to, I would not want my PD to change lest I change all my address assignments in my firewall. Maybe there is another way to re-assign a prefix as I am not familiar with IPv6 at all. But what if I want to expose an IPv6 address out to the internet? One would need to use DDNS to cover any changes but what about TLS certificates?

https://apps.juniper.net/feature-explorer/feature/2799

Strangely there is no formal documentation on the setting except this article talking about how it doesn't work over PPPoE. It has an aging-time which could be used to not preserve a PD for a "long" time. https://supportportal.juniper.net/s/art ... uage=en_US.

As much as I like JunOS, it can be a bit obtuse at times.
by gadams » Tue Feb 04, 2025 1:14 pm
It's been a couple weeks now since I've had an outage, so I don't have any specific evidence to report.

I did make one change shortly after the last outage, which was to make sure that my DHCP client sends DHCPREQUESTs every couple hours or so.

Based on reading the Juniper docs and seeing that the Sonic router has never sent an ARP request, it seems clear that the Juniper router is snooping some set of packets to fill its ARP cache. I would have expected it to be snooping ARP packets, but that doesn't explain the failure mode I'm observing. It seems that it is, instead, snooping DHCP requests.

That hypothesis is backed up by these data points:
  • My router's ARP requests continue, but the Sonic router stops answering them at some point.
  • My router sending a DHCP request causes the Sonic router to answer ARP requests again.
  • If my router sends DHCP renewal requests every two to three hours, the problem seems not to happen. Maybe.
I still think that last point may be coincidence. I have had several instances in the past of the ARP failure within 5 to 10 minutes after a full DHCP handshake (Discover, Offer, Request, Ack). The latest one was on Jan 5.

Jan 05 12:18:32 DHCPDISCOVER/DHCPOFFER/DHCPREQUEST/DHCPACK
Jan 05 12:25:07 Down (ARP requests go unanswered)
Jan 05 17:33:36 DHCPREQUEST/DHCPACK
Jan 05 17:33:36 Up (ARP requests answered again)

Also, I'm not sure what exactly the hypothetical DHCP snooping is doing. It's not just filling the ARP cache on the Juniper side, since the Juniper still sends IPv4 traffic to me even when it stops answering ACKs. I'm guessing that there's an entry in some interior routing protocol that's falling out without sufficient DHCP requests coming in. And that's stopping even my ARP requests from making it to the IP next hop.

In any event, it clearly manifests as a bug on the Sonic side to stop answering ARP requests, regardless of how long it's been since the last DHCP request. (DHCP is supposed to layer above ARP.)
by gadams » Thu Feb 13, 2025 12:01 pm
I think I've figured out what's going on, and how to prevent it. Thanks to @conradpino for suggesting setting up monitoring switch, which helped me notice a tiny detail during an outage a couple days ago. I was able to corroborate that with my logs of past outages and even test the hypothesis.

Here's what's happening: Every once in a while, because of the timing of tearing down and setting up internal routes on my side unrelated to the Sonic link, my router leaked DHCP messages from another interface onto the Sonic link. These messages include a DHCP Discover that ends up being forwarded with my router's MAC address as the source, and Sonic's router answers with a DHCP Offer. Since my router's DHCP client didn't send the request, though, it doesn't respond with a DHCP Request, so that DHCP handshake isn't completed. It seems that the DHCP snooping that the Juniper is doing notices this partial DHCP exchange and then drops some part of the route to or from me.

Why exactly the problem manifests as ARP requests/responses not being forwarded seems like it must be related to the particular setup within Sonic. Perticularly interesting is that Sonic still routes traffic to my assigned IP address to my router's MAC address, but my ARP requests stop being forwarded to Sonic's router. I really don't understand enough about Sonic's wide-area ethernet architecture to deduce why that would happen. It's a very odd behavior.

In any event, I can solve this via policy routing rules (that I've now put in place) to make sure not to leak DHCP messages to the Sonic link. I can also probably reduce the load on Sonic's router by extending the ARP timeout on my side to something like 20 minutes (Juniper's default), although I think that shouldn't affect this issue at all.

I'd still love for a Sonic engineer to weigh in on why a DHCP Offer from Sonic that isn't followed by a Request and Ack would then halt ARP traffic forwarding, but that would just be to satisfy my nagging curiosity.
by conradpino » Thu Feb 13, 2025 7:45 pm
Outstanding work; well done!!
by gadams » Fri Feb 14, 2025 8:28 am
Thanks! :-)

Oh, and I forgot to mention the thing that made it tricky to find: The ARP failure always occurs 5 minutes (well, between 311 and 337 seconds in my testing) after the incomplete DHCP exchange. So something Sonic-side seems to have a five-minute timeout.

So glad to have figured out the trigger!
by conradpino » Fri Feb 14, 2025 1:39 pm
CPE equipment can trigger Sonic equipment into ARP level failure mode which is mostly invisible as most CPE equipment does not trigger the failure.

Is that a fair summary of this topic?
19 posts Page 2 of 2