Frequent IPv4 outages due to ARP requests not being answered

Internet access discussion, including Fusion, IP Broadband, and Gigabit Fiber!
15 posts Page 1 of 2
by gadams » Wed Jan 15, 2025 11:36 am
Hello, all!

I'm having a serious problem with my 10 Gb fiber service. I have been unable to resolve this problem since the service was installed.

Every few days, my IPv4 connection drops for several hours. The symptom that I can see is that starting at some point, my router's ARP requests to the Sonic edge router go unanswered, so my router becomes unable to reach the Sonic router. This is only corrected once the DHCP lease is about to expire, and my DHCP client sends a DHCPREQUEST to the broadcast address. The Sonic DHCP server then responds with a DHCPACK, and simultaneously starts answering ARP requests again. Until it happens again a few days later.

I have called Sonic support to investigate, and the support tech told me that he could not see the ARP requests from my router--even as I watched the packets leave my router on the way to the ONT. I'm not exactly sure where he was watching the traffic, but he did tell me that he couldn't directly inspect traffic at the ONT.

This appears to be a bug or configuration problem in either the ONT or the Sonic edge router (the one that my CPE router talks to through the ONT). Either the ONT is failing to bridge the packets, or the Sonic edge router is ignoring them. I'm not able to determine any pattern to the cessation of ARP responses.

Things we have tried:
  • Replacing the CPE router with a different type of hardware
  • Rebuilding the CPE router config from scratch
  • Replacing the ONT
  • Manually sending other broadcast traffic during the outages
None of these had any effect on the problem.

Some additional details:
  • Sonic's DHCP server does not respond at all to unicast DHCPREQUESTs to refresh the lease. Ever. This seems odd.
  • Because the lease time is 6 hours, the outages last less than 6 hours. But they're often 4 or 5 hours, which is of course unacceptable.
  • IPv6 traffic is unaffected; NDP packets continue to work as expected. It's just ARP that fails.
  • IPv4 inbound traffic continues, but since our router's ARP cache entry has expired and ARP requests receive no replies, our router cannot send responses. We also can't initiate any outbound connections. The IPv4 link is effectively down.
  • There are no interface errors on the CPE router <-> ONT ethernet link.
Here is an example of what the traffic looks like from the CPE router side as we recover the connection:

Code: Select all

...
09:06:22.245883 ARP, Request who-has 192-184-176-1.fiber.dynamic.sonic.net tell 192-184-177-68.fiber.dynamic.sonic.net, length 28
09:06:23.276791 ARP, Request who-has 192-184-176-1.fiber.dynamic.sonic.net tell 192-184-177-68.fiber.dynamic.sonic.net, length 28
09:06:24.293870 ARP, Request who-has 192-184-176-1.fiber.dynamic.sonic.net tell 192-184-177-68.fiber.dynamic.sonic.net, length 28
09:06:25.318874 ARP, Request who-has 192-184-176-1.fiber.dynamic.sonic.net tell 192-184-177-68.fiber.dynamic.sonic.net, length 28

-- typed `renew dhcp int eth9` here --

09:06:26.357816 ARP, Request who-has 192-184-176-1.fiber.dynamic.sonic.net tell 192-184-177-68.fiber.dynamic.sonic.net, length 28
09:06:27.337164 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 20:7c:14:f5:90:59 (oui Unknown), length 300
09:06:27.529024 IP bng1.snfcca05.sonic.net.bootps > 192-184-177-68.fiber.dynamic.sonic.net.bootpc: BOOTP/DHCP, Reply, length 548
09:06:27.529205 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 20:7c:14:f5:90:59 (oui Unknown), length 300
09:06:27.553513 IP bng1.snfcca05.sonic.net.bootps > 192-184-177-68.fiber.dynamic.sonic.net.bootpc: BOOTP/DHCP, Reply, length 548
09:06:27.566775 ARP, Reply 192-184-176-1.fiber.dynamic.sonic.net is-at b4:f9:5d:35:2e:3c (oui Unknown), length 50
09:06:28.328917 ARP, Request who-has 192-184-176-1.fiber.dynamic.sonic.net tell 192-184-177-68.fiber.dynamic.sonic.net, length 28
09:06:28.333337 ARP, Reply 192-184-176-1.fiber.dynamic.sonic.net is-at b4:f9:5d:35:2e:3c (oui Unknown), length 50
What would cause the ONT to stop forwarding ARP requests at some point? Or what would cause the Sonic edge router to stop receiving and acting on them? Why does a DHCPREQUEST to the broadcast address trigger things to start working again?

Also, why is this not causing massive problems for all Sonic customers? I'm not the only one seeing this, though: viewtopic.php?t=18064

Our CPE router is running a Linux-based routing OS, VyOS. This same hardware is currently working just fine with an AT&T connection and has no problems with Comcast, either. We've had no issues running in a data center connected to Hurricane Electric, as well. I've never seen a problem like this in any other installation.
by conradpino » Thu Jan 16, 2025 7:56 am
VyOS documentation (quality unknown) https://vyos.dev/w/user-guide/?v=2 Basic Connectivity Verification section suggests VyOS ping accepts both IP and MAC address.

Please run ping with both IP and MAC when interface both works and fails; four (4) use cases.

ARP is a Layer 2 link level broadcast domain protocol.
  • Are there switches (basic or managed) between ONT interface and VyOS interface?
  • See VyOS documentation Bridging section; does VyOS have active bridge interface?
If either above is true then consider Spanning Tree Protocol port block due to a switching loop.

This is a hack. When service is up, query VyOS for its default route to Sonic IP and MAC addresses.
Create VyOS static persistent ARP entry with above IP and MAC.
Static ARP in VyOS makes Sonic ARP response irrelevant.
This hack breaks when Sonic changes IP and/or MAC.
by gadams » Thu Jan 16, 2025 10:51 am
Thanks for your response.

I think I can answer all of your questions now, even without waiting until the next outage. I've done a lot of testing.
  • There are no switches or anything other than a single Cat6 ethernet cable between my router and the ONT.
  • There are no bridge interfaces on the VyOS router at all. It is purely doing simple layer 3 routing.
  • The Sonic router answers pings to its address (192.184.176.1/21) when up, but naturally I am unable to send pings when there's no ARP table entry for Sonic's router.
  • Sonic's router never answers pings to the broadcast address (192.184.183.255/21) in any state.
I do not believe there is a layer 2 (ethernet) ping. ICMP is a layer 3 protocol, and I'm not sure what protocol would be similar to ping on layer 2. You may be seeing the IPv6 address notation in the ping command docs and mistaking that for a MAC address.

Indeed, I have tried setting a static ARP table entry containing the MAC address of the Sonic edge router. As you'd expect, this does bring connectivity back up.

But, as you say, it is a hack. I shouldn't count on the Sonic router having that exact IP address and MAC address; Sonic could change that at any time. Dane Jasper has previously stated that having flexibility to do things just like that are exactly why they're not offering static addressing on their fiber services.

And, of course, as soon as I remove the static ARP entry, everything is broken again, because the Sonic router still doesn't answer ARP requests. That is the basic problem, and I don't think "working around it" is wise.

Sonic's ONT should really forward ARP requests, and Sonic's router should really answer them. Anything less is just broken. It's really hard to imagine how something so basic can be broken. I feel like I must be missing something, but I've been digging into this failure for a couple months, now, and I'm still at a loss.

I'll have to cancel the service if we can't resolve this problem; the service has yet to be useable.
by conradpino » Thu Jan 16, 2025 2:47 pm
gadams wrote: Thu Jan 16, 2025 10:51 am I do not believe there is a layer 2 (ethernet) ping. ...
ARP ping exists but may not be universal; see https://en.wikipedia.org/wiki/Arping
VyOS documentation cited shows command line options:

Code: Select all

ping x.x.x.x
ping h:h:h:h:h:h:h:h
Where first form looks like IPv4 address and second a MAC address.

IMO, STP issues are ruled out, no bridge interface and no switches.

Since Sonic replaced ONT to no effect that tends to make ONT hardware moot.
Your VyOS with AT&T and Comcast assertion suggests narrow Sonic side bug.
IMO it's reasonable to assume wide bug affecting many customers isn't in play.
As only affected device and as a practical matter, you have a burden of proof.

Linux networking is generally sound which suggest VyOS packet captures are accurate but ... there's room for doubt.
Do you have a manged switch to place between ONT and VyOS to verify packet capture seen so far is accurate?

BTW, my interest here is Sonic is deploying to Castro Valley and I want to preview the Sonic experience.
by gadams » Fri Jan 17, 2025 11:20 am
ARP ping exists but may not be universal; see https://en.wikipedia.org/wiki/Arping
Huh! TIL. Thanks for that. The arping utility (which is available on VyOS) allows repeatedly sending broadcast or unicast ARP requests.

I'll use this during the next outage to see if it can tell me anything I couldn't already tell by watching the ARP requests and lack of responses.

[ ... ]
Where first form looks like IPv4 address and second a MAC address.
Those 8 hex words form an IPv6 address, not a 6-byte Ethernet MAC address. The ping command is for ICMP or ICMPv6

As you suggested, I have now inserted a spare 10 GbE managed switch between my router and the ONT. I've set up port mirroring of the ONT-facing port, so that I can observe traffic without having to trust the Linux network stack. I'll report what I see during the next outage.

(There was an ARP failure early that caused an outage this morning, but it only lasted 25 minutes, and I wasn't able to investigate at the time.)
by conradpino » Fri Jan 17, 2025 12:03 pm
@gadams Thank you; miscounted 6 hex bytes instead of 8 actual; cognitive bias. I saw VyOS arping somewhere but misplaced it; thanks for your due diligence. I presume arping test will be Sonic MAC.

Have you seen Sonic preference on arping to broadcast (ff:ff:ff:ff:ff:ff) MAC? If it works and each customer has (A) isolated broadcast domain (VLAN) then a single response is correct whereas (B) shared broadcast domain will have many responses.

IMO your network skills are professional level, I presume mirrored port packet capture suggestions are redundant.
by gadams » Fri Jan 17, 2025 5:57 pm
Thanks; I've been working in internetworking for decades, so I kind of know my way around this stuff. (I've had my home on the 'net since 1992, and I helped build the first commercial ISP in the DC area back in the day.)

I'm not sure exactly what you're asking about a preference for broadcasting. When all is operating normally, my router sends unicast ARP requests, and the Sonic router sends unicast ARP replies. But once the Sonic router stops answering, the ARP entry expires from my router's cache, and it reverts to sending broadcast requests (which also go unanswered).

The Sonic router, on the other hand, doesn't send ARP requests very often at all. In fact, I've never caught it sending one. It's as if it uses just the DHCP requests to populate its ARP cache and keeps it there. (This may be consistent with the unsolicited ARP reply that the Sonic router sends after DHCPACK.) The Sonic router is a Juniper, and I believe Juniper's default ARP timeout is 20 minutes. That's quite different from the ~45 seconds of Linux, but more tellingly, since it doesn't send ARP requests even every ~20 minutes, I think the ARP configuration of Sonic's Junipers has been changed from default.

But still, that doesn't explain why it wouldn't be answering ARP broadcast requests.

Broadcasts on a GPON network are interesting. And there is definitely some filtering going in. I wonder if there's something amiss with the filtering.

A GPON network is broadcast by nature at the physical level; it's a shared fiber medium. And since I don't see all the traffic of my neighbors (even broadcasts, such as ARP or DHCPDISCOVER) on my side of the ONT, it's clearly filtering traffic that isn't relevant to me. It's possible that since XGS-GPON uses time-division multiplexing, I'm not seeing neighbor traffic because the ONT only "sees" traffic within my allocated time slices. I'm honestly not familiar with how this part works.

At the IP level, too, there's a broadcast domain within the /21 that I've been assigned via DHCP, but again, I don't see anyone else's DHCP requests. (It's possible that the IP /21 membership corresponds closely with the PON, but it wouldn't need to.)

Clearly, something in the ONT or OLT is deciding that IP broadcast packets don't actually need to be forwarded to everyone; the ONTs are not acting as strict layer 2 bridges.

Is that mechanism deciding not to forward my router's broadcast ARP requests to Sonic's router?

I would love for a Sonic engineer to weigh in on what might be going wrong, here.
by conradpino » Fri Jan 17, 2025 7:36 pm
My XGS-GPON (10G-PON) reading says TDM and TDMA are in play; multiple wave lengths, multiple time slots.
This suggests to me Sonic OLT sees all customer ONT and customer ONT sees just the OLT.
Customer ONT are logically invisible to each other (Layer 2) unless OLT is routing (Layer 3).
Each customer ONT is a distinct Layer 2 link level broadcast domain.

Typical LAN ARP cache lifetime are typically short to accommodate unscheduled cabling changes.

Sonic knows customer equipment change must start with DHCP making frequent ARP checks pointless. We know IPv6 traffic is good meaning the fiber and PHY hardware on both sides can't be at fault; hence likely a firmware / software bug.

Consider making the WAN VyOS ARP cache entry lifetime equal to the DHCP lease time. I don't know how; just think it's reasonably safe given Sonic side behavior.

Sonic staff is here from time to time but since they've already done standard due diligence, I don't expect an opinion until definitive diagnostic evidence emerges.
by conradpino » Sat Jan 18, 2025 6:23 am
VyOS has a safe hack; extend ARP cache timeout on just WAN interface.
https://docs.vyos.io/en/latest/configur ... he-timeout
by gadams » Sat Jan 18, 2025 8:22 pm
Yes, I have considered doing exactly that. I could set the ARP timeout to be, say, the DHCP lease time (6 hours). But again, that seems like a hack (and the lease time could change any time without notice), and would just mask the problem. The Sonic equipment should answer ARP requests; not answering ARP defies the standards and the expectations of an ethernet link, and it cannot be intentional.

I'd definitely like someone from Sonic to weigh in here on this behavior before I consider hacking around an apparent broken configuration.

(And how much more definitive diagnostic evidence do you think is needed beyond showing the packet traces for ARP requests going unanswered, and user bg212 reporting the same behavior in viewtopic.php?t=18064? I imagine there are other users out there, too, who just find it reasonable to reboot their equipment occasionally, or otherwise kick-start their connection when it inexplicably goes down.)
15 posts Page 1 of 2

Who is online

In total there are 26 users online :: 1 registered, 0 hidden and 25 guests (based on users active over the past 5 minutes)
Most users ever online was 3912 on Mon Feb 10, 2025 6:15 pm

Users browsing this forum: Google [Bot] and 25 guests