sporadic WAN connection loss: ARP issues

General discussions and other topics.
6 posts Page 1 of 1
by bg212 » Mon Sep 02, 2024 3:32 pm
Hi,

TL;DR - I am having ARP issues, when my router tries to get MAC for Sonic gw and ARP requests time out. Can be solved by unplugging and plugging back in the cable between ONT and router.

Details:
I have a 10 Gbps, ONT box from Sonic and my equipment behind.
My equipment is all 1000 mbps WAN side, so link iautonegs at 1000 mbps between my router and ONT.

Everything is great except once in a while (once in a few days, or twice per day, etc) I lose internet connection.

I don't lose the DHCP IP; it's still there, but the GW MAC is expired (not in the ARP table) and ARP towards Sonic go without responses.

I tried two different residential routers, connected directly to the ONT (both 1Gbps uplinks).

I used an L2 unmanaged switch (TP-Link) in front (hence, I could monitor the uplink traffic and find those ARP issues).
Behind the switch, I had an SBC unit and the router (two different DHCP IPs from Sonic) - they both experience APR timeouts.

So far, reps are saying there is nothing wrong on their end. When I spoke to a rep when I had this issue, he said it looks from his end that nothing is connected at ONT.

For the record, ATT fiber has had 0 issues for me for years.

Any ideas?
by gadams » Tue Nov 19, 2024 7:38 pm
I am having this exact symptom right now, on my third day of Sonic 10Gb service.

Here are all the non-IPv6 packets flowing on the interface:

Code: Select all

19:20:45.400499 ARP, Request who-has 157-131-120-1.fiber.dynamic.sonic.net tell 157-131-122-30.fiber.dynamic.sonic.net, length 28
19:20:46.423131 ARP, Request who-has 157-131-120-1.fiber.dynamic.sonic.net tell 157-131-122-30.fiber.dynamic.sonic.net, length 28
19:20:47.447155 ARP, Request who-has 157-131-120-1.fiber.dynamic.sonic.net tell 157-131-122-30.fiber.dynamic.sonic.net, length 28
19:20:47.536202 IP a.icmp-monitor.noc.sonic.net > 157-131-122-30.fiber.dynamic.sonic.net: ICMP echo request, id 895, seq 37585, length 64
19:20:48.472596 ARP, Request who-has 157-131-120-1.fiber.dynamic.sonic.net tell 157-131-122-30.fiber.dynamic.sonic.net, length 28
19:20:49.495142 ARP, Request who-has 157-131-120-1.fiber.dynamic.sonic.net tell 157-131-122-30.fiber.dynamic.sonic.net, length 28
19:20:50.519150 ARP, Request who-has 157-131-120-1.fiber.dynamic.sonic.net tell 157-131-122-30.fiber.dynamic.sonic.net, length 28
...
Interestingly, simply renewing the DHCP lease (restarting dhclient and letting it request a fresh lease) kicked the Sonic side into action, and now ARP works again, and so I can reach the gateway, and so packets are flowing again.

Interestingly, IPv6 was still working at the time, but this forum (and quite a few other sites, sadly!) only has an IPv4 address, so I couldn't even post here until fixing the problem.
by gadams » Wed Jan 01, 2025 8:08 pm
I've still been struggling with this problem. It's happened several more times, and the outage lasts until my DHCP lease is near expiration, and my DHCP client falls back to the initial discovery process using broadcast requests. That can be quite a few hours.

The general sequence of events seems to be this:

- Traffic flows along fine.
- At some point, the next-hop (Sonic) router stops responding to ARP requests.
- Traffic continues to flow until the ARP cache entry goes stale.
- IPv4 traffic stops flowing. IPv6 traffic continues unabated.
- At some point hours later, the DHCP lease will be up for renewal. The DHCP server tries to send unicast DHCPREQUEST packets to the next-hop router (which I suspect is actually a DHCP relay, but may in fact be running a DHCP service). But it is unable to send these packets, because there's no valid ARP entry, and ARP requests are still being ignored.
- Eventually, the DHCP lease will be about up, and my DHCP client will fall back to DHCPREQUEST to the broadcast address.
- The next-hop Sonic router immediately responds. The ARP cache entry is filled. (I haven't actually caught the ARP request and response happening, yet, but I assume they must.) Something about this seems to trigger allowing ARP requests through again.
- IPv4 traffic starts flowing again.

I have talked with Sonic tech support about this a couple times. The first time, they updated the firmware in my ONT, but that didn't help. Apparently, there was some unrelated known issue with the ONT getting addresses, but I'm not sure what that was about.

The second time, the tech looked at the packets coming from my router, and saw that IPv6 was still flowing, but didn't see the ARP requests from my router, even as I watched my router send them (via snooping on the ethernet interface). So it seems like the ONT stops bridging ARP packets, for some reason? I'm perplexed.

I've tried two different routers with different configurations. The ethernet interface error counters are at 0. And IPv6 doesn't have any problems talking with the same next-hop router (with the same MAC address for both IPv4 and IPv6, so almost certainly a single, dual-stack router).

The next step appears to be to swap out the ONT. But this really seems like a software issue on the ONT or upstream, so I'm not sure if that will even help.
by gadams » Wed Jan 15, 2025 11:00 am
Unfortunately, replacing the ONT did not have any effect. The problem appears to be a software bug or misconfiguration in the ONT or the Sonic edge router (the last-hop router at Sonic that communicates through the ONT to the CPE router).

This thread seems to be getting no traction, so I think I'll start a new one. This problem is very frustrating, and makes the Sonic service unviable as a residential ISP connection.
by conradpino » Wed Jan 15, 2025 6:02 pm
This is a hack. When service is up, query CPE router for its default route to Sonic IP address and MAC address.
Create CPE Router static persistent ARP entry with above IP and MAC.
Static ARP in CPE makes Sonic ARP response failure irrelevant.
This hack breaks when Sonic changes IP and/or MAC.
by gadams » Wed Feb 05, 2025 12:37 pm
Yeah, that is a hack, and as I mentioned in the other thread, I don't really have any interest in paying for a service where I have to hack around something that's fundamentally broken.

I'm exploring this more in the other thread, and still just collecting information and trying to characterize the problem. It seems more subtle and complex than you'd first expect.
6 posts Page 1 of 1

Who is online

In total there are 7 users online :: 0 registered, 0 hidden and 7 guests (based on users active over the past 5 minutes)
Most users ever online was 6584 on Thu Feb 13, 2025 11:43 am

Users browsing this forum: No registered users and 7 guests