Greetings, Tim,
Thanks for looking into this. To answer some of your questions:
I have a ZyXEL P-663HN-51 which is a combo dual/bonded-ADSL + 4-port LAN router + Wireless AP.
It is configured in bridged mode in accordance with your wiki article, expect it does not have a static IP assigned to it from my pool. It is configured with the standard 192.168.1.1 address for admin functions.
Since it is a combo-router there's no effective way to plug in directly to just the modem.
I've connected a non-vm machine to IP 173.228.5.243 (MAC:00:1a:92:78:2f:d4), plugged into LAN port #1 on the ZyXEL.
I have wireshark active and tracing the interface (eth1).
This machine has two interfaces, one of which goes to my internal network (eth0) and has the default route for the machine, and the other (eth1) temporarily connected to the ZyXEL.
For these tests eth1 is not the default route interface, so only specific traffic will get out through it. Thus, it's very quite unless actively engaged for these tests.
When everything is fresh (ie: I've just pinged out to the Sonic router), I run a traceroute test from
http://net.bluemoon.net to 173.228.5.243 on another machine. Where's the output:
traceroute to 173.228.5.243 (173.228.5.243), 64 hops max, 40 byte packets
1 gatekeeper (64.200.84.2) 1.435 ms 1.346 ms 1.118 ms
2 250.ATM1-0.GW10.NYC9.ALTER.NET (63.125.96.5) 26.729 ms 29.115 ms 30.596 ms
3 545.at-6-0-0.XR2.NYC9.ALTER.NET (152.63.24.234) 30.097 ms 29.661 ms 31.483 ms
4 0.so-4-0-1.XT2.NYC9.ALTER.NET (152.63.9.90) 23.507 ms 26.102 ms 24.486 ms
5 0.xe-5-1-0.BR2.NYC4.ALTER.NET (152.63.21.221) 24.162 ms 27.506 ms 23.957 ms
6 te9-2-0d0.cir1.nyc-ny.us.xo.net (206.111.13.125) 28.639 ms 24.097 ms 25.965 ms
7 207.88.14.185.ptr.us.xo.net (207.88.14.185) 30.779 ms 37.133 ms 51.152 ms
8 te-11-0-0.rar3.sanjose-ca.us.xo.net (207.88.12.69) 96.549 ms 104.921 ms 110.687 ms
9 207.88.14.226.ptr.us.xo.net (207.88.14.226) 96.227 ms 100.337 ms 101.597 ms
10 0.xe-4-1-0.gw3.equinix-sj.sonic.net (216.156.84.102) 100.906 ms 105.718 ms 95.892 ms
11 tengig2-1.cr1.snjsca11.sonic.net (64.142.0.106) 106.891 ms 108.801 ms 109.155 ms
12 gig1-1-1.gw.snjsca11.sonic.net (70.36.230.6) 102.648 ms 100.160 ms 99.083 ms
13 gig1-1-2.gw.lsatca11.sonic.net (70.36.243.10) 99.299 ms 100.429 ms 106.379 ms
14 173-228-5-243.dsl.static.sonic.net (173.228.5.243) 124.054 ms 103.771 ms 131.828 ms
Looks good, and I see packets coming in via wireshark.
When everything is fresh, I see UDP requests coming from 64.200.84.10, as you'd expect for traceroute.
The only 'who-has' requests I see, however, are from my host, any only when I actively make a connection out. Eg: Here's one when I ping out to net.bluemoon.net(64.200.84.10):
232 6710.081292000 AsustekC_78:2f:d4 Broadcast ARP 42 Who has 64.200.84.10? Tell 173.228.5.243
233 6710.091931000 Cisco_8b:52:c6 AsustekC_78:2f:d4 ARP 64 64.200.84.10 is at 64:16:8d:8b:52:c6 [ETHERNET FRAME CHECK SEQUENCE INCORRECT]
Here's another ping out doing a who-has for the Sonic default router (173.228.5.1):
283 7127.295263000 AsustekC_78:2f:d4 Cisco_8b:52:c6 ARP 42 Who has 173.228.5.1? Tell 173.228.5.243
284 7127.307748000 Cisco_8b:52:c6 AsustekC_78:2f:d4 ARP 64 173.228.5.1 is at 64:16:8d:8b:52:c6 [ETHERNET FRAME CHECK SEQUENCE INCORRECT]
'AsustekC_78:2f:d4' is my computer.
'Cisco_8b:52:c6' appears to be 173.228.5.1 (the Sonic router IP I've been told to use for a default router for my subnet.)
The 'FRAME CHECK SEQUENCE INCORRECT' message is because this reply packet from the Sonic router does not have a correct sequence number:
Frame check sequence: 0x00000000 [incorrect, should be 0x9290099d]
I never see any 'who-has' requests coming from Sonic's side of things.
When everything magically times out (appears to be about 6 minutes with this machine) the above traceroute stops at hop #13:
...
12 gig1-1-1.gw.snjsca11.sonic.net (70.36.230.6) 105.351 ms 106.285 ms 101.635 ms
13 gig1-1-2.gw.lsatca11.sonic.net (70.36.243.10) 112.018 ms 104.457 ms 104.358 ms
14 * * *
15 * * *
16 * * *
17 * *
When a timeout was occuring, If I do a traceroute and in the middle of that, I do a ping out to the Sonic router. The result was this:
...
12 gig1-1-1.gw.snjsca11.sonic.net (70.36.230.6) 111.335 ms 98.924 ms 98.283 ms
13 gig1-1-2.gw.lsatca11.sonic.net (70.36.243.10) 98.439 ms 99.736 ms 106.464 ms
14 * * *
15 173-228-5-243.dsl.static.sonic.net (173.228.5.243) 103.684 ms 105.170 ms 105.414 ms
You have some comments that says:
> In normal operation, we should see an ARP who-has from your hosts at whatever interval they time out their cached entry of your default gateways MAC address.
I don't see how this is likely. My hosts would have to be doing something active outbound in order to generate a who-has request. They would have no reason to refresh their cache otherwise. Again, outbound is not the problem. If my machines are quiet, they never receive any inbound requests.
Perhaps you were talking about the Sonic router doing a who-has to discover my machines? I don't see any such requests coming over the DSL line.
> I've been watching your port for the last hour and I have not seen a single ARP request from you.
Hmm... I have a ping out running every 60 seconds from two of my hosts. It's possible that this is causing implicit cache refresh on both ends, but I'm doing that specifically because I don't want the hosts to become unreachable. I did have these pings running every 120 seconds, but I found that there was still a variable window in which my hosts become unreachable, hence I ping every 60 seconds now.
> Either you have a static ARP entry for your default gateway programmed into all of your hosts or something else is going on.
No static ARP entries. I do have a default IP route going to the Sonic router, but no static ARP entries. My understanding of network protocol is that routers do segment discovery to find their endpoints. Endpoints (hosts) only do discovery when they need to make an out-bound connection. In this way, the routers know which host is on which segment and can route packets effectively. The hosts should not need to actively refresh their cache except when needing to send packets out for an unresolved IP. This is a bit simplistic, but essentially gets the point across.
> During the same time period we have not lost your MAC entry in our table and your modem remains ATM pingable.
I suspect you're cache is implicitly being refreshed by me because I don't want my hosts to become unaddressable. I have 4 IPs. Two are actively pinging out. One runs in stealth any only does NTP requests. The last one is this non-vm test host.
For your testing, I'll leave up my host on .243. Feel free to do some testing of your own.
Note that since this interface is separated, it has only one subnet that it knows about: 173.228.5.0
I had to add an explicit route for the bluemoon hosts subnet to ensure ping responses went back out through the same interface. I can add some additional explicit routes to any Sonic subnets if you need me to.
Thanks in advance for anything you can discovery about my problem!