Thoughts from a 6-day outage and a support protocol suggestion

General discussions and other topics.
4 posts Page 1 of 1
by edbaskerville » Sun May 16, 2021 3:33 pm
To whom it may concern (Dane?),

The week before last, my Sonic fiber service went down for 6 days.

It turned out to be extremely strange situation that seemed to affect only me, and ended up requiring rebooting an OLT card. Throughout the process, individual Sonic support personnel were respectful, helpful, and displayed knowledge at or beyond what I'd expect for their jobs.

At the system level, however, I think the support protocol failed, and ended up wasting the time of a lot of Sonic personnel. Not to mention my time—this was my choice, but I did spend probably 30+ hours debugging on my end.

With a slight change in protocol, I think the whole thing could have reasonably been wrapped up in 2 days rather than 6.

The change I'd suggest: if somebody is using their own router, and phone support decides to send someone out, the very first step should be for Sonic to plug in one of their SmartRG routers and see if it works. That's the simplest, most obvious troubleshooting step. (As it turns out, having a Sonic technician plug their Windows laptop into the ONT is not at all equivalent.)

The full details of my case are pretty long, but the details are important to see where the protocol failed.

The whole thing went down like this:
  1. At about 3:15 AM on Saturday, May 1, my router started logging failed DNS lookups. When I woke up, my Internet connection was behaving strangely: the router was getting DHCP configuration, but neither TCP nor UDP were working. After the basic debugging step of plugging a couple different laptops directly into the ONT, I called Sonic to see if something was up on their end.
  2. After several rounds of resetting MAC tables and plugging laptops directly in the ONT, phone support decided to send someone out.
  3. On Sunday, a technician came out to check things out. If he had started by testing with a Sonic SmartRG, things would probably have been escalated and fixed sometime on Monday or Tuesday. Instead, we tried plugging my laptops and my router directly into the ONT, none of which worked. Then came the huge red herring: he plugged in his Sonic-issued Windows laptop directly into the ONT, and it worked, but this turned out to because of a bug in Windows.
  4. Institutionally, Sonic hadn't yet recognized that the problem was on their end. Although they had not proven that my devices had all suddenly broken at 3:15 AM on Saturday, and although all individual support personnel seemed as baffled as I was, the support protocol seemed to have judged that this was my fault.
  5. The technician went home. I stayed and tried a few more devices, including an ASUS router reset to factory settings, none of of which worked except for a Windows machine I had lying around.
  6. I called support and explained that only two Windows machines had worked—and that even an ASUS router reset to factory settings did not work. So either it was a Windows thing, or it was something strange like historical state tied to, say, MAC addresses.
  7. The support protocol did not allow escalation to a NOC or network engineer at this juncture. Instead, they decided to ship me an Eero, which they said would aid in debugging the problem. The Eero took three days to arrive.
  8. While I was waiting for the Eero, a friend of mine helped isolate the problem: we found that although TCP and UDP packets were coming in, they had 802.1Q VLAN tags. Presumably, internal Sonic infrastructure VLAN tags were leaking through past the ONT rather than getting removed one step upstream. Mac and Linux were correctly dropping the packets; Windows was not.
  9. The Eero arrived on Wednesday, May 5. It did not work. Not only did it not work, but it seemed to serve no troubleshooting purpose other than to prove to Sonic that I was not crazy. Phone support had me plug in my own laptop as the next step—the same thing I had started with more than four days earlier. They finally escalated to NOC, who decided to send out another technician.
  10. The technician came out on Thursday, May 6. She verified the Eero didn't work, verified my laptop didn't work, and verified that her Windows laptop did—all while sitting in my entryway, with me running Ethernet cables out to her, since pandemic protocol said she couldn't come in. Finally, she brought up a SmartRG from the van and had me plug it in. It did not work. After several hours, things finally got escalated to network engineers, who narrowed the problem down to either a SMART Multicast device or the OLT, which they couldn't test until the middle of the night in order to avoid impacting too many other customers.
  11. From 2 AM to 2:37 AM on Friday, May 7, Sonic network engineers reset the SMART Multicast device, which didn't seem to fix the problem, and a card on the OLT, which did. In the morning, I again had working Internet, with VLAN tagging appropriately missing from incoming packets.
I want to emphasize again that none of the individuals involved in this process did anything wrong as far as I could tell; they all seemed to be following protocol, and doing it in a kind, respectful way.

However, the protocol failed to execute the most obvious, simplest first step: if the customer doesn't use a Sonic router, test first with a Sonic router. Testing with a Sonic technician's Windows laptop adds an unnecessary variable that, in this case, turned out to be a horrific red herring due to a bug in Windows. If the simplest, most direct troubleshooting step had been taken first, I suspect this process would have been over 4 days earlier.

Thanks for listening,
Ed

References:
by ds_sonic_asif » Tue May 18, 2021 2:18 pm
Thoughtful, organized and well stated. I hope that Sonic takes notice.

At the one year anniversary of our installation I considered returning the Wifi/Router box, as in our environment it just feeds a box where I run a firewall and connectivity services for inside the house. I ended up keeping it because:
  • It seems like a low cost add on for improving Sonic's ability to support me
  • During PG&E shutoff events, directly using the WiFi is convenient, as it is a lot less things to keep powered (the ONT and the router) than the more complex and distributed daily environment (firewall server, various switches, WiFi access points, etc...).
by amayfield » Tue May 18, 2021 3:51 pm
Thank you for such a comprehensive breakdown of the sequence of events and for the feedback you've provided. We hadn't seen this particular issue before which is part of the reason why it took so long to identify the root cause of it. We will be circling up with our Network Operations team to see if there are any steps we can take to involve them earlier on in the process to avoid these kinds of situations in the future. Thanks again for the feedback, and I'm sorry it took us so long to get your issue resolved.
Andrew M.
Community & Escalations Manager
Sonic
by edbaskerville » Thu May 27, 2021 10:06 pm
Thanks for the response. (And sorry for the slow acknowledgment, just saw this.) I do hope it's useful to someone at Sonic, if only as an interesting case summary.

I'm still curious, though: how did this happen? This must have been a pretty strange problem for it to have affected only a single customer. My only hypothesis is cosmic rays.

Does someone at Sonic know what happened, or do they just know that resetting the OLT fixed it?
4 posts Page 1 of 1

Who is online

In total there are 10 users online :: 0 registered, 0 hidden and 10 guests (based on users active over the past 5 minutes)
Most users ever online was 999 on Mon May 10, 2021 1:02 am

Users browsing this forum: No registered users and 10 guests