TLS timeouts from SFO Network

Internet access discussion, including Fusion, IP Broadband, and Gigabit Fiber!
10 posts Page 1 of 1
by quinnypig » Wed May 12, 2021 9:27 pm
This has been going on for multiple days and was obnoxious to diagnose. After a series of packet captures, MSS/MTU adjustments, and other experiments, I've narrowed a profoundly annoying / confusing issue down:

"There's something in Sonic's SFO network that causes intermittent TLS handshake timeouts to some (but not all) endpoints in AWS's us-east-1 region."

You can replicate this yourself from within the SFO network via a quick loop:
for i in `seq 1 10`; do echo -n | timeout 2 openssl s_client -connect iam.amazonaws.com:443; done


Some of those ten attempts will connect, others will not. 

Help?
by quinnypig » Wed May 12, 2021 11:39 pm
Because this is intermittent, toss this in a terminal somewhere, whack enter, and go get a cup of coffee for ten minutes or so.

for i in `seq 1 1000`; do echo -n | timeout 2 openssl s_client -connect iam.amazonaws.com:443 &> /dev/null; if [[ $? != 0 ]]; then echo "`date` POOP";fi; done
by paulcoldren » Thu May 13, 2021 9:37 am
I experience this exact same issue. I spoke with Sonic support about it and they advised I post here. I'd like to get a Sonic NetOps person engaged if possible -- I'm more than happy to help diagnose whatever way I can. This is an important issue for me and for others who are trying to administer or use AWS services in us-east-1.

This issue doesn't occur 24/7. It seems to occur more often in the evenings Pacific Time, although I just experienced it just now (morning Pacific Time) as well.

This is not just a TLS issue. I can reproduce it with plaintext HTTP as well. When it's happening, try a bunch of rapid-fire requests as follows:

Code: Select all

time curl -v -IXGET http://us-east-1.console.aws.amazon.com


You'll see that they fail intermittently.

I believe I've eliminated DNS as a culprit. This hostname (us-east-1.console.aws.amazon.com) currently resolves to 3 IPv4 addresses, regardless of which DNS server I use, and all 3 exhibit the same behavior when I force cURL to use them.

I've tested from multiple networks throughout the Bay Area. I've found at least one other network where this happens (Monkeybrains). I've confirmed that the issue occurs on multiple Sonic fiber connections. I've confirmed that the issue does *not* happen on Comcast, nor does it happen on Level3 or Cogent enterprise fiber circuits.

I've also confirmed that the issue does *not* happen when connected to Sonic VPN. This one is strange to me, since presumably it's all the same routing out of Sonic whether it's a residential or VPN connection, but I've tested extensively. VPN "fixes" the issue.

Sonic support asked me for some traceroutes from broken and working networks. I've collected a bunch here.
https://fileshare.pmcc.net/RUjupsDu/BROKEN_sonic_residential_fiber.txt
https://fileshare.pmcc.net/xyqtjdzw/BROKEN_sonic_small_business_fiber.txt
https://fileshare.pmcc.net/hAsGWBJe/BROKEN_monkeybrains_residential.txt
https://fileshare.pmcc.net/eQrveVlu/WORKING_comcast_residential.txt
https://fileshare.pmcc.net/xjwBedkF/WORKING_level3_enterprise_fiber.txt
https://fileshare.pmcc.net/BasmeWjb/WORKING_cogent_enterprise_fiber.txt
https://fileshare.pmcc.net/RZXCdInL/WORKING_sonic_vpn.txt


The only commonality in upstream providers I've seen is that both Sonic and Monkeybrains use Cogent as their upstream for getting to AWS us-east-1, at least for ICMP. However this doesn't tell the full story, because 1.) It doesn't happen from Sonic VPN, which also uses Cogent as an upstream, and 2.) It doesn't happen from an actual Cogent enterprise fiber circuit in downtown SF.

Let me know what else I can do to help troubleshoot this!
by bgile » Thu May 13, 2021 11:44 am
Hello
Sorry to hear about the issues with TLS. My team and I did some poking around and it was pretty easy to replicate the random connection issue(thanks for the command, made things pretty handy!). We could replicate this even on the sonic VPN. We decided to make a routing change and moved traffic to Telia. So far it all looks clean now, we are unable to get any TLS handshake connection issues.

Currently, we are opening up a ticket with cogent to address the issue. Please do a retest and let us know if it is better for you as well.

I will add that we moved the following blocks, I am not 100 percent if this will account for all the services
52.94.224.0/20
52.46.128.0/19
If there is another destination you are trying to reach that is having an issue please let us know.
Thank you

-Brandon Gile
-Network Engineering
by paulcoldren » Thu May 13, 2021 4:06 pm
Hi Brandon -- thanks for investigating.

I've had issues with us-east-1.console.aws.amazon.com, which points to these three IPv4 addresses according to some global DNS checks:
54.239.30.25
54.239.31.91
54.239.31.83

This looks to be announced as 54.239.16.0/20.

Can you move all AWS routes or is that too much?

If you can at least move 54.239.16.0/20, I'll do some tests tonight and let you know if it seems better for me.
by psanford » Sun May 16, 2021 9:51 pm
I'm seeing timeouts again using Corey's one-liner above (and also in the browser).
by bgile » Mon May 17, 2021 8:37 am
Apologies, I was out Friday. It looks like Cogent did some work. I am not seeing any loss yet to those IPs listed as part of the /20 for us-east-1.console.aws.amazon.com. I'm going to do some more testing throughout the day and make sure it looks fine and move the block if needed. If it looks stable then we will probably move the other blocks back, otherwise, we will continue to pressure Cogent.
by bgile » Mon May 17, 2021 9:38 am
I moved the 54.239.16.0/20 as I started to see issues with it as well. Now it's testing clean.
by paulcoldren » Thu May 20, 2021 8:54 pm
It's been working well from Sonic for the last several days since you made the change.

I've noticed that us-east-1.console.aws.amazon.com has been working reliably from Monkeybrains as well, but iam.amazonaws.com is still not working reliably from Monkeybrains (recall that Monkeybrains also uses Cogent as a transit provider for AWS us-east-1).

So it sounds like:
1.) Switching to Telia fixed it.
2.) Cogent did something that seems to have fixed it for us-east-1.console.amazonaws.com, but it's still broken as of right now for iam.amazonaws.com.
by quinnypig » Thu May 20, 2021 9:07 pm
I saw a couple of timeouts today to the console, but they were transient. It's definitely better!
10 posts Page 1 of 1

Who is online

In total there are 7 users online :: 2 registered, 0 hidden and 5 guests (based on users active over the past 5 minutes)
Most users ever online was 999 on Mon May 10, 2021 1:02 am

Users browsing this forum: Bing [Bot], Google [Bot] and 5 guests