So there you are, sitting at home when a customer rings complaining that they have lost all connectivity through their ExpressRoute. ..Panic!
You perform your standard troubleshooting but the ER seems fine, switches are behaving as expected and all seems well. Even a ticket to MS themselves and to your chosen ER provider doesn’t solve the issue, now what?!
We faced a similar issue with one of our customers. VMs in Azure could reach the internet and each other, as could the machines in the on-prem datacentre. The only thing either could reach, was each other using the ER.
After hours of troubleshooting we were informed by MS that they could see data reaching the router on their side, but the data didn’t seem to be reaching the ExpressRoute gateway for some reason. At that point the call was ended and MS would continue troubleshooting on their side.
Enter a MS engineer who had seen this type of behavior before and who believed he might know how to fix it.
The solution turned out to be so simple it blew me away, especially since it involved something I did not know we could do from our side: Resetting the network gateway.
The procedure is easy and requires PowerShell, as do most awesome Azure fixes.
Log in into your account, select the correct subscription and simply enter:
Reset-AzureVNetGateway -VNetName *Name of your vnet*
The command will run for a couple of minutes, and should return a “success” status afterwards.
In our case the customer was on the phone reporting all connectivity had been restored even before that and all was well once again.
The root cause for the outage in this case turns out to be a security update applied to the gateway which resulted in network routes not propagating to the gateway correctly. These gateways are (as I am told, correct me if I’m wrong) basicly Windows RRAS servers, so it shouldn’t be much of a surprise that a reboot fixes all 😉
So ladies and gents, if the ExpressRoute stops routing, reboot thy gateway!
Have a cloudy day!