The OptionKey Blog: Troubleshooting Microsoft Failover Cluster Communication Errors Part 2

Monday, May 29, 2017

Troubleshooting Microsoft Failover Cluster Communication Errors Part 2

In my previous post I though I had resolved all the issues with my Hyper-V cluster. I was WRONG. As annoying as it was the cluster worked fine for about 4 days and then decided to have a massive communication error and I lost 4 of my 6 nodes. So I went over the cluster with a fine tooth comb and found that settings had been changed on the Nodes.

I found several issues with the cluster some I could fix other would have to wait for a replacement switch. First all the networks that should be on the nodes were there but the outage issues with the Switch put some of the networks on private and public networks which were fire-walling the cluster communications. The second issue had to do with Multiple subnets and binding order, the network issues hand caused a reset of the network adapters, and validating network communications continues to be an issue and a setting in the Cluster was causing some DNS issues. Also apart of these fixes we did a full reboot of all of our switches and this is because our network monitoring system Pathsolutions had registered 6 of our main switches with a packet loss of 15% or greater. So what I had done was moved 4 of the nodes to 3 different switches in an effort to better distribute the network load.

Using the cluster validation wizard I was able to troubleshoot some of the issues with the cluster network problems.

Firewall

So on the networks that got reset the firewall was blocking port 3343 so I grudgingly opened the port on all three firewall networks until I can find a better solution. Microsoft Hyper-V Cluster requires port 3343 protocol UDP to communicate properly.

Multiple Subnets

For the cluster there are two AD controllers in failover mode and they properly replicate changes to and from each other as expected. However the cluster nodes were getting new IP addresses after the network failures and I had not reserved the IP addresses so that was causing some issues along with other interfaces resetting to DHCP and causing a multiple external interface and DNS issues. This was compounded because the binding order also got screwed up.

This took the most work to fix. First you need to adjust the order in your Hyper-V switch manager then you fix the IP address on the Hyper-V virtual nic shown by the vEthernet

So the domain networks are both fully DHCP enabled and gets the DNS from the DNS Servers on vEthernet(01) and on vEthernet (01) that is the primary cluster network where the address has been reserved.

IP Example

192.168.1.11
255.255.255.0
192.168.1.1

DNS
192.168.1.1
192.168.1.2

and the unidentified networks are statically assigned with no default gateway and no DNS servers defined.

172.31.1.11
255.255.255.0

Validate Network Communications

Unfortunately there is nothing I can currently do about this but it is showing a packet loss of 10% or more

After doing all these fixes and using the Cluster Validation tool to double check and make sure the issues have been resolved. So instead of a bunch of big red X's I get 2 warnings that I get to work on resolving but for now the Cluster is backup and running great.

<< Back to Part 1

Monday, May 29, 2017

Troubleshooting Microsoft Failover Cluster Communication Errors Part 2

Meraki AP Management Changes