The 6 node Hyper-V cluster I had setup starting acting buggy this is after it had been running rock solid for about 2 years. I keep this cluster pretty up-to-date with current system patches, and I also use Control Up to monitor the real time status of the cluster and a Standalone Hyper-V Server. (On a side note Control Up helped me diagnose an I/O issue with my Standalone server. Read Post)
The symptoms were:
- slow access to the cluster manager
- cluster node timeouts/drops
- DNS Errors
- iSCSI Target Timeouts/Delayed writes
- Control Up alerts on NIC Packet Errors/Drops
- Validating Cluster Test -> Network Failure
- Cluster Update Errors
After having a quick look at the problem and a reboot of a down node where the only issue seemed to be a generic communication/TCP error a reboot of the node in question seemed to resolve the issue; however the issue seemed to be resolved for the work day but would show up again the next morning with communication errors between the nodes. All the Server managers, Cluster Manager had logs reporting communication failures, migration failures but nothing really more than that so we just kept rebooting the systems in the morning to keep things going until I could come back and troubleshoot the system more thoroughly, but I had suspected that it was the switch the cluster was plugged into.
After the switch was rebooted everything was performing much better, all the nodes appeared to be happy everything was running fast I was able to move servers onto different nodes. However there was one issue that came up after the fact it appeared that one of the nodes had been removed from DNS in Active Directory which was causing an issue with the other nodes being able to communicate. The only place I seen the issue was on a single node that had gotten it's DNS updated and showed the node missing all IP addresses and Microsoft highlights it in red which is very handy.
After re-adding the missing node to the DNS in AD, everything appears to be resolved. So if you are getting this kind of error make sure you using a switch that can handle the traffic, and double check your Active Directory controllers for DNS and if your DHCP Server to make sure all nodes are getting the address their suppose to be getting and are available on the network.
Another issue that popped up where the cluster was throwing out this error. "Cluster network name resource 'cluster name' failed registration of one or more associated dns name(s) for the following reason: DNS Server Failure"
I did doing 2 things to repair this issue and this is all done in the Failover Cluster Manager.
1) a repair of the cluster
- Right click on the server name and take it offline.
- Right click on the server name -> More Actions -> Repair
2) Move the server to another node with more resources
- Click on the cluster in the Cluster Core Resources
- In the Action Panel click on "more Actions" -> Move Core Cluster Resources -> Select Node or Best Possible Node
When you right click on the Cluster Name in the Cluster Core Resources and right click on Properties you will see the window above. This is what a healthy Cluster should look like.
Go to Part 2 >>