Friday, October 05, 2018

Storage Spaces Lost Communication or IO Error and Intel SSD Event ID 129

I had a major issue with our company's primary Hyper-V Server.  It was housing our primary AD controller, User Storage and Share, along with 2 other VMs.  The symptoms were major IO read/writing, reports of disconnects, and a loss of access to the AD, and user files.  According to our log files we were getting iANSMiniport, and Intel Nvmestor errors.

Here are some log file samples







According to the logs the system starting giving Event ID 129 warnings and this continued to happen every 10 seconds and affected our DHCP, DNS and user logons.  After a forced shutdown everything was fine according to the logs until later that morning when users started to have lag and login issues.



You can read more about Event ID 129 Here

We sent the server in for diagnostics and according to the Authorized Service Depot, they could not find anything out of date except for the bios.

Since this was a "Main Production Server" it was decided that we would move the virtual ad controller from the production server to a dev server to run until the failure of the drive could be determined so the organization could continue to run.

After doing a second forced shutdown and boot up I was able to shut down the virtual machines running on the servers and did a full export of the data to an external drive.  This made sure we did not lose any data but it did inconvenience some users as the data had to be copied back from the dev server.

Unfortunately I didn't find the Microsoft article about Event ID 129 until about a week after our issue at that point I had  destroyed the storage spaces volume, but I did find it useful as I installed some additional software which seemed to help the server recognize the NVMe drives better.  For that you need to install not just the Intel SSD Toolbox, but also the SSD Data Center tool and the Data Center NVMe SSD Drivers.  The errors in the Event ID 129 before the Maintenance 8 release seemed to be a little too coincidental to what we were experiencing.

Symptoms


When this issue occurs, your cluster may experience any of the following symptoms:
  • Slow workload performance
  • Virtual disks in the cluster that have an Operational Status value of Detached or No Redundancy.
  • Physical disks that report a status of Lost Communication or IO Error.

I haven't had a chance to verify the issues have been 100% corrected but from the testing I have done some major stress testing on the storage spaces setup using hyper-v to do mass exports of VM's to the storage array with no issues at all, HD tune and crystal disk mark have also shown the Storage Spaces Array to be in good shape.





Hyper-V Cluster Node Keeps Randomly Going Down

Over the last few weeks I had an issue with a Hyper-V Cluster node randomly going down causing issues with my Hyper-V cluster. Looking at th...