HPE MSA 2040 Unresponsive, Dead, or Failed Controller, Controller won’t boot

The Problem:

As always, I logged in to the SMU to shutdown controller A (storage). I shut it down, the blue LED illuminated it was safe for removal. I then proceeded to remove it, clean it, and re-insert it. The controller came back online, and ownership of the applicable disk groups were successfully moved back. Controller A was now completed successfully. I continued to do the same for controller B: I logged in to shutdown controller B (storage). It shut down just like controller A, the blue LED removable light illuminated. I was able to remove it, clean it, and re-insert it.

However, controller B did not come back online.

After inserting controller B, the status light was flashing (as if it was booting). I waited 20 minutes with no change. The SMU on controller B was responding to HTTPS requests, however you could not log on due to the error “system is initializing”. SSH was functioning and you could log in and issue commands, however any command to get information would return “Please wait while this information is pulled from the MC controller”, and ultimately fail. The SMU on controller A would report a controller fault on controller B, and not provide any other information (including port status on controller B).

I then tried to re-seat the controller with the array still running. Gave it plenty of time with no effect.

I then removed the failed controller, shutdown the unit, powered it back on (only with controller A), and re-inserted Controller B. Again, no effect.

The Fix:

At this point I’m thinking the controller may have failed or died during the cleaning process. I was just about to call HPE support for a replacement until I noticed the “Power LED” light inside of the failed controller would flash every 5 seconds while removed.

This made me start to wonder if there was an issue writing the cache to the compact flash card, or if the controller was still running off battery power but had completely frozen.

I tried these 3 things on the failed controller while it was unplugged and removed:

1. left the controller untouched for 1 hour out of the array (to maybe let it finish whatever it was doing while on battery power)

2. There’s an unlabeled button on the back of the controller. As a last resort (thinking it was a reset button), I pressed and held it for 20 seconds, waited a minute, then briefly pressed it for 1 second while it was out of the unit.

3. removed the Compact Flash card from the controller for 1 minute, then re-inserted it. In hoping this would fail the cache copy if it was stuck in the process of writing cache to compact flash. I then re-inserted the controller, and it booted fine! It was not functioning and working (and came up very fast). Looking at the logs, it has no record of what occurred between the first shutdown, and final boot. I hope this post helps someone else with the same issue, it can save you a support ticket, and time with a controller down.

Procedure taken from the website:

https://www.stephenwagner.com/2018/11/04/hpe-msa-2040-unresponsive-dead-or-failed-controller-controller-wont-boot/