ESXi and Proliant Weekend Woes

Carnival Boy

Did a routine power down of all our servers at the weekend, as the utility company were doing some maintenance on the electricity supply.

When I restarted the hosts, one of them failed to restart. Initially it hung at the ESXi boot screen on "loading power management". I did a forced reboot and it went past there but hung on "Loading module hpsa". I did another reboot and it hung at the same place ("Loading module hpsa"). I didn't get any HP boot errors. By this stage I was getting extremely worried.

I left it for an hour, and then did another reboot. This time it booted ok. Hallelujah.

Googling, I found one forum post where someone rebooted 5 times before it worked, but the thread ended without any solution or explanation of the cause.

I've read that there is a known problem with hpsa driver version 5.x.0.58-1. When I run:
esxcli software vib list | grep -i hpsa
I get:
scsi-hpsa 5.0.0-40OEM.500.0.0.472560
Not sure if this means my driver is ok. I'm reluctant to do anything in the short term that will require another reboot as I have zero confidence in this machine booting successfully.

This is a Proliant DL380 G6. We have two identical hosts, both with the same ESXi and hpsa version. The other host booted fine. There are no errors listed under Hardware Status in VSphere client.

I've looked through vmkernel.log to see if I can see anything obvious, but that only logs the last, successful, boot. Not the previous boots that hung. I have got the following error message, that is constantly being repeated:
WARNING: NFS: 221: Got error 2 from mount call
The host only uses local storage. However, in vSphere Client, under datastores, it is listing:
VeeamBackup_VEEAM1 (inactive) (unmounted).
What is this all about? The other hosts don't have this datastore and our Veeam server is powered on and has successfully backed up the host, so I'm not seeing any operational issues. I don't think this problem is related to my boot problems, but I'd like to get it cleaned up anyway.

I'm stuck on how to proceed. Any advice, please?

Carnival Boy

I can think of few things worse than being alone on a Sunday afternoon and having a host hang on boot. I literally had no idea what to do and no-one to call.

DustinB3403

What ESXi version are you running?

DustinB3403

Assuming you're running ESXi 5.0, 5.1 or 5.5 here is a link to an HP Driver update Follow the guides

And here is a person who had the same exact issue, but on the G7 (same hardware from what I can tell as far as the RAID Controller goes)

scottalanmiller

Pinging @John-Nicholson

Dashrender

I had a similar issue with my 5.x host two years ago. I updated the driver without a hitch and was good to go.

Carnival Boy

@DustinB3403 said:

Assuming you're running ESXi 5.0, 5.1 or 5.5 here is a link to an HP Driver update Follow the guides

And here is a person who had the same exact issue, but on the G7 (same hardware from what I can tell as far as the RAID Controller goes)

I am running 5.5. I have version 5.0.0-40OEM and it looks the problem driver was 5.5.0.58-1OEM. So my version looks ok. Also, I didn't have a PSOD, it just hung. I'm not averse to updating the driver apart from the fact that I really don't want to have to reboot the host at the moment until I can be sure it will boot successfully.

Dashrender

What about the firmware? Any known issues there?

DenisKelley

Re: VeeamBackup_VEEAM1 (inactive) (unmounted).
What is this all about? The other hosts don't have this datastore and our Veeam server is powered on and has successfully backed up the host, so I'm not seeing any operational issues. I don't think this problem is related to my boot problems, but I'd like to get it cleaned up anyway.

This just means that the NFS mount from Veeam, which is used as a connection for the backup is unmounted. When you backup a VM, Veeam will create on of these and it will show up as a datastore. This has happened to me when I moved my backup software to a new server. I just remove the unmounted store. It will get recreated if necessary and the files don't really sit on the host.

I'm pretty sure you are correct and that this is unrelated to your present problem.

Carnival Boy

Yeah, I've unmounted the Veeam datastore but I'm certain that was a red herring.

I really don't know where to go with this, other than to buy a new server! I don't even know if it's an HP issue or a VMware issue, a hardware issue or a software issue.

I need a plan of action!

Carnival Boy

I think I need to create a persistent storage location for logs. At the moment, I can't view any errors in vmkernal.log because the file is overwritten on every boot (it is stored on non-persistent storage).

I'm not sure.

If I specify logs are stored on the datastore (which is local storage), will it work? What happens if the boot hangs before the datastore is mounted? If it is hanging at "loading module hpsa", does this suggest it isn't mounting the datastore? And if that is the case, the logs files won't be written and so won't help me.

Or am I barking up the wrong tree?

DustinB3403

It certainly wouldn't hurt to grab an older system and set it up to be your remote logging server for your ESXi infrastructure. It would at least give you something to look through while researching this issue.

Have you asked on the ESXi forums if anyone has any input on this?