Deploying firmware updates on servers and testing...

Jimmy9008

Hi folks,

We have quite a few servers running outdated firmware. Due to an issue with the current firmware version, we have been going server by server updating to a newer bios firmware. These are Dell servers, all the same model and under warranty.

We have so far done about 20 servers and they went fine. However, the 21st server developed a flapping issue on one of the NIC interfaces causing unplanned downtime to the VMs.

Management are asking us to identify which of these servers, as we have many remaining to patch, will develop an issue following patching such as a flapping problem, so they can be done at a different time to lower the impact of an outage.

My thoughts are that we cannot know if a server will develop an issue from a patch before doing the patch. But, they want a plan to know which to avoid.

Any advice here on how we could accomplish this? My plan would be to plan the patch, as the patch is valid for the server, and then leave the server out of the cluster for 24h and monitor for flapping/blue screen/whatever, then put back in the cluster. I do not think we could ever know beforehand if any one server will happen to develop an issue from a patch which is for that server.

1337

@jimmy9008 said in Deploying firmware updates on servers and testing...:

Hi folks,

We have quite a few servers running outdated firmware. Due to an issue with the current firmware version, we have been going server by server updating to a newer bios firmware. These are Dell servers, all the same model and under warranty.

We have so far done about 20 servers and they went fine. However, the 21st server developed a flapping issue on one of the NIC interfaces causing unplanned downtime to the VMs.

Management are asking us to identify which of these servers, as we have many remaining to patch, will develop an issue following patching such as a flapping problem, so they can be done at a different time to lower the impact of an outage.

My thoughts are that we cannot know if a server will develop an issue from a patch before doing the patch. But, they want a plan to know which to avoid.

Any advice here on how we could accomplish this? My plan would be to plan the patch, as the patch is valid for the server, and then leave the server out of the cluster for 24h and monitor for flapping/blue screen/whatever, then put back in the cluster. I do not think we could ever know beforehand if any one server will happen to develop an issue from a patch which is for that server.

You said they are the same model. I would have a look at the serial number then. Make a list and put them in order. Identify which series of numbers that you have updated that didn't have problems. Then begin by updating servers that are in the successful series.

If you encounter the same problem I think you should be able identify the problem pretty quickly. Maybe consider having a spare server that is ready to go. If you find a problem with one server you could swap it immediately and just move the drives over.

It's also a possibility that the flapping issue was pure coincidence. Unless you verified it by going back to the old firmware and the issue disappeared.

dbeato

@jimmy9008 The flapping NIC usually has been because of firmware and driver updates of the NIC. Did you do any of those updates at the same time? Like Dell usually does with either their BIOS or SUU tool.

notverypunny

For hardware / chassis management Dell's openmanage enterprise is pretty useful. I'd suggest giving it a look if you've got a big Dell footprint to look after. Should be available for download when you check for downloads with any recent service tag. If you can't find it let me know and I'll see if I can find a direct link.

hobbit666

@notverypunny said in Deploying firmware updates on servers and testing...:

For hardware / chassis management Dell's openmanage enterprise is pretty useful. I'd suggest giving it a look if you've got a big Dell footprint to look after. Should be available for download when you check for downloads with any recent service tag. If you can't find it let me know and I'll see if I can find a direct link.

Yeah I installed and got that running few weeks ago, not gone through the whole thing.
At the moment we only have 5 servers, just need to add our SAN's and a few more servers once I've repurpose them

scottalanmiller

@jimmy9008 said in Deploying firmware updates on servers and testing...:

Management are asking us to identify which of these servers, as we have many remaining to patch, will develop an issue following patching such as a flapping problem, so they can be done at a different time to lower the impact of an outage.

While that sounds good, the answer is... there is no way for you to know. If there was, you'd have known already. Software is crazy complex and without a massive testing fleet you can't even make an educated guess. Even if they gave you thousands of servers and loads of testing resources, you'd still just be likely to get a reasonable degree of certainty.

scottalanmiller

@jimmy9008 said in Deploying firmware updates on servers and testing...:

My thoughts are that we cannot know if a server will develop an issue from a patch before doing the patch. But, they want a plan to know which to avoid.

You are correct. Unless you can identify what caused the issue you don't have anything to go on. How will you identify the issue unless you risk several more machines... probably more machines that you have in your total fleet?

Ask them how they expect you to determine something of this nature? If they aren't providing you with the proper testing equipment and timeline (which would be insane, this could easily cost hundreds of thousands of dollars) where do they expect you to produce this information from? You don't even know what you are looking for. Even if you get lucky and guess what it is, it's just a guess and you have no confidence in predicting if it will work or not.

notverypunny

@scottalanmiller said in Deploying firmware updates on servers and testing...:

@jimmy9008 said in Deploying firmware updates on servers and testing...:

My thoughts are that we cannot know if a server will develop an issue from a patch before doing the patch. But, they want a plan to know which to avoid.

You are correct. Unless you can identify what caused the issue you don't have anything to go on. How will you identify the issue unless you risk several more machines... probably more machines that you have in your total fleet?

Ask them how they expect you to determine something of this nature? If they aren't providing you with the proper testing equipment and timeline (which would be insane, this could easily cost hundreds of thousands of dollars) where do they expect you to produce this information from? You don't even know what you are looking for. Even if you get lucky and guess what it is, it's just a guess and you have no confidence in predicting if it will work or not.

Yeah, this sounds like a case of the bean-counters having unreasonable expectations. Best you can do is keep track of issues to identify machines that might be more problematic than others. Concrete example from my env is that we have a couple of machines that we know will fail to reboot cleanly about 40% of the time.... Have to pull power completely and then they'll boot.... (R730 or R730xd), apparently it's a not uncommon issue with that gen of Dell servers and the only fix is to swap the whole main-board. In my case they're up for replacement and out of warranty so we just make sure that someone's on-site when/if they need to be rebooted for updates / maintenance etc

scottalanmiller

@notverypunny said in Deploying firmware updates on servers and testing...:

Yeah, this sounds like a case of the bean-counters having unreasonable expectations.

That's even worse if it isn't management but accountants who don't understand math. That implies that they can't do accounting, either.