I've gone:
- 9 month contract.
- 2 years, one company, then promotion, 2 years again after promotion.
- 1 year at MSP.
- 1 year where I am now.
(I'd say for me, 1 - 1.5 years max anywhere).
I've gone:
@Dashrender said in What percentage of servers in your organization are Microsoft?:
@Jimmy9008 said in What percentage of servers in your organization are Microsoft?:
@scottalanmiller 99% Windows
you have 100 servers?
Haha, not yet. But close! 68 last time I looked.
@scottalanmiller "We accidently restarted core servers which as a result brought many services offline and impacted many customers." - that is what I consider accurate here. "We experienced higher than usual error rates" is a cop-out.
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
If the server is being restarted, and the application has not been temp moved to another server, that server/service/application/whatever is down. If a tech said to me "Nah, the server is on, has power, and is just restarting so its actually up, we just didn't bother to move your application to another node in the cluster first - but its up." - they are wrong. Its down.
Not if it is during a greenzone, that would be misinformation. Is it down? yes, but it is supposed to be. Do you say that your car is "missing" when the police ask you, or do you admit that you sold it?
Like you said, you have to extrapolate and what they mean by "down" is that it is "down when it is supposed to be up", which is not true. It's down when it is supposed to be down.
Amazons servers were down, rebooting, restarting etc, when they were supposed to be up. "Is the server down" = "The servers were rebooted by accident and all applications were unavailable.", not "No, were just experiencing higher errors than usual" - Pfft
Well as I said to someone privately, an error is a form of outage.
'Error', Its misleading. Wrong word to use. AWS lost a lot of respect in my eyes. Its an accident, a mistake, a Fu** Up. Not more errors than usual.
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
If the server is being restarted, and the application has not been temp moved to another server, that server/service/application/whatever is down. If a tech said to me "Nah, the server is on, has power, and is just restarting so its actually up, we just didn't bother to move your application to another node in the cluster first - but its up." - they are wrong. Its down.
Not if it is during a greenzone, that would be misinformation. Is it down? yes, but it is supposed to be. Do you say that your car is "missing" when the police ask you, or do you admit that you sold it?
Like you said, you have to extrapolate and what they mean by "down" is that it is "down when it is supposed to be up", which is not true. It's down when it is supposed to be down.
Amazons servers were down, rebooting, restarting etc, when they were supposed to be up. "Is the server down" = "The servers were rebooted by accident and all applications were unavailable.", not "No, were just experiencing higher errors than usual" - Pfft
@scottalanmiller Agree. Totally. When asked "Is the server down." you can say "The server was rebooting as planned, as the application was not part of a cluster, the application was offline". But to say "No, the server was fine." is misleading.
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
If the server is being restarted, and the application has not been temp moved to another server, that server/service/application/whatever is down. If a tech said to me "Nah, the server is on, has power, and is just restarting so its actually up, we just didn't bother to move your application to another node in the cluster first - but its up." - they are wrong. Its down.
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller When a business person says "Was the server down" - they don't many the physical box. They mean whatever that box provides to them, or the customers, or whatever that box does... The fact that its not off, its rebooting, doesn't matter - 'what it does' was down... for four hours.
Yes and no. Was the server down? No, it was in a greenzone. I speak to business people only about this stuff, and "offline in a down time" is never considered "down". And an application down to the system team is "not down." Is the server down? "Nope, go look into your app."
I realize that in the SMB people tend to look at one person for all of the questions from networking to systems to apps, but in the enterprise, if someone asks if the server is down the question is about a server, not an application. If a business person wants to know about X, they ask about X - competent business people don't start digging under the hood and asking the wrong questions to "sound cool." If the car doesn't work, a competent business person asks "is the car working" to their mechanic, they do not say "check if there is enough oil" when they actually want to know if the car works - clearly not a relevant question.
Part of work is extrapolating what people mean, even if they asked the question in the wrong way. "Is the car working" covers fuel, seats, keys, doors etc, the same as "Is the server down" covers it being in a reboot state... yes, what that server does is down.
@scottalanmiller When a business person says "Was the server down" - they don't many the physical box. They mean whatever that box provides to them, or the customers, or whatever that box does... The fact that its not off, its rebooting, doesn't matter - 'what it does' was down... for four hours.
@scottalanmiller True, I must say I really didn't think about it in depth
Exactly. We host everything at HQ on site, with a colo in Essex for DR purposes. If HQ was lost, staff are screwed until we restore from backups BUT... customers (the important part) are not really affected at all. We keep a hot copy of our websites and databases running a day out of date (which the business are fine with) in the Essex colo. We then use 'the cloud' to manage the failover process, which is a cheap solution compared to multiple Cloud DC's hosting everything.
We have one VM in Azure, and one in AWS. Both check our websites hosted at HQ are available on HTTP/HTTPS every second or so. If not responding, they will use Cloudflare API and point DNS for all our websites to the hot running copies in the colo that are a day out of date. Pretty fast. When tested, it takes seconds and were back online from a customer perspective. Our test, unplug out gateway firewall and see what happens... easy.
Yeah it can be better, but it meets our needs and other than cloudflare (which does go down) we have no single point of failure... We're happy with that risk.
I don't blame AWS at all. If you use a service like AWS, use it properly and build for DR or take the risk! Think about the systems you use, and build properly.
Have Amazon said what caused this yet? 'Tech spilled coffee on core equipment...', 'Powercut and generator failure at the same time'... any info at all?
IMO... its a mess and needs to be standardised. Management need to decide on a workable structure.
Check if the DHCP server has leased the IP which was statically assigned. Check if DNS has something else listed for that IP address. Sounds like the IP address for the server is taken, hence no response, but the VMs have different IP addressed and work as they don't have duplicates.
Does the host have a Static IP? Perhaps that has been taken by something else hence it cannot communicate but the VMs can?