Domain Controller Down (VM)

stacksofplates

@scottalanmiller said in Domain Controller Down (VM):

@Dashrender said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

No one can support every hypervisor at that level.

Don't need to. They don't support every OS at that level either, right? You support at least one. That you can't support "all" doesn't matter. They lack ANY production deployment scenario at this point. Not just limited ones.

Hold the phone - this whole thing started because Dustin says that no vendor should be able to demand what hypervisor you can or can't use. (FYI - I don't agree with Dustin, it's just where this started).

I never saw that, only that they don't support any production deployment (virtualization) whatsoever.

No, it was that they only supported VMware.

DustinB3403

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

If you're running on something using PV drivers that they don't understand...

Then your critical app vendor is below the home line. THAT'S how scary this should be to companies.

When your "business critical support" lacks the knowledge and skills of your first year help desk people, you need to be worried about their ability to support. Sure, when nothing goes wrong, everything is fine. But if anything goes wrong, you are suggesting these people don't have even the most rudimentary knowledge of systems today. That's worrisome. And it's why so many systems simply have no support options - relying on software and hardware that is out of support meaning that while the app might call itself supported, they depend on non-production systems making the whole thing out of support by extension.

So when running with a preallocated qcow2 image, which caching mode do you use for your disk? Writethrough, writeback, directsync, none?

What about IO mode? native, threads, default?

No one can support every hypervisor at that level.

Also, none of those things need to be supported by the app vendor. They just need to support the app and stop looking for meaningless excuses to block support. I understand some vendors want to support all the way down the stack, but if they don't know how to do that with virtualization, they don't know how to do it. The skills to support the stack would give them the skills to do it virtually even better (fewer variables.) So that logic doesn't hold up.

You still haven't provided a single healthcare vendor that does any of what you say is appropriate.

We don't work in healthcare, so why would we know the vendors. The point is that software should be supportable in it's environment (hypervisor agnostic)

Dashrender

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@Dashrender said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

No one can support every hypervisor at that level.

Don't need to. They don't support every OS at that level either, right? You support at least one. That you can't support "all" doesn't matter. They lack ANY production deployment scenario at this point. Not just limited ones.

Hold the phone - this whole thing started because Dustin says that no vendor should be able to demand what hypervisor you can or can't use. (FYI - I don't agree with Dustin, it's just where this started).

I never saw that, only that they don't support any production deployment (virtualization) whatsoever.

No, it was that they only supported VMware.

Yep - John that helped @wirestyle22 said that the vendor for the EHR only supported VMWare - and then this exploded this morning when I mentioned that same fact.

scottalanmiller

@stacksofplates said in Domain Controller Down (VM):

No it's not. It makes no sense to virtualize a system like that when at bare metal you're pegging the system at 100%. I can re-kickstart the system in the same amount of time it takes to copy the image back onto the hypervisor.

If you are pegging the system at 100% you need a bigger system, virtual or physical. No app vendor is selling systems that need resources that can't be virtualized. The only market for that is latency, not capacity, based and all latency sensitive apps of that nature that I've ever heard even proposed are bespoke, no vendor support.

Virtualization has effectively zero overhead, if you are at 100% capacity, virtualization isn't your problem. Moving to physical just makes scaling up and solving those problems a little harder. You are assuming failed capacity planning and a lack of support as reasons to have to do more things poorly - making mistakes to justify mistakes.

stacksofplates

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

No it's not. It makes no sense to virtualize a system like that when at bare metal you're pegging the system at 100%. I can re-kickstart the system in the same amount of time it takes to copy the image back onto the hypervisor.

If you are pegging the system at 100% you need a bigger system, virtual or physical. No app vendor is selling systems that need resources that can't be virtualized. The only market for that is latency, not capacity, based and all latency sensitive apps of that nature that I've ever heard even proposed are bespoke, no vendor support.

Virtualization has effectively zero overhead, if you are at 100% capacity, virtualization isn't your problem. Moving to physical just makes scaling up and solving those problems a little harder. You are assuming failed capacity planning and a lack of support as reasons to have to do more things poorly - making mistakes to justify mistakes.

No we don't need bigger systems. We have jobs that run over 20 nodes (multiple clusters) at 100% each with 256 GB RAM and 20 E7 cores. They take 100% no matter what you run. And they run for as long as 6 months on one calculation. There is no scale issue. I plug the new one in and it's part of the cluster.

scottalanmiller

@stacksofplates said in Domain Controller Down (VM):

No we don't need bigger systems. We have jobs that run over 20 nodes (multiple clusters) at 100% each with 256 GB RAM and 20 E7 cores. They take 100% no matter what you run. And they run for as long as 6 months on one calculation. There is no scale issue. I plug the new one in and it's part of the cluster.

Ah, you are using a different type of virtualization. You are using a multi-system virtualization platform. That's different that it is semantically not what people normally mean by virtualization but is covered by what we mean. That's not a server, that's a node in a server. I ran the 10K node compute cluster for a Wall St. firm. That was virtual, but the virtualization was at a higher layer above the individual nodes. Like you have there. It's one cluster, not individual servers.

Dashrender

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

If you're running on something using PV drivers that they don't understand...

Then your critical app vendor is below the home line. THAT'S how scary this should be to companies.

When your "business critical support" lacks the knowledge and skills of your first year help desk people, you need to be worried about their ability to support. Sure, when nothing goes wrong, everything is fine. But if anything goes wrong, you are suggesting these people don't have even the most rudimentary knowledge of systems today. That's worrisome. And it's why so many systems simply have no support options - relying on software and hardware that is out of support meaning that while the app might call itself supported, they depend on non-production systems making the whole thing out of support by extension.

So when running with a preallocated qcow2 image, which caching mode do you use for your disk? Writethrough, writeback, directsync, none?

What about IO mode? native, threads, default?

No one can support every hypervisor at that level.

Also, none of those things need to be supported by the app vendor. They just need to support the app and stop looking for meaningless excuses to block support. I understand some vendors want to support all the way down the stack, but if they don't know how to do that with virtualization, they don't know how to do it. The skills to support the stack would give them the skills to do it virtually even better (fewer variables.) So that logic doesn't hold up.

You still haven't provided a single healthcare vendor that does any of what you say is appropriate.

I know Greenway didn't have a virtualization plan 3 years ago when we were looking at them. It's why I had to build a ridiculous $100K two server failover system. Today the performance needed could be done for $25k.
The sad thing is that the vendor could not provide any IOPs requirements, etc. They only had this generic hardware requirement.
SQL Dual Proc Xeon 4 cores each two drive boot, 4 drive RAID 10 SQL, 4 drive log
RDS single proc xeon 4 core 2 drive boot, 2 drive data
IIS application dual proc xeon 4 cores each, 2 drive boot, 6 drive RAID 10 data
etc
etc

stacksofplates

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

No we don't need bigger systems. We have jobs that run over 20 nodes (multiple clusters) at 100% each with 256 GB RAM and 20 E7 cores. They take 100% no matter what you run. And they run for as long as 6 months on one calculation. There is no scale issue. I plug the new one in and it's part of the cluster.

Ah, you are using a different type of virtualization. You are using a multi-system virtualization platform. That's different that it is semantically not what people normally mean by virtualization but is covered by what we mean. That's not a server, that's a node in a server. I ran the 10K node compute cluster for a Wall St. firm. That was virtual, but the virtualization was at a higher layer above the individual nodes. Like you have there. It's one cluster, not individual servers.

It's just a job scheduler over the nodes. It's not really virtualization (I guess maybe if you consider the scheduler some kind of virtualization). The master node just runs on whatever we tell it. It can also just run on itself.

scottalanmiller

@stacksofplates said in Domain Controller Down (VM):

No we don't need bigger systems. We have jobs that run over 20 nodes (multiple clusters) at 100% each with 256 GB RAM and 20 E7 cores. They take 100% no matter what you run. And they run for as long as 6 months on one calculation. There is no scale issue. I plug the new one in and it's part of the cluster.

That's kind of a scale issue, though. You can get single servers bigger than that that would likely do the calculation in a fraction of the time. Not seconds, still months, but potentially a bit faster and without needing a cluster (and all big iron servers are 100% virtual.) You only have this semi-physical option because you are running below a certain scale and virtualizing at the cluster level.

scottalanmiller

@Dashrender said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

If you're running on something using PV drivers that they don't understand...

Then your critical app vendor is below the home line. THAT'S how scary this should be to companies.

When your "business critical support" lacks the knowledge and skills of your first year help desk people, you need to be worried about their ability to support. Sure, when nothing goes wrong, everything is fine. But if anything goes wrong, you are suggesting these people don't have even the most rudimentary knowledge of systems today. That's worrisome. And it's why so many systems simply have no support options - relying on software and hardware that is out of support meaning that while the app might call itself supported, they depend on non-production systems making the whole thing out of support by extension.

So when running with a preallocated qcow2 image, which caching mode do you use for your disk? Writethrough, writeback, directsync, none?

What about IO mode? native, threads, default?

No one can support every hypervisor at that level.

Also, none of those things need to be supported by the app vendor. They just need to support the app and stop looking for meaningless excuses to block support. I understand some vendors want to support all the way down the stack, but if they don't know how to do that with virtualization, they don't know how to do it. The skills to support the stack would give them the skills to do it virtually even better (fewer variables.) So that logic doesn't hold up.

You still haven't provided a single healthcare vendor that does any of what you say is appropriate.

I know Greenway didn't have a virtualization plan 3 years ago when we were looking at them. It's why I had to build a ridiculous $100K two server failover system. Today the performance needed could be done for $25k.
The sad thing is that the vendor could not provide any IOPs requirements, etc. They only had this generic hardware requirement.
SQL Dual Proc Xeon 4 cores each two drive boot, 4 drive RAID 10 SQL, 4 drive log
RDS single proc xeon 4 core 2 drive boot, 2 drive data
IIS application dual proc xeon 4 cores each, 2 drive boot, 6 drive RAID 10 data
etc
etc

Because... no support

Dashrender

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

No we don't need bigger systems. We have jobs that run over 20 nodes (multiple clusters) at 100% each with 256 GB RAM and 20 E7 cores. They take 100% no matter what you run. And they run for as long as 6 months on one calculation. There is no scale issue. I plug the new one in and it's part of the cluster.

Ah, you are using a different type of virtualization. You are using a multi-system virtualization platform. That's different that it is semantically not what people normally mean by virtualization but is covered by what we mean. That's not a server, that's a node in a server. I ran the 10K node compute cluster for a Wall St. firm. That was virtual, but the virtualization was at a higher layer above the individual nodes. Like you have there. It's one cluster, not individual servers.

It's just a job scheduler over the nodes. It's not really virtualization (I guess maybe if you consider the scheduler some kind of virtualization). The master node just runs on whatever we tell it. It can also just run on itself.

But the work is spread out over the different compute nodes, the fact that the main program doesn't care where the compute comes from is what makes this a virtualized setup. Nodes versus servers. i.e. a node is useless on it's own, like a CPU is useless outside the server, but a server is a complete thing, it's usable on it's own.

stacksofplates

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

No we don't need bigger systems. We have jobs that run over 20 nodes (multiple clusters) at 100% each with 256 GB RAM and 20 E7 cores. They take 100% no matter what you run. And they run for as long as 6 months on one calculation. There is no scale issue. I plug the new one in and it's part of the cluster.

That's kind of a scale issue, though. You can get single servers bigger than that that would likely do the calculation in a fraction of the time. Not seconds, still months, but potentially a bit faster and without needing a cluster (and all big iron servers are 100% virtual.) You only have this semi-physical option because you are running below a certain scale and virtualizing at the cluster level.

It won't though. This breaks the calculations up so each is manageable. CFD is kind of nuts. The acoustics solves are crazy also. They kind of have to be broken up in pieces.

stacksofplates

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

No we don't need bigger systems. We have jobs that run over 20 nodes (multiple clusters) at 100% each with 256 GB RAM and 20 E7 cores. They take 100% no matter what you run. And they run for as long as 6 months on one calculation. There is no scale issue. I plug the new one in and it's part of the cluster.

That's kind of a scale issue, though. You can get single servers bigger than that that would likely do the calculation in a fraction of the time. Not seconds, still months, but potentially a bit faster and without needing a cluster (and all big iron servers are 100% virtual.) You only have this semi-physical option because you are running below a certain scale and virtualizing at the cluster level.

It won't though. This breaks the calculations up so each is manageable. CFD is kind of nuts. The acoustics solves are crazy also. They kind of have to be broken up in pieces.

We've done a bit of research on what the highest we can get out of everything is (for single jobs). We are at about the peak right now.

If we have 300 jobs running then we could get more, but we're at about the max right now. (Unless we want to spend a ton of money for very little more performance).

Dashrender

@scottalanmiller said in Domain Controller Down (VM):

@Dashrender said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

If you're running on something using PV drivers that they don't understand...

Then your critical app vendor is below the home line. THAT'S how scary this should be to companies.

When your "business critical support" lacks the knowledge and skills of your first year help desk people, you need to be worried about their ability to support. Sure, when nothing goes wrong, everything is fine. But if anything goes wrong, you are suggesting these people don't have even the most rudimentary knowledge of systems today. That's worrisome. And it's why so many systems simply have no support options - relying on software and hardware that is out of support meaning that while the app might call itself supported, they depend on non-production systems making the whole thing out of support by extension.

So when running with a preallocated qcow2 image, which caching mode do you use for your disk? Writethrough, writeback, directsync, none?

What about IO mode? native, threads, default?

No one can support every hypervisor at that level.

Also, none of those things need to be supported by the app vendor. They just need to support the app and stop looking for meaningless excuses to block support. I understand some vendors want to support all the way down the stack, but if they don't know how to do that with virtualization, they don't know how to do it. The skills to support the stack would give them the skills to do it virtually even better (fewer variables.) So that logic doesn't hold up.

You still haven't provided a single healthcare vendor that does any of what you say is appropriate.

I know Greenway didn't have a virtualization plan 3 years ago when we were looking at them. It's why I had to build a ridiculous $100K two server failover system. Today the performance needed could be done for $25k.
The sad thing is that the vendor could not provide any IOPs requirements, etc. They only had this generic hardware requirement.
SQL Dual Proc Xeon 4 cores each two drive boot, 4 drive RAID 10 SQL, 4 drive log
RDS single proc xeon 4 core 2 drive boot, 2 drive data
IIS application dual proc xeon 4 cores each, 2 drive boot, 6 drive RAID 10 data
etc
etc

Because... no support

eh? yeah Greenway didn't bother to do the right thing for their customers and have support for hypervisors! Shit, how can they really support their customers on bare metal if they don't know the IOPs requirements, etc? Just keep stabbing hardware until they "get lucky"?

scottalanmiller

@Dashrender said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@Dashrender said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

If you're running on something using PV drivers that they don't understand...

Then your critical app vendor is below the home line. THAT'S how scary this should be to companies.

When your "business critical support" lacks the knowledge and skills of your first year help desk people, you need to be worried about their ability to support. Sure, when nothing goes wrong, everything is fine. But if anything goes wrong, you are suggesting these people don't have even the most rudimentary knowledge of systems today. That's worrisome. And it's why so many systems simply have no support options - relying on software and hardware that is out of support meaning that while the app might call itself supported, they depend on non-production systems making the whole thing out of support by extension.

So when running with a preallocated qcow2 image, which caching mode do you use for your disk? Writethrough, writeback, directsync, none?

What about IO mode? native, threads, default?

No one can support every hypervisor at that level.

Also, none of those things need to be supported by the app vendor. They just need to support the app and stop looking for meaningless excuses to block support. I understand some vendors want to support all the way down the stack, but if they don't know how to do that with virtualization, they don't know how to do it. The skills to support the stack would give them the skills to do it virtually even better (fewer variables.) So that logic doesn't hold up.

You still haven't provided a single healthcare vendor that does any of what you say is appropriate.

I know Greenway didn't have a virtualization plan 3 years ago when we were looking at them. It's why I had to build a ridiculous $100K two server failover system. Today the performance needed could be done for $25k.
The sad thing is that the vendor could not provide any IOPs requirements, etc. They only had this generic hardware requirement.
SQL Dual Proc Xeon 4 cores each two drive boot, 4 drive RAID 10 SQL, 4 drive log
RDS single proc xeon 4 core 2 drive boot, 2 drive data
IIS application dual proc xeon 4 cores each, 2 drive boot, 6 drive RAID 10 data
etc
etc

Because... no support

eh? yeah Greenway didn't bother to do the right thing for their customers and have support for hypervisors! Shit, how can they really support their customers on bare metal if they don't know the IOPs requirements, etc? Just keep stabbing hardware until they "get lucky"?

That's my guess. Lacking support of VMs isn't exactly the big issue... it's WHY they lack that support that is the big issue.

scottalanmiller

@stacksofplates said in Domain Controller Down (VM):

We've done a bit of research on what the highest we can get out of everything is (for single jobs). We are at about the peak right now.

If we have 300 jobs running then we could get more, but we're at about the max right now. (Unless we want to spend a ton of money for very little more performance).

Have they looked at other architectures like ARM, Power and Sparc? That's often where the big performance boosts are, but not always. It's very workload dependent.

DustinB3403

@scottalanmiller said in Domain Controller Down (VM):

@Dashrender said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@Dashrender said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

If you're running on something using PV drivers that they don't understand...

Then your critical app vendor is below the home line. THAT'S how scary this should be to companies.

When your "business critical support" lacks the knowledge and skills of your first year help desk people, you need to be worried about their ability to support. Sure, when nothing goes wrong, everything is fine. But if anything goes wrong, you are suggesting these people don't have even the most rudimentary knowledge of systems today. That's worrisome. And it's why so many systems simply have no support options - relying on software and hardware that is out of support meaning that while the app might call itself supported, they depend on non-production systems making the whole thing out of support by extension.

So when running with a preallocated qcow2 image, which caching mode do you use for your disk? Writethrough, writeback, directsync, none?

What about IO mode? native, threads, default?

No one can support every hypervisor at that level.

Also, none of those things need to be supported by the app vendor. They just need to support the app and stop looking for meaningless excuses to block support. I understand some vendors want to support all the way down the stack, but if they don't know how to do that with virtualization, they don't know how to do it. The skills to support the stack would give them the skills to do it virtually even better (fewer variables.) So that logic doesn't hold up.

You still haven't provided a single healthcare vendor that does any of what you say is appropriate.

I know Greenway didn't have a virtualization plan 3 years ago when we were looking at them. It's why I had to build a ridiculous $100K two server failover system. Today the performance needed could be done for $25k.
The sad thing is that the vendor could not provide any IOPs requirements, etc. They only had this generic hardware requirement.
SQL Dual Proc Xeon 4 cores each two drive boot, 4 drive RAID 10 SQL, 4 drive log
RDS single proc xeon 4 core 2 drive boot, 2 drive data
IIS application dual proc xeon 4 cores each, 2 drive boot, 6 drive RAID 10 data
etc
etc

Because... no support

eh? yeah Greenway didn't bother to do the right thing for their customers and have support for hypervisors! Shit, how can they really support their customers on bare metal if they don't know the IOPs requirements, etc? Just keep stabbing hardware until they "get lucky"?

That's my guess. Lacking support of VMs isn't exactly the big issue... it's WHY they lack that support that is the big issue.

Why the don't support VM's is very important, almost as important as the question of what made them not want to support their application in a VM?

Dashrender

@scottalanmiller said in Domain Controller Down (VM):

@Dashrender said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@Dashrender said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

If you're running on something using PV drivers that they don't understand...

Then your critical app vendor is below the home line. THAT'S how scary this should be to companies.

When your "business critical support" lacks the knowledge and skills of your first year help desk people, you need to be worried about their ability to support. Sure, when nothing goes wrong, everything is fine. But if anything goes wrong, you are suggesting these people don't have even the most rudimentary knowledge of systems today. That's worrisome. And it's why so many systems simply have no support options - relying on software and hardware that is out of support meaning that while the app might call itself supported, they depend on non-production systems making the whole thing out of support by extension.

So when running with a preallocated qcow2 image, which caching mode do you use for your disk? Writethrough, writeback, directsync, none?

What about IO mode? native, threads, default?

No one can support every hypervisor at that level.

Also, none of those things need to be supported by the app vendor. They just need to support the app and stop looking for meaningless excuses to block support. I understand some vendors want to support all the way down the stack, but if they don't know how to do that with virtualization, they don't know how to do it. The skills to support the stack would give them the skills to do it virtually even better (fewer variables.) So that logic doesn't hold up.

You still haven't provided a single healthcare vendor that does any of what you say is appropriate.

I know Greenway didn't have a virtualization plan 3 years ago when we were looking at them. It's why I had to build a ridiculous $100K two server failover system. Today the performance needed could be done for $25k.
The sad thing is that the vendor could not provide any IOPs requirements, etc. They only had this generic hardware requirement.
SQL Dual Proc Xeon 4 cores each two drive boot, 4 drive RAID 10 SQL, 4 drive log
RDS single proc xeon 4 core 2 drive boot, 2 drive data
IIS application dual proc xeon 4 cores each, 2 drive boot, 6 drive RAID 10 data
etc
etc

Because... no support

eh? yeah Greenway didn't bother to do the right thing for their customers and have support for hypervisors! Shit, how can they really support their customers on bare metal if they don't know the IOPs requirements, etc? Just keep stabbing hardware until they "get lucky"?

That's my guess. Lacking support of VMs isn't exactly the big issue... it's WHY they lack that support that is the big issue.

LOL - Short of someone like Epic, from what I can tell, they are mostly software developers, who don't care about the hardware/VM it's running on. They don't approach the software holistically.

scottalanmiller

@stacksofplates said in Domain Controller Down (VM):

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

No we don't need bigger systems. We have jobs that run over 20 nodes (multiple clusters) at 100% each with 256 GB RAM and 20 E7 cores. They take 100% no matter what you run. And they run for as long as 6 months on one calculation. There is no scale issue. I plug the new one in and it's part of the cluster.

That's kind of a scale issue, though. You can get single servers bigger than that that would likely do the calculation in a fraction of the time. Not seconds, still months, but potentially a bit faster and without needing a cluster (and all big iron servers are 100% virtual.) You only have this semi-physical option because you are running below a certain scale and virtualizing at the cluster level.

It won't though. This breaks the calculations up so each is manageable. CFD is kind of nuts. The acoustics solves are crazy also. They kind of have to be broken up in pieces.

I understand how it works, that's why we used massive Monte Carlo clusters and such. Some calculations are pretty discrete. but ther eis still cluster coordination overhead.

stacksofplates

@scottalanmiller said in Domain Controller Down (VM):

@stacksofplates said in Domain Controller Down (VM):

We've done a bit of research on what the highest we can get out of everything is (for single jobs). We are at about the peak right now.

If we have 300 jobs running then we could get more, but we're at about the max right now. (Unless we want to spend a ton of money for very little more performance).

Have they looked at other architectures like ARM, Power and Sparc? That's often where the big performance boosts are, but not always. It's very workload dependent.

We used to have a lot of Sparc (still have a lot around), but they found the performance was better on x86. (before my time).