Is this server strategy reckless and/or insane?

matteo nunziati

@jaredbusch said in Is this server strategy reckless and/or insane?:

@matteo-nunziati said in Is this server strategy reckless and/or insane?:

@scottalanmiller said in Is this server strategy reckless and/or insane?:

@matteo-nunziati said in Is this server strategy reckless and/or insane?:

@dustinb3403 I'm running it on microSD .... Brrrr

So many posts... running what?

Running hyperv on microSD. HPE microSD.

You can do it. It is not recommended, and Windows will not install itself there. You have to work around the installer to do it to a SD card.

What happened to me was that I had to run install commands from cmd line because hpe controller didn't set the removeable bit of the microsd. Actually is the "removeability" of the device which drives the installer crazy.
Btw hpe certifies this config so it is something like "oem allowed". Let see!

matteo nunziati

@tim_g said in Is this server strategy reckless and/or insane?:

@matteo-nunziati said in Is this server strategy reckless and/or insane?:

I've seen sata ssd x4 raid5 outperform 15k sas x4 raid 10.

x4 SSD any RAID level will outperform x4 15k HDD in any configuration

You're looking at a max of like 250ish realistic IOPS with 15k HDDs. Sure, you can get more at like 100% sequential reads, but not in typical use.

An SSD will give at least tens of thousands IOPS drives, up to hundreds of thousands per drive. There really is no comparison.
If you dectivate cache, ssd write perf is not so stellar on sata. Read is still fast.

StorageNinja

@tim_g said in Is this server strategy reckless and/or insane?:

x4 SSD any RAID level will outperform x4 15k HDD in any configuration

ehhhhh There are some low end SSD's that are "read optimized" that if I throw steady state throughput writes at they will fall over and implode (latency, shoot to the roof especially when Garbage collection kicks in, or the drive starts to get full).

If you don't have NVDIMM's doing write Coalesce or a 2 tier design (write endurance to absorb the writes) you can get really unpredictable latency out of the lower tier SATA SSD's.

StorageNinja

@scottalanmiller said in Is this server strategy reckless and/or insane?:

And that's on SATA. Go to PCIe and you can breach a million per drive!

Technically the "Million IOPS" card is NVMe (also it's reads only and that is most CERTAINLY reporting numbers coming from the DRAM on the card and not the actual NAND).

StorageNinja

@matteo-nunziati said in Is this server strategy reckless and/or insane?:

About bench. I've made some tests with my new server before deployment. Disabling controller and disk cache helped a lot understanding real perf of disks.
I've seen sata ssd x4 raid5 outperform 15k sas x4 raid 10.
Enabling cache at controller level blends things, even with big files making benches a bit more blurry.

Running benchmarks is a dark art. Especially with Cache.

Some workloads are cache friendly (So a hybrid system of DRAM or NAND cache and magnetic drives will work the same as an "all flash").
There are a lot of Cache's.. There are controller caches, their are DRAM caches inside of drives (You can't disable this on SSD's, and can only sometimes turn they off on SATA magnetic drives and others). Some SDS systems use one tier of NAND as a Write Cache also, some do read/write caches.
Trying to maximize drives when testing them for Throughput or IOPS is different than trying to profile steady state latency under low queue depth.

99% of people I talk to who are testing something are doing something fairly terrible that doesn't test what htey want. They are doing Crystal Disk or some desktop class system to test a single disk, on a single vHBA, on a single VM that's touching only a handful of the disks or a single cache device.

For bench-marking on VMware vSAN with HCI bench there is now a cloud analytic platform that will diagnose if you are properly creating a workload and configuration that is truly trying to maximize something (Throughput, Latency, IOPS). If it's not optimized it will give you improvements (Maybe stripe objects more, tune disk groups, generate more queued IO with your workers). This is actually pretty cool in that it helps make sure you are doing real benchmarking and not testing the speed of your DRAM

https://www.youtube.com/edit?o=U&video_id=wAz4h48pZZI

Obsolesce

@storageninja said in Is this server strategy reckless and/or insane?:

@tim_g said in Is this server strategy reckless and/or insane?:

x4 SSD any RAID level will outperform x4 15k HDD in any configuration

ehhhhh There are some low end SSD's that are "read optimized" that if I throw steady state throughput writes at they will fall over and implode (latency, shoot to the roof especially when Garbage collection kicks in, or the drive starts to get full).

If you don't have NVDIMM's doing write Coalesce or a 2 tier design (write endurance to absorb the writes) you can get really unpredictable latency out of the lower tier SATA SSD's.

Yup, there's always exceptions to everything.

matteo nunziati

@storageninja said in Is this server strategy reckless and/or insane?

There are a lot of Cache's.. There are controller caches, their are DRAM caches inside of drives (You can't disable this on SSD's, and can only sometimes turn they off on SATA magnetic drives and others). Some SDS systems use one tier of NAND as a Write Cache also, some do read/write caches.

I was aware of only 3 levels of cache: os, controller, disk.
Disk cache can be disabled according to my controller. Is The cache you are talking about another level inside the disk?

scottalanmiller

@matteo-nunziati said in Is this server strategy reckless and/or insane?:

I was aware of only 3 levels of cache: os, controller, disk.

If you use Starwind, you add two additional caches just with them.... they have a RAM Disk cache and an SSH cache layer and both fit between the OS and the controller cache levels of your list.

StorageNinja

@matteo-nunziati said in Is this server strategy reckless and/or insane?:

@storageninja said in Is this server strategy reckless and/or insane?

There are a lot of Cache's.. There are controller caches, their are DRAM caches inside of drives (You can't disable this on SSD's, and can only sometimes turn they off on SATA magnetic drives and others). Some SDS systems use one tier of NAND as a Write Cache also, some do read/write caches.

I was aware of only 3 levels of cache: os, controller, disk.
Disk cache can be disabled according to my controller. Is The cache you are talking about another level ofthe disk?

It's technically inside the disk controller (you can't put DRAM on a platter!).

Enough Magnetic SATA drives will ignore the command, combined with the performance being completely lousy that VMware stopped certifying them on the vSAN HCL.

As far as SSD's go, they use DRAM to de-amplify writes. If you didn't do this (absorb writes, compress them, compact them) the endurance on TLC would be complete and utter garbage, and the drive vendor would get REALLY annoyed replacing drives that either performed like sludge or were burning out the NAND cells. Some TLC drives will also use a SLC buffer for incoming writes beyond the DRAM buffer (and slide the NAND in an out of SLC mode as it can't take the load anymore and retire it for write cold data). SSD's are basically a mini storage array inside of the drive (which is why you see FPGA's and ASIC's and 4 core ARM processors on the damn things).

There are also hypervisor caches (Hyper-V has some kind of NTFS DRAM cache, ESXi has CBRC a deduped RAM cache commonly used for VDI, Client Cache a data local DRAM cache for VMware vSAN) there are application caches (Simple ones like SQL, more complicated ones like PVS's Write overflow cache that risks with data loss to give you faster writes so must be strategically used).

On top of this there are just other places besides this disk for bottlenecks to occur inside of various IO queues or other bottlenecks.

The vHBA, The Clustered File System locking, Weird SMB redirects use by CSV, LUN queues, Target queues, Server HBA queues (Total, and max LUN queue which are wildly different). Kernel injected latency can cause IO to back up (no CPU cycles to process IO it will cause storage latency) as well as the inverse (high disk latency, CPU cycles get stuck waiting on IO!) which can lead to FUN race conditions. Sub-LUN queues (vVols) and even NFS have per mount queues! IO filters (VAIO!), guest OS filter driver layers can also add cache or impact performance. Throw in quirks of storage arrays (Many don't use more than 1 thread for the array, or a given feature per LUN or RAID group like how FLARE RAN for ages) and you could have a system that's at 10% CPU load, but it being a 10 core system the 1 core that does Cache logic is pegged and causing horrible latency.

You can even have systems that try to dynamically balence these queues to prevent noisy neighbor issues (SIOCv1!)

http://www.yellow-bricks.com/2011/03/04/no-one-likes-queues/

matteo nunziati

@storageninja ok smartphone here. Will be ultrashort.

0- really enlighting thank you!

1- I was thinking about a simple layout for bench: os, RAID controller, disks no hypervisor, no apps on system like network fs servers and so on.
I tested the machine w/ Centos with iozone. So my fault: with controller I meant raid controller

2- yes, cache is on disk controller board.

3- so when my raid controller card asks me to disable disk onboard cache, and performance actually drops a lot on ssd, what actually happens? Dram is still alive?

StorageNinja

@matteo-nunziati said in Is this server strategy reckless and/or insane?:

@storageninja ok smartphone here. Will be ultrashort.

0- really enlighting thank you!

1- I was thinking about a simple layout for bench: os, RAID controller, disks no hypervisor, no apps on system like network fs servers and so on.
I tested the machine w/ Centos with iozone. So my fault: with controller I meant raid controller

2- yes, cache is on disk controller board.

3- so when my raid controller card asks me to disable disk onboard cache, and performance actually drops a lot on ssd, what actually happens? Dram is still alive?

Depends on the vendor and the drive but I would suspect DRAM cache is still being used (again to protect endurance), it's just delaying the ACK until it gets to the lower level. Now on some enterprise drives that have capacitors (so they can protect that DRAM completely on power loss) they will sometimes still ACK a write in DRAM anyways (as nothing really changes and it's why those drives can post giant performance numbers). On a drive that has full power loss protection built in benching with the cache disabled is dumb as we don't care what the RAW NAND can do, we care what the drive can do under a given load. Your better off in this case if you want to stress the drives do two tests.

75% write small block (some drives fall over on mixed workload).
100% sequential write large block (256KB).

Even then if your workload doesn't look like this (most don't) then it's kinda pointless finding break points of drives. The point of benchmarking is to make sure a system will handle your workload not find it's break point.

Most people screw up and accidentally test a cupecake (a DRAM cache for reads somewhere), or try to break it with an unrealistic workload. Outside of engineering labs for SSD drives and storage products, there isn't a lot of use to this.

Another thing to note is you can capture and replay an existing workload using vSCSI trace capture and one of the VMware storage flings. you can even "accelerate" or duplicate it several times over. This helps know what your REAL workload will look like on a platform.

StorageNinja

Another trend in benchmarking is using stuff like HCI bench or VM Fleet to test LOTS of workloads. A single worker in a single VM doesn't' show what contention looks like at scale.

scottalanmiller

@storageninja best meme I've seen in a long time.

matteo nunziati

@StorageNinja

enterprise drives that have capacitors

This. I asked the reseller about this feature. They anwer: disable ssd cache anyway and use controller cache.
The ~~latter~~ former is a safer choice while the latter is too new/untested feature...

matteo nunziati

@storageninja

Your better off in this case if you want to stress the drives do two tests.

I did a test w/ random read and write simulating a thread per expected user.

StorageNinja

@matteo-nunziati said in Is this server strategy reckless and/or insane?:

This. I asked the reseller about this feature. They anwer: disable ssd cache anyway and use controller cache.
The latter former is a safer choice while the latter is too new/untested feature...

To be blunt, the reseller doesn't know what they are talking about. Every enterprise SSD in the modern era (using some sort of FTL) uses this design and has for years. They are configured this way even in big enterprise storage arrays with the unique exception of Pure Storage who re-writes their firmware to basically use drives as dumb NAND devices (and then has MASSIVE NVRAM buffers fronting the drives that do the same damn thing at a global level).

Some SDS systems want you to explicitly disable the front cache as it will coalesce data and prevent data proximity optimizations in the actual raw data placement. It also exists as yet another place that data can be lost or corrupted and for systems that want to "own" IO integrity end to end they want to know where stuff is.

Then again, what do I know...

travisdh1

@storageninja said in [Is this server strategy reckless and/or insane?]

Then again, what do I know...

According to a vendor, or anyone that's got a clue?

StorageNinja

@travisdh1 My job is to fly drink and talk primarily

scottalanmiller

@storageninja said in Is this server strategy reckless and/or insane?:

@travisdh1 My job is to fly drink and talk primarily

I don't know how to "fly drink" or to "talk primarily"!

travisdh1

@scottalanmiller said in Is this server strategy reckless and/or insane?:

@storageninja said in Is this server strategy reckless and/or insane?:

@travisdh1 My job is to fly drink and talk primarily

I don't know how to "fly drink" or to "talk primarily"!

You're job description includes talking to people on web forums now doesn't it? Also, when do you stop drinking?