ML
    • Recent
    • Categories
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    ZFS Based Storage for Medium VMWare Workload

    Scheduled Pinned Locked Moved SAM-SD
    zfsstoragevirtualizationfilesystemsraid
    156 Posts 9 Posters 86.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • scottalanmillerS
      scottalanmiller @donaldlandru
      last edited by

      @donaldlandru said:

      • Add a second SAN that replicates with the first (HP MSA easy to do, not so nice price tag)

      I've never seen someone do this successfully. That doesn't suggest that it doesn't work, but are you sure that the MSA series will do SAN mirroring with fault tolerance? I'm not confident that that is a feature (but certainly not confident that it isn't.) Double check that to be sure as I talk to MSA users daily and no one has ever led me to believe that this was even an option.

      I know that Dell's MD series cannot do this, only the EQL series.

      donaldlandruD 1 Reply Last reply Reply Quote 0
      • donaldlandruD
        donaldlandru @scottalanmiller
        last edited by

        @scottalanmiller said:

        @donaldlandru said:

        Ok if we split this into two separate topics the only unmitigated failure point in operations in the single SAN. Two options to mitigate the risk are:

        Not currently, you had said that your nodes do not have the tools or the overhead to absorb the load from a failed node, correct? That makes the risk of those nodes failing unmitigated as well. You only have enough nodes to handle your capacity not enough to use them for failure mitigation.

        In Operations, the two node cluster,I said they do have necessary resources to absorb the other node failing. It is the development "cluster that isn't a cluster" that cannot absorb.

        scottalanmillerS 1 Reply Last reply Reply Quote 0
        • scottalanmillerS
          scottalanmiller @Dashrender
          last edited by

          @Dashrender said:

          That would cost a lot more than his current $14,000 budget (assuming that number was a budget number).

          Yes, but cost far less than what he was proposing. My recommendations were to lower his cost while improving reliability originally. Then he lept to the Ferrari scenario so I proposed another solution that still beats that one while maintaining the Ferrari features while still only spending a fraction as much money.

          1 Reply Last reply Reply Quote 0
          • scottalanmillerS
            scottalanmiller @donaldlandru
            last edited by

            @donaldlandru said:

            In Operations, the two node cluster,I said they do have necessary resources to absorb the other node failing. It is the development "cluster that isn't a cluster" that cannot absorb.

            Oh okay. So mitigated where it matters, I assume, and unmitigated where it doesn't matter so much. That I was not clear about.

            dafyreD 1 Reply Last reply Reply Quote 0
            • DashrenderD
              Dashrender @scottalanmiller
              last edited by

              @scottalanmiller said:

              Going to VSAN, Starwind, DRBD, etc. would be an "orders of magnitude leap" that is not warranted. It just can't make sense. What you have today and what you are talking about moving to are insanely "low availability." Crazy low. And no one had any worries or concerns about that, right?

              That's just it - the company probably thinks they have that super high level of availability and the fact that they've never had a failure feeds that fire of belief.

              This has always been the issue I've had when I try to redesign/spend more money on a new solution. I get the push back - "well, we did it that cheap way before and it worked for 8 years, why do I suddenly now need to use this other way, clearly the old cheap way works."

              donaldlandruD 1 Reply Last reply Reply Quote 2
              • donaldlandruD
                donaldlandru @Dashrender
                last edited by

                @Dashrender said:

                @scottalanmiller said:

                Going to VSAN, Starwind, DRBD, etc. would be an "orders of magnitude leap" that is not warranted. It just can't make sense. What you have today and what you are talking about moving to are insanely "low availability." Crazy low. And no one had any worries or concerns about that, right?

                That's just it - the company probably thinks they have that super high level of availability and the fact that they've never had a failure feeds that fire of belief.

                This has always been the issue I've had when I try to redesign/spend more money on a new solution. I get the push back - "well, we did it that cheap way before and it worked for 8 years, why do I suddenly now need to use this other way, clearly the old cheap way works."

                This -- all day long! It worked for the 7 years before you got here, it will keep working long after. I fight this fight every day

                coliverC scottalanmillerS 2 Replies Last reply Reply Quote 0
                • dafyreD
                  dafyre @scottalanmiller
                  last edited by

                  @scottalanmiller From Previous posts, it sounds like they are most concerned with the Dev environment right now since the Ops cluster appears to be ok.

                  1 Reply Last reply Reply Quote 0
                  • coliverC
                    coliver @donaldlandru
                    last edited by

                    @donaldlandru said:

                    @Dashrender said:

                    @scottalanmiller said:

                    Going to VSAN, Starwind, DRBD, etc. would be an "orders of magnitude leap" that is not warranted. It just can't make sense. What you have today and what you are talking about moving to are insanely "low availability." Crazy low. And no one had any worries or concerns about that, right?

                    That's just it - the company probably thinks they have that super high level of availability and the fact that they've never had a failure feeds that fire of belief.

                    This has always been the issue I've had when I try to redesign/spend more money on a new solution. I get the push back - "well, we did it that cheap way before and it worked for 8 years, why do I suddenly now need to use this other way, clearly the old cheap way works."

                    This -- all day long! It worked for the 7 years before you got here, it will keep working long after. I fight this fight every day

                    All you need is one good outage and the tune changes. Or at least that's what I experienced at my last position.

                    DashrenderD 1 Reply Last reply Reply Quote 1
                    • donaldlandruD
                      donaldlandru @scottalanmiller
                      last edited by

                      @scottalanmiller said:

                      @donaldlandru said:

                      • Add a second SAN that replicates with the first (HP MSA easy to do, not so nice price tag)

                      I've never seen someone do this successfully. That doesn't suggest that it doesn't work, but are you sure that the MSA series will do SAN mirroring with fault tolerance? I'm not confident that that is a feature (but certainly not confident that it isn't.) Double check that to be sure as I talk to MSA users daily and no one has ever led me to believe that this was even an option.

                      I know that Dell's MD series cannot do this, only the EQL series.

                      Real life I am not sure if it works, on paper it does. It is a false sense of security but the MSA does have active/active controllers built in (10GB iSCSI), redundant power supplies, and of course the disks are in a RAID. The risks that are not mitigated by the single chassis are:

                      • Chassis failure (I am sure it can happen, but the only part in the chassis is the backplane and some power routing)
                      • Software bug -- most likely failure to occur
                      • Human error (oops I just unplugged the storage chassis)

                      All in all I think the operations is pretty well protected, minus the three risks listed above. It is two nodes that can absorb either node failing, it is on redundant 10gig top of rack switches and redundant 1gig switches. Also, backups are done and tested as well with Veeam. Am I missing something here?

                      Unless I am mistaken, and Scott please correct me if I am, it is the three node development cluster that is in sorry shape.

                      scottalanmillerS 3 Replies Last reply Reply Quote 0
                      • dafyreD
                        dafyre
                        last edited by

                        In your Dev environment, you have 3 servers... with 288GB of Ram, 64GB of RAM, and 16 GB of RAM... Assume RAM compatibility... What happens if you balance out those three servers and get them at least close to having the same amount of RAM?

                        Does that help you at all? If that is a good idea, then why not look at converting them to XenServer and switching to Local Storage? You could then replicate the VMs to each of the three hosts, or you could set up HA-Lizard.

                        donaldlandruD 1 Reply Last reply Reply Quote 2
                        • dafyreD
                          dafyre
                          last edited by

                          If I am not mistaken with planned outages, you can actually Migrate VMs from one XenServer host to another (this is also true for Hyper-V, IIRC) without having shared storage... So for Maintenance, you can live migrate from one XenServer host to another (it would copy the storage too)... Whether or not that is feasible depends on the size of your VMs and speed of your network... among other things.

                          scottalanmillerS 1 Reply Last reply Reply Quote 1
                          • DashrenderD
                            Dashrender @coliver
                            last edited by

                            @coliver said:

                            @donaldlandru said:

                            @Dashrender said:

                            @scottalanmiller said:

                            Going to VSAN, Starwind, DRBD, etc. would be an "orders of magnitude leap" that is not warranted. It just can't make sense. What you have today and what you are talking about moving to are insanely "low availability." Crazy low. And no one had any worries or concerns about that, right?

                            That's just it - the company probably thinks they have that super high level of availability and the fact that they've never had a failure feeds that fire of belief.

                            This has always been the issue I've had when I try to redesign/spend more money on a new solution. I get the push back - "well, we did it that cheap way before and it worked for 8 years, why do I suddenly now need to use this other way, clearly the old cheap way works."

                            This -- all day long! It worked for the 7 years before you got here, it will keep working long after. I fight this fight every day

                            All you need is one good outage and the tune changes. Or at least that's what I experienced at my last position.

                            Of course, that part is pretty obvious. That outage also helps the company come to more realistic understanding of it's uptime needs, and what types of outages it can really handle. But many companies have never had to deal with one, so we are stuck where we are.

                            1 Reply Last reply Reply Quote 0
                            • donaldlandruD
                              donaldlandru @dafyre
                              last edited by

                              @dafyre said:

                              In your Dev environment, you have 3 servers... with 288GB of Ram, 64GB of RAM, and 16 GB of RAM... Assume RAM compatibility... What happens if you balance out those three servers and get them at least close to having the same amount of RAM?

                              Does that help you at all? If that is a good idea, then why not look at converting them to XenServer and switching to Local Storage? You could then replicate the VMs to each of the three hosts, or you could set up HA-Lizard.

                              The two smaller servers pre-date my time with the company and were likely back of truck specials. Both of these are slated to be replaced next year with a single server with similar specs to the big server. The smallest one is already maxed out and the other one doesn't make sense to upgrade just to retire.

                              I also don't need HA on these (I don't have HA today on these) so I think this is an opportunity to move to different platform.

                              scottalanmillerS 1 Reply Last reply Reply Quote 2
                              • scottalanmillerS
                                scottalanmiller @donaldlandru
                                last edited by

                                @donaldlandru said:

                                @Dashrender said:

                                @scottalanmiller said:

                                Going to VSAN, Starwind, DRBD, etc. would be an "orders of magnitude leap" that is not warranted. It just can't make sense. What you have today and what you are talking about moving to are insanely "low availability." Crazy low. And no one had any worries or concerns about that, right?

                                That's just it - the company probably thinks they have that super high level of availability and the fact that they've never had a failure feeds that fire of belief.

                                This has always been the issue I've had when I try to redesign/spend more money on a new solution. I get the push back - "well, we did it that cheap way before and it worked for 8 years, why do I suddenly now need to use this other way, clearly the old cheap way works."

                                This -- all day long! It worked for the 7 years before you got here, it will keep working long after. I fight this fight every day

                                We average more than ten years from 'just a server'. High Availability is for when you need MORE than that.

                                DashrenderD 1 Reply Last reply Reply Quote 1
                                • DashrenderD
                                  Dashrender @scottalanmiller
                                  last edited by

                                  @scottalanmiller said:

                                  We average more than ten years from 'just a server'. High Availability is for when you need MORE than that.

                                  I'm not really sure what this means?

                                  Any Tier one, and possible most Tier two servers should last 10 years. Is that 10 years without a single failure? I'm guessing not.

                                  S 1 Reply Last reply Reply Quote 0
                                  • scottalanmillerS
                                    scottalanmiller @donaldlandru
                                    last edited by

                                    @donaldlandru said:

                                    Real life I am not sure if it works, on paper it does. It is a false sense of security but the MSA does have active/active controllers built in (10GB iSCSI), redundant power supplies, and of course the disks are in a RAID. The risks that are not mitigated by the single chassis are:

                                    Not active/active. It has codependent controllers that fail together. It's the opposite of what people expect when they say "redundant". It's the two straw houses next door in a fire, scenario. Having two houses is redundant, but if they are both made of straw and there is a fire, the redundant house will provide zero protection while very likely making a fire that much more likely to happen or to spread. Active/Active controllers from HP start in the 3PAR line, not the MSAs.

                                    All that other redundant stuff is a red herring. EVERY enterprise server has all of that redundancy but without the cripplingly dangerous dual controllers. Making any normal server MORE reliable than the MSA, not less. If anyone talks to you about the "redundant" parts in an MSA you are getting a sales pitch from someone trying very hard to trick you unless they point out that every server has those things so this is "just another server".

                                    donaldlandruD 1 Reply Last reply Reply Quote 1
                                    • scottalanmillerS
                                      scottalanmiller @donaldlandru
                                      last edited by

                                      @donaldlandru said:

                                      @dafyre said:

                                      In your Dev environment, you have 3 servers... with 288GB of Ram, 64GB of RAM, and 16 GB of RAM... Assume RAM compatibility... What happens if you balance out those three servers and get them at least close to having the same amount of RAM?

                                      Does that help you at all? If that is a good idea, then why not look at converting them to XenServer and switching to Local Storage? You could then replicate the VMs to each of the three hosts, or you could set up HA-Lizard.

                                      The two smaller servers pre-date my time with the company and were likely back of truck specials. Both of these are slated to be replaced next year with a single server with similar specs to the big server. The smallest one is already maxed out and the other one doesn't make sense to upgrade just to retire.

                                      I also don't need HA on these (I don't have HA today on these) so I think this is an opportunity to move to different platform.

                                      Something to consider here, is Scale. It would be a forklift operation, more or less, but would let you consolidate everything, get HA thrown in and all for one price. Would not be cheap, but you could move workloads over as needed. Start with three nodes and replace the Dev environment up front and start moving over the Ops environment as you can.

                                      Easily doesn't fit, but it makes this model really easy and gives you all of the features that you want with essentially zero effort.

                                      1 Reply Last reply Reply Quote 0
                                      • scottalanmillerS
                                        scottalanmiller @dafyre
                                        last edited by

                                        @dafyre said:

                                        If I am not mistaken with planned outages, you can actually Migrate VMs from one XenServer host to another (this is also true for Hyper-V, IIRC) without having shared storage... So for Maintenance, you can live migrate from one XenServer host to another (it would copy the storage too)... Whether or not that is feasible depends on the size of your VMs and speed of your network... among other things.

                                        Yes, because XS includes "Storage vMotion" functionality for free.

                                        S 1 Reply Last reply Reply Quote 0
                                        • scottalanmillerS
                                          scottalanmiller @donaldlandru
                                          last edited by

                                          @donaldlandru said:

                                          • Chassis failure (I am sure it can happen, but the only part in the chassis is the backplane and some power routing)
                                          • Software bug -- most likely failure to occur
                                          • Human error (oops I just unplugged the storage chassis)

                                          Chassis failure is uncommon, but common enough that it gets discussed on SW regularly as people have their units die. Only see that every so many months, but it does happen and is not to be ignored. This one issue puts this into a "blade" risk scenario and we've seen just this month, people lose entire blade enclosures because of backplane or control issues. It's a small risk in the relative sense but a very real one.

                                          Software bugs are huge on the MSA and any device in this class. They are magnified by the dual controllers so become extremely risky and cause outages at a pace that seem to dramatically outscale standard servers.

                                          Human error is big and I've seen some pretty dramatic ones. It's more likely on an MSA than on local storage.

                                          DashrenderD 1 Reply Last reply Reply Quote 1
                                          • DashrenderD
                                            Dashrender @scottalanmiller
                                            last edited by

                                            @scottalanmiller said:

                                            Human error is big and I've seen some pretty dramatic ones. It's more likely on an MSA than on local storage.

                                            Your example in the Scale panel at SWorld was pretty epic. The woman who delete the wrong thing and had sudden health problems on top of the company being completely down.

                                            1 Reply Last reply Reply Quote 3
                                            • 1
                                            • 2
                                            • 3
                                            • 4
                                            • 5
                                            • 6
                                            • 7
                                            • 8
                                            • 8 / 8
                                            • First post
                                              Last post