ML
    • Recent
    • Categories
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Why HALizard and XenServer Failed so heavily

    Scheduled Pinned Locked Moved IT Discussion
    halizardxenserverfailed
    39 Posts 7 Posters 6.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • DustinB3403D
      DustinB3403 @scottalanmiller
      last edited by

      @scottalanmiller said in Why HALizard and XenServer Failed so heavily:

      @Dashrender said in Why HALizard and XenServer Failed so heavily:

      This doesn't bode well for a change of lizard or DRDB....

      What makes you mention DRBD? No indication of any DRBD issue.

      Infact DRBD was running, and had no issues.

      This is why Host1 was acting as ISCSI storage for Host2 to run the VM's.

      That bond was working, without issue. The way the system failed (soft failure) since the hardware was still functional, and the VM's were still functional.

      XAPI broke, along with the boot device, so we lost migration functionality, along with backup functionality.

      1 Reply Last reply Reply Quote 1
      • scottalanmillerS
        scottalanmiller @Dashrender
        last edited by

        @Dashrender said in Why HALizard and XenServer Failed so heavily:

        Well he said his VMs were corrupt, but we don't know why.

        No, we don't know why, but zero reason to suspect DRBD. Nothing points to or suggests DRBD. Yet we know that they were forced off, which is likely to corrupt them. So high chance and indicators pointing to that. That DRBD replicated the corruption is DRBD doing its job correctly.

        So while there is no proof that DRBD wasn't involved, there isn't anything pointing to it.

        1 Reply Last reply Reply Quote 0
        • scottalanmillerS
          scottalanmiller @Dashrender
          last edited by

          @Dashrender said in Why HALizard and XenServer Failed so heavily:

          If you feel that DRBD is completely blameless I'd love to hear why to add to my understanding of how the system is supposed to work.

          Think of DRBD as RAID 1. If you get corruption on your RAID 1 volume, do you suspect that the RAID system corrupted the data or that corrupted data was written to the array? And yet it is possible that the RAID itself was the issue, sure. But it's not likely. There are places where we would expect it to happen, and a scenario to cause that to happen was there. If DRBD was going to corrupt things, likely it would have done it while the system was running, not at that exact moment. It's way too much of a coincidence.

          1 Reply Last reply Reply Quote 0
          • scottalanmillerS
            scottalanmiller
            last edited by

            What if you had just pulled out the cables so that it looked like a power failure. Wouldn't that have fixed things, except for the corrupted VMs? Might have protected against that too, but that's just random.

            DustinB3403D 1 Reply Last reply Reply Quote 0
            • DashrenderD
              Dashrender
              last edited by

              Dustin, how did you discover the failure?

              Did things just stop working? IE your VMs went to read Only or just stopped responding?

              scottalanmillerS 1 Reply Last reply Reply Quote 0
              • dafyreD
                dafyre
                last edited by

                Stories like thie are why I'll happily continue to run my hypervisor on spinning rust / raid 1.

                Yeah, sure it could happen to HDD too, but that's why we have raid1. If he'd been able to come up with a way to RAID1 two USB drives together, it might not have been an issue for him either.

                (Can you not use MDRAID on two USB sticks?)

                scottalanmillerS 1 Reply Last reply Reply Quote 0
                • scottalanmillerS
                  scottalanmiller @Dashrender
                  last edited by

                  @Dashrender said in Why HALizard and XenServer Failed so heavily:

                  Dustin, how did you discover the failure?

                  Did things just stop working? IE your VMs went to read Only or just stopped responding?

                  If they went read only, that would cause us to look at DRBD. That's the kind of thing that a DRBD failure would look like.

                  1 Reply Last reply Reply Quote 0
                  • scottalanmillerS
                    scottalanmiller @dafyre
                    last edited by

                    @dafyre said in Why HALizard and XenServer Failed so heavily:

                    (Can you not use MDRAID on two USB sticks?)

                    Yup

                    1 Reply Last reply Reply Quote 0
                    • DustinB3403D
                      DustinB3403 @scottalanmiller
                      last edited by

                      @scottalanmiller said in Why HALizard and XenServer Failed so heavily:

                      What if you had just pulled out the cables so that it looked like a power failure. Wouldn't that have fixed things, except for the corrupted VMs? Might have protected against that too, but that's just random.

                      We didn't know at the time that the boot drive was dead on host1. So we weren't certain of the status of the cluster.

                      Just that both systems had XAPI hung (vm's were running fine until we touched the cluster)

                      scottalanmillerS 1 Reply Last reply Reply Quote 0
                      • scottalanmillerS
                        scottalanmiller @DustinB3403
                        last edited by

                        @DustinB3403 said in Why HALizard and XenServer Failed so heavily:

                        @scottalanmiller said in Why HALizard and XenServer Failed so heavily:

                        What if you had just pulled out the cables so that it looked like a power failure. Wouldn't that have fixed things, except for the corrupted VMs? Might have protected against that too, but that's just random.

                        We didn't know at the time that the boot drive was dead on host1. So we weren't certain of the status of the cluster.

                        Just that both systems had XAPI hung (vm's were running fine until we touched the cluster)

                        One of the risks of clusters, so much complexity. A single host you could have dropped to Xen directly and shut things down.

                        DustinB3403D 1 Reply Last reply Reply Quote 0
                        • DustinB3403D
                          DustinB3403 @scottalanmiller
                          last edited by

                          @scottalanmiller Which is why we're going with the standalone servers using XO's continuous replication.

                          FATeknollogeeF 1 Reply Last reply Reply Quote 1
                          • Reid CooperR
                            Reid Cooper
                            last edited by

                            Wow, that's crazy. Glad that it recovered.

                            DustinB3403D 1 Reply Last reply Reply Quote 0
                            • DustinB3403D
                              DustinB3403 @Reid Cooper
                              last edited by

                              @Reid-Cooper said in Why HALizard and XenServer Failed so heavily:

                              Wow, that's crazy. Glad that you had a recovery solution planned out, great job!

                              I FTFY.

                              1 Reply Last reply Reply Quote 0
                              • FATeknollogeeF
                                FATeknollogee @DustinB3403
                                last edited by

                                @DustinB3403 said in Why HALizard and XenServer Failed so heavily:

                                @scottalanmiller Which is why we're going with the standalone servers using XO's continuous replication.

                                Will you still use USB sticks for boot?

                                DustinB3403D 1 Reply Last reply Reply Quote 0
                                • DustinB3403D
                                  DustinB3403 @FATeknollogee
                                  last edited by

                                  @FATeknollogee yes, have to.

                                  Just have to make sure we keep our backup drive current, and we're moved away from the cluster approach, and using single servers with CR.

                                  1 Reply Last reply Reply Quote 0
                                  • FATeknollogeeF
                                    FATeknollogee
                                    last edited by

                                    @DustinB3403 "Why" do you have to?

                                    DustinB3403D 1 Reply Last reply Reply Quote 0
                                    • DustinB3403D
                                      DustinB3403 @FATeknollogee
                                      last edited by

                                      @FATeknollogee said in Why HALizard and XenServer Failed so heavily:

                                      @DustinB3403 "Why" do you have to?

                                      Well, at the moment I have to see if I can create two partitions on the same array with the equipment I have.

                                      As of last night I couldn't find a way to do it.

                                      I have to use LVM to create the partitions needed. Yet not sure how I'll be able to do that.

                                      scottalanmillerS 2 Replies Last reply Reply Quote 0
                                      • scottalanmillerS
                                        scottalanmiller @DustinB3403
                                        last edited by

                                        @DustinB3403 said in Why HALizard and XenServer Failed so heavily:

                                        @FATeknollogee said in Why HALizard and XenServer Failed so heavily:

                                        @DustinB3403 "Why" do you have to?

                                        Well, at the moment I have to see if I can create two partitions on the same array with the equipment I have.

                                        As of last night I couldn't find a way to do it.

                                        I have to use LVM to create the partitions needed. Yet not sure how I'll be able to do that.

                                        For at least the fifth time... we don't make partitions here, it's volumes. LVMs make volumes. I've corrected you every time you've used the word partitions. Partitions and volumes are not the same thing.

                                        1 Reply Last reply Reply Quote 0
                                        • scottalanmillerS
                                          scottalanmiller @DustinB3403
                                          last edited by

                                          @DustinB3403 said in Why HALizard and XenServer Failed so heavily:

                                          I have to use LVM to create the volumes needed. Yet not sure how I'll be able to do that.

                                          Resize what is there. Then lvcreate what you need.

                                          1 Reply Last reply Reply Quote 0
                                          • DashrenderD
                                            Dashrender
                                            last edited by

                                            He doesn't have to, but in the emergency situation he was in yesterday, it was the fastest solution to getting himself back online.

                                            I'm pretty sure we could get him running, this time only on the single HDD that's presented by the RAID controller.

                                            1 Reply Last reply Reply Quote 0
                                            • 1
                                            • 2
                                            • 2 / 2
                                            • First post
                                              Last post