Removing shared storage from VMWare environment
-
Warning! Wall of text ahead. Tried to provide the same details I am considering so I can receive the best feedback. Please help me think about things I already haven't or insert additional ideas. Thanks!
Problem Statement: Operations (production) virtual environment is running in an inverted pyramid of doom. Performance of storage array is low due to oversubscription. Looking for advice to correct this.
Background: This is a continuation of my other thread (http://mangolassi.it/topic/6236/zfs-based-storage-for-medium-vmware-workload) where I was looking for feedback on our delivery "development" environment storage. Our operations network consists of the following hardware and dependency chain. (See diagram at bottom for overview)
VMHost hardware
- VMH-OPS1/2 VMware 5.5 essentials plus
- HP Proliant DL360p G8
- 2x intel Xeon E5-2667
- 128GB memory
- 2x 146GB SAS disk (VMware OS install)
- 8x total 2.5" drive bays
- 1x p420i controller with 1GB ram and BBU
- Internal SD card slot available not used
- 1x dual port 10GB
- 1x quad port 1GB
- 1x dual port 1GB
Storage hardware:
- SAN-ARR0 is an HP MSA P2000 with dual 1GB iscsi controllers (4 ports each controller)
- each controller port pair A/B shares the same VLAN -- 4 VLANS for this storage network
Primary Network to VMHosts
- Serviced by two 10GB switches (HP 5820X-24XG-SFP+)
- Each server has a single 10GB dual port cart (SPOF - single NIC)
Storage Network to VMHosts
- Serviced by two 1GB switches (HP V1910-24g)
- Servers have 1x 4 port GNIC and 1x 2 port GNIC
- Two links from each switch go to one port on each card
- Each link is a separate VLAN
- MPIO round robin
Services currently hosted on the operations platform
- Active Directory/DNS (2008R2)- 2 servers (one on each host)
- DHCP - 1 server
- Exchange 2010 Standard - 1 server
- SharePoint 2010 Foundation - 1 server
- Windows File server (2008R2) 1.5TB data - 1 server
- SQL Server 2008 R2 (SharePoint,VMware) - 1 server
- dozen or so other low IO VMs for business applications, mostly CentOS
I acknowledge this setup should not have been deployed inexperience coupled with an outside vendor pushing this solution is what drove this implementation.
Opportunity: The business has decided to move to Office 365 next year on the E3 plan. This allows us to move Exchange/SharePoint off of the on-premise infrastructure and shrink our storage needs. Given the recent discussions around SSD and the likely return of RAID5, I set out to examine how to remove risks and dependencies in the chain.
Q1 and Q2 Goals:
- Migrate all operations Windows servers (that are not being eliminated by Office 365) to Server 2012 R2 or maybe 2016 but don't think it will be ready in time.
- Migrate business to Office 365 (120 users)
- Eliminate P2000 hosted storage from operations environment Plan:
- Reinstall VMware on embedded SD card slot to regain two SAS bays
- Add second 10GB card to each server
- Install 8*SSD into RAID5 in each server (Currently looking at SDSSDXPS-480G-G25 and MZ7GE480HMHP-00003)
- Migrate data hosted on P2000 back to local storage
- If business determines file server requires reliability/redundancy setup second file server with DFSR
- VMH-OPS1/2 VMware 5.5 essentials plus
-
I am assuming that you are planning to stay with VMware at this time, yes?
-
@dafyre said:
I am assuming that you are planning to stay with VMware at this time, yes?
At this time yes. We have the license, support, inhouse experience, and most recently we purchased Veeam Availability suite this year. Migration to a different hypervisor may be an option in the future, but for the moment is not.
-
I would suggest migrating to O365 and getting your Exchange and Sharepoint servers shut down being the first step, even before upgrading the other VM OSes.
Are you able to run your entire infrastructure on a single server at this point? Say of VMH-OPS1 explodes or has a melt down?
-
@dafyre said:
I would suggest migrating to O365 and getting your Exchange and Sharepoint servers shut down being the first step, even before upgrading the other VM OSes.
Are you able to run your entire infrastructure on a single server at this point? Say of VMH-OPS1 explodes or has a melt down?
Basis for the O365 first? Curious if there is benefit or other reasoning?
Yes, the system was designed to hold both servers work load if required. Neither server is currently more than 30-40% utilized.
Which leaves us with even more leftover resources after the 365 migration. Which we may want to use for a 10 user RDS environment.
-
@donaldlandru said:
@dafyre said:
I would suggest migrating to O365 and getting your Exchange and Sharepoint servers shut down being the first step, even before upgrading the other VM OSes.
Are you able to run your entire infrastructure on a single server at this point? Say of VMH-OPS1 explodes or has a melt down?
Basis for the O365 first? Curious if there is benefit or other reasoning?
Yes, the system was designed to hold both servers work load if required. Neither server is currently more than 30-40% utilized.
Which leaves us with even more leftover resources after the 365 migration. Which we may want to use for a 10 user RDS environment.
This gets your data off your shared storage. It also is one less thing to migrate when the time comes.
-
@donaldlandru said:
- Active Directory/DNS (2008R2)- 2 servers (one on each host)
- DHCP - 1 server
- Exchange 2010 Standard - 1 server
- SharePoint 2010 Foundation - 1 server
- Windows File server (2008R2) 1.5TB data - 1 server
- SQL Server 2008 R2 (SharePoint,VMware) - 1 server
- dozen or so other low IO VMs for business applications, mostly CentOS
Most of this looks like it can be removed. Go to Office 365 and you remove the exchange, sharepoint and SQL server. Vcenter can run on SQL server express unless you have more than 5 vm hosts.
Then you just have your DC and a file server to worry about. Since they are server 2008 I'd just build new ones from scratch.
-
@dafyre and @Jason - I see where you are coming from and this makes sense. What if I do a hybrid approach to this.
Steps:
- Refresh VMH-OPS1
a. Migrate all VMs to VMH-OPS2
b. Shutdown VMH-OPS1
c. Remove SAS drives, replace with 8 * SSD and internal SD card for OS
d. Create RAID5 in P420i
e. Install VMWare on SD card - rejoin to cluster - Refresh VMH-OPS2
a. Migrate all VMs to VMH-OPS1
b. Shutdown VMH-OPS2
c. Remove SAS drives, replace with 8 * SSD and internal SD card for OS
d. Create RAID5 in P420i
e. Install VMWare on SD card - rejoin to cluster
f. Rebalance VMs across cluster - Install and configure new Windows Server 2012 R2 for Domain Controllers
- Remove Windows Server 2008 R2 Domain controllers
- Complete any other upgrades
Steps 1 and 2 could be done in a few hours and gives me something to do before our Office 365 deployment which is currently looking like a Q2 activity. I could then work on any remaining tasks in parallel with the Office 365 migration. This doesn't cause me to migrate any data unnecessarily, every VM I move gets the immediate bonus of better disk IO and no more IPOD and I can do that sooner as I already have the budget for a storage upgrade this year.
Thoughts?
- Refresh VMH-OPS1
-
Since your current solution is designed to be able to run everything on a single server, after you migrate most of that load to O365 I don't see why you wouldn't retire the second server completely.
By running two servers you have:
twice the cooling cost
twice the number of servers to manage/update
twice the power consumption
twice the amount of UPSAnd best of all, you'd have twice the storage to purchase and an extra 10 Gb card to buy.
According to Scott, these servers have something like 4 hours of downtime every 7-8 years, on average. Unless you really need to lower that downtime, the expense of those drives and everything else I listed is pretty high.
-
You mention that you're having performance issues today - do you know where those issues are coming from? Disk IO not enough? Production network not fast enough, etc?
-
@Dashrender said:
Since your current solution is designed to be able to run everything on a single server, after you migrate most of that load to O365 I don't see why you wouldn't retire the second server completely.
By running two servers you have:
twice the cooling cost
twice the number of servers to manage/update
twice the power consumption
twice the amount of UPSAnd best of all, you'd have twice the storage to purchase and an extra 10 Gb card to buy.
According to Scott, these servers have something like 4 hours of downtime every 7-8 years, on average. Unless you really need to lower that downtime, the expense of those drives and everything else I listed is pretty high.
Interesting thought. It is really 1 of 7 servers in this location.
So a few bullet points to support the multiple servers:
- We are a 24/7 organization we have users in multiple locations working at anytime throughout the day. I will still need to service application and workstation authentication.
- Being 24/7 means I can't drop the whole thing for maintenance.
- The time managing 2-3 extra virtual machines is negligible
- 300 watts is what this single server consumes -- the cost that adds to being able to service everything without maintenance downtime is again in my opinion negligible
- The business is still out on whether or not same sign-on is sufficient for Office 365 vs single sign-on. I think the same sign-on is sufficient, but if the business wants single sign-on then ADFS will need to be deployed and available to service O365 login requests.
I would agree with your solution in a smaller, single location business -- it just wouldn't jive with the way we operate.
-
@Dashrender said:
You mention that you're having performance issues today - do you know where those issues are coming from? Disk IO not enough? Production network not fast enough, etc?
It is definitely in the storage network that is slowing us down. I am sharing 8 SATA spindles for too many virtual machines. Plus MPIO on the 1Gig side gets saturated quite frequently, but upgrading the controllers in the P2000 to 10GB iSCSI is more than the SSDs I referenced above.
-
@donaldlandru said:
Basis for the O365 first? Curious if there is benefit or other reasoning?
This will free up resources for the other VMs so that you're not running too close to the max with everything on one host.
Yes, the system was designed to hold both servers work load if required. Neither server is currently more than 30-40% utilized.
Okay, so everythign on one host isn't such a big concern.
-
Keep in mind that with your VMware license you should be able to do Storage VMotion, etc, from the shared storage up to the Local storage on VMH-OPS1 after it gets rebuilt.
-
@donaldlandru said:
- We are a 24/7 organization we have users in multiple locations working at anytime throughout the day. I will still need to service application and workstation authentication.
Being 24/7 doesn't mean you can't afford down time. @scottalanmiller has a lot of posts on this. It's about how much that costs you, not about how often you work. We are a fortune 100 and we have down times. Heck we have pretty regular momentary (once a month or so) blips with our exchange systems.
-
@Jason said:
@donaldlandru said:
- We are a 24/7 organization we have users in multiple locations working at anytime throughout the day. I will still need to service application and workstation authentication.
Being 24/7 doesn't mean you can't afford down time. @scottalanmiller has a lot of posts on this. It's about how much that costs you, not about how often you work. We are a fortune 100 and we have down times. Heck we have pretty regular momentary (once a month or so) blips with our exchange systems
Let's look at it from a different angle
- The hardware is already owned and only 3 years old minus the $1600 for SSDs
- The software is already owned
- The "data center" is already built out and over cooled
To me, saying lets discard this server we already own and license in favor of now creating outages for maintenance does not make any sense.
-
@donaldlandru said:
- Being 24/7 means I can't drop the whole thing for maintenance.
How much maintenance do you do? What is the annual downtime caused by VMware? Only VMware and hardware maintenance is assisted by having the second server.
-
@donaldlandru said:
To me, saying lets discard this server we already own and license in favor of now creating outages for maintenance does not make any sense.
That might be true, but let's do a little napkin math...
- Why is it overcooled? That should be fixed regardless of anything else. Just wasting money.
- If you add heat, you still cool more, regardless of how much you cool now, correct? So that is more money.
- The power draw costs money.
- How much downtime does this prevent?
Add those together and see if it makes sense.
-
@scottalanmiller said:
@donaldlandru said:
- Being 24/7 means I can't drop the whole thing for maintenance.
How much maintenance do you do? What is the annual downtime caused by VMware? Only VMware and hardware maintenance is assisted by having the second server.
Assuming a non DFRS file server, that would be assisted by this as well.
@donaldlandru , you said you have 7 servers. can't you install a DC on one of those? Are any of those virtualized or are they all bare metal?
-
@Dashrender said:
@scottalanmiller said:
@donaldlandru said:
- Being 24/7 means I can't drop the whole thing for maintenance.
How much maintenance do you do? What is the annual downtime caused by VMware? Only VMware and hardware maintenance is assisted by having the second server.
Assuming a non DFRS file server, that would be assisted by this as well.
@donaldlandru , you said you have 7 servers. can't you install a DC on one of those? Are any of those virtualized or are they all bare metal?
DFRS would do it on a single physical host for software upgrades, too.