Thursday, December 4, 2014

Software-Defined Storage – A new way of thinking

There is certainly a ton of different blogs out there that talk about Software-Defined Storage, and in particular Virtual SAN. My goal in this post is not to rehash much of the same old information, but to provide insights to my experiences. 

Most recently I was challenged with getting up to speed with Virtual SAN and developing an architecture design for it. Having only heard the marketing details at first it seemed pretty intimidating however it truly does live up to all the hype about being “radically simple”. It most definitely changed the way that I thought about storage. What I have found is that the more that I work with Virtual SAN the less concerned I become with the underlying storage. 

This is a bit foreign for me because if being a part of support has taught me anything, (I mean other than following instructions to the letter is important), the biggest lesson I can teach is that for the best performance of your environment the array needs to be correctly configured. All sorts of issues can occur otherwise. 

To this end, I remember first being skeptical because focusing on policies seemed so foreign. After having used it and tested it in customer environments, I can honestly say that my mind was very much changed at the absolute power that it gives an administrator. I say this as if it is something that happened in an instant, however this all happened over the course of a couple of months. At the time, I was involved with several customer projects with it and I saw that in every case there was a distinct set of things that always happen. From these experiences I was able to build the following workflow, which could be used in working through a Virtual SAN design


I say this as if it is something that happened in an instant, however this all happened over the course of a couple of months. At the time, I was involved with several customer projects with it and I saw that in every case there was a distinct set of things that always happen. From these experiences I was able to build the following workflow, which could be used in working through a Virtual SAN design:


In looking further at it I generally break this flow chart down into a couple of different areas:
  1. Hardware Selection – In absolutely every environment I have worked in there has always been a hardware problem. I would guess that 75% of the problems I have seen in implementing Virtual SAN have been as a result of hardware selection or configuration. This includes things such as non-supported devices or incorrect firmware/drivers.

    Note: VMware does not provide support for devices that are not on the Virtual SAN Compatibility List.  Be sure that when selecting hardware that it is on the list!
  2. Software Configuration – The configuration is simple, rarely have I seen questions on actually turning it on. You merely click a check box, and it will configure itself (assuming of course that the underlying configuration is correct). If it is not the result can be mixed for example if the networking is not configured correctly, or if the disks have not been presented properly.  
  3. Storage Policy – The storage policy is at first a huge decision point. This is what gives Virtual SAN its power, the ability to configure what happens with the Virtual Machine for performance and availability characteristics. 
  4. Monitoring / Performance Testing / Failure Testing – the final area, is in regards to how you are supposed to monitor, and test the configuration.  


All of these things should really be taken into account in any design for Virtual SAN, or the design is not really complete. Now, I could talk through a lot of this for hours. Rather than doing that I thought it would be better to post my three top gotchas and lessons learned from the projects I have been involved with.  

Top 3 Gotchas from PSE

Here are my top three gotchas that I have run into with Virtual SAN: 

  1. Network Configuration – No matter what the networking team says, always validate the configuration.  The “Misconfiguration detected” error, is by far the most common thing I have seen:



    Normally this means that either the port group has not been successfully configured for Virtual SAN or that Multicast has not been setup properly. If I were to guess, most of the issues I have seen are as a result of multicast setup.  On Cisco switches, unless an IGMP Snooping Carrier has been configured OR IGMP snooping has been explicitly disabled on the ports used for Virtual SAN it will generally fail. Having it in the default configuration means that it is simply not configured and therefore even if the network admin says that it is configured properly double check it to avoid any pain.
  2. Network Speed – Although 1GB networking is supported and I have seen it operate effectively for small environments, 10GB networking is highly recommended for most configurations. I don’t just say this because the documentation says so. From experience, what it really comes down to here is not the regular every day usage of Virtual SAN. Where people run into problems rather is when an issue occurs, such as during failures or periods of heavy virtual machine creation. Replication traffic during these periods can be substantial and cause huge performance degradation while they are occurring. The only way to know is to test what happens during a failure, or during a peek provisioning cycle.  This is critical as this tells you what the expected performance will be. When in doubt, always use 10GB Networking.
  3. Storage Adapter Choice – Although seemingly simple, the queue depth of the controller should be greater than 256 to ensure the best performance. This is not as much of an issue now as it was several months ago because the VMware Virtual SAN compatibility list should no longer have any cards that are under 256 queue depth in it anymore. Be sure to verify though. As an example there was one card when first released that artificially limited the queue depth of the card in the driver software. Performance was dramatically impacted until an updated driver was released. 

Top 3 Lessons Learned

The lessons learned have come with a price of a half or full day in which we were troubleshooting issues. Here are my lessons learned:
  1. Always Verify Firmware/Driver Versions – This one always seems to be overlooked but I am stating it because of experiences on site with customers. The one example comes to mind where we had three identical servers, bought and shipped in the same order that we were using to configure Virtual SAN. Two of them worked fine, the third just wouldn’t cooperate, no matter what we did.  After investigating for several hours we found that not only would Virtual SAN not configure, but all drives attached to that host were read only. Looking at the utility that was provided with the actual card itself showed that the card was a revision behind on the firmware. As soon as we upgraded the firmware (long story short turns out reading documentation is not one of my strong suits...for that firmware update we struggled with it until we realized that a COLD power off was required...) it came online and everything was working brilliantly.
  2. Pass-through/RAID0 Controller Configuration – It is almost always recommended to use a pass through controller such that Virtual SAN is the owner of the drives and can have full control of them. In many cases there is only RAID0 mode. Proper configuration of this is required to avoid any problems and to maximize performance for Virtual SAN. First, ensure any controller caching is set to 100% Read Cache. Secondly configure each drive as its own “array” and not a giant array of disks. This will ensure it is setup properly. As an example of incorrect configuration that can cause unnecessary overhead, several times I have seen all disks in a RAID configuration at the controller. This shows up as a single disk to the operating system (ESXi in this case) which is not desired. To fix this you have to go into the controller and configure it correctly, but you also have to ensure that the partition table (if previously configured) is removed, which can in many cases involve a zero out of the drive if there is not an option to remove the header.
  3. Performance Testing – The lesson learned here is that you can do an infinite amount of testing…where do you stop or even where do you start. Wade Holmes from the Virtual SAN technical marketing team at VMware has an amazing blog series on this that I highly recommend reviewing for guidance here. His methodology allows for both basic and more in-depth testing to be done for your Virtual SAN configuration.
I hope that these pointers help in your evaluation and implementation of Virtual SAN. Before diving head first in to anything, I always like to make sure that I am informed about the subject matter.  Virtual SAN is no different. To be successful you need to make sure you have genuine subject matter expertise for the design, whether that be in-house or by contacting a professional services organization. Remember, VMware is happy to be your trusted advisor if you need assistance with Virtual SAN or any other of our products!