Wednesday, January 27, 2016

A constant state of change...

I struggled with whether I wanted to write something formally on this or not, but the benefits I think outweigh the drawbacks at this point.  As I am sure has flooded most people's facebook in the last couple of days if you know me, as well as the news in the IT industry if not, that there have been lots of changes in the industry due to a multitude of factors, economy, strategy, consolidation, the market being bullish, or whatnot.

Whether there is truth to this or not really makes no difference.  The fact is that the office that I have personally worked out of for over 10 years now, was announced as being closed after a 12 year run. I have been a part of this community at my office for the majority of my IT career, minus a couple of years out of college at Microsoft.  I call it a community because that is truly what it came to be to me after all these years....definitely much more than a workplace. Out of the staff that was here, the majority of people were let go as a result of this action. I don't know any details but from what I have heard everyone was treated incredibly well, which is great to hear.

For me personally, I was lucky enough to have the opportunity to be able to move to a team around three years ago now, which is remote home office. Although I did still maintain a desk at the office, it also means that I maintained a lot of the relationships that I had prior to moving. I get to move home, and have the shortest commute ever 15 feet from my bed. Seeing all of this happen in front of me was by far one of the hardest things I have gone through in my career.  I honestly think that for myself I have a case of survivors guilt as a result, however the initial shock is gone and I was finally able to get a full nights sleep last night. I am sure I will be in need of human interaction sooner rather than later.

I know that there have been way to many inspirational quotes posted at least to my feeds, so I won't preach.  What I will say is that any time I have seen an event like this that although it is painful at the time it ends up working out.  Take some time to let it sink in, recharge, and use the amazing amount of knowledge that you gained from the experiences to go forward.

To everyone from the office, I will miss everything from the numerous charity events we did together as a community - Big Bike being one of our favourite:





To things such as winning Diane the Dragon as a team at our last Christmas party:


I definitely will miss it all.  I am sure that we will remain friends long after this week...I learned so much from some of you which has lead to myself growing personally and technically.  I wish everyone the best...and never stop growing!

So I guess I will end this post by saying ... "So Long, and Thanks for all the Fish!"

Tuesday, January 19, 2016

Virtual SAN Stretch Clusters – Real World Design Practices (Part 2)


This is the second part of a two blog series as there was just too much detail for a single blog. For part 1 see (http://vmware.jmweb.ca/2016/01/virtual-san-stretch-clusters-real-world.html).

As I mentioned at the beginning of the last blog, I want to start off by saying that all of the details here are based on my own personal experiences. It is not meant to be a comprehensive guide to setting up stretch clustering for Virtual SAN, but rather a set of pointers to show the type of detail most commonly asked for. 
Hopefully it will help you prepare for any projects of this type.

Continuing on with the configuration, the next set of questions regarded networking!


Networking, Networking, Networking


With sizing and configuration behind us, the next step was to enable Virtual SAN and set up the stretch clustering. As soon as we turned it on, however, we got the infamous “Misconfiguration Detected” message for the networking.


In almost all engagements I have been a part of, this has been a problem, even though the networking team said it was already set up and configured. This always becomes a fight, but it gets easier with the new Health UI Interface and multicast checks. Generally, when multicast is not configured properly, you will see something similar to the screenshot shown below.



It definitely makes the process of going to the networking team easier. The added bonus is there are no messy command line syntaxes needed to validate the configuration. I can honestly say the health interface for Virtual SAN is one of the best features introduced for Virtual SAN!

Once we had this configured properly the cluster came online and we were able to configure the cluster, including stretch clustering, the proper vSphere high availability settings and the affinity rules.

The final question that came up on the networking side was about the recommendation that L3 is the preferred communication mechanism to the witness host. The big issue when using L2 is the potential that traffic could be redirected through the witness in the case of a failure, which has a substantially lower bandwidth requirement. A great description of this concern is in the networking section of the Stretched Cluster Deployment Guide.

In any case, the networking configuration is definitely more complex in stretched clustering because the networking across multiple sites. Therefore, it is imperative that it is configured correctly, not only to ensure that performance is at peak levels, but to ensure there is no unexpected behavior in the event of a failure. 

High Availability and Provisioning


All of this talk finally led to the conversation about availability. The beautiful thing about Virtual SAN is that with the “failures to tolerate” setting, you can ensure there are between one and three copies of the data available, depending on what is configured in the policy. Gone are the long conversations of trying to design this into a solution with proprietary hardware or software.

A difference with  stretch clustering is that the maximum “failures to tolerate” is one. This is because we have three fault domains: the two sites and the witness. Logically, when you look at it, it makes sense: more than that is not possible with only three fault domains. The idea here is that there is a full copy of the virtual machine data at each site. This allows for failover in case an entire site fails as components are stored according to site boundaries.

Of course, high availability (HA) needs to be aware of this. The way this is configured from a vSphere HA perspective is to assign the percentage of cluster resources allocation policy and set both CPU and memory to 50 percent:



This may seem like a LOT of resources, but when you think of it from a site perspective, it makes sense; if you have an entire site fail, resources in the failed site will be able to restart without issues.

The question came up as to whether or not we allow more than 50 percent to be assigned. Yes, we can set it to use more than half consumed, but there might be an issue if there is a failure, as all virtual machines may not start back up. This is why it is recommended that 50 percent of resources be reserved. If you do want to configure a utilization of more than 50 percent of the resources for virtual machines, it is still possible, but not recommended. This configuration generally consists of setting a priority on the most important virtual machines so HA will start up as many as possible, starting with the most critical ones. Personally, I recommend not setting above 50 percent for a stretch cluster.

An additional question came up about using host and virtual machine affinity rules to control the placement of virtual machines. Unfortunately, the assignment to these groups is not easy during provisioning process and did not fit easily into the virtual machine provisioning practices that were used in the environment. vSphere Distributed Resource Scheduler (DRS) does a good job ensuring balance, but more control was needed rather than just relying on DRS to balance the load. The end goal was that during provisioning, placement in the appropriate site could be done automatically by staff.

This discussion boiled down to the need for a change to provisioning practices. Currently, it is a manual configuration change, but it is possible to use automation such as vRealize Orchestrator to automate deployment appropriately. This is something to keep in mind when working with customers to design a stretch cluster, as changes to provisioning practices may be needed.

Failure Testing


Finally, after days of configuration and design decisions, we were ready to test failures. This is always interesting because the conversation always varies between customers. Some require very strict testing and want to test every scenario possible, while others are OK doing less. After talking it over we decided on the following plan:
  • Host failure in the secondary site
  • Host failure in the primary site
  • Witness failure (both network and host)
  • Full site failure
  • Network failures
    • Witness to site
    • Site to site
  • Disk failure simulation
  • Maintenance mode testing 

This was a good balance of tests to show exactly what the different failures look like. Prior to starting, I always go over the health status windows for Virtual SAN as it updates very quickly to show exactly what is happening in the cluster.

The customer was really excited about how seamlessly Virtual SAN handles errors. The key is to operationally prepare and ensure the comfort level is high with handling the worst-case scenario. When starting off, host and network failures are always very similar in appearance, but showing this is important; so I suggested running through several similar tests just to ensure that tests are accurate.

As an example, one of the most common failure tests requested (which many organizations don’t test properly) is simulating what happens if a disk fails in a disk group. Simply pulling a disk out of the server does not replicate what would happen if a disk actually fails, as a completely different mechanism is used to detect this. You can use the following commands to properly simulate a disk actually failing by injecting an error.  Follow these steps:
    1. Identify the disk device in which you want to inject the error. You can do this by using a combination of the Virtual SAN Health User Interface, and running the following command from an ESXi host and noting down the naa.<ID> (where <ID> is a string of characters) for the disk:

      esxcli vsan storage list

    2. Navigate to /usr/lib/vmware/vsan/bin/ on the ESXi host.
    3. Inject a permanent device error to the chosen device by running:

      python vsanDiskFaultInjection.pyc -p -d <naa.id>

    4. Check the Virtual SAN Health User Interface. The disk will show as failed, and the components will be relocated to other locations.
    5. Once the re-sync operations are complete, remove the permanent device error by running:

      python vsanDiskFaultInjection.pyc -c -d <naa.id>

    6. Once completed, remove the disk from the disk group and uncheck the option to migrate data. (This is not a strict requirement because data has already been migrated as the disk officially failed.)
    7. Add the disk back to the disk group.
    8. Once this is complete, all warnings should be gone from the health status of Virtual SAN.

      Note
      : Be sure to acknowledge and reset any alarms to green.

After performing all the tests in the above list, the customer had a very good feeling about the Virtual SAN implementation and their ability to operationally handle a failure should one occur.

Performance Testing


Last, but not least, was performance testing. Unfortunately, while I was onsite for this one, the 10G networking was not available. I would not recommend using a gigabit network for most configurations, but since we were not yet in full production mode, we did go through many of the performance tests to get a baseline. We got an excellent baseline of what the performance would look like with the gigabit network.

Briefly, because I could write an entire book on performance testing, the quickest and easiest way to test performance is with the Proactive Tests menu which is included in Virtual SAN 6.1:



It provides a really good mechanism to test different types of workloads that are most common – all the way from a basic test, to a stress test. In addition, using IOmeter for testing (based on environmental characteristics) can be very useful. 

In this case, to give you an idea of performance test results, we were pretty consistently getting a peak of around 30,000 IOPS with the gigabit network with 10 hosts in the cluster. Subsequently, I have been told that once the 10G network was in place, this actually jumped up to a peak of 160,000 IOPS for the same 10 hosts. Pretty amazing to be honest.

I will not get into the ins and outs of testing, as it very much depends on the area you are testing. I did want to show, however, that it is much easier to perform a lot of the testing this way than it was using the previous command line method. 

One final note I want to add in the performance testing area is that one of the key things (other than pure “my VM goes THISSSS fast” type tests), is to test the performance of rebalancing in the case of maintenance mode, or failure scenarios. This can be done from the Resyncing Components Menu: 



Boring by default perhaps, but when you either migrate data in maintenance mode, or change a storage policy, you can see what the impact will be to resync components. It will either show when creating an additional disk stripe for a disk, or when fully migrating data off the host when going into maintenance mode. The compliance screen will look like this:



This represents a significant amount of time, and is incredibly useful when testing normal workloads such as when data is migrated during the enter maintenance mode workflow. Full migrations of data can be incredibly expensive, especially if the disks are large, or if you are using gigabit rather than 10G networks. Oftentimes, convergence can take a significant amount of time and bandwidth, so this allows customers to plan for the amount of data to be moved while in or maintenance mode, or in the case of a failure.


Well, that is what I have for this blog post. Again, this is obviously not a conclusive list of all decision points or anything like that; it’s just where we had the most discussions that I wanted to share. I hope this gives you an idea of the challenges we faced, and can help you prepare for the decisions you may face when implementing stretch clustering for Virtual SAN. This is truly a pretty cool feature and will provide an excellent addition to the ways business continuity and disaster recovery plans can be designed for an environment.

Thursday, January 7, 2016

Virtual SAN Stretch Clusters – Real World Design Practices (Part 1)

(Also available on the VMware Consulting Blog: https://blogs.vmware.com/consulting/2016/01/virtual-san-stretch-clusters-real-world-design-practices-part-1.html)


This is part one of a two blog series as there was just too much detail for a single blog. I want to start off by saying that all of the details here are based on my own personal experiences. It is not meant to be a comprehensive guide for setting up stretch clustering for Virtual SAN, but a set of pointers to show the type of detail that is most commonly asked for. Hopefully it will help prepare you for any projects that you are working on.

Most recently in my day-to-day work I was asked to travel to a customer site to help with a Virtual SAN implementation. It was not until I got on site that I was told that the idea for the design was to use the new stretch clustering functionality that VMware added to the Virtual SAN 6.1 release. This functionality has been discussed by other folks in their blogs, so I will not reiterate much of the detail from them here. In addition, the implementation is very thoroughly documented by the amazing Cormac Hogan in the Stretched Cluster Deployment Guide.


What this blog is meant to be is a guide to some of the most important design decisions that need to be made. I will focus on the most recent project I was part of; however, the design decisions are pretty universal. I hope that the detail will help people avoid issues such as the ones we ran into while implementing the solution.

A Bit of Background


For anyone not aware of stretch clustering functionality, I wanted to provide a brief overview. Most of the details you already know about Virtual SAN still remain true. What it really amounts to is a configuration that allows two sites of hosts connected with a low latency link to participate in a virtual SAN cluster, together with an ESXi host or witness appliance that exists at a third site. This cluster is an active/active configuration that provides a new level of redundancy, such that if one of the two sites has a failure, the other site will immediately be able to recover virtual machines at the failed site using VMware High Availability. 

The configuration looks like this:


This is accomplished by using fault domains and is configured directly from the fault domain configuration page for the cluster: 


Each site is its own fault domain which is why the witness is required. The witness functions as the third fault domain and is used to host the witness components for the virtual machines in both sites. In Virtual SAN Stretched Clusters, there is only one witness host in any configuration. 



For deployments that manage multiple stretched clusters, each cluster must have its own unique witness host. 

The nomenclature used to describe a Virtual SAN Stretched Cluster configuration is X+Y+Z, where X is the number of ESXi hosts at data site A, Y is the number of ESXi hosts at data site B, and Z is the number of witness hosts at site C. 

Finally, with stretch clustering, the current maximum configuration is 31 nodes (15 + 15 + 1 = 31 nodes). The minimum supported configuration is 1 + 1 + 1 = 3 nodes. This can be configured as a two-host virtual SAN cluster, with the witness appliance as the third node.

With all these considerations, let’s take a look at a few of the design decisions and issues we ran into.

Hosts, Sites and Disk Group Sizing


The first question that came upas it almost always doesis about sizing. This customer initially used the Virtual SAN TCO Calculator for sizing and the hardware was already delivered. Sounds simple, right? Well perhaps, but it does get more complex when talking about a stretch cluster. The questions that came up regarded the number of hosts per site, as well as how the disk groups should be configured. 

Starting off with the hosts, one of the big things discussed was the possibility of having more hosts in the primary site than in the secondary. For stretch clusters, an identical number of hosts in each site is a requirement. This makes it a lot easier from a decision standpoint, and when you look closer the reason becomes obvious: with a stretched cluster, you have the ability to fail over an entire site. Therefore, it is logical to have identical host footprints. 

With disk groups, however, the decision point is a little more complex. Normally, my recommendation here is to keep everything uniform. Thus, if you have 2 solid state disks and 10 magnetic disks, you would configure 2 disk groups with 5 disks each. This prevents unbalanced utilization of any one component type, regardless of whether it is a disk, disk group, host, network port, etc. To be honest, it also greatly simplifies much of the design, as each host/disk group can expect an equal amount of love from vSphere DRS. 

In this configuration, though, it was not so clear because one additional disk was available, so the division of disks cannot be equal. After some debate, we decided to keep one disk as a “hot spare,” so there was an equal number of disk groups—and disks per disk group—on all hosts. This turned out to be a good thing; see the next section for details.

In the end, much of this is the standard approach to Virtual SAN configuration, so other than site sizing, there was nothing really unexpected. 

Booting ESXi from SD or USB 


I don’t want to get too in-depth on this, but briefly, when you boot an ESXi 6.0 host from a USB device or SD card, Virtual SAN trace logs are written to RAMdisk, and the logs are not persistent. This actually serves to preserve the life of the device as the amount of data being written can be substantial. When running in this configuration these logs are automatically offloaded to persistent media during shutdown or system crash (PANIC). If you have more than 512 GB of RAM in the hosts, you are unlikely to have enough space to store this volume of data because these devices are not generally this large. Therefore, logs, Virtual SAN trace logs, or core dumps may be lost or corrupted because of insufficient space, and the ability to troubleshoot failures will be greatly limited.

So, in these cases it is recommended to configure a drive for the core dump and scratch partitions. This is also the only supported method for handling Virtual SAN traces when booting an ESXi from a USB stick or SD card. 

That being said, when we were in the process of configuring the hosts in this environment, we saw the “No datastores have been configured” warning message pop up – meaning persistent storage had not been configured. This triggered the whole discussion; the error is similar to the one in the vSphere Web Client in this screenshot:


In the vSphere Client, this error also comes up when you click to the Configuration tab:


The spare disk turned out to be useful because we were able to use it to configure the ESXi scratch dump and core dump partitions. This is not to say we were seeing crashes, or even expected to; in fact, we saw no unexpected behavior in the environment up to this point. Rather, since this was a new environment, we wanted to ensure we’d have the ability to quickly diagnose any issue, and having this configured up-front saves significant time in support. This is of course speaking from first-hand experience.

In addition, syslog was set up to export logs to an external source at this time. Whether using the syslog service that is included with vSphere, or vRealize Log Insight (amazing tool if you have not used it), we were sure to have the environment set up to quickly identify the source of any problem that might arise. 

For more details on this, see the following KB articles for instructions:

I guess the lesson here is that when you are designing your virtual SAN cluster, make sure you remember that having persistence available for logs, traces and core dumps is a best practice. If you have a large memory configuration, this is the easiest way to install ESXi and the scratch/core dump partitions to a hard drive. This also simplifies post-installation tasks, and will ensure you can collect all the information support might require to diagnose issues.


Witness Host Placement


The witness host was the next piece we designed. Officially, the witness must be in a distinct third site in order to properly detect failures. It can either be a full host or a virtual appliance residing outside of the virtual SAN cluster. The cool thing is that if you use an appliance, it actually appears differently in the Web client:



For the witness host in this case, we decided to use the witness appliance rather than a full host. This way, it could be migrated easily because the networking was not set up to the third site yet. As a result, for the initial implementation while I was onsite, the witness was local to one of the sites, and would be migrated as soon as the networking was set up. This is definitely not a recommended configuration, but for testing—or for a non-production proof-of-concept—it does work. Keep in mind, that a site failure may not be properly detected unless the cluster is properly configured. 

With this, I conclude Part 1 of this blog series; hopefully, you have found this useful. Stay tuned for Part 2!