We continue with the last Part 3 of series about Top things you should know about vSAN. For Part 1 of the article click here.
#4 – Dealing with disk replacement
In vSAN, components that have failed can be in absent or degraded state. According to the component state, vSAN uses different approaches for recovering virtual machine data.
A component is in degraded state if vSAN detects a permanent component failure and assumes that the component is not going to recover to working state.
It can be caused by:
- Failure of a flash caching device
- Magnetic or flash capacity device failure
- Storage controller failure
which indicates a Permanent Device Loss (PDL) states.
vSAN starts rebuilding the affected components immediately.
A component is in absent state if vSAN detects a temporary component failure where the component might recover and restore its working state.
It can caused by:
- Lost network connectivity
- Failure of a physical network adapter
- ESXi host failure
- Unplugged flash caching device
- Unplugged magnetic disk or flash capacity device
which indicates an All Paths Down (APD) state.
vSAN starts rebuilding absent components if they are not available within a certain time interval. By default, vSAN starts rebuilding absent components after 60 minutes.
vSAN monitors the performance of each storage device and proactively isolates unhealthy devices. It detects gradual failure of a storage device and isolates the device before congestion builds up within the affected host and the entire vSAN cluster.
If a disk experiences sustained high latencies or congestion, vSAN considers the device as a dying disk, and evacuates data from the disk. vSAN handles the dying disk by evacuating or rebuilding data. No user action is required, unless the cluster lacks resources or has inaccessible objects.
There can be two types of disk failure (SSD and Magnetic) and both have a slightly different impact on the VSAN cluster. If a SSD fails the disk group it front becomes unusable and all components on all magnetics disks within the disk group are marked as “degraded”
If a magnetic disk fails the disk group will continue to function however all components on the failed magnetic disk are marked as degraded.
When VSAN detects a disk failure it will immediately create a new mirror copy or witness on a different ESXi host or different disk group (subject to there being sufficient resources to store this new copy). If there are insufficient resources to create that mirror copy , VSAN will wait until resources are added. Once you have added a new disk, or even a host, the recovery will begin.
If a disk is removed (i.e. not a SMART disk failure) VSAN will wait for 60 minutes before rebuilding the component on another node. The 60 minutes wait is a configured on each ESXi host (advanced settings > VSAN.ClomRepairDelay) and can be changed if required. The setting must be the same on each ESXi host in the VSAN cluster.
#3 – The impact of unicast on vSAN network topologies
One of the new features in vSAN 6.6 is unicast support. This makes it much easier to deploy VMware’s virtual storage product from a (cloud/multi-site) networking perspective. vSAN has always used multicast for its metadata updates between the hosts in the cluster. With unicast this is done through the vCenter server making it the single source of truth eliminating the need for multicast.
To use unicast networking in the first place we need version 6.6 of vSAN. The first step is to update vCenter server and the ESXi hosts to at least version 6.5d. In vSAN 6.6, all hosts will now talk unicast, and the vCenter server becomes the source of truth for cluster membership. If you are upgrading from a previous version of vSAN, vSAN will automatically switch to unicast once all hosts have been upgraded to vSAN 6.6.
This removal of multicast as a requirement will definitely make vSAN deployments much easier from a networking requirements perspective. Also simplifies the vSAN network design, for local-site, multi-site or cloud environment It potentially also saves you time in troubleshooting scenarios.
#2 – Getting the most out of Monitoring and Logging
Health Check is awesome, but provides you with current info. What about historic data and trends? Monitor your environment closely with:
- Web Client
- vCenter VOBs
Or anything else that you want to use.
– vSAN Observer is an RVC (Ruby vSphere Console) graphical user interface utility which displays vSAN related statistics from a vSAN Client perspective. The utility can be used to understand vSAN performance characteristics. The utility is intended to provide deeper insights of vSAN performance characteristics and analytics.
The VSAN Observer’s user interface displays performance information of the following items:
- Statistics of the physical disk layer
- Deep dive physical disks group details
- CPU Usage Statistics
- Consumption of VSAN memory pools
- Physical and In-memory object distribution across VSAN clusters
– vSAN Management Pack for vRealize Operations Manager allow us to use a “single pane of glass” for their vSAN infrastructure monitoring. When the management pack is installed, there are a selection of new vSAN related dashboards available out of the box. There is a “troubleshooting” dashboard displaying the vSAN topology as well as any vSAN related health issues.. This topology diagram includes the relationship between virtual machines, ESXi hosts, Disk Groups, SSDs, Magnetic Disks, network connections, etc on a VSAN cluster.
The vSAN management pack is specifically designed to accelerate time to production with vSAN, optimize application performance for workloads running on vSAN and provide unified management for the Software Defined Datacenter (SDDC).
– vSAN content pack for vRealize Log Insight
VMware vRealize Log Insight is a log management and analytics solution that gives the data center administrator an easy way to see context, correlation, and meaning behind otherwise obfuscated log content. Log Insight can aggregate log data from a variety of different sources, and creates a time series database of events that can be easily mined using a very simple query mechanism, and interactive graphs.
The new content pack introduces a new dashboard specifically for vSAN. The VMware – vSAN content pack for Log Insight provides deep knowledge and insight into VMware Virtual SAN logs. The content pack contains various dashboards, queries and alerts to provide better diagnostics and troubleshooting capabilities to the Virtual SAN administrators.
Here are listed all the dashboard included in the Log Insight content pack for vSAN:
- The Host state information dashboard
- The disk group failures dashboard
- The networking dashboard
- The Congestion dashboard
- The object configurations dashboard
- The configuration failures dashboard
- The Health dashboard
- The Object Events dashboard
#1 – What is Congestion?
Congestion is a flow control mechanism used by vSAN. Whenever there is a bottleneck in a lower layer of vSAN (closer to the physical storage devices), vSAN uses this flow control (aka congestion) mechanism to relieve the bottleneck in the lower layer and instead reduce the rate of incoming I/O at the vSAN ingress, i.e. vSAN Clients (VM Consumption). This reduction of the incoming rate is done by introducing an IO delay at the ingress that is equivalent to the delay the IO would have occurred due to the bottleneck at the lower layer. Thus, it is an effective way to shift latency from the lower layers to the ingress without changing the overall throughput of the system. vSAN measures congestion as a scalar value between 0 to 255.
Possible causes of Congestions
- SSD LLOG/PLOG congestions
- High network errors (Observable via vSAN observer or new in vSAN 6.6 with vCenter)
- vSAN Software Layer
- CPU contention
- CPU contention in the ESXi scheduler, seen when wrong setup in BIOS