Skip to content

vRealize Operations Manager (vROps) Troubleshooting – Lesson Learned

As I was recently delivering a VMware Professional Services engagement, I learned a valuable lesson concerning VMware vRealize Operations Manager (vROps). I should use vROps to troubleshoot vROps!

During my effort to enable a new vROps customer to successfully utilize the software to monitor, analyze, and troubleshoot their business application workloads and infrastructure, I missed the opportunity to  use vROps to solve an issue with vROps.  After covering how to use vROps features, such as monitoring, optimization, resource reclamation, and compliance, we ran into issues activating the vRealize Application Monitoring management pack and using the activated Service Discovery Management Pack.  We successfully enabled Service Discovery on the vCenter Server adapter, but the Manage Services page did not populate as expected.  We expected the Manage Services page to display all the underlying VMs, including those where service monitoring is not yet working due to issues such as wrong credentials or VMware Tools version, as you see in this example from my (MEJEER, LLC) lab environment.

Instead, the rows in the Manage Services table were empty. Additionally, the Service Discovery Adapter instance indicated the number of Objects Collected was zero, as shown here.

We expected the issue to be related to the error we received when attempting to install the vRealize Orchestrator management pack and the issue we encountered when attempting to activate the vRealize Application Monitoring management pack.  Immediately, we started examining log files and other brute force efforts.

If we had simply looked in vROps for any alerts related to the vROps node (virtual appliance), we would have discovered an alert for guest file system usage and would have quickly identified the root cause and solution.  Specifically, the out-of-the-box alert named One or more guest file systems of the virtual machine are running out of disk space was triggered days earlier, but had gone unnoticed because the (test) environment had many alerts

In the Service Discover adapter instance’s log file, we saw an error writing a file.

We analyzed the error and discovered the root partition in the vROps node was 100% full.  The time spent from the moment we began reviewing logs until we discovered the filled root partition was about one hour.  If we had simply looked at the vROps alerts, we could have discovered the filled root partition within minutes.

In my defense, the environment was new and was being used for proof of concept and user enablement.  We added an endpoint to collect data from an old vSphere environment, which triggered hundreds of out of the box vROps alerts. (The customer intends to address all of the alerts in time.) So we were ignoring alerts while I provided informal hands-on training to the customer.  I planned to guide the customer with creating a dashboard that provides a single-pane of glass for observing the health, alerts, performance, and risk of their management cluster, including the vROps nodes.  If the vROps issue occurred after the engagement, the customer likely would have proactively caught it prior to the root partition reach 100% full (while the alert was at the warning level).

In case you are wondering, the root cause of the error was due to a known issue in vROps 8.0 that was fixed in a later version.  The error is described in VMware KB 76154.  The fix is to restart the rsyslog service  (service rsyslog restart).

NOTE: We learned that if we caught the issue before the root partition filled, we may have been able to fix the problem by restarting the vROps appliance.  BUT, if we restarted the vROps appliance after the root partition filled, it would have put the appliance in very bad state and required us to open a VMware Service Request.

 

VCP-DCV 2020: vSphere 6.7 Exam Prep

As explained in this previous post (VCP-DCV 2019 Exam Prep), I posted some very rough material for preparing for the Professional vSphere 6.7 Exam 2019 (2V0-21.19), which can now be passed to earn VCP-DCV 2020 certification.

If you are preparing to take the vSphere 6.7 Exam and earn VCP-DCV 2020 Certification and if you cannot find a polished preparation tool, you may want to download the vSphere 6.7 Appendix and to use with the VCP6-DCV Official Cert Guide (VMware Press).  You may also want to use my sample vSphere 6.7 Exam Subtopics document, which as explained previously, is my non-official attempt to identify sub-topics for each vSphere 6.7 exam topic.

 

 

4 Node VSAN Cluster: RAID-1 vs RAID-5

In the past few years, I have encountered a specific scenario several times concerning different customers who are looking to reduce VSAN storage consumption in a 4 node cluster by migrating VMs to use a RAID-5 (Erasure Enclosure) policy from the RAID-1 (Mirror) policy. Here is a brief statement summarizing my opinion on the topic.

You should reexamine the requirements and decisions that were made during the design of the cluster.  The decision to configure a 4 node cluster with a specific set of cache drives and capacity drives are typically based on requirements to deliver to a specific amount of usable storage with a specific level of availability.  It coincides with a decision to apply a VSAN RAID-1 policy to the VMs

The VSAN RAID-1 policy means that Failures to Tolerate (FTT) = 1 and Fault Tolerance Method = performance.  VSAN RAID-1 means that for each data item written to capacity drives in one ESXi host, a duplicate is placed on a second host.  The minimum number of hosts required in a VSAN RAID-1 cluster is three, due to the need for an odd number of nodes for quorum.  VMware recommends having N+1 nodes (4) in a VSAN cluster to allow you to rebuild data (vSAN self-healing) in case of a host outage or extended maintenance.  In other words, whenever a host is offline for a significant amount of time, you can rebuild data and be protected in case of the failure of another host.

You can elect to use VSAN RAID-5 (Erasure Coding) on all or some VMs in the cluster to reduce the used VSAN space.  VSAN RAID-5 means FTT=1 and Fault Tolerance Method = Capacity.  Its required minimum number of nodes is 4, which your cluster has.  But, VMware recommends at least 5 (N+1) nodes to allow you to rebuild data due to host outage or extended maintenance.  Your cluster does not meet VMware recommendation for RAID-5.

If you do elect to use VSAN RAID-5 in the 4-node cluster, be aware of the risk during periods of a host outage or extended maintenance.  In other words, whenever a host is offline for a significant amount of time, you will not be able to rebuild data and you are not fully protected in event of the failure of another host. If you decide that the risk is acceptable for some subset of your VMs and not for others, you can apply the VSAN  RAID-1 and RAID-5 policies accordingly.  If you want the benefit of reduced storage consumption but want to maintain the current level of availability, consider adding a 5th node to the cluster prior to implementing VSAN RAID-5.

Reference:  https://blogs.vmware.com/virtualblocks/2018/05/24/vsan-deployment-considerations/

Examining Health of a vRA 7.x Instance

Your vRA 7.x instance can be configured to run System Tests daily.  You can navigate to Administration > Health to view details of the most recent system test.  In this example, five tests failed during the last run.

Picture1

Picture2

To troubleshoot specific issues or to just learn more about vRA health, you can examine its logs.  Navigate to Infrastructure > Log

Picture3

In this example, warnings and errors are occurring every minute.

Picture5

Prior to opening a vRA ticket with VMware Support, you should generate a support bundle that you can upload to VMware.  Use the VAMI (https://vRA-node-FQDN:5480) and navigate to Logs.  Click the Create Support Bundle button.

Picture6

Back on the Infrastructure page, just above the Log option, you can select the DEM Status option to verify that the DEMs are running.  In this example, both DEMs are Online.

Picture7

You can use SSH (Putty) to logon to a vRA appliance using the root account.  Here you can run this command:

vra-command list-nodes –components

This produces a lot of results about the health of each node. You should run this command and capture its results when everything is healthy in your environment.  You can use the results as a baseline for comparison during times of health concerns.

Here are example screenshots from a recent execution in a healthy environment.  The full results of the command are captured as separate screenshots and organized here by node type with some explanation.  In this example, we refer to each node’s name by a number, which is part of the node’s full name that is masked in the screenshot.

vRA appliances (nodes 41, 42, 43).  In this example, node 42 is currently the Master node, as indicated by the fact that its value for Primary is true.

Picture8

Picture9

Picture10

IaaS Web Server Nodes (81 and 82).  Notice two components (Database and ModelManagerData) that appear for Node 81 do not appear for Node 82.

Picture11

 

Picture12

IaaS Server Nodes (83 and 84).  Notice in this example, that Node 84 is the Active node and Node 83 is the Passive node.

Picture13

Picture14

DEM / Agent nodes (85 and 86).Picture16

Picture17

 

 

 

 

 

 

 

 

 

Installing vRealize Automation 8.0 in a Home Lab

vRealize Automation (vRA) 8.0 is a new animal.  It is an on-premise version of VMware Cloud Automation Services, rather than an upgrade from vRA 7.x.  It has four main components:  vRA Cloud Assembly (build and deploy blueprints), vRA Service Broker (deliver and consume service catalogs), vRA Code Stream (implement CI/CD), and vRA Orchestrator (develop custom workflows).

In comparison to vRA 7.6 (which involves virtual appliances, Windows based IaaS components, an MSSQL database, etc), vRA 8.0 has a very simple architecture.  Primarily, it consists of:

  • A three node cluster of vRA appliances (or a single node when high availability (HA) is not needed)
  • A three node cluster of VMware Identify Manager (vIDM) appliances (or a single node when high availability (HA) is not needed) or an existing vIDM instance.
  • A vRealize Suite Life Cycle Manager (LCM) appliance.

Based on my experience, you may struggle to find any VMware Hands on Labs or other means to gain hands-on familiarity with vRA 8.0. (The Try for Free link in the my.vmware portal currently takes you to a vRA 7.x Hands on Lab, not a vRA 8.0 lab.)  Like me, you may decide to deploy vRA 8.0 in your home lab.  I implemented a home lab based on the following:

  • Windows 10 Home running on a Dell XPS 8930 PC with an 8 core Intel i7-9700 3 GHz CPU, 64 GB RAM, 1 TB SSD, 1 TB HDD
  • VMware Workstation Pro 15.5
  • VMs running directly in VMware Workstation:
    •  vCenter Server 6.7 Appliance:  2 vCPUs, 10 GB vRAM, about 25 GB SSD storage (13 thin-provisioned vDisks configured for 280GB total)
    • ESXi 6.7:  8 vCPUs, 48 vRAM, 310 GB SSD storage (2 thick-provisioned vDisks: 10 GB and 300 GB)
    • Windows 2012 R2 Server running DNS with static entries for each VM:  2 vCPUs, 4 vRAM, 60 GB thin-provisioned vDisk on SSD Storage
  • After the vRA installation, these virtual appliances are deployed in the ESXi VM (in a VMFS volume backed by the 300 GB SSD virtual disk)
    • vIDM 3.31, 2 vCPUs, 6 GB vRAM, 60 GB total thin-provisioned vDisks
    • LCM 8.0, 2 vCPUs, 6 GB vRAM, 48 GB GB total thin-provisioned vDisks
    • vRA 8.0, 8 vCPUs, 32 GB vRAM, 222 GB total thin-provisioned vDisks

NOTE: Your biggest challenge in deploying vRA 8.0 in a home lab may be the fact that vRA 8.0 appliances require 8 vCPUs and 32 GB memory.  After deploying vRA 8.0, I tried re configuring the vRA appliance with fewer vCPUs and less memory, but had to revert after experiencing performance and functional issues.

 

NOTE: If you building a home lab for the first time, here is a great reference that I used.  https://www.nakivo.com/blog/building-vmware-home-lab-complete/

Fortunately for me in my relationship with VMware, I have access to free product downloads and evaluation licenses.  If you do not have access to free product downloads and evaluation licenses, you may need to request a vRA 8.0 trial via your VMware Account Representative (Currently, the my.vmware portal does not appear to provide a link to a free trial).

To get started, you should use Easy Installer, which can be used to install LCM, vIDM, and vRA.   It can also be used to integrate with a previously deployed vIDM instance or used to migrate from earlier versions of vRealize Suite Life Cycle Manager.

In your first attempt, you could choose to use Easy Installer to deploy the minimum architecture, which includes a single LCM appliance and a single vIDM appliance, but no vRA appliances.  This enables you to verify that LCM and vIDM are deployed successfully and remediate any issues.  Next, you can use LCM to deploy a single vRA appliance.  Finally, you can use QuickStart to perform the initial vRA configuration.

To get familiar with the installation, I recommend that you review the How to Deploy vRA 8.0 article at VMGuru:  https://vmguru.com/2019/10/how-to-deploy-vrealize-automation-8/.

To learn what Easy Installer does, refer to this link:  https://docs.vmware.com/en/vRealize-Automation/8.0/installing-vrealize-automation-easy-installer/GUID-CEF1CAA6-AD6F-43EC-B249-4BA81AA2B056.html

Based on my experience, here is a summary of steps that you could use for installing vRA 8.0 in a home lab.  (Be sure to use the official documentation when you are installing vRA 8.0: https://docs.vmware.com/en/VMware-vRealize-Suite-Lifecycle-Manager/8.0/com.vmware.vrsuite.lcm.80.doc/GUID-1E77C113-2E6E-4425-9626-13172A14D327.html)

Installation Steps:

To get started, download the vRealize Easy Installer and run it on a Windows, Linux, or Mac system.  https://docs.vmware.com/en/vRealize-Automation/8.0/installing-vrealize-automation-easy-installer/GUID-1E77C113-2E6E-4425-9626-13172A14D327.html

Use Easy Installer to:

  1. Deploy the LCM appliance to your vSphere environment.  https://docs.vmware.com/en/vRealize-Automation/8.0/installing-vrealize-automation-easy-installer/GUID-4D23B793-4EC8-4449-8B3A-34CB1D9A8609.html
  2. Deploy the vIDM appliance to your vSphere environment. https://docs.vmware.com/en/vRealize-Automation/8.0/installing-vrealize-automation-easy-installer/GUID-1C15C31B-D51F-4881-9CD1-EFB29C683EFF.html
  3. Skip the vRA installation.

Complete the Easy Installer wizard and monitor the installation progress.

Use LCM to install vRA 8.0 into a new environment.

vRA provides QuickStart to simply your steps for performing the initial vRA configuration, such as adding a vCenter Server cloud account, creating a project, creating a sample machine blueprint, creating policies, adding catalog items, and deploying a VM from the catalog.  You can only run QuickStart once, so get familiar with it before launching it. To learn more about what QuickStart does to vRA Cloud Assembly and vRA Service Broker, refer to the Take me on a tour of vRealize Automation to see what the QuickStart did at https://docs.vmware.com/en/vRealize-Automation/8.0/Getting-Started-Cloud-Assembly/GUID-4090D3A8-49C5-4530-8359-C8265B784C80.html .

NOTE: If you choose not to use QuickStart or if something goes wrong, you can use the Guided Setup.

Use Quickstart to perform the initial vRA 8.0 configuration. https://docs.vmware.com/en/vRealize-Automation/8.0/Getting-Started-Cloud-Assembly/GUID-91597976-E472-493B-8017-2D37DC8DC0E5.html

 

 

VCP-DCV 2019 Exam Preparation – Presentation

Here is the slide deck that I used in the presentation on Preparing to Take and Pass the VCP-DCV Exam at the Boston VMUG UserCon 2019.

VCP-DCV Cert Prep Slides

 

VCP-DCV 2019: vSphere 6.7 Exam Prep

At the beginning of 2019, VMware released a new VCP-DCV 2019 certification and a new certification exam: Professional vSphere 6.7 Exam 2019 (2V0-21.19).

Steve, Owen, and I began drafting a new appendix for use in conjunction with the VCP6-DCV: Official Cert Guide (VMware Press) when preparing the vSphere 6.7 exam; but, we determined that the changes in exam structure and content made the endeavor futile.  We realized that the effort was huge and the product would not be ideal.  Instead, we decided on developing a new guide that we would release shortly after next major releases of vSphere and the exam.

But, due to the absence of ideal VCP-DCV vSphere 6.7 exam preparation tools, I am posting this very, rough draft of vSphere 6.7 Appendix here.  Feel free to use as-is:

Appendix-VCP-DCV-2019

NOTE: The Exam Preparation Guide for the Professional vSphere 6.7 Exam does not provide any details for the exam objectives.  Here is my quick attempt to identify some potential subtopics.  (This is not official.  It is just based on my knowledge of the topic and information that I found in the official VMware documentation.):

VCP-DCV-subtopics