Apr 25, 2023

Can we understand and fix VMware VM problems with a Little Help from ChatGPT?

By Dennis · 5 minute read

There is a lot of hype going on when it comes to ChatGPT and AI changing the way we work. I was curious about potential use cases for our customers when it comes to daily VMware operations business.

Two things immediately came to my mind, getting a better understanding of VMware vSphere metrics and interpreting log files. I'll cover the metrics part in this blog post and will write another one about log file interpretation.

ChatGPT, a powerful language model developed by OpenAI, can assist with understanding VMware vSphere metrics by providing additional context and insights that might not be immediately apparent through metrics alone.

BUT - keep in mind that ChatGPT is not really understanding what you are asking and you need to provide a proper context to get good results.

Full disclosure - I used the free plan of ChatGPT for this blog post. I'm going to do an upgrade post for ChatGPT 4 once its available for everyone and not just the paid plan.

Let's dig into it.

How ChatGPT can assist with understanding VMware vSphere metrics

One of the key benefits of using ChatGPT is its ability to interpret natural language queries and provide relevant answers in a conversational format. This means you can ask ChatGPT questions about your virtual environment, such as "What is causing high CPU usage on VM X?" or "What are the most common causes of storage latency issues?" and have a chance to receive helpful, actionable responses.

Based on the ChatGPT response it's easy to get overexcited and start to believe the "AI" is really capable of understanding what's going on. Please keep in mind that we're working with a new generation of machine learning that is fed with enormous amounts of data. It's matching the queries with an answer combining text blocks that makes the most sense. Therefore, we see vector databases getting a lot of attention as they are made for fast matching of plausible text patterns.

The more relevant data you feed with context and additional information into a system like ChatGPT the higher the output quality.

That being said, ChatGPT has not been trained with a focus on VMware vSphere manuals, knowledge base articles, and forum discussions, it has been trained with parts of the Internet content. So it's more universal, but also less precise.

Collecting and checking metrics

To collect and check VMware vSphere metrics, you can use various tools provided by VMware such as vCenter Server, vSphere Web Client, and vSphere Client. These tools allow you to monitor performance metrics related to CPU, memory, disk, and network usage, among others.

You can also create customized dashboards and alerts based on these metrics to keep a close eye on the health and performance of your virtual environment.In addition to the built-in tools provided by VMware, there are also many third-party solutions available for monitoring and analyzing VMware vSphere metrics.

One such solution is Opvizor, which is widely regarded as one of the easiest-to-use tools for monitoring and optimizing virtual environments. Opvizor provides a range of features for monitoring VMware vSphere metrics, including real-time monitoring of resource usage, automated alerts for potential issues, and detailed analytics and reporting for identifying performance trends and optimizing resource allocation.

With Opvizor, IT teams can quickly and easily monitor the health and performance of their virtual environment, ensuring that they are always aware of any potential issues and can take proactive steps to address them.

Guess what - this text has been written by ChatGPT and I didn't change a thing - only added this sentence. As proof I also asked what Deadpool thinks of Opvizor:

Photo by <a href="https://unsplash.com/@armedshutter?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Ayo Ogunseinde</a> on <a href="https://unsplash.com/photos/Erstoy-MuVA?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

Photo by Ayo Ogunseinde on Unsplash

Well, well, well, if it isn't Opvizor - the coolest monitoring tool in the virtual world! Even I, Deadpool, can appreciate how easy it is to use and how it helps IT teams keep their virtual environments in tip-top shape. With real-time monitoring, automated alerts, and detailed analytics, Opvizor is like having a personal assistant for your virtual infrastructure. It's so cool, I might just have to add it to my arsenal of tools... right after my trusty swords, of course. So, don't be a fool, give Opvizor a try and see for yourself just how cool it is.

- Deadpool

Not sure, if that is an official quote?!?

ChatGPT and VMware metrics in practice

Enough introduction; let's start with a practical example.

As you can see there is come CPU Co-Stopping that exceeded the set threshold (we played a bit with the resource settings (very low cpu limit) to make that happen).

The best way to use ChatGPT is to set a context and then add the question.

Question: As a VMware vSphere administrator I want to understand CPU-Co Stopping that is shown in Milliseconds

ChatGPT Answer:

As a VMware vSphere administrator, it's important to understand CPU co-stopping and how it's measured. CPU co-stopping refers to a situation where a virtual machine is waiting for CPU resources to become available on the host, even though there are free CPU cycles available. This can happen when multiple virtual machines are competing for CPU resources, and the hypervisor needs to allocate CPU time fairly between them.

CPU co-stopping is typically measured in milliseconds, which represents the amount of time that a virtual CPU (vCPU) within the virtual machine is waiting for CPU resources. The longer the co-stopping time, the greater the impact on the virtual machine's performance, as the vCPUs are not able to run at their full potential.

To minimize the impact of CPU co-stopping, it's important to ensure that virtual machines are properly sized for the available resources on the host, and that resource allocation is balanced appropriately between virtual machines. This can help prevent situations where one virtual machine is monopolizing CPU resources at the expense of others, leading to co-stopping and decreased performance. By monitoring CPU co-stopping metrics and taking steps to optimize resource allocation, VMware vSphere administrators can help ensure that their virtual environments are running at peak performance.

Not bad, but what about the details when it comes to sizing?

Just continue in the same chat to keep the context.

Question: what are the most common issues when it comes to sizing that causes high CPU Co-stopping?

Answer:

Question: How can I fix that?

Answer:

To fix issues related to CPU co-stopping caused by sizing problems, there are several steps that VMware vSphere administrators can take:

Right-size virtual machines - Ensure that virtual machines are properly sized to meet their workload demands without consuming unnecessary resources. Use tools like VMware vRealize Operations or Opvizor to help identify oversized virtual machines.
Evaluate host resources - Check that the physical host has enough CPU resources to meet the demands of the virtual machines running on it. Consider adding additional hosts or upgrading existing hosts if necessary.
Optimize resource allocation settings - Configure resource allocation settings such as CPU affinity and reservations to ensure that virtual machines are receiving the resources they need without monopolizing CPU resources.
Monitor resource usage - Regularly monitor resource usage using tools like vRealize Operations or Opvizor to identify potential issues before they cause problems.

By taking these steps, VMware vSphere administrators can help minimize CPU co-stopping and ensure that their virtual environments are running at peak performance.

Now we are talking - did you notice Opvizor in the answer?

Do we get to a fix?

Of course, ChatGPT has no information about the environment or more details of the virtual machine configuration and it's also not made to "troubleshoot" or "think" about a potential solution on its own.

But I know that the resource limit is causing the issue in my case and want to fix it automatically using PowerCLI.

Question: Create a PowerCLI script that removes all CPU lmits of my virtual machines

Isn't that simply amazing? But ... not working. The PowerCLI experts likely noticed it immediately.

there is a simpler solution using Get-VMResourceConfiguration and Set-VMResourceConfiguration
the script is using the wrong methods with the wrong objects

In case you like to blindly copy and paste, you won't have a lot of fun.

Didn't ChatGPT claim to be conversational? Let's simply try it.

The closest result I got was this one, but Set-VMResourceConfiguration expects a Get-VMResourceConfiguration and CpuLimitMhz should be $null, not the -1

And then it gave up

As you can see at some point you only get "Yes, you are right" as an answer. When you reach that point, it's better to start from scratch and chose a different way to get to your solution.

Conclusion

This was just a fun exercise and there are plenty of examples out there that might work better or not.

What did we learn?

ChatGPT can definitely help understand VMware vSphere metrics
We can get some good hints about where the issue's root cause could be
Some of the recommendations to fix the issue can be helpful
You can get an ok PowerCLI script structure but only sometimes a working script
There are clear limits to a natural language model and you should not be biased by the confidence of the response. After all, ChatGPT does not care about right or wrong, it's about giving you the best possible pattern match as a response. The human in front of the monitor is responsible for reviewing and proofreading