We're taking a slightly different angle than what Facebook published, in that we're primarily using tool calling and observability data to run investigations.
What we've released really shines at surfacing up relevant observability data automatically, and we're soon planning to add the change-tracking elements mentioned in the Facebook post.
If anyone is curious, I did a webinar with PagerDuty on this recently.
Interesting. We're finding that it's more verbose than GPT-4o (and often ignores instructions to give output in a specific format), but does better when faced with ambiguity and potentially confusing situations.
Here is an example using http://github.com/robusta-dev/holmesgpt/ where ask it to troubleshoot a payment processing workload in a Kubernetes cluster vs the GPT-4o output underneath:
holmes ask --model="bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0" "what is wrong with the payment processor?"
AI: To investigate what's wrong with the payment processor, we'll need to gather more information about it. First, let's try to find the relevant Kubernetes resources related to the payment processor.
Running `kubectl get -A --show-labels pod | grep payment`
Thank you for your question about the payment processor. initial search, I've found two pods related to payment processing. Let's investigate the one that seems to be having issues:
1. In the default namespace, there's a pod named "payment-processing-worker-5bb5bd75cd-x4g99" which is in a CrashLoopBackOff state. This indicates that the pod is repeatedly crashing and restarting.
2. In the sock-shop namespace, there's a pod named "payment-ff86cd6f8-rs47b" which appears to be running normally.
Let's focus on the problematic pod in the default namespace. To get more information about what's wrong, we'll need to look at its logs and description.
Running `kubectl describe pod payment-processing-worker-5bb5bd75cd-x4g99 -n default`
AI: Based on the information gathered, it appears that there is an issue with the payment processor in the Kubernetes cluster. Let me explain what's wrong:
1 The payment processing worker pod is in a CrashLoopBackOff state:
• Pod name: payment-processing-worker-5bb5bd75cd-x4g99
• Namespace: default
• Status: CrashLoopBackOff
2 The pod is continuously restarting and failing:
• Restart Count: 242
• The container is in a "Waiting" state with the reason "CrashLoopBackOff"
3 The root cause of the issue appears to be related to an environment variable:
• The container is running a bash script that checks for the presence of the DEPLOY_ENV environment variable.
• If DEPLOY_ENV is not set, the container exits immediately after printing an error message.
• The command being run is:
if [[ -z "${DEPLOY_ENV}" ]]; then echo Environment variable DEPLOY_ENV is undefined ; else while true; do echo hello; sleep 10;done; fi
4 The environment section in the pod description shows:
Environment: <none>
This indicates that no environment variables are set for the container.
5 The container is exiting with a status of "Completed" and an exit code of 0, which suggests that it's running the script, finding that DEPLOY_E
To fix this issue:
1 The DEPLOY_ENV environment variable needs to be set for the payment-processing-worker deployment.
2 You should update the deployment configuration to include this environment variable.
3 After setting the environment variable, the pod should stop crashing and enter a running state where it prints "hello" every 10 seconds.
To resolve this, you'll need to update the deployment configuration for the payment-processing-worker. You can do this by editing the deployment and adding the DEPLOY_ENV environment variable with an appropriate value.
Would you like me to show you how to update the deployment with the required environment variable?
Note that it explicitly picks up on the ambiguity in the question and notes that there are 2 payment related workloads and one is healthy.
GPT-4o:
$ holmes ask "what is wrong with the payment processor?"
Running `kubectl get -A --show-labels pod | grep payment-processor`
Running `kubectl get -A --show-labels deployment | grep payment-processor`
Running `kubectl get -A --show-labels service | grep payment-processor`
Running `kubectl get -A --show-labels pod | grep payment`
Running `kubectl describe pod payment-processing-worker-5bb5bd75cd-x4g99 -n default`
Running `kubectl logs payment-processing-worker-5bb5bd75cd-x4g99 -n default --previous`
AI: payment-processing-worker-5bb5bd75cd-x4g99 in namespace default crashes due to missing DEPLOY_ENV environment variable.
Fix: Define DEPLOY_ENV in the deployment spec.
Example patch:
env:
- name: DEPLOY_ENV
value: "production"
Most of the cli tools just wrap an LLM, but don't give it access to the data it needs to be useful. Aider is an exception of course - it gives great results because it feeds the LLM your source files.
We built http://github.com/robusta-dev/holmesgpt/ to investigate Prometheus/Jira/PagerDuty issues. We're able to get pretty good results (we benchmark extensively) because we use function-calling to give the LLM read acess to relevant data. I think we're the only open source AIOps tool, and the only AIOps tool period that does something more complex than RAG + summarization.
If you're looking for something open source: https://github.com/robusta-dev/holmesgpt/