Article originally published on Forbes Technology Council
AIOps emerged as a response to modern IT systems’ overwhelming volume and complexity of data. These tools were needed to help keep up with the demands of large-scale distributed systems.
AIOps offered a promising way to automate routine tasks, detect anomalies, and provide proactive insights, but its limitations have become increasingly apparent. Many AIOps solutions rely heavily on machine learning algorithms that require extensive training data and can be prone to overfitting. Additionally, the black-box nature of some AI models can make it difficult to understand how they arrive at their conclusions, hindering troubleshooting efforts.
As a result, a more comprehensive and intelligent approach is needed to address these challenges.
Generative AI (GenAI) is one solution offering a more powerful and flexible approach to addressing the challenges of modern observability. Generative AI, such as large language models (LLMs), can process and correlate heterogeneous data types (logs, metrics, traces) more effectively than traditional AIOps tools, providing a more comprehensive view of system behavior.
Let’s look at how GenAI can transform observability and what it will take to lay the foundation for the transition from AIOps to a GenAI-enabled observability future.
How Observability Changes With Generative AI
AIOps and observability are like detectives and crime scene investigators. Observability occurs when the crime scene investigator gathers evidence (telemetry data) and documents the scene. AIOps is the detective, analyzing the evidence, identifying patterns and solving the mystery (or, in IT terms, identifying root causes and resolving issues). Together, they form a powerful team, with observability providing the foundation and AIOps using its analytical skills to uncover insights and drive action.
However, generative AI and observability are like a seasoned detective and a cutting-edge forensic lab. Observability is the forensic lab collecting and analyzing evidence (telemetry data) using advanced tools and techniques. Generative AI is the experienced detective, leveraging the lab’s findings to connect the dots, identify patterns and solve complex cases (or, in IT terms, pinpoint root causes and resolve complex issues).
In their blog post “Goodbye AIOps: Welcome AgentSREs—The Next $100B Opportunity,” Foundation Capital’s Ashu Garg and Jaya Gupta argue that “AIOps” is an overused term and does not capture AI’s full potential in observability. Garg and Gupta highlight the challenges of managing complex telemetry data in modern systems, leading to slow troubleshooting and inefficient operations.
The Foundations Of Observability: Time Series Databases And Telemetry Data
Generative AI can offer several advantages over traditional AIOps approaches in enhancing observability for enterprise organizations. While AIOps has made significant strides, generative AI can provide a more robust and flexible approach to addressing the challenges of modern observability:
• Unified data understanding
• Improved anomaly detection
• Enhanced root cause analysis
• Natural language interfaces
• Adaptability
• Explainability
Today, organizations need comprehensive observability to handle various data types. To achieve this, we must overcome some technical challenges related to time series databases (TSDBs) and telemetry data. Once we solve these problems, LLMs can provide a powerful way to make sense of all this data and gain valuable insights.
Managing telemetry data can be complex. The sheer volume and variety of data modern systems generate can overwhelm traditional data management tools. Additionally, ensuring data quality, security and privacy is essential but can be challenging.
Organizations need an innovative approach to managing telemetry data to address these challenges. This involves adopting modern tools and technologies that can handle the scale and complexity of modern data sets.
Reducing Hallucinations With RAG And Observability
Retrieval Augmented Generation (RAG) is a technique that combines a traditional information retrieval system with an LLM. This allows LLMs to access and incorporate factual information from external databases or knowledge graphs during their generation process, leading to more accurate and relevant outputs, particularly in factual recall tasks.
While RAG and observability are distinct concepts, they share the goal of leveraging data for improved decision making. They can be viewed as complementary tools. RAG can benefit from observability data to further enhance its understanding of the world, and observability can leverage RAG for better data analysis and interpretation.
While we cannot yet eliminate hallucinations, RAG and observability can be employed to reduce or eliminate hallucinations in LLMs.
Laying The Foundation For LLM-Powered Observability
As explained in Garg and Gupta’s blog post cited above, agent-based site reliability engineers (AgentSREs) are expected to play a crucial role in the future of observability, particularly in pairing observability with GenAI capabilities.
AgentSREs will help bridge the gap between observability and GenAI by serving as the intermediaries between the observability platform and the GenAI models. They will help ensure seamless integration and effective communication between the two. Additionally, they can help translate the observability data and requirements into a format that the GenAI models can understand and process effectively.
By leveraging AgentSREs, observability platforms should be better equipped to effectively navigate the challenges and opportunities presented by GenAI integration, ensuring that the insights generated are reliable, interpretable and aligned with the organization’s operational and strategic objectives.
AgentSREs can help organizations transition from AIOps to GenAI-enabled observability. In bridging the gap, however, organizations should watch out for some key challenges, including:
1. Bias And Accuracy Concerns: GenAI models can potentially introduce biases and inaccuracies in their insights, which could lead to flawed decision making and operational issues.
2. Interpretability And Explainability: GenAI models’ complexity can make it challenging to understand the reasoning behind their insights. Companies should prioritize the development of interpretable and explainable GenAI models to enable their IT teams to comprehend the underlying logic and make informed decisions.
3. Cybersecurity Risks: Integrating GenAI into observability platforms could introduce new attack vectors and security vulnerabilities that must be addressed. Companies should implement robust security measures—including access controls, data encryption and anomaly detection—to mitigate potential cybersecurity risks.
By being aware of these “watch-outs” and proactively addressing them, companies can more effectively navigate the transition from AIOps to GenAI-enabled observability, minimizing risks and maximizing the benefits of this emerging technology.