Closing the Observability Gap Between Data and AI: Q&A With Implys Eric Tschetter


With the growing deployment of AI agents, these tools are exposing a flaw in observability: To cut costs, teams routinely filter or offload the very data those systems depend on. Without full historical context to validate outputs and understand patterns, AI performance degrades, and teams aren’t understanding why. The limiting factor for enterprise AI isn’t just the AI model—it’s also the data platform underneath it.

This observability gap exists because most enterprises are layering AI on top of legacy monitoring infrastructure. The lack of transparency into model reasoning creates cascading risks that include regulatory exposure, customer harm, and the inability to debug failures at scale. Observability for AI requires visibility into model reasoning, hallucination detection, token-level tracing, drift and degradation, and end-to-end latency breakdown.

Closing this gap requires purpose-built instrumentation, evaluation pipelines, and anomaly detection systems. For agentic workflows, organizations need execution tracing, decision logging, and cross-step consistency checks. Eric Tschetter, chief architect at Imply, believes shifting to create a central data layer that can power every part of the observability ecosystem—from Splunk to Grafana to AI copilots—can be done without silos or trade-offs.

Tschetter is one of the original creators of the open source Apache Druid project. He has been consistently working with Druid throughout his role as a fellow at Splunk and, before that, as a distinguished engineer at Yahoo. Imply, the “Data Layer for Observability, Security, and AI,” empowers organizations to keep more data, search it faster, and spend less—without changing their tools.

How is AI exposing fundamental gaps in how observability is stored and accessed—and why?

Traditional observability systems were built for human-driven workflows operating under cost constraints. They assume limited data volumes, selective retention, and predefined queries.

That model is now under pressure. Data volumes have been growing for years, and AI is accelerating that trend. Most importantly, AI is exposing the limitations of existing architectures. These systems depend on complete, full-fidelity context and fast access across large time ranges. When data is fragmented, delayed, or selectively retained, both humans and AI are forced to operate with partial information. The gap isn’t about getting better tools. Fixing it requires an entire architecture rethink, specifically in the way data is stored from the rest of the observability system.

What are companies doing about this, and what should they be doing?

As data volumes surge, driven in large part by AI, many companies are defaulting to object storage to keep costs under control. This improves economics, but introduces new trade-offs: higher latency, limited indexing, and slower query performance when speed matters most. In critical moments, yes, the data is there, but [it’s] almost unusable or hard to access.

The underlying issue is that observability workloads are not uniform. Detection requires always-on, low-latency access to recent data. Investigations are bursty and often require repeated analysis across long time ranges.

A single execution model cannot efficiently support all of these workloads. What organizations should consider is to decouple storage from compute and apply the appropriate execution model to each use case. This allows teams to retain large volumes of telemetry, cost-effectively, while still supporting fast, flexible access when needed.

What should enterprises do as AI agents become more operationally embedded?

As AI agents become more embedded, the primary bottleneck shifts to data access. These systems aren’t limited by interface—they’re limited by what they can see. If the data is incomplete, delayed, or fragmented, an agent’s decision will be too.

Enterprises need to design for high-volume ingestion, long-term retention, and low-latency access across time. This is not about adding more AI capabilities. It is about ensuring AI systems can operate on complete, queryable datasets. Without that foundation, AI cannot be relied on to make decisions or take action.

How can Imply Lumi help address this problem?

Imply Lumi is an observability warehouse purpose-built to store massive volumes of telemetry without giving up query performance. By decoupling storage from compute, Lumi enables queries directly on compressed data without requiring traditional rehydration or preloading workflows. This allows teams to retain significantly more data while keeping it immediately accessible.

Instead of forcing trade-offs between cost and performance, this approach provides a foundation where both can scale independently. For AI systems, this is particularly important.

They require access to complete datasets across time, not sampled or fragmented views. Lumi ensures that this data is not only retained, but [it’s] also usable in real time.

How will AI agents evolve in the observability space?

AI agents in observability will evolve from being narrow, assistive tools to integrated components of the operational pipeline. But their effectiveness will be gated by data access. If agents are constrained to partial datasets or delayed access, their role will remain limited. The shift occurs when they can operate on complete, real-time datasets across large time ranges. At that point, they can move beyond isolated analysis and begin to coordinate actions across systems. This transition is not driven by advances in AI alone. It depends on the underlying data architecture that makes that level of access possible.

Leave a Reply

Your email address will not be published. Required fields are marked *