Dynamic Telemetry is a PROPOSAL : please provide feedback! :-)

Dynamic Telemetry is not an implementation, it's a request for collaboration, that will lead to an shared understanding, and hopefully one or more implementations.

Your feedback and suggestions on this document are highly encouraged!

Please:

Join us, by providing comments or feedback, in our Discussions page
Submit a PR with changes to this file ( docs/Architecture.FlightRecorder.Overview.document.md)

Direct Sharing URL

http://microsoft.github.io/DynamicTelemetry/docs/Architecture.FlightRecorder.Overview.document/

Flight Recorder Overview

A Flight Recorder is essentially a ring buffer that stores logging data that is only uploaded to a backend, when instructed by a triggering Action. Unlike standard streaming telemetry, this telemetry is only emitted when instructed, usually on failure.

In this capacity, it serves as an alternative and compliment to standard Streaming Observability. In fact, users of Flight Recorders. Typically report that their presence turns the frustrating act of debugging a live system into something a lot more fun and interactive, almost like a video game.

A quick recap:

A Probe is a form of logging that flows into OpenTelemetry
A Filter or Router is a piece of code situated in the middle of an OpenTelemetry pipe that routes or filters logs
A Flight Recorder is a circular log that is never emitted unless there's a reason
An Action provides the reason

Think of a Flight Recorder as a way to enhance your logging capabilities. It provides flexibility and additional options for addressing complex issues. Imagine needing a specific log when something goes wrong

A Flight Recorder offers a unique approach where logs are collected but not uploaded unless a problem arises. The challenge and creative opportunity lie in defining the triggering Ation for when the issue you're monitoring occurs.

Now we're playing cat and mouse with our bugs and that is a lot of fun.

What is a Flight Recorder?

Technically, a Flight Recorder comprises a ring buffer with fixed capacity, typically residing in memory or (maybe) on disk, such that the data continually overwrites older data with newer entries. This design is naturally lossy, yet it captures essential insights about system events and states in near real time.

What is contained in a Flight Recorder, is simply redirected standard Logging - this be from any Probe technology that emits into OpenTelemetry.

Think of a Flight Recorder as just routed Observability, that goes nowhere, unless asked. If you imagine the pipe analogy in the Umbilical document, a Flight Recorder is just a cut Observability pipe, that is redirected into a circular buffer, instead of streamed directly to a a backend.

Special Characteristics of a Flight Recorder

In Dynamic Telemetry, one one machine, there can be hundreds - if not thousands of Flight Recorders. Some are as small as a log record or two, others may contain megabytes of Logs, any of which can be collected as instructed by an Action.

The most special characteristic of a Flight Recorder, is that each can be uniquely identified by a dedicated tag or name. This allows for quick recognition among multiple data sources and ensures streamlined retrieval when logs need to be collected by an Action.

This on-demand egress is a core feature, enabling data extraction whenever deeper investigation is necessary. By preserving a snapshot of the overwritten data, the Flight Recorder helps diagnose issues by making past telemetry accessible for post-mortem analysis.

How to collect a Flight Recorder

A Flight Recorder augments standard streaming telemetry by capturing data from multiple probes, such as OpenTelemetry logging, ETW (Windows), user_events, or syslog into a circular buffer. Logs remain local unless triggered to leave the machine, delivering deeper insights through real-time and localized data analysis.

By storing high-verbosity traces locally, a Flight Recorder retains critical details for post-event analysis. Logs remain accessible when needed, even if they might not be retained long-term.

This solution can provide both performance benefits and cost savings. To learn more, refer to the position papers on scarcity and triggered Flight Recorders.

The basic steps to collect a Flight Recorder are to know through some mechanism its identifier and then to use a triggering action to collect it.

Steps Involved in Collecting a Flight Recorder

Route high volume Logging to a Flight Recorder
Note its Identifier
Use the Flight Recorder egress action, to collect the Flight Recorder

Trace 'Horizons'

Flight Recorders often collect high volume logs that remain local until a triggering event prompts upload. This approach introduces different trace horizons. One horizon might capture logs leading to a process crash or other diagnostic event. Because these logs can be high in volume, ring buffers overwrite older data frequently. This arrangement is commonly referred to as a "short-horizon" Flight Recorder.

In contrast, some logging applies only to specific failures that may take minutes or days to occur. Examples include Bluetooth sessions on a client operating system, long-running transactions, or writing data to a slow medium like tape. These scenarios require maintaining a Flight Recorder over an extended period, ensuring that all pertinent logs remain accessible when needed.

These lower volume but long duration Flight Recorders are known as long-horizon Flight Recorders. They are designed to capture and retain logs over extended periods, ensuring that all relevant data is available for analysis when needed.

Interesting Applications of Flight Recorders

Flight Recorders are an extremely interesting and fun concept. As you gain proficiency in using them, you'll find applications everywhere. They offer a unique way to capture and analyze telemetry data, providing insights that are not possible with traditional logging methods.

Below are some of our favorite applications:

Recording Information leading into a process crash

Imagine a long-running Flight Recorder that collects 100 times the logging that would normally stream through something similar to OpenTelemetry.

This logging would be very verbose, containing information like function entry and exit, web requests, queue lengths, open file pointers, file indexes, and so on.

Normally, this type of information would clutter up a backend database and be useless in most contexts.

However, when collected during a process crash, this information is sufficiently inexpensive and can significantly boost productivity.

A Flight Recorder like this is not free; the logs will have to go into a circular buffer, which does cause CPU load. But when done well, for example, using something similar to ETW or user events on Linux, this CPU load can be very inexpensive compared to other techniques.

When the process crashes, this log can be collected and will serve as a set of breadcrumbs leading to that process crash.

Pretty fantastic.

This approach also has a positive impact on the developer's mindset. Developers often struggle with the need to suppress logging messages due to cost, security, and privacy concerns imposed by business and finance teams.

With the availability of Flight Recorders, developers can feel reassured. Knowing that in the event of a process crash, they will have access to the critical logs leading up to the incident, alleviates their concerns and allows them to focus on more productive tasks.

Tracking Memory Leaks

Even in managed languages, determining why memory is being consumed can be a complicated matter. We've all encountered a linked list that holds a pointer to memory and doesn't shrink as it should. While some may argue whether this constitutes a memory leak, the system inevitably starts to slow down and thrash as memory pressure exceeds the hardware's capabilities.

A powerful use of a Flight Recorder is to track the insertion or deletion from such a list, or the add reference and release, or malloc() and free() operations in unmanaged languages.

This is achieved either through standard logging or by inserting a dynamic probe on the malloc() and free() calls.

By using a probe that indicates the amount of memory load on a machine, coupled with an action to collect this type of memory Flight Recorder, a developer can obtain a high-fidelity glimpse into the machine's memory usage over a long duration without negatively impacting performance.

Best of all, once the project is complete and the memory leak is understood, dynamic telemetry can disable all of this logging, including the flight recorder, allowing the machines to operate at high speed.