The DTrace wait provider

Lucian Carata

February 24, 2018

Introduction

This is a stateful providerStateful provider: a provider that maintains relevant state across multiple probes being fired exposing details about the waiting that happens at all levels of the software stack. (future: also across activities spanning multiple machines). The data should be detailed enough to be useful in understanding performance variation, for example by characterising the resource contention and workload interference issues that can appear while multiple kernel subsystems interact.

Think of it as allowing you to set probes such as:

wait_queue:NFS:*:summary {...}
wait_state:TCP:t1:summary /p=="NFS" && state=="ESTABLISHED"/ {...}

We’ll discuss more about the types of probes and the data exposed by each of them, but for now it is sufficient to say that:

The results should be accessible in a configurable way: the consumer The consumer is the person or piece of software that uses DTrace to collect/store information about the system at runtime needs to be able to make decisions about the granularity at which measurements are done, with a direct impact on the sustained overhead and the resulting probe effect. Similarly, at least some of the probes exposed by the provider should allow the consumer to investigate aggregated, higher level data without being exposed to the details of individual kernel functions being executed:

wait_sequence:NFS:*:all { /* aggregate data available */ }

Probe format and stability

In DTrace, the human-readable name of each probe has four parts and a fixed format.

    provider:module:function:name 

Each of those components has well-defined semantics for the waitprovider, as described in the “mapping” column of the table below.

Element Mapping Name stability Data stability Dependency class
Provider wait_[probe type] Evolving Unstable Common
Module subsys Unstable Unstable Unknown
Function tracking_id Unstable Unstable Common
Name type-specific Unstable Unstable Unknown

Table 1: Provider mapping of probe elements together with their stability

Types of probes

There are two large categories of probes: wait measurement probes, dealing with the measurement of wait metrics specific to particular types of waiting and wait combinator probes responsible with the storage and aggregation of data from multiple wait measurement probes.

Measurement probes

  1. wait_queue

    This is a type of probe specialised in tracking metrics related to queueing. It is able to record state for individual items that waited in a queue, as well as general statistics about the resource servicing the queue (like saturation and utilisation).

    For this type of probe, the provider exposes the following names: enqueue, dequeue, summary

    wait_queue:NFS:[Qid]:summary { 
      //aggregate data available as arguments 
    }

    Here, Qid is the id given for a particular queue being tracked. In combination with the module element, this can be used to track activity for all queues in a given kernel subsystem:

    wait_queue:NFS:*:summary { ... }

    TODO(lc525): arguments available

  2. wait_state

    Type of probe specialised in tracking the waiting done through state machine-like abstractions: for example, one may want to determine the latency between the LISTEN and ESTABLISHED states of a TCP connection, or the time spent in the TIME-WAIT state.

    The provider exposes the following names:

          transition_to_[state-name], 
          between_[state-1]_and_[state-2], 
          summary

    For example,

    wait_state:TCP:[Sid]:_between_TIME-WAIT_and_CLOSE

    Where Sid is the id of the state machine being tracked:

    TODO(lc525): arguments available

  3. wait_return

    Type of probe that waits for activity to complete (either blocking or non-blocking). This can be used in a generic way when specialised probes types are not available in a given subsystem.

    The provider exposes the following names:

         start,
         end

    The only data available, and passed as an argument to the _end probe is the one regarding the duration of the wait

    wait_return:*:[Wid]:end

    Here, Wid represents the id/name of the place where waiting happens. If that is on a blocking function, the name of the function is used:

    wait_return:*:read:end

    TODO(lc525): arguments available

Combinator probes

  1. wait_sequence

    This type of probe is a state aggregator for sequences of waits. It is fired by the provider to allow consumers to read data collected from a number of measurement probes.

    The simple example below models at a coarse granularity the waiting that process A does when executing a blocking read() call. In order for the call to return, a sequence of two waits needs to be completed, in a specific order: (1) the I/O subsystem will wait for the disk data to be read (details captured by a wait_data probe); then, A is marked as RUNNABLE and waits (2) in the run queue of the scheduler to continue execution (details captured by a wait_queue probe).

    The combinator probe would be defined like this:

    wait_sequence:*:pid:read /.../{...}

    it will aggregate data from all the waits happening between a call to read() and its return for a given pid.

Example where tracking data about a sequence of waits is necessary: what is the \(\Delta\) between end of (1) and end of (2)?

Example where tracking data about a sequence of waits is necessary: what is the $\Delta$ between end of (1) and end of (2)?
  1. wait_dag

  2. wait_critical

Sample scripts

First, the measurement probes of interest need to be enabled, specifying what data should be collected:

// - Does low-overhead Off-CPU analysis without stack sampling
// - Off-CPU analysis doesn't tell you anything about things waiting
//   in a queue != runqueue
//
// The example here is tracking the delta latency between the data 
// becoming available on a read 

// Enable tracking of particular types of wait;
// You can enable tracking for all waits by specifying
//   wait_any:*:*:summary {}

wait_queue:sched:run_queue:summary {}
wait_queue:TCP:dev_queue:summary {}
wait_return:VFS:*:summary {}

Printing and using captured data

Second, the scripts should retrieve the data through the available combinator probes. Those have stored state for the waits that happened due to probes enabled above, and have extra data about their interractions.

Implementation

Notes