# Introduction

This is a stateful providerStateful provider: a provider that maintains relevant state across multiple probes being fired exposing details about the waiting that happens at all levels of the software stack. (future: also across activities spanning multiple machines). The data should be detailed enough to be useful in understanding performance variation, for example by characterising the resource contention and workload interference issues that can appear while multiple kernel subsystems interact.

Think of it as allowing you to set probes such as:

wait_queue:NFS:*:summary {...}
wait_state:TCP:t1:summary /p=="NFS" && state=="ESTABLISHED"/ {...}

We’ll discuss more about the types of probes and the data exposed by each of them, but for now it is sufficient to say that:

• the first probe tracks how much time is spent waiting in queues of the NFS subsystem (for example, waiting on other NFS RPCs to complete);
• the second probe looks at how long it takes for TCP connections initiated by NFS to reach the ESTABLISHED state (t1 being the name of the TCP state machine being tracked).

The results should be accessible in a configurable way: the consumer The consumer is the person or piece of software that uses DTrace to collect/store information about the system at runtime needs to be able to make decisions about the granularity at which measurements are done, with a direct impact on the sustained overhead and the resulting probe effect. Similarly, at least some of the probes exposed by the provider should allow the consumer to investigate aggregated, higher level data without being exposed to the details of individual kernel functions being executed:

wait_sequence:NFS:*:all { /* aggregate data available */ }

# Probe format and stability

In DTrace, the human-readable name of each probe has four parts and a fixed format.

    provider:module:function:name 

Each of those components has well-defined semantics for the waitprovider, as described in the “mapping” column of the table below.

• we expose multiple provider names, depending on the various type of waiting being tracked. probe types will be enumerated and discussed in detail below;
• subsys refers to the kernel subsystem that is targeted by the probe; this can be * for “any”;
• tracking_id refers to a probe-type dependent id representing the place where waiting happens. For example, wait probes specialised in tracking queues will have an ID for each queue
• the names exposed by the provider vary depending on the probe type
Element Mapping Name stability Data stability Dependency class
Provider wait_[probe type] Evolving Unstable Common
Module subsys Unstable Unstable Unknown
Function tracking_id Unstable Unstable Common
Name type-specific Unstable Unstable Unknown

Table 1: Provider mapping of probe elements together with their stability

# Types of probes

There are two large categories of probes: wait measurement probes, dealing with the measurement of wait metrics specific to particular types of waiting and wait combinator probes responsible with the storage and aggregation of data from multiple wait measurement probes.

## Measurement probes

1. wait_queue

This is a type of probe specialised in tracking metrics related to queueing. It is able to record state for individual items that waited in a queue, as well as general statistics about the resource servicing the queue (like saturation and utilisation).

For this type of probe, the provider exposes the following names: enqueue, dequeue, summary

wait_queue:NFS:[Qid]:summary {
//aggregate data available as arguments
}

Here, Qid is the id given for a particular queue being tracked. In combination with the module element, this can be used to track activity for all queues in a given kernel subsystem:

wait_queue:NFS:*:summary { ... }

TODO(lc525): arguments available

2. wait_state

Type of probe specialised in tracking the waiting done through state machine-like abstractions: for example, one may want to determine the latency between the LISTEN and ESTABLISHED states of a TCP connection, or the time spent in the TIME-WAIT state.

The provider exposes the following names:

      transition_to_[state-name],
between_[state-1]_and_[state-2],
summary

For example,

wait_state:TCP:[Sid]:_between_TIME-WAIT_and_CLOSE

Where Sid is the id of the state machine being tracked:

TODO(lc525): arguments available

3. wait_return

Type of probe that waits for activity to complete (either blocking or non-blocking). This can be used in a generic way when specialised probes types are not available in a given subsystem.

The provider exposes the following names:

     start,
end

The only data available, and passed as an argument to the _end probe is the one regarding the duration of the wait

wait_return:*:[Wid]:end

Here, Wid represents the id/name of the place where waiting happens. If that is on a blocking function, the name of the function is used:

wait_return:*:read:end

TODO(lc525): arguments available

## Combinator probes

1. wait_sequence

This type of probe is a state aggregator for sequences of waits. It is fired by the provider to allow consumers to read data collected from a number of measurement probes.

The simple example below models at a coarse granularity the waiting that process A does when executing a blocking read() call. In order for the call to return, a sequence of two waits needs to be completed, in a specific order: (1) the I/O subsystem will wait for the disk data to be read (details captured by a wait_data probe); then, A is marked as RUNNABLE and waits (2) in the run queue of the scheduler to continue execution (details captured by a wait_queue probe).

The combinator probe would be defined like this:

wait_sequence:*:pid:read /.../{...}

it will aggregate data from all the waits happening between a call to read() and its return for a given pid.

1. wait_dag

2. wait_critical

# Sample scripts

First, the measurement probes of interest need to be enabled, specifying what data should be collected:

// - Does low-overhead Off-CPU analysis without stack sampling
// - Off-CPU analysis doesn't tell you anything about things waiting
//   in a queue != runqueue
//
// The example here is tracking the delta latency between the data
// becoming available on a read

// Enable tracking of particular types of wait;
// You can enable tracking for all waits by specifying
//   wait_any:*:*:summary {}

wait_queue:sched:run_queue:summary {}
wait_queue:TCP:dev_queue:summary {}
wait_return:VFS:*:summary {}

## Printing and using captured data

Second, the scripts should retrieve the data through the available combinator probes. Those have stored state for the waits that happened due to probes enabled above, and have extra data about their interractions.

• First example prints the details of waiting on a read() syscall for the process with PID 2096; arg[0] retains the state associated with the current waiting_sequence. Out of that state, we extract the time delta between the end of the first wait and the end of the second by calling sequence(arg[0], delta, 1, 2)

//Trace the starting point of a wait
//provider   : subsys: pid  :function
wait_sequence: *     : 2096 :read {
sequence();
printf("wait after ready: %d ms", sequence(arg[0], delta, 1, 2));
}

Possible output (2 reads, might want to do aggregation across them):

pid 2096 wait read
VFS    aggregate
latency: 200 ms
SCHED  run_queue
latency: 20 ms,
requeue: 0,
top_3_wait_for: 1034, 956, 154

VFS    aggregate
latency: 800 ms
SCHED  run_queue
latency: 400 ms,
requeue: 0,
top_3_wait_for: 1035, 550, 569
wait after ready: 390 ms
• Second example looks at how waiting results can be quantized:

wait_sequence: *     : 2096 :write {
@distribution = quantize(sequence(arg[0], delta, 1, 2));
}

with possible output:

pid 2096 wait write
seq_delta(1,2) ------ Distribution --------- count
0  |                                         0
1  |                                         2
2  |@@                                       296
4  |@@@@@@@@@@@                              1879
8  |@@@@@@@@@@@@@@@@                         2719
16 |@@@@@@@@@@@                              1974
32 |                                         69
64 |                                         2
128|                                         0
• Third example looks at aggregating by the sequence of waits being performed:

wait_sequence: *     : 534  :read {
@wait[sequence(arg[0])] = count();
}

Possible output:

pid 534 wait read
wait_VFS, wait_sched_runqueue          50
wait_TCP_queue, wait_sched_runqueue    764
• Combining aggregation by sequence of waits and quantization:

wait_sequence: *     : 534  :read {
@wait[sequence(arg[0])] = quantize(sequence(arg[0], delta, 1, 2));
}

May yield an output such as:

pid 534 wait read
wait_VFS, wait_sched_runqueue
seq_count ---------- Distribution ---------- count
0  |                                         0
1  |                                         2
2  |@@                                       296
4  |@@@@@                                    1879
8  |                                         15
16 |@@@@@@@@@@@                              5096
32 |@                                        69
64 |                                         2
128|                                         0

wait_TCP_queue, wait_sched_runqueue
seq_count ---------- Distribution ---------- count
0  |                                         0
1  |                                         2
2  |@@@@@@@@@@@@@@@@                         2719
4  |@@                                       296
8  |@@@@@@@@@@@                              1879
16 |@@@@@@@@@@@                              1974
32 |                                         69
64 |                                         2
128|                                         0


# Implementation

## Notes

• data structure to hold “histogram of sequences”
• (discussion with ds815): what about things that are not a wait but can behave like one from the perspective of an user? (fast-path/slow-path or cache hit/miss)