Tuesday, 26 July 2022

Intel Processor Trace (IPT) under Windows

My interest in IPT under Windows has been piqued more than once but, until now, my judgement of the effort versus benefit of interpreting IPT data tended to a “not worth it” decision: identifying and displaying the individual IPT packets seemed straightforward enough but interpreting the Taken/Not-taken (TNT) bits would require a disassembler and the binary code and load addresses of all the executable files that might have been executed. Mentions of IPT tracing and trace analysis found in the Web often mention very large volumes of trace data and very long trace analysis times.

The event that prompted a more detailed consideration of the effort required to develop a simple (and therefore slow) tool to analyse very short IPT traces was the discovery that Windows Performance Recorder (WPR) can be configured to cause “IPT” events to be generated as an accompaniment to kernel events. The Windows Performance Recorder Profile (WPRP) schema (https://docs.microsoft.com/en-us/windows-hardware/test/wpt/wprcontrolprofiles-schema) includes a description of the “HardwareCounter” element which can contain elements such as “LastBranch” (for Last Branch Recording (LBR)), “Counters” (for capturing Performance Monitoring Counters (PMC)) and “ProcessorTrace” (for Intel Processor Tracing (IPT)).

The configuration of ProcessorTrace is simple: one just has to specify three items:

  1. The “CodeMode” for the trace (user, kernel or user plus kernel).
  2. The (maximum) “BufferSize” of the IPT trace data (chosen from 4, 8, 16 or 32 kilobytes).
  3. A list of the kernel “Events” that cause a corresponding IPT event to be generated. The events can be chosen from the “SystemStackEnumeration” which is the list of event names that can be used in configuring stack traces for the system provider.

One thing that is missing is an equivalent of the “CustomStack” element for defining custom events, since not all of the kernel events are included in SystemStackEnumeration. A sample configuration could look like this:

<HardwareCounter Id="Perf">

  <ProcessorTrace>

    <BufferSize Value="4" />

    <CodeMode Value="Kernel" />

    <Events>

      <Event Value="SystemCallExit" />

    </Events>

  </ProcessorTrace>

</HardwareCounter>

 

Starting “small” (smallest buffer size and a single code mode) might be advisable until one has acquired experience in the analysis and interpretation of the data.

Out-of-the-box, there are no MOF classes describing most of the performance events; I added the following definitions for IPT on my system with the “mofcomp” utility:

[dynamic: ToInstance, Guid("{ff1fd2fd-6008-42bb-9e75-00a20051f3be}"), EventVersion(2), DisplayName("IntelProcessorTrace")]

class IPT_V2 : MSNT_SystemTrace
{
}; 

[dynamic: ToInstance, EventType{32}, EventTypeName{"ProcessorTrace"}]
class IPT_Event : IPT_V2
{
    [WmiDataId(1), read] uint64 EventTimeStamp;
    [WmiDataId(2), read] uint32 Process;
    [WmiDataId(3), read] uint32 Thread;
    [WmiDataId(4), read, format("x")] uint64 IptOption;
    [WmiDataId(5), read] uint32 TraceSize;
    [WmiDataId(6), read] uint32 TracePosition;
};

The actual trace data follows immediately after this header; I could not think of a way to include the variable length array of bytes in the MOF class definition.

The first three members are identical in meaning to a kernel stack trace event. A type for the IptOption value is available as a public type in the ipt.sys driver; Windows debuggers display it thus:

0:000> dt ipt!_IPT_OPTION
   +0x000 TraceMode        : Pos 0, 4 Bits
   +0x000 TimeMode         : Pos 4, 4 Bits
   +0x000 MTCFreq          : Pos 8, 4 Bits
   +0x000 CycThresh        : Pos 12, 4 Bits
   +0x000 BufferSize       : Pos 16, 4 Bits
   +0x000 TraceSessionMode : Pos 20, 3 Bits
   +0x000 TraceChild       : Pos 23, 1 Bit
   +0x000 TraceCodeMode    : Pos 24, 4 Bits
   +0x000 Reserved2        : Pos 28, 4 Bits
   +0x004 Reserved3        : Uint4B
   +0x000 Value            : Uint8B 

The TraceSize is the size of the trace data; if the size is less than the configured size, then the “entire” trace is available (all trace data from the last context switch until the triggering event occurred). If TraceSize is equal to the configured “BufferSize” then the trace has probably wrapped and “TracePosition” is the point in the (circular) buffer at which the next packet would have been written; one has to search the buffer in a circular fashion for a PSB (Packet Stream Boundary) packet, starting from the TracePosition.

By including “Loader”keyword events in a WPR trace (which enables loaded modules to be identified, along with their load address), one seems “in good shape” to interpret the IPT trace.

A disassembler is needed to interpret the IPT trace and fortunately one is readily available: the one used by the Windows debuggers, namely the Disassemble method of the IDebugControl interface. The disassembler is needed to identify relevant instructions (e.g. conditional branches) and instruction lengths. The “Disassemble” method does much more than this, formatting the instruction as a string and performing symbol look-up for memory references, so it is slow but it does the job and obviates the need to develop a purpose oriented replacement.

In a typical trace, code from many executable files may appear and the IDebugClient/IDebugControl interfaces is probably not well suited to simultaneously opening several separate executable files. “Fortunately”, I encountered another problem with this approach and the same “solution” resolved both problems.

This code is taken from the executable file; because it contains an indirect control transfer, a TIP (Target IP) would be needed in the IPT trace:

tcpip!TcpDeliverDataToClient+0x119:
call qword ptr [tcpip!_imp_KeAcquireSpinLockAtDpcLevel (00000001`c02331e8)]
nop  dword ptr [rax+rax]
cmp  r14d,0C000021Bh 

However no TIP was present and it turned out that the code in memory that was actually executed looks like this (direct control transfer):

tcpip!TcpDeliverDataToClient+0x119:
mov  r10,qword ptr [tcpip!_imp_KeAcquireSpinLockAtDpcLevel (fffff803`590d31e8)]
call nt!KeAcquireSpinLockAtDpcLevel (fffff803`534a2490)
cmp  r14d,0C000021Bh 

Import Optimization (https://techcommunity.microsoft.com/t5/windows-kernel-internals-blog/mitigating-spectre-variant-2-with-retpoline-on-windows/ba-p/295618) had been applied when building the executable and, whilst there is obviously sufficient metadata in the executable file to recognize and emulate the code modifications, it would be difficult to integrate this into the simple use of the disassembler.

The “solution” was to use a “dump” of the process (or a “live dump” of the kernel) to perform the analysis. This simplifies many things but also means that a standalone ETW (Event Tracing for Windows) file is not enough for an analysis with my simple tool (a dump is needed too).

The “conciseness” of the IPT trace data means that it is not easy to “check” whether an analysis is proceeding correctly. One of my many mistakes was in incorrectly handling “Indirect Transfer Compression for Returns” (the uncompressed cases), but “RET compression” was a big help in identifying problems: if a RET was compressed, then the Taken/Not-taken bit should be set and if it is clear then one knows that something has gone wrong. Another hint is if the “recorded” code path does not seem plausible; this is not always easy to judge, but I often found that my tool was analyzing the routine “KeBugCheckEx” – something that had patently not happened during the trace capture.

My “use case” for IPT tracing is as an additional aid in debugging/troubleshooting tricky/interesting problems. For this type of tracing to be useful, one needs to identify kernel events that occur after the code of interest has been executed and whose IPT data might include the path taken. The limited set of events in SystemStackEnumeration (lacking, for example, network events) is a hindrance, but the undocumented API to set additional custom events is relatively easy to deduce. IPT trace data attached to the “CSwitch” event is often useful; some traces are very short (a context switch from idle) and are useful for testing the TNT interpreter and some others are useful “backstops” for data gathering (especially if the context switch is the result of a natural “break” in execution, such as entering a wait state).

\Device\IPT IOCTL Interface

IPT can be used separately from ETW: the ipt.sys driver makes certain IPT operations available via an IOCTL style interface. This interface is not documented, but the ipt.sys driver is small and “straightforward”, so many of the features of the interface can be deduced. As someone who is almost exclusively interested in short IPT, it is a relief that the interface supports some of the IPT filtering mechanisms – most importantly filtering by IP (Instruction Pointer).

The current (undocumented and probably still evolving) interface allows IPT tracing to be enabled for a process and IP filtering to be configured per thread. Once tracing has been enabled for a process (which enables tracing on all threads in the process), tracing of individual threads can be suspended and resumed and IP filtering can be applied to individual threads. Threads created after tracing has started inherit the tracing options set for the process but start without any IP filtering.

I am often interested in tracing the path through short sections of code in service processes, where the thread which will execute the code cannot easily be predicted and might even be a newly created thread. I wanted to avoid “invading” the process to be traced (by attaching a “debugger”), but that is the only standard way of being informed of (and partially controlling) thread creation in a process. Initially, I thought that this would be simple: just receive the debug events, apply the IPT IP filter to any newly created threads and then resume the target. However applying an IPT IP filter to a thread that has just been created and is paused at the create thread debug event has no effect – it is necessary to arrange for the thread to proceed to the “LdrInitializeThunk” routine before applying the filter.

Most of the IPT tracing configurable via IOCTL traces to circular buffers; these buffers can be large and, with judicious filtering, they might not need to wrap. There is one operation that writes the trace data directly to a file, ensuring a complete trace; this operation just traces the user mode behaviour of a process and does not support IP filtering.

Summary

I am often interested with problems related to networking (for example, a potential minor problem in the Windows Filtering Platform, described in an earlier posting) and the frequently used troubleshooting tools are event tracing (including network packet) and user-mode debugging. Kernel debugging is possible but I use it only very occasionally (partly because disturbing the timing of things in the debugger disturbs the whole evolution of the debugging scenario). IPT tracing will hopefully be useful, when it can be applied. Often the “transmit” side of communication occurs in a predictable process (and a process for which a “handle” can be obtained) and here process based tracing can be effectively deployed. However, the “receive” side can occur in any context/process and I hope that combining ETW and IPT will help there. There are also common scenarios where the “transmissions” originate from the “System” process (e.g. SMB traffic) and the IOCTL interface, which uses handles rather than process ids to identify the target, can’t be used there.