The starting point for my interest in this topic was the analysis of a DPC_WATCHDOG_VIOLATION bugcheck, reported in a Microsoft Q&A forum. The top of the stack at the time of the crash looked like this:
nt!KeBugCheckEx
nt!KeAccumulateTicks
nt!KiUpdateRunTime
nt!KiUpdateTime
nt!KeClockInterruptNotify
nt!HalpTimerClockInterrupt
nt!KiCallInterruptServiceRoutine
nt!KiInterruptSubDispatchNoLockNoEtw
nt!KiInterruptDispatchNoLockNoEtw
nt!KeFlushMultipleRangeTb
I recognized nt!KeFlushMultipleRangeTb as a
routine that uses the Inter-Processor Interrupt (IPI) mechanism to
communicate/coordinate with all of the other processors and a spin loop
(including the “pause” instruction) to wait for replies. The debugger “!ipi”
command showed that the processor that initiated the crash was waiting on IPI
replies from 20 of the 24 (logical) processors in the system. The debugger
“!running –I –t” command reported “Idle Processors: (000000000000080c)” and could only retrieve
stack traces from two processors (the processor that initiated the crash and
one of the processors that had responded to the IPI and was now executing the
idle loop):
amdppm!ReadIoMemRaw
amdppm!ReadGenAddr
amdppm!C2Idle
amdppm!AcpiCStateIdleExecute
nt!PpmIdleExecuteTransition
nt!PoIdle
nt!KiIdleLoop
The “CurrentThread” in the _KPRCB for all
except the crash initiating processor was the “IdleThread”. The “Idle
Processors: (000000000000080c)”
information identified the processors for which the “NextThread” in the _KPRCB
was null.
At this point, I wanted to know a bit more
about the “state” of the processors. Ultimately, the following command and
result provided the reassurance that I wanted:
10: kd> dt nt!_KNODE poi(nt!KeNodeBlock)
-y Deep
+0x040 DeepIdleSet : 0xfffbff
All of the processors, except the crash
initiating processor, were recorded as being in a “deep” idle state. The
processor (AMD Ryzen 9 5900X 12-Core Processor) uses the AMD
“chiplet” architecture, and power management is separated from the
processors. My conclusion was that some problem at a hardware level prevented
processors from being woken from a deep idle state and that the crash dump
probably did not contain any more information that would help to more precisely
identify the cause of the problem.
Extracting DPC/ISR events from the
in-memory buffers of the “Circular Kernel Context Logger” ETW session and
viewing them in Windows Performance Analyzer shows that all of the processors
(except processor 11) stopped being “active” at nearly the same time (probably
the first time the processors entered a deep idle state after the
low-level/hardware problem occurred):
During the search for a method of
identifying the idle state of processors, I investigated debugger extensions
and Event Tracing for Windows (ETW) providers to see what information was
available and how it might be retrieved from a crash dump. This topic seemed to
be poorly documented, so I record my findings below.
Processor Power Management Debugger Extensions
The “Debugging Tools for Windows” help file
(debugger.chm) lists just 5 processor power management commands (!ppmidle,
!ppmidleaccounting, !ppmperf, !ppmperfpolicy, !ppmstate) but does not provide
any information about them. The on-line Kernel-Mode
Extensions help lists 8 commands (as before plus !ppmidlepolicy, !ppmlpscenarios,
!ppmsettings), but only provides a terse, one sentence description of the
command. The functions exported from kdexts.dll add the commands !ppm,
!ppmcheck, !ppmhelp, !ppmlatency, !poperf, !platformilde and
!platformidleaccounting to the list of easily discoverable commands.
The commands are probably little used
outside of Microsoft, since most of them don’t work with just the public
symbols available from the Microsoft symbol server. Sometimes necessary
structures are simply not present in the public symbol file and sometimes the
structure is present but is referred to by kdexts.dll using a “typedef” name
(e.g. “nt!PPM_IDLE_STATES” rather than “nt!_PPM_IDLE_STATES”) which is not
present. By “debugging the debugger”, it is sometimes possible to insert a
missing underscore when necessary and thereby get an impression of the
information that could be displayed.
Even when, after some coaxing, information
is displayed, there is still sometimes reason to doubt its accuracy. For
example, part of the output of the “!ppmidleaccounting” command is information
about “IdleTime Buckets” such as:
0 to 1ms: 46086
1 to 2ms:
3608
2 to 4ms:
1650
4 to 8ms:
739
8 to 16ms:
659
16 to 32ms:
531
[…]
However the
PlatformAccountingBucketIntervalsRundown event from the
Microsoft-Windows-Kernel-Processor-Power provider (which takes its data from nt!PpmIdleIntervalLimits)
suggests that the bucket intervals are:
0 to 0.10ms
0.10 to 0.25ms
0.25 to 0.50ms
0.50 to 0.75ms
0.75 to 1ms
1.00 to 2ms
[…]
Processor Power Management Event Tracing
There are three providers that are
particularly relevant to Process Power Management (PPM) event tracing: the “System
Power Provider”, “Microsoft-Windows-Kernel-Processor-Power” and “Microsoft.Windows.Kernel.Timer”.
System Power Provider
The WPR system keyword “Power”
(SYSTEM_POWER_KW_GENERAL / PERF_POWER) contributes two events that are relevant
to PPM: “Perf State Change” and “Idle State Change”. The “Idle State Change”
event records changes between different idle “target” states (i.e. the idle
state to enter when switching from the running state); the events contribute to
the WPA “CPU Idle States” graph (for the “Target” state). The “Perf State
Change” event records changes in CPU frequency and forms the basis of the WPA
“CPU Frequency” graph; the “Microsoft-Windows-Kernel-Processor-Power” generates
a similar event (ProcessorPerfStateChange) but with more information at the
same time, except during “rundown”.
The WPR system keyword “IdleStates” (SYSTEM_POWER_KW_PROCESSOR_IDLE
/ PERF_PROCESSOR_IDLE) records transitions between running and idle states. The
idle state enter event contains a wealth of information including the target
C-state, the expected idle duration and wake reason and contextual information
(e.g. clock owner). The IdleStates events are the main contributor to the WPA
“CPU Idle States” graph (for the “Actual” state).
The actual mapping of idle state names to
numbers in the raw event data is C1 → 0, C2 → 1, C3 → 2, and so on. C0 is
the “running” state (i.e. not an idle state). In order to include the running
state in the “CPU Idle States” graph, one is added to each idle state number
and C0 (running) is assigned the value 0. The “aggregation” of the numeric
state values performed by WPA when drawing the graph can make interpretation of
the results a non-trivial task.
The WPR system custom keyword “0x40200000”
(SYSTEM_POWER_KW_IDLE_SELECTION / PERF_IDLE_SELECTION) forms the basis of the
WPA “Idle State Selection” graph. The event includes information about
potential target C-states that were vetoed (along with the veto reason);
typically, a veto will change the target idle state for a processor and cause
an “Idle State Change” event to be generated.
There is a “Microsoft-Windows-Kernel-Processor-Power” event (BiosCStatesRundown) that provides information about the available C-states, including the latency to enter and exit the state (in microseconds) and the average power consumption (in milliwatts). On my PC, the values are:
|
State |
Latency (μs) |
Power (mW) |
|
C1 |
1 |
1000 |
|
C2 |
151 |
200 |
|
C3 |
1034 |
200 |
The WPR system custom keyword “0x44000000” (SYSTEM_POWER_KW_PPM_EXIT_LATENCY / PERF_PPM_EXIT_LATENCY) records the exit latency from a “platform idle” state; “platform idle” is a lower power state for the entire System on a Chip (SoC) and is used by Modern Standby (Always on Always connected, AoAc). In addition to setting the WPR keyword, a call to the routine NtPowerInformation to set an ExitLatencySamplingPercentage value is needed before events will be generated.
Microsoft.Windows.Kernel.Timer
The “Microsoft.Windows.Kernel.Timer” events
are not used by WPA and their emphasis is management of the clock; the events
include ProcessorEnterDeepIdle and ProcessorExitDeepIdle.
Microsoft-Windows-Kernel-Processor-Power
Many of the “Microsoft-Windows-Kernel-Processor-Power”
events are reported during the periodic execution of the PpmCheckRun DPC
routine.
The WPA “Core Parking Instantaneous
Concurrency” graph is mostly derived from the System Power Provider IdleStates
event, but is only available if ParkNodeRundown events are also present in the
trace. The word “Parking” in the graph title seems to be inappropriate: my
device does not support “parking” (i.e. removing a processor from the pool of
processors available for thread scheduling) and the graph shows what one might
call the “Idle Processor Instantaneous Concurrency”.
The WPA “Core Parking Concurrency” graph
shows essentially the same information as the “Core Parking Instantaneous
Concurrency” but is derived from sampled counts of idle processors rather than
exactly tracking idle transitions. The ParkNodeRecordedStats events are the
source for this graph. In addition to information about the number of
concurrent cores, the event also contains “concurrency histogram” information:
time spent (performance counter counts) since the system booted with 0 to “n”
concurrent cores (an array of n+1 values).
The WPA “Processor Performance” and
“Processor Frequency” are two views of the same data: mostly DeliveredPerfChange
events (if available, RecordedUtility events if not) with some context provided
by ProcessorIdRundown events. Both DeliveredPerfChange and RecordedUtility
carry information about both performance and frequency; on my PC the
performance and frequency values (which are expressed as percentages of a
nominal value) are equal. The “nominal” 100% performance/frequency values are
indeed “nominal”: on my PC, the maximum performance/frequency is twice the
nominal value and the y-axis of the graph covers the range 40% to 200%.
The WPA “Processor Utility” and “Processor
Utilization” graphs also draw their data from RecordedUtility (plus
ProcessorIdRundown) events. The RecordedUtility event includes items named
BusyTime, IdleTime and DeliveredPerformance. “Processor Utilization” is
calculated as:
BusyTime/(BusyTime+IdleTime)×100
“Processor Utility” is calculated as:
BusyTime/(BusyTime+IdleTime)×100×DeliveredPerformance
The DeliveredPerformance
value in the event is expressed as a percentage, so if IdleTime is 0 and DeliveredPerformance
is 100% then “Processor Utility” will be 10,000 (ten thousand, 10 K). This
explains why the y-axis range is so large:
The meaning of the WPA “Processor
Performance Tuning” graphs was not immediately clear to me. From the timing of
the events, one could determine that the graphs used data from events such as ProcessorPerfStateChange,
DomainPerfStateChange and QosClassPerfSelection; these events contain a
(Selected)State value, which is stored as a 32-bit value in the
*PerfStateChange events and as a 64-bit value in the QosClassPerfSelection
evente – only when I realized that the State value was actually the value of
the IA32_HWP_REQUEST_PKG MSR did the events start to make sense. Many of the
other values in the events are just values broken out of IA32_HWP_REQUEST_PKG
and normalized. The events report on the Hardware-Controlled Performance States
(HWP); HWP is an implementation of the ACPI-defined Collaborative Processor
Performance Control (CPPC). The “Epp” parameter visible in the graphs is Energy
Performance Preference
No comments:
Post a Comment