In Detail: 2021

Wednesday, 14 July 2021

Slow performance of IKEv2 built-in client VPN under Windows

I noticed over time several reports in technical forums of slow IKEv2 performance, with the observed performance often being quoted as just 10% to 20% of the expected performance; troubleshooting network performance problems almost always requires making network traces and, on the few occasions that I offered to help with the analysis, there was an (understandable) unwillingness to share trace data. When someone agreed to share the data, the analysis proved to be trickier than expected.

The first network trace

The first trace used a combination of two ETW providers: Microsoft-Windows-PktMon (with keywords 0x10) and Microsoft-Windows-TCPIP (with keywords 0x200500000020 and level 16). I had used this combination before when troubleshooting problems with the TCP CUBIC congestion algorithm and out-of-sequence delivery of TCP packets (see here) and developed a simple tool to visualize interesting aspects of the captured data. This is what the tool showed:

The blue line is the TCP send sequence number; the green line is the size of the congestion window (not to scale on the Y-axis – it is its shape that is most interesting); the yellow line is the send window (to same scale as congestion window).

At 6 arbitrary Y-values, points are shown when “interesting” events occur; the events are:

· TcpEarlyRetransmit

· TcpLossRecoverySend Fast Retransmit

· TcpLossRecoverySend SACK Retransmit

· TcpDataTransferRetransmit

· TcpTcbExpireTimer RetransmitTimer

· D-SACK [RFC3708] arrival

The black points on the horizontal access indicate when IPsec ESP sequence numbers are detected “out-of-sequence”. In contrast to the other data and events, which all refer to the same TCP connection, this information is not necessarily related to the same TCP connection but is “circumstantially” related. The reason for this is that it is difficult to know what is “inside” a particular ESP packet; by combining information from several ETW providers, it is often possible to infer what is inside, but that is not essential for this visualization tool – the temporal relationship between events often “suggests” whether a (causal) link possibly exists.

The traffic being measured in the trace/graph is 10 back-to-back sends of a one megabyte block to the “discard” port on the “server”, so there is no application protocol hand-shaking or other overheads.

I had initially expected to see something like a fragmentation problem, but the graph looked exactly like the TCP CUBIC out-of-sequence behaviour. This was surprising because the client and server were on the same network, which made it unlikely that network equipment was causing the disorder in the packet sequencing.

For comparison purposes, this is what the visualization of a “good” transfer looks like:

The first surprise

A more detailed look at the trace data revealed that data carrying (large) ESP packets were leaving the sender with the ESP sequence numbers sometimes “disordered”. The ESP RFC 2406 describes the generation and verification of sequence numbers and one would certainly expect the transmitted sequence numbers to just monotonically count upwards from zero for a particular SPI.

Other discoveries

Some reports of this problem on the web note that the problem is not observed when using a wired connection. I also observed this and then tried testing a USB wireless adapter – this device did not exhibit the poor performance behaviour. This lead to the conclusion that the physical network adapter was an element of the problem rather than a simple wired/wireless difference.

Another “discovery” (actually a confirmation, since it was tried after a theory about the cause had been formed) was that restricting the client to using just one processor (via BIOS settings) eliminated the problem.

Filtering the trace data for interesting ESP sequence number events often showed something like this:

2021-07-14T08:40:28.1867532Z FBDD70E2: 233 -> 235 (5928)
2021-07-14T08:40:28.1870094Z FBDD70E2: 235 -> 241 (5928)
2021-07-14T08:40:28.1870119Z FBDD70E2: 241 -> 243 (5928)
2021-07-14T08:40:28.1872059Z FBDD70E2: 243 -> 234 (5892)
2021-07-14T08:40:28.1872088Z FBDD70E2: 234 -> 236 (5892)
2021-07-14T08:40:28.1872338Z FBDD70E2: 240 -> 242 (5892)
2021-07-14T08:40:28.1872361Z FBDD70E2: 242 -> 244 (5892)

The format of the above is timestamp (used to correlate various views of the trace data), followed by SPI and then the discontinuity in the sequence numbers and finally the thread ID (in parentheses).

Some of the interesting features of the above trace data are that some of the “jumps” in the ESP sequence number are quite large (e.g. 9 backwards) and that multiple threads are sending interleaved ESP packets: TID 5928 sends Seq№ 241, TID 5892 sends Seq№ 242 and TID 5928 sends Seq№ 243. Either some synchronization mechanism is failing or some prior assumption has proved wrong.

The use of threads in the problematic case is complicated, so it is instructive to first see how they are used in the well-behaved case.

Threading in the well-behaved case

To recap the situation, ETW tracing is being performed on a client that is sending data to the TCP discard port (9). Although the client is “driving” the interaction (initiating the connection, sending data, etc.), it is useful to follow the use of threads from the arrival of a packet (probably a TCP acknowledgement-only packet) to the sending of a response packet (the next packet of data); often the client is waiting to send (normally because the congestion window is closed) and the arrival of an acknowledgement triggers the sending of more data.

When an incoming packet is first detected by the ETW trace, the stack looks like this:

ndis!PktMonClientNblLogNdis+0x2b
ndis!NdisMIndicateReceiveNetBufferLists+0x3d466
rt640x64.sys+0x1F79B
rt640x64.sys+0x29F9
ndis!ndisInterruptDpc+0x197
nt!KiExecuteAllDpcs+0x30e
nt!KiRetireDpcList+0x1f4
nt!KiIdleLoop+0x9e

rt640x64.sys is the device driver for a “Realtek PCIe GBE Family Controller” (no symbols available).

Interrupts for a particular device are often directed to a particular processor and the DPC, queued in the interrupt handling routine, typically executes on the same processor. When the indication of an incoming packet reaches AgileVpn.sys (the WAN Miniport driver for IKEv2 VPNs), AgileVpn dispatches the further processing of the message to one of its worker threads; the particular worker thread is chosen “deterministically”, using the current processor number to choose a thread that will run on the same processor.

Most of the subsequent processing (unpacking/decrypting the ESP packet, indicating the decrypted incoming packet to the VPN interface, releasing any pending packet that can be sent (congestion control permitting), encrypting/encapsulating the outgoing packets, etc.) takes place on the same thread. When the packet re-enters AgileVpn on a transmit path, it is again handed over to a different AgileVpn worker thread on the same processor (AgileVpn creates 3 worker threads per processor, for different tasks).

So, in summary (and neglecting the initial interrupt), all of the activity normally takes place in three threads, all with an affinity for the same processor. This normally ensures that IPsec packets are sent in the expected sequence, but it is not a guarantee (just “good enough”).

Using Performance Monitor to trace the send/indicate activity per processor during a “discard” test shows this distribution of load:

Threading in the badly-behaved case

Having understood the well-behaved case, the stack when an incoming packet is first detected by the ETW trace in the badly-behaved case already suggests the likely cause of the problem:

ndis!PktMonClientNblLogNdis+0x2b
ndis!NdisMIndicateReceiveNetBufferLists+0x3d466
wdiwifi!CPort::IndicateFrames+0x2d8
wdiwifi!CAdapter::IndicateFrames+0x137
wdiwifi!CRxMgr::RxProcessAndIndicateNblChain+0x7f7
wdiwifi!CRxMgr::RxInOrderDataInd+0x35a
wdiwifi!AdapterRxInorderDataInd+0x92
Netwtw06.sys+0x51D13
Netwtw06.sys+0x52201
ndis!ndisDispatchIoWorkItem+0x12
nt!IopProcessWorkItem+0x135
nt!ExpWorkerThread+0x105
nt!PspSystemThreadStartup+0x55
nt!KiStartSystemThread+0x28

The “indication” of the incoming packet (to PktMon and AgileVpn) does not occur in the context of the device driver DPC routine but rather in a system worker thread; the device driver DPC routine probably called NdisQueueIoWorkItem.

NdisAllocateIoWorkItem and NdisQueueIoWorkItem make no statements (and provide no guarantees) about on which processor the worker thread will execute. The subsequent handling of the packet is similar to the well-behaved case (transferring process to an AgileVpn worker thread, etc.) but, since the initial processor number is essentially chosen at random for each incoming packet, AgileVpn worker threads on all processors are used.

A “snapshot” of the same Performance Monitor trace as above looks like this:

I used the word “snapshot” because the distribution of load varies quite a bit during a test – the essential message is clear though: many processors are sending packets on the same SPI concurrently and the risk of disorder in the ESP sequence numbers is clearly present.

TCP CUBIC congestion control

Sending packets out-of-sequence is not ideal (especially if the degree of disorder approaches the level where ESP Sequence Number Verification starts to have an effect) but higher level protocols that guarantee in-sequence delivery (such as TCP, HTTP/3, QUIC, etc.) have mechanisms for coping with out-of-sequence packets.

In particular TCP, at one level, can handle out-of-sequence packets easily. However the particular congestion control mechanism used for a TCP connection can behave badly – and Windows’ current implementation of CUBIC does behave badly and is the cause of the slow transfer rate.

This is what happens in the “discard” test case:

· The client sends data bearing TCP packets out-of-sequence to the server.

· The server follows normally acknowledgement policies (e.g. at least one acknowledgement for every two received data packets) and includes SACKs where appropriate.

· The client receives the acknowledgements and sometimes a sequence of acknowledgements for the same sequence number are received (because the segment that would fill the gap and allow the acknowledged sequence number to be advanced has not yet arrived at (or just has not yet been processed) by the server – because it was sent out-of-sequence).

· The client triggers congestion control mechanisms, reducing the size of its congestion window and (fast) retransmitting the “missing” segment.

· The server receives the duplicate segment and includes a D-SACK in its next acknowledgement.

· The client ignores the D-SACK.

In a better CUBIC implementation the client, upon receiving a D-SACK, would “undo” the congestion window reduction caused by the spurious retransmission and adjust its parameters for detecting a suspected lost segment.

Summary and workarounds

A number of factors combine to cause the poor performance. One element is a network adaptor that uses NdisQueueIoWorkItem, causing the initial NDIS “indication” of packet arrival to occur on random processors. Another element is how AgileVpn distributes the load based on the current processor number rather than, say, SPI (although I don’t know if information like SPI is available at the time this decision is made). The final element is the TCP congestion control implementation weaknesses.

There are no good workarounds. Using different network interfaces (if available) or different VPN protocols (if appropriate/possible) is obviously possible (but probably an unhelpful suggestion). “Hamstringing” the system so that it only uses one processor is not something that one could seriously propose.

Improvements in the TCP congestion control implementation have been announced but are not available in any mainstream Windows version yet.

Saturday, 12 June 2021

Mapped network drive reconnection failures

As a regular reader of forums discussing technical problems with Windows components, I have been interested in the number of problems reported with connections to SMB file shares. I did not have the problem myself and I could not think of a way to reproduce and troubleshoot the problem(s).

There is a Microsoft article entitled “Mapped network drive may fail to reconnect in Windows 10, version 1809” which says:

Microsoft is working on a resolution and estimates a solution will be available by the end of November 2018. Monitor the mapped drive topic in the Windows 10 1809 Update History KB 4464619.

The referenced KB article does not contain any relevant information about progress on the problem.

It may seem irrelevant as this stage, but there was a discussion on the answers.microsoft.com forum in 2020 about the “UseOptions” value in the registry key HKCU\Network\<DRIVELETTER> - it seemed to be causing problems with persistent connections and the value seemed to have been introduced in Windows 10, version 2004.

The authoritative source of information about SMB is the Microsoft specification "[MS-SMB2]: Server Message Block (SMB) Protocol Versions 2 and 3".

Section "3.2.4.2.2 Negotiating the Protocol" of this document says:

When a new connection is established, the client MUST negotiate capabilities with the server. The client MAY<111> use either of two possible methods for negotiation.

The first is a multi-protocol negotiation that involves sending an SMB message to negotiate the use of SMB2. If the server does not implement the SMB 2 Protocol, this method allows the negotiation to fall back to older SMB dialects, as specified in [MS-SMB].

The second method is to send an SMB2-Only negotiate message. This method will result in successful negotiation only for servers that implement the SMB 2 Protocol.

The reference <111> says:

The Windows-based client will initiate a multi-protocol negotiation unless it has previously negotiated with this server and the negotiated server's DialectRevision is equal to 0x0202, 0x0210, 0x0300, 0x0302, or 0x0311. In the latter case, it will initiate an SMB2-Only negotiate.

It seems that older SMB servers (NAS devices and Windows Server 2003) don’t expect a new connection to start with an SMB2-Only negotiate. They can behave in various incorrect ways, such as returning an error message, not responding at all, breaking the connection, etc., and this results in different error messages being shown to the user.

It is often mentioned that there are anomalies when referring to the file share by server name or server IP address – this is caused by a dependency on which version of the path to the share has “remembered” SMB2 capabilities.

There are many reasons why it might be necessary to “reconnect” to a file share: the transport connections have an “idle timeout”, the client may move between different networks or many other types of network interruption may cause the connection to a file share to require reconnection. This means that problems can occur at unpredictable times.

Another reason to “reconnect” a share is to restore persistent file shares when a user logs in. The change in Windows 10, version 2004 seems to have been to “persist” the knowledge of the server capabilities to the registry (rather than just in-memory data structures in mrxsmb.sys). When persisting a share, Windows now queries the attributes of the share with a call to NtQueryInformationFile, with a FILE_INFORMATION_CLASS of FileRemoteProtocolInformation and stores this information in the UseOptions value of the key HKCU\Network\<DRIVELETTER>.

The information returned by NtQueryInformationFile is a FILE_REMOTE_PROTOCOL_INFORMATION structure and the ProtocolMajorVersion member contains the negotiated SMB major version number. This enables Windows to decide whether it can use SMB2-Only negotiation.

struct FILE_REMOTE_PROTOCOL_INFORMATION
{
    USHORT StructureVersion;     // 1 for Win7, 2 for Win8 SMB3, 3 for Blue SMB3, 4 for RS5
    USHORT StructureSize;           // sizeof(FILE_REMOTE_PROTOCOL_INFORMATION)
    ULONG Protocol;                    // Protocol (WNNC_NET_*) defined in winnetwk.h or ntifs.h.
    USHORT ProtocolMajorVersion;
    USHORT ProtocolMinorVersion;
    USHORT ProtocolRevision;
    USHORT Reserved;
    ULONG Flags;
    struct {
        ULONG Reserved[8];
    } GenericReserved;
    union {
        struct {
            struct {
                ULONG Capabilities;
            } Server;
            struct {
                ULONG Capabilities;
                ULONG CachingFlags;
                UCHAR ShareType;
                UCHAR Reserved0[3];
                ULONG Reserved1;
            } Share;
        } Smb2;
        ULONG Reserved[16];
    } ProtocolSpecific;
}

Registry storage of remembered mapped network drives

The “remembered” mapped network drives are stored in the registry under the key HKCU\Network. For each “drive letter” subkey under this key, the following information can be stored:

ConnectFlags: a REG_DWORD value containing a bit mask of values constructed from some of the CONNECT_* definitions in winnetwk.h; in particular CONNECT_REQUIRE_INTEGRITY, CONNECT_REQUIRE_PRIVACY and CONNECT_WRITE_THROUGH_SEMANTICS.

ConnectionType: a REG_DWORD value containing a value taken from the RESOURCETYPE_* definitions in winnetwk.h; in particular RESOURCETYPE_DISK.

DeferFlags: a REG_DWORD value indicating whether interaction with the user is needed to restore the connection (e.g. to obtain a password); the value 1 means that a password from the user is needed, a value of 2 means that a password from the user might be needed and a value of 4 means that default/stored credentials can be used (i.e. no need to ask the user for a password).

ProviderFlags: a REG_DWORD value representation of a Boolean value (0/1), indicating whether the RemotePath refers to a DFS root. If the RemotePath is not a DFS root, this value is normally omitted.

ProviderName: a REG_SZ value containing the provider name; in particular “Microsoft Windows Network”.

ProviderType: a REG_DWORD value containing a value taken from the WNNC_NET_* definitions in wnnc.h; in particular WNNC_NET_SMB.

RemotePath: a REG_SZ value containing the UNC path of the mapped network drive.

UseOptions: a REG_BINARY value containing a sequence of Tag/Length/Value elements. The only “Tag” that I have observed is “DefC”, the value of which is a FILE_REMOTE_PROTOCOL_INFORMATION structure.

UserName: when needed, a REG_SZ value containing the username; when not needed, a REG_DWORD value containing 0.

Setting ProviderFlags as a partial workaround

Many reports can be found in the Internet that setting ProviderFlags to 1 for a remembered mapped network drive can help and this appears to be true. When ProviderFlags is set to 1, indicating that the RemotePath refers to a DFS root, more DFS operations take place. The DFS driver initially rewrites the RemotePath, replacing the share name with “IPC$” and then asks the SMB driver to connect to this path so that a FSCTL_DFS_GET_REFERRALS request can be sent to the server – the “remembered” SMB capabilities of the server are not made available to the SMB driver for this call, so the SMB driver performs a “multi-protocol negotiation”. The FSCTL_DFS_GET_REFERRALS request fails with STATUS_FS_DRIVER_REQUIRED (if the RemotePath is not a DFS root) and the SMB connection process continues – but, by now, the protocol has been negotiated (via multi-protocol negotiation) and the network drive is successfully mapped.

Most of the registry values for remembered mapped network drives are updated when used – except the ProviderFlags value: it is only created/set if the RemotePath is a DFS root. This allows misleading “workaround” information (ProviderFlags = 1) to persist in the registry.

Invisible references to an SMB server

Unfortunately, deleting all references to a SMB server from user mode is not guaranteed to remove all recollection of the server from mrxsmb.sys. Under these circumstances, attempts to unload mrxsmb.sys also fail/hang (such unload attempts normally succeed). The unload attempt gets stuck here:

nt!KeWaitForSingleObject+0x233
rdbss!RxSpinDownOutstandingAsynchronousRequests+0x9d
rdbss!RxUnregisterMinirdr+0x1fc
mrxsmb!MRxSmbInitUnwind+0x123
mrxsmb!MRxSmbUnload+0x4e
nt!IopLoadUnloadDriver+0xdc065

The WPP ETW provider for mrxsmb.sys allows the reference count for the SrvEntry for the server to be tracked, so it is possible (if difficult) to check whether “hanging” references to a SrvEntry are preventing the known workarounds from being effective.

Verifying whether this issue is active

One way of verifying whether this issue is active is to try the following:

Issue the command: logman start why -ets -p Microsoft-Windows-SMBClient Smb_Info -o why.etl

Try to access the mapped network drive.

Issue the command: logman stop why -ets

Issue the command: wevtutil qe /f:text /lf:true why.etl | findstr "SMB.send SMB.receive"

The final command will show selected items from the trace data; if there is a repeated sequence of “SMB send[0]: [NEGOTIAT]” items, then the client is repeatedly trying SMB2-Only negotiation, failing and retrying – this is the main characteristic of this problem.

Prospects

This problem affects Windows 2003 (among other SMB servers) and there is no doubt in my mind that (parts of) Microsoft is fully aware of the problem, its causes and its potential remedies. There is, however, a dearth of authoritative information on this topic easily findable in the Internet.

The common workarounds for these problems include deleting and recreating shares when problems occur, setting ProviderFlags to 1 and deleting the UseOptions value for persistent shares (whenever it is (re-)created). Without an option to disable the “SMB2-Only” optimization, there are no ideal solutions.

Tuesday, 18 May 2021

Network Discovery and Name Resolution under Windows 10 in a Home Network (zero-configuration networking)

Network discovery (for example, the “discovery” performed by the “Network” item in the left (navigation) pane of “File Explorer” under Windows 10) in a home network should “just work” in the sense of discovering and displaying the network devices that are known to be in the home network. However, one often reads in technical support forums that “network discovery” is not working to some extent; sometimes this results from outdated expectations (for example, that the “net view” command is the full extent of “network discovery”) but sometimes also from old network equipment that does not support newer discovery mechanisms or from network equipment that has been configured not to respond to network discovery requests (perhaps for security reasons).

Let’s first consider how “network discovery” works and what can be done to influence its behaviour.

The Microsoft interface IFunctionDiscovery is the entry point into performing network discovery in the same style as File Explorer. The method CreateInstanceCollectionQuery of this interface is called first with either a “layered category” (e.g. "Layered\Microsoft.Networking.Devices") which will use a collection of providers appropriate to the layer or a “provider category” (e.g. “Provider\Microsoft.Networking.WSD”) which will use a specific provider/technology/protocol.

Some of the providers that are relevant to discovering networking devices are:

Provider\Microsoft.Networking.WSD
Provider\Microsoft.Networking.SSDP
Provider\Microsoft.Networking.Netbios

Network discovery can take some time, so the method that executes the discovery normally returns a “pending” status (E_PENDING) and delivers discovery results to its caller asynchronously (as they happen). The main work of discovery is performed in the “Function Discovery Provider Host” (fdPHost) service.

One piece of advice that one often sees on the Internet is to ensure that Windows services used in the discovery process are running and/or configured to run. This is not something that I would recommend. The relevant services (e.g. fdPHost, FDResPub, SSDPSRV) are normally configured as “demand” start; some may also include “trigger” configuration (e.g. FDResPub triggers on specific event values of the Microsoft-Windows-NetworkProfileTriggerProvider ETW provider); some are defined as “dependencies” for other services; some services explicitly start other services. The ability of a service to operate is also often dependent on Windows Firewall rules (that are also actively maintained and changed as system events occur). Manual interference should be a last step, guided by evidence that there is actually a misconfiguration, rather than a first/early troubleshooting step.

The progress of network discovery can be followed using ETW. A combination of the providers Microsoft-Windows-FunctionDiscovery, Microsoft-Windows-WFP (to check for firewall packet drops) and Microsoft-Windows-PktMon (or equivalent, to observe the actual network protocol interactions) is often a good combination.

Web Services Dynamic Discovery (WS-Discovery or WSD)

The Microsoft.Networking.WSD provider is the provider most likely to detect computers and file servers on the home network. During the discovery operation, the fdPHost service sends WSD Probe messages to the WSD IPv4 and IPv6 multicast addresses defined by the WSD protocol. If and when the fdPHost receives a ProbeMatch message, it sends a Get request to the responder (via TCP) to obtain a Get response. In the case of Windows computers, the responder is the FDResPub (Function Discovery Resource Publication) service

The key information in the Get response is contained within the wsdp:Relationship/wsdp:Host/pub:Computer element. As the [MS-PBSD] document says, if the computer is domain joined then the value will be of the form “<NetBIOS_Computer_Name>/Domain:<NetBIOS_Domain_Name>”, if the computer is in a workgroup then the value will have the form “<NetBIOS_Computer_Name>\Workgroup:<Workgroup_Name>”, otherwise it will have the form “<NetBIOS_Computer_Name>\NotJoined”.

Network Discovery via IFunctionDiscovery finds all of these variants and File Explorer displays all of the results that represent domain joined or workgroup computers, but it does not display computers that report “not joined”. FDResPub uses the NetGetJoinInformation API to obtain workgroup/domain information; it normally obtains the information when the service starts, so if the LanmanWorkstation service (which serves the NetGetJoinInformation request) has not (completely) started when FDResPub calls NetGetJoinInformation, then the published information will state that the computer is “not joined”.

A workaround for the above problem is to add a service dependency to the FDResPub service on the LanmanWorkstation service. The problem could be called a “bug” and it has a simple source code fix. FDResPub calls NetGetJoinInformation specifying the name of the local computer as the system for which the information should be retrieved; if NetGetJoinInformation fails with RPC_S_SERVER_UNAVAILABLE and a system name was specified then a failure code is returned to the caller (NERR_WkstaNotStarted), but if no system name was specified (a null was passed as parameter, implying the local system) then NetGetJoinInformation uses other local mechanisms to obtain join information and returns a success code to the caller.

This discovery mechanism should discover all devices (Windows, Apple, Linux, Network Attached Storage (NAS), etc.) that support WS-Discovery, have a WS-Discovery publisher service running and are not blocking WS-Discovery messages via firewall mechanisms.

For Windows systems, the “Network and Sharing Centre, Advanced sharing settings” dialog (on each Windows system in the home network) should be the only thing that needs to be checked to ensure that network discovery is correctly configured.

Simple Service Discovery Protocol (SSDP)

The Microsoft.Networking.SSDP provider “discovers” most of the printers, scanners, displays, etc. in the home network. The SSDPSRV service periodically multicasts SSDP M_SEARCH requests and observes SSDP NOTIFY announcements. When network discovery is started, fdPHost retrieves a list of responses from SSDPSRV via RPC. The fdPHost then retrieves detailed information about the service by querying the Location URL in the SSDP response. For services hosted on Windows systems (perhaps directly attached printers, music and video libraries, etc.), the upnphost (UPnP Device Host) service is normally the process that is listening at the Location URL.

NetBIOS

The Microsoft.Networking.Netbios provider essentially performs a classic “net view” command, using the WNetOpenEnum/WNetEnumResource/WNetCloseEnum API.

A prerequisite for this resolution mechanism is that NetBIOS over TCP/IP is enabled. By default, the relevant setting is set to “Use NetBIOS from the DHCP server. If static IP address is used or the DHCP server does not provide NetBIOS setting, enable NetBIOS over TCP/IP”.

If SMBv1 is installed, then this method should produce the classically expected results. If SMBv1 is not installed/enabled then this discovery method will only work in the computer has been elected as the “Master Browser” of a workgroup.

If the local computer is not the Master Browser, then the local computer will try to negotiate a connection with the Master Browser. Normally, the newest SMB protocol version available to both parties will be negotiated – typically SMBv3. From a network trace perspective, it seems as though the negotiation has been concluded successfully, but post processing by the client causes the connection to be disconnected.

The stack on the client (local computer) when a disconnection is initiated looks like this:

mrxsmb!SmbCeDisconnectServerConnections+0x2d6:
mrxsmb20!MRxSmb2HandOverSrvCall+0x2054:
mrxsmb!SubRdrClaimSrvCall+0x90:
mrxsmb!SmbCeCompleteSrvCallConstructionPhase2+0x146:
mrxsmb!SmbCeCompleteServerEntryInitialization+0x176:
mrxsmb!SmbCeCompleteNegotiatedConnectionEstablishment+0x155:
mrxsmb!SmbNegotiate_Finalize+0x5b:

Some code in mrxsmb20!MRxSmb2HandOverSrvCall decides that a disconnect is necessary and a quick look at that routine shows that the condition is ConnectionType == Tdi. Possible values for ConnectionType are Tdi (TDI - Transport Driver Interface), Wsk (Windows Kernel Sockets), Rdma (Remote Direct Access Memory) and VMBUS.

TDI is a deprecated technology and is used by "NetBIOS over TCP/IP" (netbt.sys). It seems as though the client will refuse to use SMBv2/3 in conjunction with "NetBIOS over TCP/IP".

If the local computer is the Master Browser, it has access to the list of servers via local mechanisms and the results are made available to the user of IFunctionDiscovery. Users of IFunctionDiscovery, such as Windows File Explorer, typically recognize that some systems have been discovered by more than one mechanism (perhaps WSD and NetBIOS) and display just a single entry for such systems in their user interface.

Name Resolution

If network discovery fails to discover some resource (for example, a file server), it may still be possible to reference the resource by name (rather than by IP address; IP addresses are typically not permanently assigned but rather leased, so it is difficult to be certain of the IP address in the long term in a home network). Name resolution uses different protocols to network discovery and these may well work, even if discovery has failed.

Windows uses 3 mechanisms to resolve names: multicast DNS (mDNS), Link-Local Multicast Name Resolution (LLMNR) and NetBIOS Name Service (NBNS). Name resolution via all applicable mechanisms is normally started in parallel (i.e. the mechanisms are not tried sequentially, waiting for one method to fail before the next is tried). If NetBIOS over TCP/IP is disabled or the name being queried is not NetBIOS compatible (e.g. it is longer than 15 characters) then the NetBIOS Name Service resolution method is not used.

Wednesday, 12 May 2021

PktMon

Judging by Web search results, Windows 10 has included a new network traffic capturing mechanism since October 2018; however, two and a half years later, it still seems to be largely unknown (I only discovered it a few days ago).

Microsoft provides and supports a “classic” NDIS Filter traffic capturing mechanism (NdisCap, its associated Microsoft-Windows-NDIS-PacketCapture ETW provider and a PowerShell cmdlet Add-NetEventPacketCaptureProvider) and previously also supported a Windows Filtering Platform capture mechanism (WFPCapture, its associated Microsoft-Pef-WFP-MessageProvider ETW provider and a PowerShell cmdlet Add-NetEventWFPCaptureProvider). The new mechanism (PktMon and its associated Microsoft-Windows-PktMon ETW provider) does not yet have a PowerShell cmdlet to add it to ETW tracing sessions.

According to the PktMon home page, “[PktMon] is especially helpful in virtualization scenarios, like container networking and SDN, because it provides visibility within the networking stack”. The mechanism that allows PktMon to intercept a packet at various points in its transition through the network stack are additional “hooks” introduced into NDIS.sys. Some typical stack traces of the points at which PktMon is invoked are:

PktMon!PktMonPacketLogCallback+0x19
ndis!PktMonClientNblLog+0xbd
ndis!PktMonClientNblLogNdis+0x2b
ndis!ndisCallSendHandler+0x3ca4b
ndis!ndisInvokeNextSendHandler+0x10e
ndis!NdisSendNetBufferLists+0x17d

PktMon!PktMonPacketLogCallback+0x19
ndis!PktMonClientNblLog+0xbd
ndis!PktMonClientNblLogNdis+0x2b
ndis!ndisMIndicateNetBufferListsToOpen+0x3e95c
ndis!ndisMTopReceiveNetBufferLists+0x1bd
ndis!ndisCallReceiveHandler+0x61
ndis!ndisInvokeNextReceiveHandler+0x1df
ndis!ndisFilterIndicateReceiveNetBufferLists+0x3be91
ndis!NdisFIndicateReceiveNetBufferLists+0x6e

PktMon can be seen as an improvement on NdisCap. The main advantage (in my opinion) is that PktMon can be loaded and started without requiring rebinding of the network stack. As I have mentioned in other articles, rebinding the network stack can, under unfortunate circumstances, be a risky undertaking. The new ability to intercept packets at various points in the network stack is something that I have never personally had a need to use but is probably welcomed by those who have had difficulty in diagnosing network problems in “virtualization scenarios”.

One thing that PktMon cannot do is to trace loopback traffic, since the Windows loopback implementation does not use NDIS (WFP mechanisms and raw sockets can capture such traffic).

There are 3 main components of PktMon: the driver (PktMon.sys), a DLL (PktMonApi.dll) and an executable (PktMon.exe).

PktMon.sys

PktMon.sys is the core component. It is controlled via a small set of IOCTLs (to start, stop and query a capture; add, remove and list packet filters; list traceable components; reset trace counters) and the keywords used in the ETWENABLECALLBACK (Config, Rundown, NblParsed, NblInfo and Payload).

The information in the list of traceable components will seem familiar to anyone who has used the kernel debugger extensions “!ndiskd.miniports”, “!ndiskd.protocols” and “!ndiskd.filters”. The list of components is not only available via the IOCTL but is also included (in a different form) at the end of an ETW trace if the “Rundown” keyword is enabled.

The packet filtering possibilities are of the address/protocol-type/port type rather than what Microsoft sometimes calls OLP (Offset value, bit Length, and value Pattern). The output of the command “pktmon filter add help” accurately reflects the filtering possibilities. The filtering mechanism does not allow “negative” conditions to be expressed – for example, one can’t specify “ignore RDP” (as one might wish to do if one is logged onto a system via RDP).

Similar to both NdisCap and WFPCapture, PktMon must be explicitly loaded and/or started before it can generate any events; just starting an ETW trace session containing Microsoft-Windows-PktMon is not enough to capture trace data.

PktMonApi.dll

PktMonApi.dll currently has 9 exports, which are mostly just simple wrappings around IOCTLs to PktMon.sys:

PktmonAddFilter
PktmonGetComponentList
PktmonGetFilterList
PktmonGetStatus
PktmonRemoveAllFilters
PktmonResetCounters
PktmonStart
PktmonStop
PktmonUnload

PktMonApi.dll is not used by PktMon.exe (which contains its own simple wrappings around the IOCTLs).

PktMon.exe

PktMon.exe has several facets: it can configure and control PktMon.sys via its IOCTLs, it can manage ETW trace sessions, it can extract information from the Microsoft-Windows-PktMon ETL and save it in various formats (including “pcapng”), and it can perform “tcpdump” style simple formatting of packets captured and display in real-time.

As mentioned, PktMon.sys must be explicitly managed in order for Microsoft-Windows-PktMon to capture data. It would be ideal (for me) if Microsoft-Windows-PktMon could just be included in a Windows Performance Recorder (WPR) Profile along with other providers and use of advanced ETW features (such as stack traces, SID information, etc.). Since that is not possible, one must be content with the limited ETW configuration options of PktMon.exe (provider, keywords and level) – a similar situation to that with NdisCap and “netsh trace”.

The PktMon home page currently says “Packet drops from Windows Firewall are not visible through Packet Monitor yet” (my emphasis of “yet”). I would not expect Windows Firewall drop detection to be included in PktMon.sys, since they deal with different technologies. By including both Microsoft-Windows-PktMon and Microsoft-Windows-WFP in ETW trace sessions, one can see both the drops types detected by PktMon and the drops caused by Windows Firewall. Perhaps the “yet” is just a nod to future improvements in the PktMon.exe user interface to present a unified view of the output of the two providers.

The configuration defaults of PktMon.exe cause packets to be logged at all interception points (--comp all); if the intent of a capture is just to analyse the traffic in a tool like Wireshark, this can cause each packet to be repeated twenty or more times in the network trace. One can tackle this by selecting specific components when exporting captured data to the pcapng format, but I prefer to use “PktMon start” with the “--comp nics” qualifier.

Microsoft Message Analyzer

Microsoft Message Analyzer (MMA) was discontinued more-or-less contemporaneously with the introduction of PktMon and the PktMon team provides no support for MMA. The OPN (Open Protocol Notation) below allows the ETL output of PktMon to be viewed comfortably in MMA (if one still has a copy installed).

The OPN is short and it works (for me), but it includes some “design” decisions and is incomplete (no support for the MBB (Mobile BroadBand) NDIS medium type, for example, and probably in many unknown/unanticipated ways).

module PktMon;

using Microsoft_Windows_PktMon;
using Standard;
using Ethernet;
using WiFi;

autostart actor PktMonPayload(ep_Microsoft_Windows_PktMon e)
{
    process e accepts m:Event_160 where m.PacketType == 1
    {
  dispatch endpoint Ethernet.Node accepts BinaryDecoder<Ethernet.Frame[m.LoggedPayloadSize < m.OriginalPayloadSize]>(m.Payload) as Ethernet.Frame;
  }

    process e accepts m:Event_160 where m.PacketType == 2
    {
                DecodeWiFiMessageAndDispatch(m.Payload);
    }
}

Saving this OPN in a file named %LOCALAPPDATA%\Microsoft\MessageAnalyzer\OPNAndConfiguration\OPNForEtw\CoreNetworking\PktMon.opn (for example) and restarting MMA should enable the functionality (MMA may take some time to completely start while it recompiles various OPN files).

Microsoft_Windows_PktMon

The Microsoft_Windows_PktMon ETW provider defines 5 keywords:

1. Config: PktMon.exe help says “Internal Packet Monitor errors”; I have never observed any.

2. Rundown: this causes the list of components to be logged to the ETL when PktMon is stopped.

3. NblParsed: this causes address, protocol type, port, etc. information for each packet to be logged. The same information could be extracted from the binary payload (if present), but it is not trivial to do this because of various options at each layer (data link, network, transport).

4. NblInfo: this seems to be a superset of the information logged as TcpipNlbOob by the Microsoft-Windows-TCPIP provider at level 17 (uninteresting for most people).

5. Payload: this causes the raw data of the packet to be logged. The data can be truncated, if desired (truncation length information is included in the start IOCTL to PktMon.sys).

pktmon etl2pcap

Converting pktmon packet events to PCAPNG format is, in principle, relatively straightforward. There is a GitHub Microsoft repository with the code of a utility that performs the slightly more difficult task of converting “netsh trace” packet data to PCAPNG format (https://github.com/microsoft/etl2pcapng).

Perhaps surprisingly, the “pktmon etl2pcap” command (“Convert pktmon log file to pcapng format”) only produces a useful PCAPNG file if the packets were captured on an Ethernet/802.3 link. If packets were captured on a WiFi/802.11 link, the resulting PCAPNG packet is recorded as having been captured on an Ethernet link; since the datalink headers are different in content and length, a tool like Wireshark “decodes” the packet data incorrectly. If most of the packets in the capture were obtained from a WiFi/802.11 link, the command “editcap -T ieee-802-11 <infile> <outfile>” should make the PCAPNG file usable.

Apart from correctly identifying 802.11 frames as such (LINKTYPE_IEEE802_11), there is one other additional step that might be needed when saving 802.11 packets to PCAPNG format: the “protected” flag in the 802.11 frame control field might need to be cleared. The “protected” flag indicates whether the frame was encrypted; packets carrying network layer data are protected/encrypted but, by the time received packets have reached the packet capture hooks, the protected/encrypted content has been decrypted – however the captured 802.11 packet header (frame control, etc.) is not necessarily updated (probably depends on the network interface driver). If the protected bit is not cleared when saving the capture then, when the capture is loaded into Wireshark, the packet is assumed to still be protected and is displayed as such (no attempt is made to decode the “encrypted” portion of the packet).

pktmon start --trace

PktMon can control (start/stop) other ETW trace providers, as can “netsh trace”, logman and wpr (Windows Performance Recorder) amongst others. However, PktMon differs from the other controllers in the default values for “keywords” and “level” passed to the providers. By default, most providers use “maximal” values for keywords and level but PktMon uses a value of 0xFFFFFFFF for the keywords value (which is actually a 64-bit value, so the high 32 bits are set to zero) and 4 for the level (the maximum value is 255 and Microsoft-Windows-TCPIP, for example, logs some events at level 17).

Although the help/documentation (“pktmon start help”) mentions this, I have been caught out more than once puzzling over why certain expected events were missing from a trace.

New Packet Monitor API (Windows 11 24H2, Windows Server 2025)

PktMonApi.dll now exports additional, documented routines:

PacketMonitorAddCaptureConstraint
PacketMonitorAddSingleDataSourceToSession
PacketMonitorAttachOutputToSession
PacketMonitorCloseRealtimeStream
PacketMonitorCloseSessionHandle
PacketMonitorCreateLiveSession
PacketMonitorCreateRealtimeStream
PacketMonitorEnumDataSources
PacketMonitorInitialize
PacketMonitorSetSessionActive
PacketMonitorUninitialize

The documentation highlights “Multisession” and “Packet-streaming” as new capabilities. The new capabilities are enabled via “Controlled Feature Rollout” (CFR) for the feature named "UxConfTest"; Microsoft notes:

Using CFR, features may be gradually rolled out, starting with devices that install the monthly optional non-security preview release. When we've validated that each feature is ready, we'll gradually roll it out to new devices, and eventually include it enabled-by-default in a subsequent monthly security update.

Two characteristics of the new routines immediately irritated me. The first irritation is the granularity of the captured packet timestamps; the documentation says:

TimeStamp – Timestamp when the packet was reported. This is retrieved using ‘KeQuerySystemTime’.

KeQuerySystemTime documentation includes the following remark:

System time is typically updated approximately every ten milliseconds.

My first test of the new routines was to capture packets and write them to a file in PCAPNG format. I was surprised, when looking at the packets in Wireshark, how many packets had identical timestamps (a lot of packets can be exchanged in ten to twenty milliseconds).

The second irritation is that the length of the packet is not included in the metadata describing a captured packet. The length of the captured data is, of course, available but if a TruncationSize (snapshot length) is specified, the original length of the packet is not available from the metadata (it might be possible to infer the original length of the packet from the packet contents). Since the original length and snapshot length are included in the PCAPNG packet block, I chose to capture full packets (TruncationSize = 9000).