In Detail: IKEv2

Showing posts with label IKEv2. Show all posts

Wednesday, 14 July 2021

Slow performance of IKEv2 built-in client VPN under Windows

I noticed over time several reports in technical forums of slow IKEv2 performance, with the observed performance often being quoted as just 10% to 20% of the expected performance; troubleshooting network performance problems almost always requires making network traces and, on the few occasions that I offered to help with the analysis, there was an (understandable) unwillingness to share trace data. When someone agreed to share the data, the analysis proved to be trickier than expected.

The first network trace

The first trace used a combination of two ETW providers: Microsoft-Windows-PktMon (with keywords 0x10) and Microsoft-Windows-TCPIP (with keywords 0x200500000020 and level 16). I had used this combination before when troubleshooting problems with the TCP CUBIC congestion algorithm and out-of-sequence delivery of TCP packets (see here) and developed a simple tool to visualize interesting aspects of the captured data. This is what the tool showed:

The blue line is the TCP send sequence number; the green line is the size of the congestion window (not to scale on the Y-axis – it is its shape that is most interesting); the yellow line is the send window (to same scale as congestion window).

At 6 arbitrary Y-values, points are shown when “interesting” events occur; the events are:

· TcpEarlyRetransmit

· TcpLossRecoverySend Fast Retransmit

· TcpLossRecoverySend SACK Retransmit

· TcpDataTransferRetransmit

· TcpTcbExpireTimer RetransmitTimer

· D-SACK [RFC3708] arrival

The black points on the horizontal access indicate when IPsec ESP sequence numbers are detected “out-of-sequence”. In contrast to the other data and events, which all refer to the same TCP connection, this information is not necessarily related to the same TCP connection but is “circumstantially” related. The reason for this is that it is difficult to know what is “inside” a particular ESP packet; by combining information from several ETW providers, it is often possible to infer what is inside, but that is not essential for this visualization tool – the temporal relationship between events often “suggests” whether a (causal) link possibly exists.

The traffic being measured in the trace/graph is 10 back-to-back sends of a one megabyte block to the “discard” port on the “server”, so there is no application protocol hand-shaking or other overheads.

I had initially expected to see something like a fragmentation problem, but the graph looked exactly like the TCP CUBIC out-of-sequence behaviour. This was surprising because the client and server were on the same network, which made it unlikely that network equipment was causing the disorder in the packet sequencing.

For comparison purposes, this is what the visualization of a “good” transfer looks like:

The first surprise

A more detailed look at the trace data revealed that data carrying (large) ESP packets were leaving the sender with the ESP sequence numbers sometimes “disordered”. The ESP RFC 2406 describes the generation and verification of sequence numbers and one would certainly expect the transmitted sequence numbers to just monotonically count upwards from zero for a particular SPI.

Other discoveries

Some reports of this problem on the web note that the problem is not observed when using a wired connection. I also observed this and then tried testing a USB wireless adapter – this device did not exhibit the poor performance behaviour. This lead to the conclusion that the physical network adapter was an element of the problem rather than a simple wired/wireless difference.

Another “discovery” (actually a confirmation, since it was tried after a theory about the cause had been formed) was that restricting the client to using just one processor (via BIOS settings) eliminated the problem.

Filtering the trace data for interesting ESP sequence number events often showed something like this:

2021-07-14T08:40:28.1867532Z FBDD70E2: 233 -> 235 (5928)
2021-07-14T08:40:28.1870094Z FBDD70E2: 235 -> 241 (5928)
2021-07-14T08:40:28.1870119Z FBDD70E2: 241 -> 243 (5928)
2021-07-14T08:40:28.1872059Z FBDD70E2: 243 -> 234 (5892)
2021-07-14T08:40:28.1872088Z FBDD70E2: 234 -> 236 (5892)
2021-07-14T08:40:28.1872338Z FBDD70E2: 240 -> 242 (5892)
2021-07-14T08:40:28.1872361Z FBDD70E2: 242 -> 244 (5892)

The format of the above is timestamp (used to correlate various views of the trace data), followed by SPI and then the discontinuity in the sequence numbers and finally the thread ID (in parentheses).

Some of the interesting features of the above trace data are that some of the “jumps” in the ESP sequence number are quite large (e.g. 9 backwards) and that multiple threads are sending interleaved ESP packets: TID 5928 sends Seq№ 241, TID 5892 sends Seq№ 242 and TID 5928 sends Seq№ 243. Either some synchronization mechanism is failing or some prior assumption has proved wrong.

The use of threads in the problematic case is complicated, so it is instructive to first see how they are used in the well-behaved case.

Threading in the well-behaved case

To recap the situation, ETW tracing is being performed on a client that is sending data to the TCP discard port (9). Although the client is “driving” the interaction (initiating the connection, sending data, etc.), it is useful to follow the use of threads from the arrival of a packet (probably a TCP acknowledgement-only packet) to the sending of a response packet (the next packet of data); often the client is waiting to send (normally because the congestion window is closed) and the arrival of an acknowledgement triggers the sending of more data.

When an incoming packet is first detected by the ETW trace, the stack looks like this:

ndis!PktMonClientNblLogNdis+0x2b
ndis!NdisMIndicateReceiveNetBufferLists+0x3d466
rt640x64.sys+0x1F79B
rt640x64.sys+0x29F9
ndis!ndisInterruptDpc+0x197
nt!KiExecuteAllDpcs+0x30e
nt!KiRetireDpcList+0x1f4
nt!KiIdleLoop+0x9e

rt640x64.sys is the device driver for a “Realtek PCIe GBE Family Controller” (no symbols available).

Interrupts for a particular device are often directed to a particular processor and the DPC, queued in the interrupt handling routine, typically executes on the same processor. When the indication of an incoming packet reaches AgileVpn.sys (the WAN Miniport driver for IKEv2 VPNs), AgileVpn dispatches the further processing of the message to one of its worker threads; the particular worker thread is chosen “deterministically”, using the current processor number to choose a thread that will run on the same processor.

Most of the subsequent processing (unpacking/decrypting the ESP packet, indicating the decrypted incoming packet to the VPN interface, releasing any pending packet that can be sent (congestion control permitting), encrypting/encapsulating the outgoing packets, etc.) takes place on the same thread. When the packet re-enters AgileVpn on a transmit path, it is again handed over to a different AgileVpn worker thread on the same processor (AgileVpn creates 3 worker threads per processor, for different tasks).

So, in summary (and neglecting the initial interrupt), all of the activity normally takes place in three threads, all with an affinity for the same processor. This normally ensures that IPsec packets are sent in the expected sequence, but it is not a guarantee (just “good enough”).

Using Performance Monitor to trace the send/indicate activity per processor during a “discard” test shows this distribution of load:

Threading in the badly-behaved case

Having understood the well-behaved case, the stack when an incoming packet is first detected by the ETW trace in the badly-behaved case already suggests the likely cause of the problem:

ndis!PktMonClientNblLogNdis+0x2b
ndis!NdisMIndicateReceiveNetBufferLists+0x3d466
wdiwifi!CPort::IndicateFrames+0x2d8
wdiwifi!CAdapter::IndicateFrames+0x137
wdiwifi!CRxMgr::RxProcessAndIndicateNblChain+0x7f7
wdiwifi!CRxMgr::RxInOrderDataInd+0x35a
wdiwifi!AdapterRxInorderDataInd+0x92
Netwtw06.sys+0x51D13
Netwtw06.sys+0x52201
ndis!ndisDispatchIoWorkItem+0x12
nt!IopProcessWorkItem+0x135
nt!ExpWorkerThread+0x105
nt!PspSystemThreadStartup+0x55
nt!KiStartSystemThread+0x28

The “indication” of the incoming packet (to PktMon and AgileVpn) does not occur in the context of the device driver DPC routine but rather in a system worker thread; the device driver DPC routine probably called NdisQueueIoWorkItem.

NdisAllocateIoWorkItem and NdisQueueIoWorkItem make no statements (and provide no guarantees) about on which processor the worker thread will execute. The subsequent handling of the packet is similar to the well-behaved case (transferring process to an AgileVpn worker thread, etc.) but, since the initial processor number is essentially chosen at random for each incoming packet, AgileVpn worker threads on all processors are used.

A “snapshot” of the same Performance Monitor trace as above looks like this:

I used the word “snapshot” because the distribution of load varies quite a bit during a test – the essential message is clear though: many processors are sending packets on the same SPI concurrently and the risk of disorder in the ESP sequence numbers is clearly present.

TCP CUBIC congestion control

Sending packets out-of-sequence is not ideal (especially if the degree of disorder approaches the level where ESP Sequence Number Verification starts to have an effect) but higher level protocols that guarantee in-sequence delivery (such as TCP, HTTP/3, QUIC, etc.) have mechanisms for coping with out-of-sequence packets.

In particular TCP, at one level, can handle out-of-sequence packets easily. However the particular congestion control mechanism used for a TCP connection can behave badly – and Windows’ current implementation of CUBIC does behave badly and is the cause of the slow transfer rate.

This is what happens in the “discard” test case:

· The client sends data bearing TCP packets out-of-sequence to the server.

· The server follows normally acknowledgement policies (e.g. at least one acknowledgement for every two received data packets) and includes SACKs where appropriate.

· The client receives the acknowledgements and sometimes a sequence of acknowledgements for the same sequence number are received (because the segment that would fill the gap and allow the acknowledged sequence number to be advanced has not yet arrived at (or just has not yet been processed) by the server – because it was sent out-of-sequence).

· The client triggers congestion control mechanisms, reducing the size of its congestion window and (fast) retransmitting the “missing” segment.

· The server receives the duplicate segment and includes a D-SACK in its next acknowledgement.

· The client ignores the D-SACK.

In a better CUBIC implementation the client, upon receiving a D-SACK, would “undo” the congestion window reduction caused by the spurious retransmission and adjust its parameters for detecting a suspected lost segment.

Summary and workarounds

A number of factors combine to cause the poor performance. One element is a network adaptor that uses NdisQueueIoWorkItem, causing the initial NDIS “indication” of packet arrival to occur on random processors. Another element is how AgileVpn distributes the load based on the current processor number rather than, say, SPI (although I don’t know if information like SPI is available at the time this decision is made). The final element is the TCP congestion control implementation weaknesses.

There are no good workarounds. Using different network interfaces (if available) or different VPN protocols (if appropriate/possible) is obviously possible (but probably an unhelpful suggestion). “Hamstringing” the system so that it only uses one processor is not something that one could seriously propose.

Improvements in the TCP congestion control implementation have been announced but are not available in any mainstream Windows version yet.

Tuesday, 10 December 2019

Implementing an IKEv2 VPN client under Windows 10 VPN

I have previously written about the difficulties of implementing a Cisco AnyConnect VPN client using just the Extensible Authentication Protocol (EAP) framework and interfaces in Windows 10: most of what needs to be done to enable the establishment of VPN connections to an AnyConnect server can be implemented as an EAP authentication mechanism, but some requirements cannot be fulfilled within this framework. These requirements are: use of IKEv2 Vendor ID payloads, control of the IDi Identification payload and access to IKEv2 messages and derived keys for the AnyConnect authentication computations.

The first two messages in the establishment of an IKEv2 security association are sent in plain text, so one can see their content with a simple network sniffer, but subsequent messages are encrypted and if the “protocol” is not fully documented (e.g. the AnyConnect EAP mechanism) then one needs to somehow see these encrypted messages in plain text. There are several ways to do this, but if one has access to an original client and server then one way is particularly useful – develop server and client replacements in “steps”. By “steps” I mean implement the known steps of the protocol and then use the original client/server in conjunction with the new server/client to capture the next unknown packet in plain text.

For example, at the start of the process one does not know what the client will send in its second message (its first encrypted message), so one can use the original client to connect to the new server (that, at first, only knows how to handle the first, unencrypted, message exchange). After seeing what the original client sends, one can use the new client to send an equivalent message to an original server and see what it returns in plain text. With the exception of the detailed cryptographic computations used in the authentication process, one can derive an understanding of the full protocol exchanges using this technique.

At the end of the process, one has a full keying module for the protocol but, having served its purpose as a way of understanding the protocol, it is of little practical use without two other components: an implementation of the Encapsulating Security Payload (ESP) protocol and a mechanism for creating the operating system “objects” (such as network interfaces) that can be assigned IP addresses, appear in routing tables, etc..

It is perhaps not widely known, but the Windows Filtering Platform (WFP) Base Filtering Engine (BFE) contains all of the functionality necessary for the first the first task (using ESP to transport data) and is documented and fully supported. Two documentation entries in Using Windows Filtering Platform are particularly useful: Using Tunnel Mode and Manual SA Keying.

At this stage, it suffices to say that the first task can be implemented using documented APIs and the C# programming language (with quite a large investment of time in creating C# declarations of the WFP data structures). But before expanding on this topic, it would be reassuring to know that the second task (creating a “network interface”) can also be achieved. There appears to be no documented API for this, but the Microsoft IKEv2 implementation (part of IKEEXT) uses the DLL “vpnikeapi”. The DLL exports of vpnikeapi look very promising:

CancelProcessEapAuthPacket

CloseTunnel

CreateTunnel

FreeConfigurationPayloadBuffer

FreeEapAuthAttributes

FreeEapAuthPacket

FreeIDPayloadBuffer

FreeTrafficSelectors

GetConfigurationPayloadRequest

GetIDPayload

GetOptionalIDrPayload

GetServerEapAuthRequestPacket

GetTrafficSelectorsRequest

NewRasIncomingCall

ProcessAdditionalAddressNotification

ProcessConfigurationPayloadReply

ProcessConfigurationPayloadRequest

ProcessEapAuthPacket

ProcessTrafficSelectorsReply

ProcessTrafficSelectorsRequest

QueryEapAuthAttributes

RemoveTrafficSelectors

TunnelAuthDone

UpdateTunnel

This looks just like the set of routines that would be needed to integrate IKEv2 into an operating system as both a client and a server. The IKEv2 payloads Configuration, Identification, Traffic Selector and EAP are just those that have relevance to the operating system, “ProcessAdditionalAddressNotification” sounds as though it supports MOBIKE and there are routines to create and close a “tunnel”.

Unfortunately, there is one preparatory step that must be performed before using these routines and the implications of that step make the vpnikeapi DLL unsuitable for my purposes. Why that is the case will become apparent later.

Flow of control and data in the Windows IKEv2 VPN implementation

The RasDial routine is the entry point for the flow of control for the establishment of a VPN connection and %APPDATA%\Microsoft\Network\Connections\Pbk\rasphone.pbk is the source of the configuration data for the connection. RasDial communicates with the “RasMan” (Remote Access Connection Manager) service via RPC and most of the operating system specific “action” takes place in that service; RasMan communicates with the IKEEXT (IKE and AuthIP IPsec Keying Modules) service for IKE/IKEv2/AuthIP protocol specific functionality.

When the IKEEXT service starts up, it uses the undocumented WFP routine IPsecKeyModuleAdd0 to register itself as the keying module provider for IKE, IKEv2 and AuthIP; the registration includes call-back routines to acquire and expire Security Associations. When RasMan needs the services of a registered keying module, it uses the undocumented routine IPsecSaInitiateAsync0 to initiate the behaviour (by invoking one of the registered call-back routines) and the keying module uses the routine IPsecKeyModuleUpdateAcquire0 to feedback progress. IPsecKeyModuleDelete0 rounds out the small set of keying module routines.

Some information is passed to the keying module in the parameters of the call-back routine, but this information is largely references to information stored dynamically in WFP in the form of “provider contexts”. These “provider contexts” are created from information specified in the mainModeTunnelPolicy, tunnelPolicy and keyModKey arguments to the FwpmIPsecTunnelAdd2 routine:

DWORD FwpmIPsecTunnelAdd2(

HANDLE engineHandle,

UINT32 flags,

const FWPM_PROVIDER_CONTEXT2 *mainModePolicy,

const FWPM_PROVIDER_CONTEXT2 *tunnelPolicy,

UINT32 numFilterConditions,

const FWPM_FILTER_CONDITION0 *filterConditions,

const GUID *keyModKey,

PSECURITY_DESCRIPTOR sd

);

The routines exported by vpnikeapi (but implemented by vpnike.dll in the RasMan service) are only available for use when a VPN establishment has been initiated by RasDial or similar. rasphone.pbk does not directly define the keying module GUID to be used (instead, it indicates (via VpnStrategy) which type of VPN is preferred and RasMan chooses the appropriate keying module GUID), so while it would be easy to register a parallel/alternative IKEv2 keying module, it would be difficult to route any work to it. To make the situation even clearer, vpnike contains an explicit test that its client is IKEEXT (via the IKEEXT service SID), so even if one did register a new keying module and managed to persuade RasMan to use it and initialize the data structures needed for vpnikeapi use, the new keying module would fail the security tests performed by the vpnike DLL.

Short aside: VPN Connection IPsec Configuration

By default, the Windows IKEv2 VPN client makes 18 Main Mode Security Association (SA) Proposals and 6 Quick Mode SA proposals. This default list of proposals can be strengthened and shortened by registry entries (such as NegotiateDH2048_AES256) or set (and limited) to one proposal by use of the Set-VpnConnectionIPsecConfiguration PowerShell cmdlet.

The flexibility of SA Proposals and Transform in IKEv2 means that often two proposals (one proposal with GCM/CCM mode cipher algorithms and one with CBC mode cipher algorithms) are adequate. However the WFP data structures used to communicate this information between RasMan and IKEEXT were probably designed before IKEv2 was standardized and do not support its flexibility – they are better suited to the IKE proposal model.

The format of proposals stored in rasphone.pbk exhibits the same problem (one transform of each type per proposal). They are stored as a sequence of serialized ROUTER_CUSTOM_IKEv2_POLICY_0 structures with the name “CustomIPSecPolicies”; “NumCustomPolicy” records how many proposals there are. Set-VpnConnectionIPsecConfiguration always creates a single custom IPsec policy, but manual editing of rasphone.pbk can be used to add more.

How to create a Network Interface?

This picture taken from the RAS Architecture Overview shows the search area for a solution to the problem of creating a network interface to “front up” (present an interface to the operating system) of the ESP secured (VPN) connection that I hoped to establish:

The Windows 10 IKEv2 client is sometimes known as the “Agile VPN” client and there is an agilevpn.sys WAN Miniport Driver which will probably be needed by any solution built from existing components. The diagram shows two potential user-mode interfaces that might provide the required functionality: RAS and TAPI (the Telephony API). However, I could not find any set of documented RAS or TAPI routines that could make any effective contribution to solving the problem (the routines that RasMan uses are not exported from any DLL (they are internal to vpnike.dll, rasmans.dll, etc.)).

The next step was to assume that the interface between the user-mode abstractions and the kernel-mode functionality was implemented with Device I/O Controls (IOCTLs). Tracing control and data flows across this boundary during VPN connection establishment suggested that IOCTLs to four different device drivers would be necessary:

\Device\NDProxy

\Device\NdisWan

\Device\AgileVPN

\Device\WANARP

This turned out to be almost correct, but one thing further was needed to complete the task: a call to the undocumented routine “NsiSetAllParameters”.

The security descriptors on the NDProxy and AgileVPN devices permit only access by SYSTEM; NdisWan grants access to SYSTEM and NETWORK SERVICE and WANARP grants some access to “Authenticated Users”.

NDProxy

As suggested by the architecture overview diagram, and in practice, NDProxy is the first point of contact.

After opening the device, one can issue a “connect” IOCTL, which returns the number of devices (numbered from zero to N - 1) which can be used via NDProxy; this is typically 5: the WAN Miniports SSTP, IKEv2, L2TP, PPTP and PPPOE. One can then iterate over these miniports, looking for one with a NDIS_WAN_MEDIUM_SUBTYPE of NdisWanMediumAgileVPN (i.e. IKEv2).

Having identified and opened the miniport, the next step is to make a “call” on it via a “query info” IOCTL with OID_TAPI_MAKE_CALL. The type of information needed to make the call includes the IP address of the VPN server, the network interface index via which the VPN server is reachable (the GetBestInterfaceEx API routine for the VPN server address returns the needed value) and the “tunnel ID”.

The “tunnel ID” is a 64 bit number that can be freely chosen (it should uniquely identify the tunnel, but normally there will be only one tunnel). It is used to link the network interface to the BFE SA Contexts (using the IPSEC_VIRTUAL_IF_TUNNEL_INFO0 structure).

In general, TAPI call establishment can take some time and RasMan uses asynchronous I/O to perform the call. Synchronous I/O works too and is OK for test purposes. Even with synchronous I/O, the call is not necessarily established when the I/O completes. There is an event reporting mechanism (line events) which report changes in call state, but polling the call state (waiting for LINECALLSTATE CONNECTED) works too.

Once the call has been established, an IOCTL with OID_TAPI_GET_ID can be used to retrieve the (connection) ID which is needed for a later step. This completes the initial interaction with NDProxy; nothing more needs to be done until the VPN connection is terminated, when one should drop (OID_TAPI_DROP) and close (OID_TAPI_CLOSE_CALL) the call and close the device (OID_TAPI_CLOSE).

AgileVPN

The call initiated via NDProxy triggers call set-up in AgileVPN, so AgileVPN is then ready for additional IKEv2 specific set-up – the first of which is informing AgileVPN of the IKEv2 Traffic Selectors.

The second, and final, IOCTL directly to AgileVPN is the command to create the VPN tunnel.

WANARP

A new network interface is still not visible at this point. One first has to issue an IOCTL to WANARP to obtain a network interface LUID for an interface of type IF_TYPE_PPP.

When the VPN connection is terminated and the network interface taken down, the LUID allocation should be freed (via another IOCTL).

NdisWan

Before performing the “activate route” IOCTL on NdisWan, it is necessary to choose a GUID for the network interface and associate this with a network compartment – this can be done via the undocumented NsiSetAllParameters routine.

The first IOCTL to NdisWan is used to map the connection ID obtained from NDProxy to a “bundle handle”. Once one has the bundle handle, one can activate the network interface and route to the network interface. In addition to the bundle handle, the IOCTL that activates the route requires the tunnel ID, the LUID obtained from WANARP, a name for the network interface, the GUID for the network interface and the IP address assigned by the VPN server to this client.

When the VPN connection is terminated and the network interface taken down, the route should be deactivated (via another IOCTL).

Finishing touches

Despite providing the assigned IP address to NdisWan, the network interface comes up with a link-local IP address (169.254.X.X). The correct address can be assigned via a call to the IP Helper API routine CreateUnicastIpAddressEntry.

The network interface comes up with a few routes preassigned (the VPN server, broadcast and multicast addresses), but it is useful to manually add either a route to the network reachable via the VPN or a default route via the VPN. This can be done with the IP Helper API routine CreateIpForwardEntry2.

Windows Filtering Platform / Base Filtering Engine

The FwpmIPsecTunnelAdd routine is just a convenience – its functionality can be mimicked by several calls to other WFP routines. For the purposes of this application, most of its parameters can be null/zero; only the engineHandle, flags and tunnelPolicy need specific values: a valid engineHandle, a flags value of FWPM_TUNNEL_FLAG_ENABLE_VIRTUAL_IF_TUNNELING and a dummy IKEv2 Quick Mode Tunnel provider context. The provider context must include (at least) one IPsec proposal – although this is not used (it need not correspond to any actually offered or negotiated proposal).

Only a few other calls to WFP routines are needed. Calls to IPsecSaContextCreate1 and IPsecSaContextGetSpi1 create a Security Association (SA) context and retrieve the SPI (Security Parameter Index). Once keying material has been negotiated/derived, calls to IPsecSaContextAddInbound1 and IPsecSaContextAddOutbound1 make this information available for use by ESP (Encapsulating Security Payload).

Implementation statistics

The entire implementation of a “working” VPN is less than 7000 lines of C#, of which around 1700 lines are just definitions of WFP structures and over 300 lines are just definitions of Diffie-Hellman primes and curves. The IKEv2 implementation is very “bare bones”, with plenty of missing functionality (such as rekeying).