The Page Fault in NonPaged Pool Syndrome

The crash was a typical-looking bugcheck D1, “DRIVER_IRQL_NOT_LESS_OR_EQUAL”—which is the same as IRQL_NOT_LESS_OR_EQUAL except that the bugcheck routines noticed that the instruction pointer at the time of the exception was inside driver code. The exception happened inside a KMDF DMA support routine, which of course had been called by a KMDF driver.

6: kd> !analyze -v
[...]
DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: fffffa8060cab028, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
Arg4: fffff88000ecb666, address which referenced memory
[...]

The analysis of the problem, and the fix, were more about studying the existing driver source code and implementing DMA correctly than using the debugger. But during the dump analysis I noticed two things that deserved a writeup:

One, I was looking at what seemed to be an impossible value in a page table entry.

And two, I was looking at a page fault to nonpaged pool. And as all Windows driver writers know, that just isn’t supposed to happen.

My, what a big pagefile you have

Here, from the output of .trap, is the instruction that raised the exception:

Wdf01000!FxDmaScatterGatherTransaction::StageTransfer+0x9a:
fffff880`00ecb666 418b5028        mov     edx,dword ptr [r8+28h] ds:fffffa80`60cab028=????????

This tells us that the exception was raised while trying to read the longword at fffffa80`60cab028.

One of the first things I usually do with an address that’s involved in a crash (provided it isn’t something obviously bogus, like 0 or 0x80000000`00000000 or 7) is to try the !pte command on it. PTE is of course short for “page table entry”. The !pte command shows the page table entry, page directory entry, etc., that describe a specified virtual address. Its output will usually tell us exactly what was wrong with the address… which is to say, why the attempted access to that address failed. (It’s one of my favorite debugger commands.)

A common case is that the PTE is zero, meaning the address has not been defined at all. Another case that !pte can clarify is an attempt to write to a read-only page.

6: kd> !pte fffffa80`60cab028
                                           VA fffffa8060cab028
PXE at FFFFF6FB7DBEDFA8    PPE at FFFFF6FB7DBF5008    PDE at FFFFF6FB7EA01830    PTE at FFFFF6FD40306558
contains 000000000224F863  contains 0000000002251863  contains 0000001341058863  contains 0082A73400000000
pfn 224f      ---DA--KWEV  pfn 2251      ---DA--KWEV  pfn 1341058   ---DA--KWEV  not valid
                                                                                  PageFile:  0
                                                                                  Offset: 82a734
                                                                                  Protect: 0

(You really need a wide-screen display for debugging x64!) “Not valid” over there underneath the PTE explains why the reference to this address raised an exception: The “valid” bit—bit 0 in the PTE—is clear. (Some refer to this as the “page resident” bit, or the “page present” bit. I have to agree that these are more evocative terms, if less succinct.) Any reference to memory that goes through a PTE (or PDE, or etc.) with this bit clear will incur a page fault.

The bugcheck parameters show that the exception was raised at IRQL 2. Of course, page faults at IRQL 2 or above are forbidden and result in a bugcheck.

But a closer look at the !pte output shows something strange:

PageFile:  0
Offset: 82a734
Protect: 0

Usually, when the !pte command annotates a PTE this way, it means that a copy of the virtual page can be found in the pagefile. “Pagefile: 0” means it’s in pagefile number 0 (that’s a four-bit field in the PTE, so up to 16 pagefiles are supported), and “Offset: 82a734” is the offset in pages from the start of the pagefile.

Fine, but that would mean the pagefile has to be at least 0x82a735 pages in size—over 35 gigabytes. That’s possible, but uncommon. The system in question turned out to have a quite ordinary-sized pagefile of 4 GB.

Let’s come back to that point. It turns out to be completely consistent with what happened, but we first need to look at another puzzling part of the situation.

Pagefault in… nonpaged pool?

Another command I’ll commonly try on most any kernel-space address that’s involved in a crash is !pool. Just about all data referred to by kernel mode code (and certainly by kernel mode drivers, the I/O subsystem, and KMDF) is going to be either in one of the pools or on the stack, and this address wasn’t close to the stack. So !pool should tell us something, right?

Well, in this case, one might be forgiven for thinking otherwise: the debugger has already told us that it can’t see the contents of the page (that’s the “????????” in the instruction decode output). After all, the page isn’t valid, which means that the physical page will not be in a kernel memory dump (which is what I had here). So no pool tags, etc., can possibly be displayed.

Nevertheless, !pool can be useful, if only to tell us whether the address is within one of the pools.

6: kd> !pool fffffa80`60cab028
Pool page fffffa8060cab028 region is Nonpaged pool
fffffa8060cab000 is not a valid large pool allocation, checking large session pool...

[...]

Pool page [ fffffa8060cab000 ] is __inVALID.

Sure enough, the suspect address is within the range assigned to nonpaged pool. Or perhaps we should say it is within the range assigned to what is supposedly nonpaged pool. But this simply raises another question: If it’s within nonpaged pool, how could it be invalid—which is to say, paged out?

Reclaiming kernel address space

The answer to that can’t be found by poking around in the debugger. (Although as we will see, once the situation is understood, the debugger can give us corroborating evidence.) It has to do with some fairly sweeping changes that were made to the Windows memory manager, starting back in Windows Vista.

Prior to Vista, once kernel address space + RAM was allocated to nonpaged pool, it was there to stay. Nonpaged pool could be allocated and then freed, but when it was freed it was simply freed back to the pool. The virtual addresses were still mapped and memory references to them would still work. And the !pool command would report the region as “free”, but not “not valid.”

As of Vista and later, though, certain types of allocations in nonpaged pool can be not just “freed,” as in ExFreePoolWithTag, but released completely from both virtual and physical memory. The call that does it is MiReturnSystemVa. (This is for internal use only, folks: do not try calling this at home!) ExFreePoolWithTag will call this routine in certain cases. (It’s called in other paths too, paths that have nothing to do with pool.) The pages’ virtual addresses are then unmapped, and the physical pages that were underneath them are released to the system-wide free page list. It is analagous to the Windows API VirtualFree.

This by the way is partly related to “dynamic kernel virtual address space,” which was introduced with Vista on x86. That doesn’t apply to Windows on x64, which still uses statically assigned ranges of kernel space to the various kernel-mode components (such as paged and nonpaged pool). But Windows on x64 nevertheless has the ability to release allocations of kernel space—including nonpaged pool—via MiReturnSystemVa.

One reason is that even though kernel address space isn’t likely going to be a scarce resource on x64 (especially since Windows 8.1 expanded it to 128 TiB), physical memory can still be.

We’re still researching the exact circumstances under which MiReturnSystemVa gets called, for nonpaged pool and otherwise. There are strong indications that, for nonpaged pool, it is done only for special pool allocations. For now, though, this at least explains how it is possible that nonpaged pool can show up as “invalid.”

It’s a pagefile offset. No, it’s a timestamp. No, it’s just a counter…

So what about that very large page table offset?

It turns out it’s not really a page table offset. It is a global translation buffer flush timestamp.

When a page unmapped by MiReturnSystemVa is later used for something else, it will no doubt be associated with a different physical page than before. That means that the PTE has to be updated with the new physical page number. At that time, or earlier, the translation buffer entry for the virtual address (if any) has to be flushed.

In modern processors this can be done for specific virtual addresses, but in the old days the only option was to flush the entire translation buffer. This is an expensive thing to do, so Windows doesn’t want to do it too often, most certainly not when it’s unnecessary… for example, when it’s already been done. So, Microsoft implemented the following algorithm:

There is a global longword counter called KiTbFlushTimeStamp. It does not really record a “time stamp”, but rather a simple count of global flushes of the translation buffer. Whenever the OS performs a global flush of the TB, it increments this counter.

Whenever the OS unmaps a page of kernel space (as it does in MiReturnSystemVa), it stores the value of this counter in the corresponding page table entry. (On 32-bit non-PAE systems, only the low 20 bits of the counter are stored, in the PTE’s PFN field. On 64-bit systems, the entire 32 bits of the counter are stored in the PTE’s high 32 bits.)

When a previously “returned” virtual page is to be used again, the memory management code examines the corresponding page table entry. If the value of KiTbFlushTimeStamp is the same as the value found in the PTE, then the OS assumes that the translation buffer has not been flushed since the page was unmapped—and performs the flush (and increments KiTbFlushTimeStamp).

This was actually the subject of a bug that was described in an MSKB article, all the way back in Windows NT 4!

Since Windows added support for selectively flushing TB entries, more, much more, has been added to these mechanisms. But even as of Windows 8.1 on x64, Windows still stores the value of KiTbFlushTimeStamp in a PTE when it unmaps a kernel-space virtual address. There’s even a data type for PTEs being used this way: 

6: kd> dt _MMPTE_TIMESTAMP
nt!_MMPTE_TIMESTAMP
   +0x000 MustBeZero       : Pos 0, 1 Bit
   +0x000 PageFileLow      : Pos 1, 4 Bits
   +0x000 Protection       : Pos 5, 5 Bits
   +0x000 Prototype        : Pos 10, 1 Bit
   +0x000 Transition       : Pos 11, 1 Bit
   +0x000 Reserved         : Pos 12, 20 Bits
   +0x000 GlobalTimeStamp  : Pos 32, 32 Bits

It is just one of several possible formats of a PTE:

6: kd> dt _MMPTE -r1
nt!_MMPTE
   +0x000 u                : 
      +0x000 Long             : Uint8B
      +0x000 VolatileLong     : Uint8B
      +0x000 Hard             : _MMPTE_HARDWARE
      +0x000 Flush            : _HARDWARE_PTE
      +0x000 Proto            : _MMPTE_PROTOTYPE
      +0x000 Soft             : _MMPTE_SOFTWARE
      +0x000 TimeStamp        : _MMPTE_TIMESTAMP
      +0x000 Trans            : _MMPTE_TRANSITION
      +0x000 Subsect          : _MMPTE_SUBSECTION
      +0x000 List             : _MMPTE_LIST

If the valid bit is set, then the PTE must conform to the hardware’s expectations; it will be in the _MMPTE_HARDWARE format. But if the valid bit is clear, the CPU doesn’t care what’s in the other 31 or 63 bits. So Windows is free to use those bits as needed.

A PTE representing a page that’s in the pagefile is in the _MMPTE_SOFTWARE format:

6: kd> dt _MMPTE_SOFTWARE
nt!_MMPTE_SOFTWARE
   +0x000 Valid            : Pos 0, 1 Bit
   +0x000 PageFileLow      : Pos 1, 4 Bits
   +0x000 Protection       : Pos 5, 5 Bits
   +0x000 Prototype        : Pos 10, 1 Bit
   +0x000 Transition       : Pos 11, 1 Bit
   +0x000 UsedPageTableEntries : Pos 12, 10 Bits
   +0x000 InStore          : Pos 22, 1 Bit
   +0x000 Reserved         : Pos 23, 9 Bits
   +0x000 PageFileHigh     : Pos 32, 32 Bits

This is the form the debugger thinks our PTE is in. There are two things we can check to see that it’s really in _MMPTE_TIMESTAMP format. The definitive test is to look at the Protection field: If it’s zero, it’s an _MMPTE_TIMESTAMP PTE, meaning that the corresponding virtual address has been freed. Sure enough, the debugger in this case showed “Protect: 0”. (The !pte extension should probably be updated to annotate such PTEs correctly.)

The second thing we can look at is not as definitive, but in my opinion is more interesting. If you check the value of the counter you’ll probably find it’s fairly close to that found in the PTE… or in this case, exactly the same:

6: kd> dd KiTbFlushTimeStamp L 1
fffff800`01e0fb80  0082a734

Since we don’t believe in four-billion-to-one coincidences, this confirms pretty strongly that the “nonpaged pool” had been unmapped. There’s simply no other mechanism that would copy KiTbFlushTimeStamp to a PTE.