This article is about some very low-level stuff. If you are interested in loading 32-bit libraries in 64-bit applications, then go away - this is not the kind of stuff you are looking for.
While working on improving Darling, I started looking for a way of executing 32-bit macOS software bootstrapped in a 64-bit Linux process. This approach has certain advantages, but also brings some complications. As of writing this article, it is still an emerging technology developed in a branch.
The primary property of the 32-bit code is that it's fully self-contained. It makes system calls on its own and it won't be until a lot later until it needs to interface with 64-bit ELF Linux libraries.
How I Started
In the beginning, I thought I could execute 32-bit code directly, because every 64-bit process can make 32-bit system calls on Linux. The only complication was allocating a new stack at an address reachable from 32-bit code (= somewhere in the first 4 GB of process memory).
But I quickly learned that some instructions behave differently when executed as 64-bit. For example, the following two instructions compile to the same opcodes when compiled as 32 and 64 bit respectively, i.e. there is no 64-bit prefix commonly seen in other instructions (such as 0x48).
pushl %ebx pushq %rbx
This means the CPU must somehow be told to run in 32-bit mode.
How I Made it Work
The Linux kernel maintains various segment descriptors in its GDT. Even though Linux uses pagination (and not segmentation) for memory management, a few segments must still exist for things to work.
The trick is to jump into the 32-bit code and switch the segment selector at the same time. For code, the numbers to remember are 0x23 for 32-bit code and 0x33 for 64-bit code.
// This code assumes %eax contains 32-bit code's address in memory. subq $8, %rsp movl $0x23, 4(%rsp) movl %eax, (%rsp) lret // long return will pop the address and segment selector
You'll quickly notice your 32-bit code will now execute properly, except it will segfault on any memory access. To make things even more confusing, Linux will report 0 as the faulting address.
The explanation is simple. When the CPU runs in long mode (64-bit), the value of the DS register is forced to 0. The register is not considered for memory access. This is, however, not the case in 32-bit mode. Therefore, we need to set a correct value into DS. As DS is inaccessible in 64-bit mode, the following instructions need to be executed after switching into 32-bit mode.
push $0x2b pop %ds
Now your code should start running... until it starts making system calls.
System Call Quirks
SYSENTER
If your code makes (32-bit) system calls using the SYSENTER, you will quickly learn that after the system call is executed, the kernel jumps into the middle of nowhere, even if you are certain your code works fine when run in a 32-bit process.
The reason lies within the main deficiency of SYSENTER: it does not save original EIP where to return after the call is finished. This is normally not a problem, because the kernel will jump into a helper function inside of vDSO, which picks up the original EIP from the stack (where your code has stored it).
This will not work now, though, as your process only has a 64-bit vDSO. The solution is to perform system calls by executing the traditional int $0x80.
mmap
The problem with mmap (and mmap2) is funny. The IA32 system call wrapper takes care of extending arguments etc., but a common mmap implementation is eventually called. Unless you set MAP_FIXED, this implementation checks if the process is 32-bit, and if it is, it makes sure the mapping is created in an address range accessible to a 32-bit process.
However, since our process is 64-bit, new mappings will probably get created above the 2^32 limit, the IA32 system call wrapper will cut off the upper 32 bits and your code will receive an incomplete address. The strace output will look very confusing, as strace will show you have successfully created a mapping at an address, but you will probably end up crashing accessing that very address shortly thereafter.
There is a simple fix. Just add MAP_32BIT into the flags for mmap/mmap2. The flag's name is unfortunately not very fitting, I would personally call it MAP_30BIT. The man page says the mapping will be created within the first 2 GB of memory space (What is 32-bit about that? This is only 31 bits!). But it gets worse. If you check the kernel source code, you will notice the kernel also places a lower limit of 0x40000000 on the mapping, which effectively limits the mappings in the range between 0x40000000 and 0x80000000. That's only 30 bits (1 GB) of memory!
So I searched for a different solution. A nice way would be to use personality() and set flags ADDR_LIMIT_32BIT or ADDR_LIMIT_3GB. Sadly, neither of these flags have any effect at all. The whole personality() system call is a bad joke: it only takes effect across execve() and most flags are not implemented.
Alas, the same guy who created the very broken MAP_32BIT is also the guy standing in the way of fixing these flags.
 
Just found the post, while reading about Darling.
ReplyDeleteYou might be interested that just at the moment you found mmap() bug, there were patches in lkml posted.
This revision was accepted:
https://lore.kernel.org/linux-mm/20170306141721.9188-1-dsafonov@virtuozzo.com/
As you're interested in running 32-bit code from 64-bit application, you may be interested in other things like how-to get vdso working. In CRIU ia32 applications are restored from 64-bit CRIU binary, so most important issues were addressed at that time. Though, there are still standing uprobes/oprofile issues yet.