D4D4

A co-worker of mine was looking at some disassembled ARM code the other day, and discovered something weird. Lots of d4d4 instructions, scattered about. LLVM’s objdump says this is a relative branch to -0x58. The weird part is that they were always unreachable.

Experiments

Here’s an example in a minimal reproducer I wrote:

00020100 <one>:
   20100: 4770               bx      lr
   20102: d4d4               bmi     0x200ae <__dso_handle+0x100ae> @ imm = #-0x58

That bx lr right before the d4d4 branches to the link register. In other words, it returns. Here’s the C code that goes with this function:

#include "mod.h"

static void one(void)
{
    return;
}

int main(void)
{
    void *fn;

    fn = one;
    use_ptr(fn);
    return 0;
}

The use_ptr function is declared in mod.h (defined in mod.c), and what it does with the pointer is not important. You can see that there’s a function called one, and that function just returns. Thus bx lr being the only thing. But why is there an extra d4d4 after it in the disassembled object code? My first thought was that it was there for alignment. Of course, Thumb instructions are 16 bits and maybe functions need to be 32-bit aligned. Weird that it would use a branch to a real relative address instead of a nop or something that would cause a fault, but let’s try expanding the experiment.

code:

static void one(void)
{
    return;
}

static void two(void)
{
    return;
}

int main(void)
{
    void *fn;

    fn = one;
    use_ptr(fn);
    fn = two;
    use_ptr(fn);
    return 0;
}

And the disassembly:

000200f4 <main>:
   200f4: b580               push    {r7, lr}
   200f6: 466f               mov     r7, sp
   200f8: 4803               ldr     r0, [pc, #0xc]          @ 0x20108 <main+0x14>
   200fa: f000 f80b          bl      0x20114 <use_ptr>       @ imm = #0x16
   200fe: 4803               ldr     r0, [pc, #0xc]          @ 0x2010c <main+0x18>
   20100: f000 f808          bl      0x20114 <use_ptr>       @ imm = #0x10
   20104: 2000               movs    r0, #0x0
   20106: bd80               pop     {r7, pc}
   20108: 11 01 02 00        .word   0x00020111
   2010c: 13 01 02 00        .word   0x00020113

00020110 <one>:
   20110: 4770               bx      lr

00020112 <two>:
   20112: 4770               bx      lr

00020114 <use_ptr>:
   20114: 4770               bx      lr

Not only does the compiler not feel the need to align functions to 32-bit boundaries, but adding a second single-instruction function actually got rid of the d4d4 entirely. And note how use_ptr (I made it just return also) doesn’t have a d4d4 in it. Curious.

What happens if I go to 3 functions?

code:

00020124 <one>:
   20124: 4770               bx      lr

00020126 <two>:
   20126: 4770               bx      lr

00020128 <three>:
   20128: 4770               bx      lr
   2012a: d4d4               bmi     0x200d6 <__dso_handle+0x100d6> @ imm = #-0x58

0002012c <use_ptr>:
   2012c: 4770               bx      lr

It’s back! But now only once. It seems like it aligns the end of main.o so that mod.o can start on a 32-bit boundary. Ok, maybe this makes sense. So let’s take a look at the compiler’s output directly. Here’s main.o:

00000028 <one>:
      28: 4770               bx      lr

0000002a <two>:
      2a: 4770               bx      lr

0000002c <three>:
      2c: 4770               bx      lr

Oh, the compiler didn’t put that in at all, it must have been the linker! So lld is rounding the end of an object file up to 32-bit alignment using this d4d4 instruction. If that’s true, then linking mod.o before main.o ought to move where the d4d4 is.

000200e4 <use_ptr>:
   200e4: 4770               bx      lr
   200e6: d4d4               bmi     0x20092 <__dso_handle+0x10092> @ imm = #-0x58

000200e8 <main>:
   200e8: b5d0               push    {r4, r6, r7, lr}
   200ea: af02               add     r7, sp, #0x8
   200ec: 4804               ldr     r0, [pc, #0x10]         @ 0x20100 <main+0x18>
   200ee: 4c05               ldr     r4, [pc, #0x14]         @ 0x20104 <main+0x1c>
   200f0: 47a0               blx     r4
   200f2: 4805               ldr     r0, [pc, #0x14]         @ 0x20108 <main+0x20>
   200f4: 47a0               blx     r4
   200f6: 4805               ldr     r0, [pc, #0x14]         @ 0x2010c <main+0x24>
   200f8: 47a0               blx     r4
   200fa: 2000               movs    r0, #0x0
   200fc: bdd0               pop     {r4, r6, r7, pc}
   200fe: bf00               nop
   20100: 11 01 02 00        .word   0x00020111
   20104: e5 00 02 00        .word   0x000200e5
   20108: 13 01 02 00        .word   0x00020113
   2010c: 15 01 02 00        .word   0x00020115

00020110 <one>:
   20110: 4770               bx      lr

00020112 <two>:
   20112: 4770               bx      lr

00020114 <three>:
   20114: 4770               bx      lr

It did! The extra instruction got moved up to align the beginning of the object code that contains main!

One more check: if I use the GNU linker, does it do the same thing?

00008000 <use_ptr>:
    8000: 4770               bx      lr
    8002: 0000               movs    r0, r0

No! GNU ld (2.44) inserts zeroes to align files. It’s not nop, though it may as well be. ARMv7-M actually has a nop instruction: bf00 or f3af8000 depending on encoding. You can see it in main, inserted by the compiler between the end of the function and its constants.

Conclusion

So now we know. LLD is inserting the weird d4d4 instructions, and it’s doing it to align across object file boundaries. Why did they pick such a weird constant, though? GNU ld went with zeroes, which seems benign.

Research

A little bit of checking out the code later, and we find this in ARM.cpp:

trapInstr = {0xd4, 0xd4, 0xd4, 0xd4};

That was actually way easier than I was expecting. The git blame path meanders a little before getting to this commit where Rui Ueyama explains:

Add trap instructions for ARM and MIPS.

This patch fills holes in executable sections with 0xd4 (ARM) or
0xef (MIPS). These trap instructions were suggested by Theo de Raadt.

llvm-svn: 306322

This appears to have been precipitated by this message, also from Rui Ueyama, to the llvm-bugs mailing list. In it, the question is asked whether LLD should use a trap instruction for ARM/AArch64 like x86 and x86-64’s 0xCC. I didn’t find any replies, though.

I couldn’t find any messages on the mailing list about this from Theo de Raadt, so I guess we just have to live with Ueyama’s testimony that he thought 0xd4 would be a good byte to repeat as a trap instruction. But a trap instruction is supposed to halt the processor, so what’s with the disassembler saying it’s a branch?

RTFM

Let’s take a look at the ARMv7-M Architecture Reference Manual, the ARM. First of all, it says that we’re using the Thumb instruction set, and most instructions are 16 bits. Any instructions that begin with 0b11101, 0b11110, or 0b11111 are the beginnings of 32-bit instructions, but all the rest are 16 bits. Since d4 is 0b11010100, we can safely assume that the instruction decoder will always treat it like a 16-bit instruction.

Next up, we have table A5-1, showing how 16-bit instructions are encoded. The first 6 bits are the opcode, followed by 10 bits of other stuff. As we established earlier, the bits we’re looking for are 0b110101. That matches conditional branch and supervisor call’s 0b1101xx. Well, so far it’s still looking like a branch..

Page A5-134 leads us to the statement that the encoding here is 0b1101 followed by a 4 bit opcode. 0xd is the 0b1101 and the next 4 bits are the number 4: 0b0100. This doesn’t match UDF, which seems like a reasonable choice for this purpose. Instead, any opcode not matching 111x is a conditional branch, explained under B on page A7-207, according to table A5-8.

This page tells us that the B instruction has several encodings, but only one that begins with a 0xd: T1.

../../_images/branch-encoding.png

With 0xd4d4, cond in this table is 0x4 or 0b0100. That doesn’t match UDF (again) or SVC. InITBlock() just checks if we’re within 4 instructions of an IT. IT is a weird instruction, but not important for our purposes. What’s important is that imm32 is now those least significant 8 bits, sign extended. That’s 0xffffffd4, or -44. That’s -0x2c in hex, not the -0x58 from objdump. However, immediate offsets count half-words, not bytes. So the offset is -88, or -0x58.

The other field, cond, is 0b0100, which means some bits in the condition registers have to be set in order for the branch to be taken. Which bits don’t particularly matter, since this code is supposed to be unreachable, but for completeness, it’s (ASPR.C == '1') && (ASPR.Z == '0').

More Conclusions

So it looks to me like objdump is behaving correctly, and Theo’s suggestion was a bad one, at least for Thumb. Instead of a trap, it’s a relative jump. In the tiny program I made above, it jumps right out of .text, but in a more substantial program, it will usually land somewhere in code. It should be unreachable, but the whole point of a trap instruction is to halt the processor, not send it off on a random walk through your codebase.

This feels like a bug in LLD to me. The linker’s inserting something called “trapInstr” but it’s nothing like a trap. In fact, it’s a (conditional) jump to a constant relative address.

So after a fun night of spelunking through source code, commit logs, and manuals, I feel like I’ve learned something. I guess maybe it’s time to file a bug report.

Previous: Magic File Names