Sort Out Your PODs

Some time ago I was helping a friend debug a crash in an application they were working on. It worked fine in production, but when running it on a smaller test file, it kept crashing. However, the crash was always on exit – the application itself seemed to run without issue. The crash was from a free, complaining about the pointer:

free(): invalid pointer
Aborted

GDB showed the offending free was from compiler-generated code, with no user code on the stack:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000007ff7ca0aa0 in __GI_abort () at abort.c:79
#2  0x0000007ff7ced280 in __libc_message (action=action@entry=do_abort, 
    fmt=fmt@entry=0x7ff7db09e0 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x0000007ff7cf47dc in malloc_printerr (
    str=str@entry=0x7ff7dac488 "free(): invalid pointer") at malloc.c:5347
#4  0x0000007ff7cf5844 in _int_free (av=0x7ff7defa98 <main_arena>, 
    p=0x555558b980 <things+170248>, have_lock=<optimized out>) at malloc.c:4173
#5  0x0000005555550a60 in __tcf_0 ()
#6  0x0000007ff7cb67ac in __run_exit_handlers (status=0, 
    listp=0x7ff7def6b8 <__exit_funcs>, 
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at exit.c:108
#7  0x0000007ff7cb693c in __GI_exit (status=<optimized out>) at exit.c:139
#8  0x0000007ff7ca0e1c in __libc_start_main (main=0x5555550a70 <main>, argc=1, 
    argv=0x7ffffff4f8, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=<optimized out>)
    at ../csu/libc-start.c:342
#9  0x0000005555550914 in _start ()

Address Sanitizer did not show any signs of heap trampling, and agreed with the GDB analysis that the offending free was from compiler-generated code:

=================================================================
==391229==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x005583113010 in thread T0
    #0 0x7fb8bd027c in operator delete(void*, unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:172
    #1 0x5583071058 in __tcf_0 (/home/user/a.out+0x1058)
    #2 0x7fb88197a8 in __run_exit_handlers stdlib/exit.c:108
    #3 0x7fb8819938 in __GI_exit stdlib/exit.c:139
    #4 0x7fb8803e18 in __libc_start_main ../csu/libc-start.c:342
    #5 0x5583070ed0  (/home/user/a.out+0xed0)

0x005583113010 is located 593424 bytes inside of global variable 'things' defined in 'example.cpp:12:7' (0x5583082200) of size 1200000
SUMMARY: AddressSanitizer: bad-free ../../../../src/libsanitizer/asan/asan_new_delete.cpp:172 in operator delete(void*, unsigned long)
==391229==ABORTING

One interesting detail is that the offending address being freed pointed into the inside of some global array. I had initially figured that this was the result of some sort of memory trampling, replacing e.g. a single byte, changing an unrelated malloc’ed pointer to 0x559F082200 with a pointer to 0x5583082200 that happened to live inside the (fairly large, so not that big a coincidence) global array. Address Sanitizer should have caught this, however, so that’s not likely what’s happening.

Let’s investigate this mysterious __tcf_0 more. What’s happening in there? (In reality, there was more code in this function, but I’m demonstrating with a smaller targeted reproduction example. Also, in reality, this was on x86, but the only convenient Linux machine I have on-hand is ARM.)

(gdb) disassemble 
Dump of assembler code for function __tcf_0:
   0x0000005555550a18 <+0>: stp x29, x30, [sp, #-32]!
   0x0000005555550a1c <+4>: mov x29, sp
   0x0000005555550a20 <+8>: stp x19, x20, [sp, #16]
   0x0000005555550a24 <+12>:    adrp    x19, 0x5555686000 <things+1195912>
   0x0000005555550a28 <+16>:    add x19, x19, #0xfe8
   0x0000005555550a2c <+20>:    adrp    x20, 0x5555562000 <rand@got.plt>
   0x0000005555550a30 <+24>:    add x20, x20, #0x78
   0x0000005555550a34 <+28>:    sub x20, x20, #0x10
   0x0000005555550a38 <+32>:    b   0x5555550a48 <__tcf_0+48>
   0x0000005555550a3c <+36>:    sub x19, x19, #0x28
   0x0000005555550a40 <+40>:    cmp x19, x20
   0x0000005555550a44 <+44>:    b.eq    0x5555550a64 <__tcf_0+76>  // b.none
   0x0000005555550a48 <+48>:    ldur    x0, [x19, #-16]
   0x0000005555550a4c <+52>:    cmp x0, x19
   0x0000005555550a50 <+56>:    b.eq    0x5555550a3c <__tcf_0+36>  // b.none
   0x0000005555550a54 <+60>:    ldr x1, [x19]
   0x0000005555550a58 <+64>:    add x1, x1, #0x1
   0x0000005555550a5c <+68>:    bl  0x5555550880 <_ZdlPvm@plt>
=> 0x0000005555550a60 <+72>:    b   0x5555550a3c <__tcf_0+36>
   0x0000005555550a64 <+76>:    ldp x19, x20, [sp, #16]
   0x0000005555550a68 <+80>:    ldp x29, x30, [sp], #32
   0x0000005555550a6c <+84>:    ret
End of assembler dump.

What’s going on here? First x19 is circuitously loaded with the address 0x5555686FD8 and x20 is loaded with 0x5555562078. What are these?

(gdb) p/a 0x5555562078
$1 = 0x5555562078 <things>
(gdb) p/a 0x5555686FD8
$2 = 0x5555686fe8 <things+1199984>
(gdb) p 1199984/40
$3 = 29999
(gdb) p 1199984%40
$4 = 24

x20 is the beginning of the global things array, and then we subtract 16 from it. x19 is almost the last element (sizeof(Thing) == 40). Then we have a loop. Each iteration, we check to see if *(x19 - 16) != x19. If so, we operator delete(void*, unsigned long)(*(x19 - 16), *x19 + 1). Then in either case, we go to the next-earlier element in things. OK, but what is this doing?

What are these offsets in Thing? When we refer to x19, what field is that? What about x19 - 16? Thing is defined as such:

struct Thing {
    int code;
    std::string message;
};

(In reality, this was a large struct with 30-or-so fields, not just two.) x19 points to thing.message._M_local_buf. x19 - 16 points to thing.message._M_dataplus._M_p. Really, std::string looks like this:

struct std::string {
    char *ptr;
    size_t length;
    char local_buf[16];
};

This is for the short-string optimization: If the string is 15 characters or less, the string is stored inline in the structure. Only if the string is longer than 15 characters is a heap allocation done. The code we’re looking at in __tcf_0 is the destructor for things; it is looping backwards through the array calling Thing::~Thing(), which has been inlined. Thing::~Thing() is calling string::~string(). string::~string() is comparing to see if ptr != local_buf before calling roughly delete[] ptr.

OK, so that makes sense, but then why is it failing? The pointer is not pointing at local_buf, but it’s also not pointing at the heap. Where is it pointing? It turns out: To the local_buf of a different thing.message in the things array. How did that happen?

Some more careful debugging with watchpoints later reveals: The pointers are being mucked about with inside of memcpy. Somewhere in the code base, there was code using memcpy to copy some Things around. A picture comes to light: Thing was originally a POD structure, with a fixed length char array, e.g.:

struct Thing {
    int code;
    char message[128];
};

At some point, an engineer comes along, figuring an arbitrary length limit isn’t cutting it – let’s just support any length. In std::string goes. In real life, all files had messages longer than 15 characters, so it acted as a char *, which is fine to memcpy. Then one day, when testing with a smaller file, it comes across its first instance of a blank message, and the assumptions fall apart – memcpying a std::string with the short string optimization breaks it.

Actually, the memcpy was not just in the application proper. It was actually inside of qsort. The things array was being sorted using the C standard library’s qsort, which uses memcpy under the hood. It does not know any better.

The final solution: Replace the qsort call with an std::sort call, rewriting the comparator as necessary to accommodate the change. std::sort takes proper C++ semantics into account when moving elements around. And with that, the crash disappeared.

As a takeaway, make sure that when adding a non-POD member to a POD type for the first time, that no code remains that depends on it being POD. Failure to do so may cause subtle bugs not detected until much later under different circumstances, long after the offending change was introduced.

Tags: ,

Leave a Reply