Debugging and profiling libtool binaries

libtool takes care of a lot of the details of building binaries and shared libraries for you, but one of the side effects of this is that until you install your binaries the binaries in your build directory are actually shell script wrappers. Running gdb or profiling tools on the shell script won’t work:

# gdb ./mybinary 
GNU gdb (GDB) Fedora 7.5.1-42.fc18
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
"/home/will/mybinary": not in executable format: File format not recognized
(gdb)

However, you can get libtool to help you:

# libtool --mode=execute gdb ./mybinary
GNU gdb (GDB) Fedora 7.5.1-42.fc18
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /home/will/.libs/lt-mybinary...done.
(gdb)

This trick will also work for other tools like gprof and perf.

A simple power law pseudo-random number generator

Today I needed some non-uniform random numbers. Most pseudo random number generators such as the rand(3) function return numbers that are drawn from a uniform distribution, so for example with rand(3) you have an approximately equal chance of getting and number between 0 and RAND_MAX. However in many real world situations some numbers occur much more frequently than others. For example, on average most programs make many more small memory allocations than large ones. Graphing the distribution of allocation sizes you get something that looks like a power law distribution. I don’t know if allocations always follow this type of pattern in all applications or whether it is in fact a true power law distribution, but it seems like a reasonable approximation.

So how can we generate random numbers that are drawn from a power law distribution? The easiest way is to use the standard uniform random number generator and transform the results. Below is some code in C that does that. It looks quite complex but gcc handles the constant propagation and the generated code is quite simple with only a single call to drand48.

#include <math.h>
#include <stdlib.h>

#define MIN_ALLOCATION_SIZE	4
#define MAX_ALLOCATION_SIZE	8192

size_t
get_block_size (void)
{
  /* Inverse square.  */
  double exponent = -2;
  /* Minimum value of distribution.  */
  double dist_min = MIN_ALLOCATION_SIZE;
  /* Maximum value of distribution.  */
  double dist_max = MAX_ALLOCATION_SIZE;

  double min_pow = pow (dist_min, exponent + 1);
  double max_pow = pow (dist_max, exponent + 1);

  return (size_t) pow((max_pow - min_pow) * drand48 ()
                      + min_pow, 1 / (exponent + 1));
}

How not to do indirect calls on ARM

The Thumb2 instruction encoding can reduce your code size by up to 30%, so it is nice if you’re writing assembler code to use the unified syntax which will allow your code to be encoded as either ARM or Thumb2 seamlessly. Or almost seamlessly – even though the unified syntax is the same for ARM and Thumb2 encodings, there are still some things you have to bear in mind when writing code for Thumb2.

For example, the following code implements an indirect call:

       mov     lr, pc         @ Save return address in lr
       ldr     pc, [sp], #8   @ Load function address

This code will not work correctly on Thumb2. Switching mode from ARM to Thumb happens when an address with the bottom bit set is jumped to, this can be by a branch, an ldr or a mov. However the pc does not store the Thumb mode bit, so moving from the pc as we do in the first instruction above will not capture the information that this is a return to a Thumb function. A better sequence would be:

       ldr     lr, [sp], #8   @ Load function address
       blx     lr             @ Call function

The blx instruction is only available on ARMv5 and above however, so if you need portability to historical ARM cores you may need to do a bit more work.

Debugging check unit tests with gdb

check is a unit testing framework for C code that provides a number of neat features, including running tests after a fork in order to catch any fatal signals that may be caused by a broken test. Unfortunately this can have the side effect of making debugging a failing test with gdb more painful than it needs to be.

There’s a simple way to get around this however – set the CK_FORK environment variable to disable the fork before running tests:

$ gdb my_check_test 
GNU gdb (GDB) Fedora (7.5.1-38.fc18)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /tmp/my_check_test...done.
(gdb) set environment CK_FORK=no
(gdb) run

Simple lock-free ring buffer implementation

Recently I was looking for some code to pass buffers between two threads in a memory allocator benchmark. The code had to be low overhead enough to ensure that it wouldn’t obscure the performance characteristics of the code being benchmarked, so I was looking for a data structure that was lock-free and reasonably simple and fast. It also had to avoid calling the allocator at any point.

The code below is what I came up with – it’s a blocking, single producer, single consumer lock-free ring buffer FIFO queue.

/*
 * Copyright (c) 2013, Linaro Limited
 *
 * Licensed under the 3 clause BSD license.
 */

#ifndef __LFQ_H__
#define __LFQ_H__

#include 
#include 

struct lfq {
  volatile unsigned int producer;
  volatile unsigned int consumer;
  size_t queue_size;
  void **data;
};

/* Initialize a struct lfq with a size and array of memory. */
static inline void lfq_init(struct lfq *q, size_t queue_size,
			    void **data)
{
  q->queue_size = queue_size;
  q->data = data;
  q->producer = 0;
  q->consumer = 0;
}

/* Internal function to backoff on contention. */
static inline void __lfq_backoff(void)
{
  sched_yield();
}

/* Dequeue an item from the ring. Spin waiting if it is empty. */
static inline void *lfq_dequeue(struct lfq *q)
{
  void *ret;
  while (q->producer == q->consumer)
    __lfq_backoff();
  ret = q->data[q->consumer % q->queue_size];
  q->consumer++;
  return ret;
}

/* Enqueue an item onto the ring. Spin waiting if it is full. */
static inline void lfq_queue(struct lfq *q, void *item)
{
  while (q->producer - q->consumer >= q->queue_size)
    __lfq_backoff();
  q->data[q->producer % q->queue_size] = item;
  __sync_synchronize();
  q->producer++;
}

/* Test is the queue is empty. */
static inline bool lfq_empty(struct lfq *q)
{
  return q->producer == q->consumer;
}
#endif

Using GNU indirect functions

GNU indirect functions are an extension of ELF that allows you to make a decision about which implementation of a function to use at link time. This allows you to choose the fastest code for a particular processor, which is very useful if you’re shipping your code as binaries. It involves an ELF symbol type (STT_GNU_IFUNC) and a new dynamic relocation (R_*_IRELATIVE) which have been added to the GNU ld and gold linkers and the glibc dynamic linker. Support exists for arm, powerpc, s390, sparc, i386 and x86_64 with aarch64 support coming soon.

Indirect functions aren’t used very widely at the moment, as far as I can tell only glibc uses them at the moment for things like string functions, but they can be used by any other application or library. I’ll try and explain with a short example how they can be used.

#include 
#include  /* Needed for HWCAP definitions. */

static void func1_neon(void)
{
	printf("NEON implementationn");
}

static void func1_vfp(void)
{
	printf("VFP implementationn");
}

static void func1_arm(void)
{
	printf("ARM implementationn");
}

/* This line makes func1 a GNU indirect function. */
asm (".type func1, %gnu_indirect_function");

void *func1(unsigned long int hwcap)
{
	if (hwcap & HWCAP_ARM_NEON)
		return &func1_neon;
	else if (hwcap & HWCAP_ARM_VFP)
		return &func1_vfp;
	else
		return &func1_arm;
}

This code defines a resolver function, func1, which returns a pointer to a function implementation based on the value of hwcap that is passed to it. The value of hwcap is architecture specific and on some architectures, like x86_64, it is not passed at all, and you have to determine hardware capabilities by hand.

int main(void)
{
	func1();
	return 0;
}

This main function must be defined in a separate file to prevent the compiler from getting the wrong type for func1. Somewhat confusingly the function called func1 from the caller’s perspective is of the type of its implementations rather than of the type of the indirect function. All indirect functions have the same prototype – they take an optional hwcap argument and return a pointer to a function.

Build these two pieces of code and link them together:


# gcc -O2 ifunc.c -c
# gcc -O2 main.c -c
# gcc main.o ifunc.o -o ifunc
# ./ifunc
NEON implementation
#

This is the result I get on my ARM Chromebook, your result will vary depending on which hardware you actually have. So why is this better than the alternatives? You could dispatch to the optimized routine dynamically yourself, but that adds an extra comparison and indirect call to every call to the function and you will also have to find the hardware capabilities yourself, which can be tricky. Alternatively you could build optimized libraries for every bit of hardware you support, but this takes up a lot of disk space and takes more significant infrastructure to dynamically open the correct library.

Indirect functions are not without their downsides however. They are currently supported only on Linux with the GNU toolchain and only on a subset of Linux supported architectures, and in some cases the tools are not very well tested yet – for example, at the moment this will only work correctly on ARM with the as yet unreleased glibc 2.18 due to a bug in the dynamic linker – but they are a powerful tool that deserves to be more widely used.

The usefulness of gcc -Wextra

gcc is really great tool for finding errors in your code, but it helps to pass it the right flags to make sure it doesn’t forget to tell you about them. I would always recommend enabling -Wall on whatever code you build, and if possible -Werror. However, sometimes this is not enough, for example, the code below has a subtle bug:

int func(char *buf)
{
    int i;

    for (i = 0; i < 16; i++)
    {
        if (buf[i] != 0xff)
            return 1;
    }

    return 0;
}

If we compile this with gcc:

# gcc -O2 -S -Wall func.c

The assembly output is given below.

        .text
        .globl  func
        .type   func, @function
func:
        movl    $1, %eax
        ret

It looks like all our code has disappeared! gcc has taken advantage of the fact the type information in the code to decide that the loop can be elided completely and optimized it out. This is because c is a pointer to signed char (on x86_64, other architectures define char as unsigned and will not optimize this out) and 0xff is an integer constant that will not fit in a signed char, hence the comparison will never be true and the loop has no effect.

This behaviour is perfectly fine with respect to the C standard, however I do think gcc should warn when it is performing this type of transformation. If we build with clang (the LLVM C compiler), we get:

# clang -O2 -S func.c
func.c:8:14: warning: comparison of constant 255 with expression of type 'char' is always true [-Wtautological-constant-out-of-range-compare]
                if (buf[i] != 0xff)
                    ~~~~~~ ^  ~~~~
1 warning generated.

Which seems much more appropriate than silence. gcc can be fixed to generate an error with -Wextra or -Wtype-limits:

gcc -O2 -S -Wall -Wextra func.c
func.c: In function ‘func’:
func.c:8:3: warning: comparison is always true due to limited range of data type [-Wtype-limits]

-Wextra can be problematic however as it contains a number of warnings that are quite noisy and uninteresting, for example -Wunused-parameter, so you may have to experiment a little to apply it to your codebase.