Using GNU indirect functions

GNU indirect functions are an extension of ELF that allows you to make a decision about which implementation of a function to use at link time. This allows you to choose the fastest code for a particular processor, which is very useful if you’re shipping your code as binaries. It involves an ELF symbol type (STT_GNU_IFUNC) and a new dynamic relocation (R_*_IRELATIVE) which have been added to the GNU ld and gold linkers and the glibc dynamic linker. Support exists for arm, powerpc, s390, sparc, i386 and x86_64 with aarch64 support coming soon.

Indirect functions aren’t used very widely at the moment, as far as I can tell only glibc uses them at the moment for things like string functions, but they can be used by any other application or library. I’ll try and explain with a short example how they can be used.

#include 
#include  /* Needed for HWCAP definitions. */

static void func1_neon(void)
{
	printf("NEON implementationn");
}

static void func1_vfp(void)
{
	printf("VFP implementationn");
}

static void func1_arm(void)
{
	printf("ARM implementationn");
}

/* This line makes func1 a GNU indirect function. */
asm (".type func1, %gnu_indirect_function");

void *func1(unsigned long int hwcap)
{
	if (hwcap & HWCAP_ARM_NEON)
		return &func1_neon;
	else if (hwcap & HWCAP_ARM_VFP)
		return &func1_vfp;
	else
		return &func1_arm;
}

This code defines a resolver function, func1, which returns a pointer to a function implementation based on the value of hwcap that is passed to it. The value of hwcap is architecture specific and on some architectures, like x86_64, it is not passed at all, and you have to determine hardware capabilities by hand.

int main(void)
{
	func1();
	return 0;
}

This main function must be defined in a separate file to prevent the compiler from getting the wrong type for func1. Somewhat confusingly the function called func1 from the caller’s perspective is of the type of its implementations rather than of the type of the indirect function. All indirect functions have the same prototype – they take an optional hwcap argument and return a pointer to a function.

Build these two pieces of code and link them together:


# gcc -O2 ifunc.c -c
# gcc -O2 main.c -c
# gcc main.o ifunc.o -o ifunc
# ./ifunc
NEON implementation
#

This is the result I get on my ARM Chromebook, your result will vary depending on which hardware you actually have. So why is this better than the alternatives? You could dispatch to the optimized routine dynamically yourself, but that adds an extra comparison and indirect call to every call to the function and you will also have to find the hardware capabilities yourself, which can be tricky. Alternatively you could build optimized libraries for every bit of hardware you support, but this takes up a lot of disk space and takes more significant infrastructure to dynamically open the correct library.

Indirect functions are not without their downsides however. They are currently supported only on Linux with the GNU toolchain and only on a subset of Linux supported architectures, and in some cases the tools are not very well tested yet – for example, at the moment this will only work correctly on ARM with the as yet unreleased glibc 2.18 due to a bug in the dynamic linker – but they are a powerful tool that deserves to be more widely used.

The usefulness of gcc -Wextra

gcc is really great tool for finding errors in your code, but it helps to pass it the right flags to make sure it doesn’t forget to tell you about them. I would always recommend enabling -Wall on whatever code you build, and if possible -Werror. However, sometimes this is not enough, for example, the code below has a subtle bug:

int func(char *buf)
{
    int i;

    for (i = 0; i < 16; i++)
    {
        if (buf[i] != 0xff)
            return 1;
    }

    return 0;
}

If we compile this with gcc:

# gcc -O2 -S -Wall func.c

The assembly output is given below.

        .text
        .globl  func
        .type   func, @function
func:
        movl    $1, %eax
        ret

It looks like all our code has disappeared! gcc has taken advantage of the fact the type information in the code to decide that the loop can be elided completely and optimized it out. This is because c is a pointer to signed char (on x86_64, other architectures define char as unsigned and will not optimize this out) and 0xff is an integer constant that will not fit in a signed char, hence the comparison will never be true and the loop has no effect.

This behaviour is perfectly fine with respect to the C standard, however I do think gcc should warn when it is performing this type of transformation. If we build with clang (the LLVM C compiler), we get:

# clang -O2 -S func.c
func.c:8:14: warning: comparison of constant 255 with expression of type 'char' is always true [-Wtautological-constant-out-of-range-compare]
                if (buf[i] != 0xff)
                    ~~~~~~ ^  ~~~~
1 warning generated.

Which seems much more appropriate than silence. gcc can be fixed to generate an error with -Wextra or -Wtype-limits:

gcc -O2 -S -Wall -Wextra func.c
func.c: In function ‘func’:
func.c:8:3: warning: comparison is always true due to limited range of data type [-Wtype-limits]

-Wextra can be problematic however as it contains a number of warnings that are quite noisy and uninteresting, for example -Wunused-parameter, so you may have to experiment a little to apply it to your codebase.

calloc versus malloc and memset

The standard C library provides two ways to allocate memory from the heap, calloc and malloc. They differ superficially in their arguments but more fundamentally in the guarantees they provide about the memory they return – calloc promises to fill any memory returned with zeros but malloc does not.

It would appear that the following two code snippets are equivalent:

/* Allocating with calloc. */
void *ptr = calloc(1024, 1024);
/* Allocating with malloc. */
void *ptr = malloc(1024*1024);
memset(ptr, 0, 1024*1024);

Functionally the two snippets of code are the same – they allocate 1MB of memory full of zeros from the heap – but in one subtle way they may not be.

The standard allocator on Linux is ptmalloc, part of glibc, which for large allocations of 128kB or above may use the mmap system call to allocate memory. The mmap system call returns a mapping of zeroed pages, and ptmalloc avoids calling memset for calloc allocations made with mmap to avoid the overhead involved. So what does this mean?

In order to return an allocation of zeroed pages the kernel plays a trick. Instead of allocating however many pages are required and writing zeros into them, it allocates a single page filled with zeroes, and every time it needs a page full of zeros it maps the same one.

zero_page_mapping

Only one page is allocated initially and subsequent pages are allocated as needed when they are written to (copy on write), saving memory and allowing the allocation to be completed quickly. The allocation starts out looking like the picture on the left, and then as pages are written to would eventually end up looking like the picture on the right.

Normally this behaviour is beneficial to the performance and memory usage of your system however it caught me out when running some code to benchmark memory copy routines. Allocating a source buffer for a copy benchmark with calloc is not the same as using initialized memory, the zero page is a single physical page, so a physically tagged cache can quite comfortably contain the whole of your source buffer even if the buffer you allocated was much larger than the cache.

So in special situations like benchmarking it is best to think carefully before using calloc, but in general it can be a useful tool to improve the memory usage and performance of your code.

 

ARM C++ constructors are different

Programmers who have, like me, come to ARM from other architectures may find one or two things surprising.

For example, the following code is quite simple C++:

class A {
public:
    A() {}
    ~A() {}
};

A a;

int main(void)
{
    return 0;
}

But if we compile it and examine it with gdb, there’s something a bit unexpected:

$ gcc -g constructor.cpp -o constructor
$ gdb ./constructor 
GNU gdb (GDB) 7.5.91.20130417-cvs-ubuntu
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/linaro/constructor...done.
(gdb) ptype A::A
type = class A {
  public:
    A(void);
    ~A(int);
} *(A * const)
(gdb) ptype A::~A
type = void *(A * const)
(gdb)

The types of the constructor and destructor are not quite what we might expect. Traditionally a C++ constructor or destructor does not return a value, however on ARM things are different – the constructor returns a pointer to class A, and the destructor returns a pointer to void.

Why is this the case? On ARM constructors and destructors are specified differently in order to provide scope for optimizing calls to a chain of constructors or destructors while minimizing the pushing stack frames (tail call optimization). There’s a very helpful document available here called the C++ ABI for the ARM Architecture which details the differences between the ARM ABI and the Generic GNU C++ ABI including this little quirk.