scvalex.net

40. Variable Length Arrays

When I learned C back in high-school, I was taught that all arrays declared on the stack had to have their size statically known at compile time. If we wanted variable sized arrays, we had to allocate them on the heap. It turns out that, as of C99, that’s no longer true.

Let’s start with the known, and then move into the unknown. Here’s how we define an array inside a C function, and below is the -O1 assembly GCC generates for it.

void __attribute__ ((noinline)) exercise(int *arr, int n) {
    int i;
    for (i = 1; i < n; ++i) {
        arr[i] = arr[i - 1] ^ arr[i];
    }
}

void foo() {
    int arr[10];
    arr[3] = 1234;
    exercise(arr, 10);
}
foo:
        subq    $56, %rsp
        movl    $1234, 12(%rsp)
        movl    $10, %esi
        movq    %rsp, %rdi
        call    exercise
        addq    $56, %rsp
        ret

The function we care about is foo. The point of exercise() is to prevent GCC from optimising it out. The assembly was generated by gcc -O1 -S and then stripped of annotations. We’re not using -O2 because, at that level, GCC reorders some of the instructions obfuscating the generated code too much.

Step by step, foo allocates 56 bytes of local storage, writes 1234 into the fourth element of arr, copies 10 and the address of arr into registers in preparation for the call to exercise, calls exercise, frees the 56 bytes of local storage, and finally returns. Strictly speaking, foo only needs 40 bytes of local storage since its only local variable is a 10 element array of 4 byte elements. A bit more is allocated because of the x86_64 calling convention, but the exact details aren’t important here. After the allocation, the address of arr is stored in %rsp, so its nth element is at %rsp + 4 * n. Per GNU assembler notation, arr[3] is written as 12(%rsp).

Since the address of arr is known at compile time, GCC can reference its elements directly. For instance, arr[3] was translated to a direct address; GCC didn’t have to load the address of arr into a register, add 12 to it, then reference that in the assignment. This remains true even if we have multiple local arrays.

void bar() {
    int arr[10];
    int brr[12];
    arr[3] = 1234;
    brr[3] = 5678;
    exercise(arr, 10);
    exercise(brr, 12);
}
bar:
    subq    $104, %rsp
    movl    $1234, 12(%rsp)
    movl    $5678, 60(%rsp)
    movl    $10, %esi
    movq    %rsp, %rdi
    call    exercise
    movl    $12, %esi
    leaq    48(%rsp), %rdi
    call    exercise
    addq    $104, %rsp
    ret

Like before, we can tell that arr is at %rsp and that brr is at %rsp + 48 by looking at the assignments or at the calls to exercise. Again, we see that arr[3] and brr[3] were addressed directly.

Now, according to the GCC manual (or the C99 spec), arrays can have variable length.

These arrays are declared like any other automatic arrays, but with a length that is not a constant expression. The storage is allocated at the point of declaration and deallocated when the block scope containing the declaration exits.

Here’s what the generated assembly looks like.

void baz(int n) {
    int arr[n];
    arr[3] = 1234;
    exercise(arr, n);
}
baz:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, %esi
    movslq  %edi, %rax
    leaq    18(,%rax,4), %rax
    andq    $-16, %rax
    subq    %rax, %rsp
    leaq    3(%rsp), %rax
    shrq    $2, %rax
    leaq    0(,%rax,4), %rdi
    movl    $1234, 12(,%rax,4)
    call    exercise
    leave
    ret

Rewriting baz with malloc(3) would be considerably more verbose, so if we can afford to allocate the array on the stack, doing so leads to clearer code. However, this comes at a cost, as the assembly reveals.

The generated code looks complicated mostly because GCC tries very hard to avoid repeated computations by reordering instructions. The first thing baz does is to save the %rbp register on the stack, and then copies %rsp into it. This is undone at the end by the leave instruction.

The parameter n is passed in through %rdi, and this is copied into %rsi in preparation for the call to exercise. Previously, this was done right before the call to exercise.

The next few lines compute the amount of local storage needed in %rax (i.e. (%rdi * 4 + 18) & (-16)), and allocate it by subtracting that from %rsp. The formula just ensures that the amount allocated is in increments of 16 bytes per the x86_64 calling convention. This took 4 instructions, where before it could be done in only one.

The next few lines do the assignment and finish setting up the call to exercise. The address of arr is in %rsp at this point, and that gets copied into %rax. The instructions for the assignment and for passing arr to exercise are the same as before, except that the address is first divided by 4, and then multiplied by 4 when its addressed. Unfortunately, I don’t have a good explanation for why this multiplication is done, but it doesn’t affect the rest of this discussion.

To summarise this last bit of code: once we added one variable length array, allocating local storage got slightly more expensive, but accessing array elements remained just as fast. But that’s just an artifact of the simplicity of our example. Consider what the stack looks like with one variable length array.

+-----------+
+    ...    +
+-----------+ <- %rbp
+    ...    +
+  arr[n-1] +
+    ...    +
+   arr[1]  +
+   arr[0]  +
+    ...    +
+---------+ <- %rsp

As long as we have at most one such array, the compiler can still address any local variable as a static offset from either %rbp or %rsp. However, if you have even two variable length arrays, this becomes impossible. At that point, the addresses of the arrays can only be computed at runtime.

Luckily, this performance penalty turns out to not matter in practice since compilers have gotten very good at optimising code. For instance, when adding more variable length arrays to baz, GCC just makes the allocation a bit more convoluted, but saves the addresses of the arrays in registers. Once it runs out of registers at around the 12th suck array, it starts saving their addresses on the stack and loading them up on demand. Strictly speaking, this could make array accesses twice as expensive (one instruction to load the address, and one to index from it), however, to hit this case, we’d have to accessing the 12 arrays one at a time in a round-robin fashion. In any other case, the extra cost incurred by the variable length arrays would be essentially constant, and would be dominated by the cost of the rest of the code.