scvalex.net

37. An Optimisation Story

Premature optimisation may be the root of all evil, but it’s also dammed fun. I recently needed an OCaml library for affine transformations: essentially, I needed to multiply 3x3 matrices together.

Since the features I needed could be coded in 20 lines of code, it seemed like overkill to use a fully-featured matrix library. Following is a description of three implementations of 3x3 matrix multiplication and benchmark results.

The simple implementation just stores the matrices as nested arrays and does the multiplication with three nested loops.

type mat3 = float array array

let zero () =
  [| [| 0.0; 0.0; 0.0; |]
   ; [| 0.0; 0.0; 0.0; |]
   ; [| 0.0; 0.0; 0.0; |]
  |]
;;

val ( * ) : mat3 -> mat3 -> mat3
let ( * ) m1 m2 =
  let m3 = zero () in
  for i = 0 to 2 do
    for j = 0 to 2 do
      for k = 0 to 2 do
        m3.(i).(j) <- m3.(i).(j) +. m1.(i).(k) *. m2.(k).(j)
      done
    done
  done;
  m3
;;

There are several glaring problems with this code. The way it creates the result mat3 is inefficient: it first creates a matrix of zeroes, then sets each element to a new value. In addition to doing two memory writes when only one is needed, every second write will trigger the write barrier further slowing down the code. We can easily fix this by expanding the loops and populating the result mat3 in-place:

let ( * ) m1 m2 =
  [| [| m1.(0).(0) *. m2.(0).(0) +. m1.(0).(1) *. m2.(1).(0) +. m1.(0).(2) *. m2.(2).(0)
      ; m1.(0).(0) *. m2.(0).(1) +. m1.(0).(1) *. m2.(1).(1) +. m1.(0).(2) *. m2.(2).(1)
...
      ; m1.(2).(0) *. m2.(0).(2) +. m1.(2).(1) *. m2.(1).(2) +. m1.(2).(2) *. m2.(2).(2) |]
  |]
;;

Benchmarking the simple code and the expanded code (and the third version explained below), we see there’s a marked improvement in speed.

┌─────────────────────────────┬──────────┬────────────┐
│ Name                        │ Time/Run │ Percentage │
├─────────────────────────────┼──────────┼────────────┤
│ Array_mat.mat_mult          │ 107.74ns │    100.00% │
│ Array_mat.Expanded.mat_mult │  41.12ns │     38.17% │
│ Record_mat.mat_mult         │  13.75ns │     12.77% │
└─────────────────────────────┴──────────┴────────────┘

Another problem is that a mat3 is represented in memory as an array of 3 references to 3 other arrays of floats. This means that every array access will require following a pointer to a possibly distant memory location. We can get rid of the extra operation and achieve memory locality by representing a mat3 as a record of only float fields; the OCaml compiler will store this internally as an array of floats.

type mat3 = { m00 : float; m01 : float; m02 : float;
              m10 : float; m11 : float; m12 : float;
              m20 : float; m21 : float; m22 : float;
            }

let ( * ) m1 m2 =
  {
    m00 = m1.m00 *. m2.m00 +. m1.m01 *. m2.m10 +. m1.m02 *. m2.m20;
    m01 = m1.m00 *. m2.m01 +. m1.m01 *. m2.m11 +. m1.m02 *. m2.m21;
...
    m22 = m1.m20 *. m2.m02 +. m1.m21 *. m2.m12 +. m1.m22 *. m2.m22;
  }
;;

In addition to being eight times faster than the simple implementation, by representing matrices as records, we also rule out the possibility of structurally malformed matrices and get a compact notation for cells.


The full code used in the benchmark is available here. The next step would be to look at the generated code for ( * ) and fix any problems, but that’s too much premature optimisation even for me.