Premature optimisation may be the root of all evil, but it’s also dammed fun. I recently needed an OCaml library for affine transformations: essentially, I needed to multiply 3x3
matrices together.
Since the features I needed could be coded in 20 lines of code, it seemed like overkill to use a fully-featured matrix library. Following is a description of three implementations of 3x3
matrix multiplication and benchmark results.
The simple implementation just stores the matrices as nested arrays and does the multiplication with three nested loops.
type mat3 = float array array
let zero () =
[| [| 0.0; 0.0; 0.0; |]
; [| 0.0; 0.0; 0.0; |]
; [| 0.0; 0.0; 0.0; |]
|]
;;
val ( * ) : mat3 -> mat3 -> mat3
let ( * ) m1 m2 =
let m3 = zero () in
for i = 0 to 2 do
for j = 0 to 2 do
for k = 0 to 2 do
m3.(i).(j) <- m3.(i).(j) +. m1.(i).(k) *. m2.(k).(j)
done
done
done;
m3
;;
There are several glaring problems with this code. The way it creates the result mat3
is inefficient: it first creates a matrix of zeroes, then sets each element to a new value. In addition to doing two memory writes when only one is needed, every second write will trigger the write barrier further slowing down the code. We can easily fix this by expanding the loops and populating the result mat3
in-place:
let ( * ) m1 m2 =
[| [| m1.(0).(0) *. m2.(0).(0) +. m1.(0).(1) *. m2.(1).(0) +. m1.(0).(2) *. m2.(2).(0)
; m1.(0).(0) *. m2.(0).(1) +. m1.(0).(1) *. m2.(1).(1) +. m1.(0).(2) *. m2.(2).(1)
...
; m1.(2).(0) *. m2.(0).(2) +. m1.(2).(1) *. m2.(1).(2) +. m1.(2).(2) *. m2.(2).(2) |]
|]
;;
Benchmarking the simple code and the expanded code (and the third version explained below), we see there’s a marked improvement in speed.
┌─────────────────────────────┬──────────┬────────────┐
│ Name │ Time/Run │ Percentage │
├─────────────────────────────┼──────────┼────────────┤
│ Array_mat.mat_mult │ 107.74ns │ 100.00% │
│ Array_mat.Expanded.mat_mult │ 41.12ns │ 38.17% │
│ Record_mat.mat_mult │ 13.75ns │ 12.77% │
└─────────────────────────────┴──────────┴────────────┘
Another problem is that a mat3
is represented in memory as an array of 3 references to 3 other arrays of float
s. This means that every array access will require following a pointer to a possibly distant memory location. We can get rid of the extra operation and achieve memory locality by representing a mat3
as a record of only float
fields; the OCaml compiler will store this internally as an array of float
s.
type mat3 = { m00 : float; m01 : float; m02 : float;
m10 : float; m11 : float; m12 : float;
m20 : float; m21 : float; m22 : float;
}
let ( * ) m1 m2 =
{
m00 = m1.m00 *. m2.m00 +. m1.m01 *. m2.m10 +. m1.m02 *. m2.m20;
m01 = m1.m00 *. m2.m01 +. m1.m01 *. m2.m11 +. m1.m02 *. m2.m21;
...
m22 = m1.m20 *. m2.m02 +. m1.m21 *. m2.m12 +. m1.m22 *. m2.m22;
}
;;
In addition to being eight times faster than the simple implementation, by representing matrices as records, we also rule out the possibility of structurally malformed matrices and get a compact notation for cells.
The full code used in the benchmark is available here. The next step would be to look at the generated code for ( * )
and fix any problems, but that’s too much premature optimisation even for me.