• ## 37. An Optimisation Story

Premature optimisation may be the root of all evil, but it’s also dammed fun. I recently needed an OCaml library for affine transformations: essentially, I needed to multiply 3x3 matrices together.

Since the features I needed could be coded in 20 lines of code, it seemed like overkill to use a fully-featured matrix library. Following is a description of three implementations of 3x3 matrix multiplication and benchmark results.

The simple implementation just stores the matrices as nested arrays and does the multiplication with three nested loops.

type mat3 = float array array

let zero () =
[| [| 0.0; 0.0; 0.0; |]
; [| 0.0; 0.0; 0.0; |]
; [| 0.0; 0.0; 0.0; |]
|]
;;

val ( * ) : mat3 -> mat3 -> mat3
let ( * ) m1 m2 =
let m3 = zero () in
for i = 0 to 2 do
for j = 0 to 2 do
for k = 0 to 2 do
m3.(i).(j) <- m3.(i).(j) +. m1.(i).(k) *. m2.(k).(j)
done
done
done;
m3
;;

There are several glaring problems with this code. The way it creates the result mat3 is inefficient: it first creates a matrix of zeroes, then sets each element to a new value. In addition to doing two memory writes when only one is needed, every second write will trigger the write barrier further slowing down the code. We can easily fix this by expanding the loops and populating the result mat3 in-place:

let ( * ) m1 m2 =
[| [| m1.(0).(0) *. m2.(0).(0) +. m1.(0).(1) *. m2.(1).(0) +. m1.(0).(2) *. m2.(2).(0)
; m1.(0).(0) *. m2.(0).(1) +. m1.(0).(1) *. m2.(1).(1) +. m1.(0).(2) *. m2.(2).(1)
...
; m1.(2).(0) *. m2.(0).(2) +. m1.(2).(1) *. m2.(1).(2) +. m1.(2).(2) *. m2.(2).(2) |]
|]
;;

Benchmarking the simple code and the expanded code (and the third version explained below), we see there’s a marked improvement in speed.

┌─────────────────────────────┬──────────┬────────────┐
│ Name                        │ Time/Run │ Percentage │
├─────────────────────────────┼──────────┼────────────┤
│ Array_mat.mat_mult          │ 107.74ns │    100.00% │
│ Array_mat.Expanded.mat_mult │  41.12ns │     38.17% │
│ Record_mat.mat_mult         │  13.75ns │     12.77% │
└─────────────────────────────┴──────────┴────────────┘

Another problem is that a mat3 is represented in memory as an array of 3 references to 3 other arrays of floats. This means that every array access will require following a pointer to a possibly distant memory location. We can get rid of the extra operation and achieve memory locality by representing a mat3 as a record of only float fields; the OCaml compiler will store this internally as an array of floats.

type mat3 = { m00 : float; m01 : float; m02 : float;
m10 : float; m11 : float; m12 : float;
m20 : float; m21 : float; m22 : float;
}

let ( * ) m1 m2 =
{
m00 = m1.m00 *. m2.m00 +. m1.m01 *. m2.m10 +. m1.m02 *. m2.m20;
m01 = m1.m00 *. m2.m01 +. m1.m01 *. m2.m11 +. m1.m02 *. m2.m21;
...
m22 = m1.m20 *. m2.m02 +. m1.m21 *. m2.m12 +. m1.m22 *. m2.m22;
}
;;

In addition to being eight times faster than the simple implementation, by representing matrices as records, we also rule out the possibility of structurally malformed matrices and get a compact notation for cells.

The full code used in the benchmark is available here. The next step would be to look at the generated code for ( * ) and fix any problems, but that’s too much premature optimisation even for me.

• ## 36. Empirical Pi

You’ve finally snuck into the evil mastermind’s billiards room. You’re only one door away from his office and the Big Red Button that stops the moon laser from vaporizing the Great Barrier Reef.

What is the value of $$\pi$$?”, beeps the security panel.

“Huh. Easy”, you think as you type $$2.71$$.

“Mhm. Maybe it needs more digits?” You type $$2.71828$$.

You feel a cold sweat coming on as you scan the room for something to use. You could try to calculate $$\pi$$ with a series expansion, but you don’t trust yourself anymore. No, with one chance to go, you want something foolproof. Billiards table. Chairs. Dartboard. Liqueur cabinet. Dartboard!

A dartboard is just a square with an inscribed circle. If the side of the square is $$L$$, the radius of the circle is $$\frac{1}{2} L$$. So, the ratio of the surfaces of the circle and the square are:

$\frac{S_C}{S_S} = \frac{\pi * (\frac{1}{2} L)^2}{L^2} = \frac{1}{4}\pi$

But you can also find that ratio by throwing darts randomly and counting how many land in the circle. A simple program to do this is here:

#!/usr/bin/python

from __future__ import print_function
import math, random

acErr = 1e-5

def inCircle(x, y):
"""inCircle(x, y) iff (x, y) is in the circle of radius 1 centered at
origin"""
return (math.sqrt(x**2 + y**2) - 1.0) <= acErr

def piGen():
"""Generates increasingly accurate values for Pi"""
S_circle, S_square = 0.0, 0.0
while True:
x, y = random.random(), random.random()
S_circle += (1 if inCircle(x, y) else 0)
S_square += 1
yield (4*S_circle/S_square)

def run_pi():
random.seed()
print("Pi is 0.00000", end="")
i = 0
for p in piGen():
i += 1
if i % 25 == 0:
print("\rPi is %1.5f" % p, end="")

if __name__ == "__main__":
run_pi()

Alternatively, you could use the Gregory-Leibniz series which is easy to remember but much harder to derive:

$\frac{\pi}{4} = 1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \cdots$