Every time I’ve needed a temporary file, I’ve used
mkstemp(3) or some variant of it. However, it wasn’t until recently that I wondered how it works, and more importantly, under what conditions it fails.
Let’s start from scratch: we need a temporary file, what do we do? If we had the name for the file, we could create it, so the problem reduces to generating a fresh unused filename. A simple way of doing this would be to use a sequence of integers as the filenames. For instance, we could name the first file “1”, the second “2”, and so on.
This introduces a new problem: given a directory, how do we know which numbers have already been used so that we can pick the next one? In a single process, we could keep a counter to represent all the numbers used so far, and increment it whenever a new name is needed. However, if we have multiple processes doing this in the same directory, it’s easy to see that they’ll trample over each other. We could synchronize the processes with some shared state. It could be a known file on disk, or shared memory, or a separate datastore process, but that’s clearly over-engineering the problem, and it still leaves the possibility that something else might be creating files with the same names.
So, we need to generate filenames according to some scheme, then we need to check if they’re already used, and if so, repeat until an unused filename is found. The issue is that checking if a filename is used involves a context switch and disk IO, both of which are costly in time. We can’t avoid checking once, but every collision will entail another check, and those can be avoided to some degree. To illustrate the problem, if we were to use consecutive integers starting at 0, the cost of generating a fresh filename would likely be linear with number of previously generated names.
Since the collisions can happen between filenames generated in the same process, in different runs of the program, or between multiple processes, we want to use as little state as possible. A simple way of achieving this is to make the sequence of filenames non-deterministic. For instance, we could use random 64-bit integers. By doing this, we’re pushing the burden of generating non-colliding names onto the random number generator, and even non-cryptographic ones are likely to be good enough for our purpose, especially given the huge number of possible 64-bit integers.
If we wanted to further reduce the possibility of collisions, we could include the program name, the pid, and a timestamp in the filename, in addition to the random number. In our case, the loss of readability caused by the longer filenames would outweigh the marginal gain in speed. However, if we were to take this line of reasoning further, we would end up with the various UUID schemes.
Going back to the original problem, the
mkstemp used in most Linux system is likely to be the one from
glibc (the function is called
__gen_tempname in that file). It works as described above, except that it uses a combination of the current time and pid instead of a random number, and it converts that to a six character alphanumeric string. The only ways it can fail, which aren’t caused by bad parameters, are IO errors and running out of attempts. Since it tries over 200K times by default, the latter failure case is exceedingly unlikely.