This is the debugging story of a strange 12 minute hang I’d get after logging in to my SFTP server. It wasn’t the usual culprits, and it wasn’t any of the likely cloud complications. Instead it turned out to be an issue as old as POSIX itself.

The setup

Before we get bogged down in debugging, let’s set the scene. I run a FoundryVTT deployment for my D&D group. This consists of the FoundryVTT nodejs server itself, but also an OpenSSH SFTP server used to upload assets. This all runs on my small Kubernetes cluster on a bunch of cheap Hetzner servers.

FoundryVTT setup

Kubernetes networking is complicated to say the least. I have kube-proxy configured in iptables mode, so user packets can arrive to any node. Then, there are netfilter firewall rules to redirect each packet to the right pod. In the diagram above, all the red circles are sets of netfilter rules, and not actual processes. This is efficient since all the routing is done in kernel space without any copying, but it makes everything harder to reason about, and it introduces lots of uncertainties when debugging.

Kubernetes has a good story for handling incoming HTTPS connections with ingresses. When a player goes to the game server address, DNS resolves to any one of the three nodes, the nginx ingress on that node receives the HTTPS connection, and reverse proxies it to the right pod. Practically, the ingress actually forwards the HTTP request to the service’s IP address, and there are netfilter rules on each host to rewrite that to the actual IPs of the pods implementing the service. One nice thing about this setup is that the ingress pod is a clear point where internal Kubernetes networking ends, and regular host networking begins. The ingress also has logs, which are a boon when debugging.

The story for handling incoming generic TCP connections is less good. More precisely, it’s absent. One could setup a HAProxy ingress, but there’s no defined Kubernetes way of configuring this, and it seems like a lot of work for little gain. In our case, the much simpler solution is to do away with the ingress, and configure the SFTP service as a NodePort with a static port. When a GM connects to the SFTP port, the packets reach any of the three hosts, and are then redirected by netfilter rules to the right pod. This works, and is even more efficient than having an ingress since there’s one less process to mediate connections, but it also means there are fewer logs to look at when debugging a problem.

The case of the incredibly slow SFTP login

With the setup out of the way, let’s see what it looks like for a GM to connect to the SFTP server:

$ sftp gm@foundry-ftp.example.com
gm@foundry-ftp.example.com's password:
...🕒 12 minutes of waiting 🕒...
Connected to foundry-ftp.abstractbinary.org.
sftp>

That 12 minutes of waiting is unexpected. It’s also happening consistently, which rules out solar flares as a cause. Is it taking that long to verify the password? No, because we see the following in the SFTP pod log without any delay:

Accepted password for gm from 10.32.2.0 port 45087 ssh2

It appears to just wait for minutes after accepting the authentication, and before giving the user a prompt.

ℹ️ FTP vs SFTP

FTP and SFTP are only related by name and by user interface. Under the hood, they’re completely different protocols.

  • The File Transfer Protocol (FTP) uses separate control and data connections, and has no built-in transport security. The two separate connections are a frequent cause of issues, and would be the first thing I’d look at when debugging.

  • The SSH File Transfer Protocol (SFTP) is essentially just a custom shell that OpenSSH can force users into. This shell exposes only some file transfer commands. So, for instance, the user would be allowed to copy files, but wouldn’t be able to run programs. Since SFTP is really just SSH, it uses the same authentication and transport security.

In this post, we’re essentially talking about an SSH server which would pause for 12 minutes between accepting authentication and showing a prompt.

Speaking from experience, pauses of 10 minutes or more are rare. Pauses of a second are not unusual when cryptography is involved. Pauses of a few seconds frequently happen if there are external services involved. These can easily stretch to a minute if said services are timing out, and there’s some retry logic involved. But a pause of 12 minutes that happens consistently and always completes? That’s very unusual.

For reference, the pause in my server wasn’t always this bad. When I originally set up the FoundryVTT instance a year and a half ago, the pause was only a few seconds long. Then, it increased to around 4 minutes half a year ago. Finally, it increased to 12 minutes one month ago. These increases didn’t coincide with FoundryVTT or atmoz/sftp upgrades. They did seem to correlate with base system upgrades, but since the SFTP server was in a container, surely that can’t be the cause. Of couse, it turned out to be exactly that.

Searching the web doesn’t reveal a whole lot. It seems like the common explanations for delayed prompts after logging in involve DNS, but the container image already has UseDNS no configured.

There are also some forum posts mentioning broken network switches and TCP QoS flags. While we do have container networking involved, there’s no reason to expect it to drop random packets. It’s hard to be sure, though, because there are so many moving parts. We can try connecting to the SFTP server from one of the hosts running the Kubernetes nodes to rule out internet connectivity as a cause. We can also try connecting from the host of the node running the FoundryVTT pod to rule out physical networking altogether. Finally, and this did not occur to me until much later, we can kubectl exec into the actual SFTP container, and try connecting from there. The pause was always reproducible, so whatever the problem is, it’s not networking.

If there’s no canned solution, we have to figure it out ourselves. What’s sshd actually doing while the user is observing the pause? The easiest thing to do is strace(1) it, and see what syscalls it’s calling.

$ kubectl exec --stdin --tty -n foundryvtt foundryvtt-0 -c foundryvtt-ftp -- /bin/sh

### Snapshot of processes before login
# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0  13280  4256 ?        Ss   13:00   0:00 sshd: /usr/sbin/sshd -D -e [listener] 0 of 10-100 startups
root          46  0.0  0.0   2420   412 pts/0    Ss   13:00   0:00 /bin/sh
root          52  0.0  0.0   6700  1724 pts/0    R+   13:00   0:00 ps aux

### Start logging in to SFTP and enter password

### Processes while login is happening. There are three new processes.
# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0  13280  4296 ?        Ss   13:00   0:00 sshd: /usr/sbin/sshd -D -e [listener] 0 of 10-100 startups
root          46  0.0  0.0   2420   412 pts/0    Ss   13:00   0:00 /bin/sh
root          73  0.0  0.0  13516  4888 ?        Ss   13:14   0:00 sshd: gm [priv]
gm            75  0.0  0.0  13516  3644 ?        S    13:14   0:00 sshd: gm@notty
gm            76  108  0.0  13516  2516 ?        Rs   13:14   0:04 sshd: gm@notty
root          77  0.0  0.0   6700  1796 pts/0    R+   13:14   0:00 ps aux

# strace -ff -p 73
strace: Process 73 attached
restart_syscall(<... resuming interrupted read ...>

# strace -ff -p 75
strace: Process 75 attached
select(13, [4 5 10 12], [], NULL, NULL

# strace -ff -p 76
...
close(1064764445)                       = -1 EBADF (Bad file descriptor)
close(1064764446)                       = -1 EBADF (Bad file descriptor)
close(1064764447)                       = -1 EBADF (Bad file descriptor)
close(1064764448)                       = -1 EBADF (Bad file descriptor)
close(1064764449)                       = -1 EBADF (Bad file descriptor)
close(1064764450)                       = -1 EBADF (Bad file descriptor)
close(1064764451)                       = -1 EBADF (Bad file descriptor)
close(1064764452)                       = -1 EBADF (Bad file descriptor)
close(1064764453)                       = -1 EBADF (Bad file descriptor)
close(1064764454)                       = -1 EBADF (Bad file descriptor)
close(1064764455)                       = -1 EBADF (Bad file descriptor)
close(1064764456)                       = -1 EBADF (Bad file descriptor)
close(1064764457)                       = -1 EBADF (Bad file descriptor)
close(1064764458)                       = -1 EBADF (Bad file descriptor)
close(1064764459)                       = -1 EBADF (Bad file descriptor)
close(1064764460)                       = -1 EBADF (Bad file descriptor)
close(1064764461)                       = -1 EBADF (Bad file descriptor)
close(1064764462)                       = -1 EBADF (Bad file descriptor)
close(1064764463)                       = -1 EBADF (Bad file descriptor)
close(1064764464)                       = -1 EBADF (Bad file descriptor)
...

We see sshd spawning three processes: 73 (running as root), 75 (running as gm) and 76 (again gm). The root process is blocked in a read. It’s presumably waiting for one of the children to write something. Process 75 is waiting for input via select, perhaps on a socket.

What on earth is process 76 doing? Why is it attempting to close an fd over a million? There’s no way it ever opened that many files. Indeed, the close is failing with EBADF because the fd was never opened. Why is it going through fds consecutively? Did it actually go through a million fds before my strace started, and is it going to go through millions more? Forked processes inherit fds from their parents, so is this some security feature?

This behaviour explains the delay: It’s not surprise that executing millions of syscalls takes minutes. It also hints as to why the delay was getting longer with system upgrades – the limit of open files must’ve been creeping up with each NixOS release. Now, we need to figure out if there’s any way to avoid this.

Forking SFTP

Let’s see what actually happens when a user connects to an OpenSSH server. We’re going to look at the 8.8p1 source code, which hasn’t changed in an interesting way from the 8.4p1 I was originally debugging.

We search for accept in the source, and find a promising hit in server_accept_loop at sshd.c:1255. After accepting the socket, it does some checks, then forks at line 1326, and eventually returns to main. The server takes care of authentication and starts the session. Because we’re using the SFTP subsystem, control eventually arrives to server_input_channel_req at serverloop.c:865. Following the code, we first get to session_subsystem_req in session.c:1955, then to do_child at line 1515. This function changes the user with do_setusercontext, chroots if necessary, then calls the suspiciously named child_close_fds, which in turn calls the even more suspiciously named closefrom.

/*
 * Close any extra open file descriptors so that we don't have them
 * hanging around in clients.  Note that we want to do this after
 * initgroups, because at least on Solaris 2.3 it leaves file
 * descriptors open.
 */
closefrom(STDERR_FILENO + 1);

It appears that closefrom is a function in BSDs. It doesn’t exist on Linux, so we end up using the implementation from openbsd-compat/bsd_closefrom.c. The shape of the file is this (lots of code elided without mention):

#ifndef HAVE_CLOSEFROM

#ifndef HAVE_FCNTL_CLOSEM
/* Close all file descriptors greater than or equal to lowfd. */
static void closefrom_fallback(int lowfd) {
    long fd, maxfd;
    maxfd = sysconf(_SC_OPEN_MAX);
    for (fd = lowfd; fd < maxfd; fd++)
        (void) close((int) fd);
}
#endif /* HAVE_FCNTL_CLOSEM */

#ifdef HAVE_FCNTL_CLOSEM
void closefrom(int lowfd) {
    (void) fcntl(lowfd, F_CLOSEM, 0);
}

#elif defined(HAVE_LIBPROC_H) && defined(HAVE_PROC_PIDINFO)
void closefrom(int lowfd) {
    ...
}

#elif defined(HAVE_DIRFD) && defined(HAVE_PROC_PID)
void closefrom(int lowfd) {
    ...
    /* Check for a /proc/$$/fd directory. */
    len = snprintf(fdpath, sizeof(fdpath), "/proc/%ld/fd", (long)getpid());
    if (len > 0 && (size_t)len < sizeof(fdpath) && (dirp = opendir(fdpath))) {
        while ((dent = readdir(dirp)) != NULL) {
            fd = strtol(dent->d_name, &endp, 10);
            if (dent->d_name != endp && *endp == '\0' &&
                fd >= 0 && fd < INT_MAX && fd >= lowfd && fd != dirfd(dirp))
                (void) close((int) fd);
        }
        (void) closedir(dirp);
        return;
    }
    /* /proc/$$/fd strategy failed, fall back to brute force closure */
    closefrom_fallback(lowfd);
}

#else
void closefrom(int lowfd)
{
    closefrom_fallback(lowfd);
}

#endif /* HAVE_CLOSEFROM */

We see that there’s a waterfall of closefrom implementations:

  • The system closefrom is used on BSD.

  • If fcntl supports the F_CLOSEM flag, then that’s used. This seems to be another BSD-only feature.

  • If libproc is available (probably as part of libprocps), or if the /proc/PID filesystem exists, then that’s used to find the largest open fd. The code then closes all fds up to the largest one.

  • If all else fails, just try closing all fds up to the system limit.

Let’s look at our strace again to see if there’s any hint as to which case we’re hitting.

...
openat(AT_FDCWD, "/proc/76/fd", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
prlimit64(0, RLIMIT_NOFILE, NULL,  <unfinished ...>
<... prlimit64 resumed>{rlim_cur=1073741816, rlim_max=1073741816}) = 0
close(3)                    = 0
close(4)                    = -1 EBADF (Bad file descriptor)
close(5)                    = 0
close(6)                    = 0
close(7)                    = 0
close(8)                    = -1 EBADF (Bad file descriptor)
close(9)                    = -1 EBADF (Bad file descriptor)
close(10)                   = -1 EBADF (Bad file descriptor)
close(11)                   = -1 EBADF (Bad file descriptor)
...

It looks like we’re failing to open /proc/76/fd, then falling back to the brute-force approach. This could be because the process has already changed to a different user, but the error code is ENOENT, not EACCES, so this isn’t it. The only other possibility is because we’re chroot’d into /home/gm, so /proc wouldn’t exist for the process.

We test this out by removing ChrootDirectory %h from sshd_config, and indeed the pause is gone 🎉. Disabling the chroot like this is acceptable because we’re running sftp in an unprivileged container with just one user, so there isn’t much an attacker could do if they gained access.

That’s the workaround, but what’s the fix? In plain POSIX, I don’t think there is one; you have to somehow know which fds are open in order to close them, and there’s no way to mass-close fds. In the various BSDs, there’s closefrom and fcntl(_, F_CLOSEM, _). As of kernel 5.9, Linux gained the close_range(2) syscall, and OpenSSH is already using it on master. So, once that’s released, and once it filters down into distributions, we can re-enable the chroot.

Conclusion

Ultimately, the “fix” is to disable the ChrootDirectory in sshd_config, and wait for an OpenSSH release with better behaviour. This isn’t the most interesting resolution, but at least logins don’t take forever anymore.