This is the debugging story of a strange 12 minute hang I’d get after logging in to my SFTP server. It wasn’t the usual culprits, and it wasn’t any of the likely cloud complications. Instead it turned out to be an issue as old as POSIX itself.
Table of contents
The setup
Before we get bogged down in debugging, let’s set the scene. I run a FoundryVTT deployment for my D&D group. This consists of the FoundryVTT nodejs server itself, but also an OpenSSH SFTP server used to upload assets. This all runs on my small Kubernetes cluster on a bunch of cheap Hetzner servers.
Kubernetes networking is complicated to say the least. I have kube-proxy
configured in iptables
mode, so user packets can arrive to any node. Then, there are netfilter firewall rules to redirect each packet to the right pod. In the diagram above, all the red circles are sets of netfilter rules, and not actual processes. This is efficient since all the routing is done in kernel space without any copying, but it makes everything harder to reason about, and it introduces lots of uncertainties when debugging.
Kubernetes has a good story for handling incoming HTTPS connections with ingresses. When a player goes to the game server address, DNS resolves to any one of the three nodes, the nginx ingress on that node receives the HTTPS connection, and reverse proxies it to the right pod. Practically, the ingress actually forwards the HTTP request to the service’s IP address, and there are netfilter rules on each host to rewrite that to the actual IPs of the pods implementing the service. One nice thing about this setup is that the ingress pod is a clear point where internal Kubernetes networking ends, and regular host networking begins. The ingress also has logs, which are a boon when debugging.
The story for handling incoming generic TCP connections is less good. More precisely, it’s absent. One could setup a HAProxy ingress, but there’s no defined Kubernetes way of configuring this, and it seems like a lot of work for little gain. In our case, the much simpler solution is to do away with the ingress, and configure the SFTP service as a NodePort
with a static port. When a GM connects to the SFTP port, the packets reach any of the three hosts, and are then redirected by netfilter rules to the right pod. This works, and is even more efficient than having an ingress since there’s one less process to mediate connections, but it also means there are fewer logs to look at when debugging a problem.
The case of the incredibly slow SFTP login
With the setup out of the way, let’s see what it looks like for a GM to connect to the SFTP server:
$ sftp gm@foundry-ftp.example.com
gm@foundry-ftp.example.com's password:
...🕒 12 minutes of waiting 🕒...
Connected to foundry-ftp.abstractbinary.org.
sftp>
That 12 minutes of waiting is unexpected. It’s also happening consistently, which rules out solar flares as a cause. Is it taking that long to verify the password? No, because we see the following in the SFTP pod log without any delay:
Accepted password for gm from 10.32.2.0 port 45087 ssh2
It appears to just wait for minutes after accepting the authentication, and before giving the user a prompt.
ℹ️ FTP vs SFTP
FTP and SFTP are only related by name and by user interface. Under the hood, they’re completely different protocols.
-
The File Transfer Protocol (FTP) uses separate control and data connections, and has no built-in transport security. The two separate connections are a frequent cause of issues, and would be the first thing I’d look at when debugging.
-
The SSH File Transfer Protocol (SFTP) is essentially just a custom shell that OpenSSH can force users into. This shell exposes only some file transfer commands. So, for instance, the user would be allowed to copy files, but wouldn’t be able to run programs. Since SFTP is really just SSH, it uses the same authentication and transport security.
In this post, we’re essentially talking about an SSH server which would pause for 12 minutes between accepting authentication and showing a prompt.
Speaking from experience, pauses of 10 minutes or more are rare. Pauses of a second are not unusual when cryptography is involved. Pauses of a few seconds frequently happen if there are external services involved. These can easily stretch to a minute if said services are timing out, and there’s some retry logic involved. But a pause of 12 minutes that happens consistently and always completes? That’s very unusual.
For reference, the pause in my server wasn’t always this bad. When I originally set up the FoundryVTT instance a year and a half ago, the pause was only a few seconds long. Then, it increased to around 4 minutes half a year ago. Finally, it increased to 12 minutes one month ago. These increases didn’t coincide with FoundryVTT or atmoz/sftp
upgrades. They did seem to correlate with base system upgrades, but since the SFTP server was in a container, surely that can’t be the cause. Of couse, it turned out to be exactly that.
Searching the web doesn’t reveal a whole lot. It seems like the common explanations for delayed prompts after logging in involve DNS, but the container image already has UseDNS no
configured.
There are also some forum posts mentioning broken network switches and TCP QoS flags. While we do have container networking involved, there’s no reason to expect it to drop random packets. It’s hard to be sure, though, because there are so many moving parts. We can try connecting to the SFTP server from one of the hosts running the Kubernetes nodes to rule out internet connectivity as a cause. We can also try connecting from the host of the node running the FoundryVTT pod to rule out physical networking altogether. Finally, and this did not occur to me until much later, we can kubectl exec
into the actual SFTP container, and try connecting from there. The pause was always reproducible, so whatever the problem is, it’s not networking.
If there’s no canned solution, we have to figure it out ourselves. What’s sshd
actually doing while the user is observing the pause? The easiest thing to do is strace(1)
it, and see what syscalls it’s calling.
$ kubectl exec --stdin --tty -n foundryvtt foundryvtt-0 -c foundryvtt-ftp -- /bin/sh
### Snapshot of processes before login
# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 13280 4256 ? Ss 13:00 0:00 sshd: /usr/sbin/sshd -D -e [listener] 0 of 10-100 startups
root 46 0.0 0.0 2420 412 pts/0 Ss 13:00 0:00 /bin/sh
root 52 0.0 0.0 6700 1724 pts/0 R+ 13:00 0:00 ps aux
### Start logging in to SFTP and enter password
### Processes while login is happening. There are three new processes.
# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 13280 4296 ? Ss 13:00 0:00 sshd: /usr/sbin/sshd -D -e [listener] 0 of 10-100 startups
root 46 0.0 0.0 2420 412 pts/0 Ss 13:00 0:00 /bin/sh
root 73 0.0 0.0 13516 4888 ? Ss 13:14 0:00 sshd: gm [priv]
gm 75 0.0 0.0 13516 3644 ? S 13:14 0:00 sshd: gm@notty
gm 76 108 0.0 13516 2516 ? Rs 13:14 0:04 sshd: gm@notty
root 77 0.0 0.0 6700 1796 pts/0 R+ 13:14 0:00 ps aux
# strace -ff -p 73
strace: Process 73 attached
restart_syscall(<... resuming interrupted read ...>
# strace -ff -p 75
strace: Process 75 attached
select(13, [4 5 10 12], [], NULL, NULL
# strace -ff -p 76
...
close(1064764445) = -1 EBADF (Bad file descriptor)
close(1064764446) = -1 EBADF (Bad file descriptor)
close(1064764447) = -1 EBADF (Bad file descriptor)
close(1064764448) = -1 EBADF (Bad file descriptor)
close(1064764449) = -1 EBADF (Bad file descriptor)
close(1064764450) = -1 EBADF (Bad file descriptor)
close(1064764451) = -1 EBADF (Bad file descriptor)
close(1064764452) = -1 EBADF (Bad file descriptor)
close(1064764453) = -1 EBADF (Bad file descriptor)
close(1064764454) = -1 EBADF (Bad file descriptor)
close(1064764455) = -1 EBADF (Bad file descriptor)
close(1064764456) = -1 EBADF (Bad file descriptor)
close(1064764457) = -1 EBADF (Bad file descriptor)
close(1064764458) = -1 EBADF (Bad file descriptor)
close(1064764459) = -1 EBADF (Bad file descriptor)
close(1064764460) = -1 EBADF (Bad file descriptor)
close(1064764461) = -1 EBADF (Bad file descriptor)
close(1064764462) = -1 EBADF (Bad file descriptor)
close(1064764463) = -1 EBADF (Bad file descriptor)
close(1064764464) = -1 EBADF (Bad file descriptor)
...
We see sshd
spawning three processes: 73 (running as root
), 75 (running as gm
) and 76 (again gm
). The root
process is blocked in a read
. It’s presumably waiting for one of the children to write something. Process 75 is waiting for input via select
, perhaps on a socket.
What on earth is process 76 doing? Why is it attempting to close
an fd over a million? There’s no way it ever opened that many files. Indeed, the close
is failing with EBADF
because the fd was never opened. Why is it going through fds consecutively? Did it actually go through a million fds before my strace
started, and is it going to go through millions more? Forked processes inherit fds from their parents, so is this some security feature?
This behaviour explains the delay: It’s not surprise that executing millions of syscalls takes minutes. It also hints as to why the delay was getting longer with system upgrades – the limit of open files must’ve been creeping up with each NixOS release. Now, we need to figure out if there’s any way to avoid this.
Forking SFTP
Let’s see what actually happens when a user connects to an OpenSSH server. We’re going to look at the 8.8p1 source code, which hasn’t changed in an interesting way from the 8.4p1 I was originally debugging.
We search for accept
in the source, and find a promising hit in server_accept_loop
at sshd.c:1255
. After accepting the socket, it does some checks, then forks at line 1326, and eventually returns to main
. The server takes care of authentication and starts the session. Because we’re using the SFTP subsystem, control eventually arrives to server_input_channel_req
at serverloop.c:865
. Following the code, we first get to session_subsystem_req
in session.c:1955
, then to do_child
at line 1515. This function changes the user with do_setusercontext
, chroot
s if necessary, then calls the suspiciously named child_close_fds
, which in turn calls the even more suspiciously named closefrom
.
/*
* Close any extra open file descriptors so that we don't have them
* hanging around in clients. Note that we want to do this after
* initgroups, because at least on Solaris 2.3 it leaves file
* descriptors open.
*/
closefrom(STDERR_FILENO + 1);
It appears that closefrom
is a function in BSDs. It doesn’t exist on Linux, so we end up using the implementation from openbsd-compat/bsd_closefrom.c
. The shape of the file is this (lots of code elided without mention):
#ifndef HAVE_CLOSEFROM
#ifndef HAVE_FCNTL_CLOSEM
/* Close all file descriptors greater than or equal to lowfd. */
static void closefrom_fallback(int lowfd) {
long fd, maxfd;
maxfd = sysconf(_SC_OPEN_MAX);
for (fd = lowfd; fd < maxfd; fd++)
(void) close((int) fd);
}
#endif /* HAVE_FCNTL_CLOSEM */
#ifdef HAVE_FCNTL_CLOSEM
void closefrom(int lowfd) {
(void) fcntl(lowfd, F_CLOSEM, 0);
}
#elif defined(HAVE_LIBPROC_H) && defined(HAVE_PROC_PIDINFO)
void closefrom(int lowfd) {
...
}
#elif defined(HAVE_DIRFD) && defined(HAVE_PROC_PID)
void closefrom(int lowfd) {
...
/* Check for a /proc/$$/fd directory. */
len = snprintf(fdpath, sizeof(fdpath), "/proc/%ld/fd", (long)getpid());
if (len > 0 && (size_t)len < sizeof(fdpath) && (dirp = opendir(fdpath))) {
while ((dent = readdir(dirp)) != NULL) {
fd = strtol(dent->d_name, &endp, 10);
if (dent->d_name != endp && *endp == '\0' &&
fd >= 0 && fd < INT_MAX && fd >= lowfd && fd != dirfd(dirp))
(void) close((int) fd);
}
(void) closedir(dirp);
return;
}
/* /proc/$$/fd strategy failed, fall back to brute force closure */
closefrom_fallback(lowfd);
}
#else
void closefrom(int lowfd)
{
closefrom_fallback(lowfd);
}
#endif /* HAVE_CLOSEFROM */
We see that there’s a waterfall of closefrom
implementations:
-
The system
closefrom
is used on BSD. -
If
fcntl
supports theF_CLOSEM
flag, then that’s used. This seems to be another BSD-only feature. -
If
libproc
is available (probably as part oflibprocps
), or if the/proc/PID
filesystem exists, then that’s used to find the largest open fd. The code then closes all fds up to the largest one. -
If all else fails, just try closing all fds up to the system limit.
Let’s look at our strace
again to see if there’s any hint as to which case we’re hitting.
...
openat(AT_FDCWD, "/proc/76/fd", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
prlimit64(0, RLIMIT_NOFILE, NULL, <unfinished ...>
<... prlimit64 resumed>{rlim_cur=1073741816, rlim_max=1073741816}) = 0
close(3) = 0
close(4) = -1 EBADF (Bad file descriptor)
close(5) = 0
close(6) = 0
close(7) = 0
close(8) = -1 EBADF (Bad file descriptor)
close(9) = -1 EBADF (Bad file descriptor)
close(10) = -1 EBADF (Bad file descriptor)
close(11) = -1 EBADF (Bad file descriptor)
...
It looks like we’re failing to open /proc/76/fd
, then falling back to the brute-force approach. This could be because the process has already changed to a different user, but the error code is ENOENT
, not EACCES
, so this isn’t it. The only other possibility is because we’re chroot
’d into /home/gm
, so /proc
wouldn’t exist for the process.
We test this out by removing ChrootDirectory %h
from sshd_config
, and indeed the pause is gone 🎉. Disabling the chroot
like this is acceptable because we’re running sftp
in an unprivileged container with just one user, so there isn’t much an attacker could do if they gained access.
That’s the workaround, but what’s the fix? In plain POSIX, I don’t think there is one; you have to somehow know which fds are open in order to close them, and there’s no way to mass-close fds. In the various BSDs, there’s closefrom
and fcntl(_, F_CLOSEM, _)
. As of kernel 5.9, Linux gained the close_range(2)
syscall, and OpenSSH is already using it on master. So, once that’s released, and once it filters down into distributions, we can re-enable the chroot
.
Conclusion
Ultimately, the “fix” is to disable the ChrootDirectory
in sshd_config
, and wait for an OpenSSH release with better behaviour. This isn’t the most interesting resolution, but at least logins don’t take forever anymore.