An oddity delaying kernel shutdown
A couple of years ago Dan Walsh introduced the SELinux sandbox which was a way to confine what resources an application can access using SELinux and the Linux filesystem namespace functionality. Meanwhile we developed sVirt in libvirt to confine QEMU virutal machines, and QEMU itself has gained support for passing host filesystems straight through to the guest operating system, using a VirtIO based transport for the 9p filesystem. This got me thinking about whether it was now practical to create a sandbox based on QEMU, or rather KVM by booting a guest with a root filesystem pointing to the host’s root filesystem (readonly of course), combined with a couple of overlays for /tmp and /home, all protected by sVirt.
One prominent factor in the practicality is how much time the KVM and kernel startup sequences add to the overall execution time of the command being sandboxed. From Richard Jones‘ work on libguestfs I know that it is possible to boot to a functioning application inside KVM in < 5 seconds. The approach I take with 9pfs has a slight advantage over libguestfs because it does not occur the initial (one-time only per kernel version) delay for building a virtual appliance based on the host filesystem, since we’re able to direct access the host filesystem from the guest. The fine details will have to wait for a future blog post, but suffice to say, a stock Fedora kernel can be made to boot to the point of exec()ing the ‘init’ binary in the ramdsisk in ~0.6 seconds, and the custom ‘init’ binary I use for mounting the 9p filesystems takes another ~0.2 seconds, giving a total boot time of 0.8 seconds.
Boot up time, however, is only one side of the story. For some application sandboxing scenarios, the shutdown time might be just as important as startup time. I naively thought that the kernel shutdown time would be unmeasurably short. It turns out I was wrong, big time. Timestamps on the printk messages showed that the shutdown time was in fact longer than the bootup time ! The telling messages were:
[ 1.486287] md: stopping all md devices. [ 2.492737] ACPI: Preparing to enter system sleep state S5 [ 2.493129] Disabling non-boot CPUs ... [ 2.493129] Power down. [ 2.493129] acpi_power_off called
which point a finger towards the MD driver. I was sceptical that the MD driver could be to blame, since my virtual machine does not have any block devices at all, let alone MD devices. To be sure though, I took a look at the MD driver code to see just what it does during kernel shutdown. To my surprise the answer to blindly obvious:
static int md_notify_reboot(struct notifier_block *this, unsigned long code, void *x) { struct list_head *tmp; mddev_t *mddev; if ((code == SYS_DOWN) || (code == SYS_HALT) || (code == SYS_POWER_OFF)) { printk(KERN_INFO "md: stopping all md devices.\n"); for_each_mddev(mddev, tmp) if (mddev_trylock(mddev)) { /* Force a switch to readonly even array * appears to still be in use. Hence * the '100'. */ md_set_readonly(mddev, 100); mddev_unlock(mddev); } /* * certain more exotic SCSI devices are known to be * volatile wrt too early system reboots. While the * right place to handle this issue is the given * driver, we do want to have a safe RAID driver ... */ mdelay(1000*1); } return NOTIFY_DONE; }
In other words, regardless of whether you actually have any MD devices, it’ll impose a fixed 1 second delay into your shutdown sequence :-(
With this kernel bug fixed, the total time my KVM sandbox spends running the kernel is reduced by more than 50%, from 1.9s to 0.9s. The biggest delay is now down to Seabios & QEMU which together take 2s to get from the start of QEMU main(), to finally jumping into the kernel entry point.
Doh! Thanks for sharing, hope you continue to write up posts like this in the future :).
Did you check the git blame history to see when and for which specific devices that unconditional delay was added?
Unfortunately the change predates the import of the kernel tree into GIT, so there’s no useful blame info to be found in GIT AFAICT.