It all began with some CloudLab machines I needed to use for “research”. CloudLab only offers CentOS, Ubuntu or FreeBSD images to provision new machines from. I have been daily driving Arch Linux for 3 years, and very much enjoy its simplicity, so wondered if it would be possible to put Arch Linux on the CloudLab machines. I vaguely remember at one point I came across a tutorial to install NixOS from another running distro (turns out to be this), so it should also be doable for Arch.
The research was very boring and I didn’t like it at all, but by digging down this rabbit hole I learned some new cool systems calls and got some more insight into Linux, which I am going to share here : )
This technique is also useful for VPSes. Some have a “custom ISO” feature where you can download any installer image, attach to your VPS, then boot from it, while others don’t. Depending on your host of choice, it may or may not be possible to install your favourite distro in a regular way, but this nuclear option would always work as long as you have root access to any running Linux system.
Those who have installed Arch Linux themselves should be very familiar with the procedures already, but for clarity I will go over it:
/mnt
. Make sure to mount sub-filesystems as well, e.g. /boot/efi
.pacstrap
(basically pacman
with a chroot) and Debian uses debootstrap
./etc/fstab
for the newly installed system, then chroot into it to do some setup (e.g. locale, timezone and root password).We will largely be following the same flow, but cutting the last two steps in favor of kexec
, for reasons we will cover later.
We will need some space for the Arch Linux installer, whether it’s unused disk space (which you can create by shrinking existing ext4 partitions) or RAM with tmpfs. In my case, I am using the latter.
If you are doing this in a memory-bound environment (e.g. low-spec VPS), use disk space.
For one VPS I own, the host allows booting from a pre-selected array of installer images, but does not allow custom ones. They do provide netboot.xyz but it needs abundant memory to work. In this case I used a similar procedure, replacing the running installer with the Arch Linux installer. As CDROM is mounted read-only, the Arch installer has to fit somewhere else, and main memory is insufficient for it.
I always use LVM for my VPSes, so my way around that was to save a few gigs of free space in the VG for a temporary LV where the installer will reside. Later after the installation process I can reclaim this space if I wish, by deleting the installer LV and expanding the system LV.
A quick sanity check to make sure I have the necessary amount of RAM:
root@node0:~# free -g
total used free shared buff/cache available
Mem: 251 2 250 0 0 249
Swap: 7 0 7
We have more than enough! Good to proceed. Now we will be creating a tmpfs for an Arch Linux environment we are using for the installation process.
root@node-0:~# mount -t tmpfs -o size=50G tmpfs /mnt
root@node-0:~# df -h | grep /mnt
tmpfs 50G 0 50G 0% /mnt
With the new tmpfs, now it’s time to populate it with the Arch Linux rootfs files. The latest bootstrap tarball can be downloaded from any mirror listed on the Arch Linux website.
curl -O https://mirrors.edge.kernel.org/archlinux/iso/latest/archlinux-bootstrap-x86_64.tar.zst
tar --zstd -xvf archlinux-bootstrap-x86_64.tar.zst
Extracting the tarball should give us the folder root.x86_64
. It contains a complete filesystem needed for an Arch Linux environment.
root@node0:~# ls root.x86_64/
bin boot dev etc home lib lib64 mnt opt proc root run sbin srv sys tmp usr var version
We will copy everything in this folder to our tmpfs:
cp -r root.x86_64/* /mnt/
At this point, we should already be able to chroot into /mnt
and see that we have pacman
available.
root@node-0:~# chroot /mnt pacman -V
.--. Pacman v7.0.0 - libalpm v15.0.0
/ _.-' .-. .-. .-. Copyright (C) 2006-2024 Pacman Development Team
\ '-. '-' '-' '-' Copyright (C) 2002-2006 Judd Vinet
'--'
This program may be freely redistributed under
the terms of the GNU General Public License.
But before we actually jump into the chroot, we have to make sure to properly handle special filesystems:
/run
and /run/lock
/dev
, /dev/pts
and /dev/shm
/proc
/sys
You should be familiar with these if you have had to repair your Linux installation manually. Usually when you try to recover from a non-booting situation, you would mount your normal filesystems, bind mount these special ones from the live CD environment to the root fs of your broken installation, then chroot.
root@node-0:~# for fs in /run /dev /dev/shm /dev/pts /sys /proc; do mount -o bind $fs /mnt$fs; done
It is more recommended to do mount --move
here instead of bind mounts, as we will later need to move some of these filesystems anyways.
In this case, the command would be
for fs in /run /dev /sys /proc; do mount --move $fs /mnt$fs; done
Moving a mount would also move any child mounts within it, so we only need to move the top level ones.
This would fail without first making the mountpoints private. Read on for more details.
Here instead of chroot-ing, we can use pivot_root
, a utility wrapping around the syscall of the same name. It changes the root mount, such that a specified directory becomes the new root, and the old root will be moved into a subdirectory in the new one. In the following command, we attempt to make /mnt
the new root, and put the old Ubuntu root in /mnt/mnt
, or the /mnt
after pivoting.
root@node0:~# pivot_root /mnt /mnt/mnt
pivot_root: failed to change root from `/mnt' to `/mnt/mnt': Invalid argument
The command failed because of “Invalid argument”. In the man page pivot_root(2)
, EINVAL
can indicate several different problems which we can check for. It turns out that in our case, this is the reason:
EINVAL Either the mount point at new_root, or the parent mount of that mount point, has propagation type MS_SHARED.
man 2 pivot_root
By “default”, the root filesystem (and all mountpoints within it) is marked as “shared”. This is because
systemd(1) automatically remounts all mounts as MS_SHARED on system startup. Thus, on most modern systems, the default propagation type is in practice MS_SHARED.
man 7 mount_namespaces
We have to first mark it as “private”. The r
in --make-rprivate
makes it recursive such that the new root also becomes a private mount:
mount --make-rprivate /
Now if we retry the pivot_root
call, it will succeed, but we will not see any difference at surface level. This is because
pivot_root() changes the root directory and the current working directory of each process or thread in the same mount namespace to new_root if they point to the old root directory. (See also NOTES.)
On the other hand, pivot_root() does not change the caller’s current working directory (unless it is on the old root directory), and thus it should be followed by a chdir(“/”) call.
man 2 pivot_root
So the calling process (in our case, bash) effectively has an inconsistent view of the filesystem, as it still believes we are working from within /root
, but its CWD is now actually /mnt/root
. We can fix this by doing an arbitrary cd
.
Now to inspect the mount namespace:
root@node0:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 126G 0 126G 0% /dev
tmpfs 26G 1.6M 26G 1% /run
/dev/sda3 63G 3.3G 56G 6% /mnt
tmpfs 126G 0 126G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/sda1 256M 8.7M 248M 4% /mnt/boot/efi
tmpfs 26G 12K 26G 1% /mnt/run/user/<REDACTED>
tmpfs 50G 633M 50G 2% /
Our /
is now the tmpfs we previously created, and the Ubuntu root now resides in /mnt
. It is now safe to unmount the old root…
root@node0:~# umount -R /mnt
umount: /mnt/run/user/<REDACTED>: target is busy.
… but no. I tried to figure out what was preventing the umount
, but got no definitive answer. The best guess I have is anonymous inodes, as there are a lot of them. You cannot see whether these inodes are correlated with the particular filesystem you are trying to unmount though.
After all, there are still a bunch of running processes using resources from the old rootfs. We need to kill all of them to free the old rootfs up before we can unmount. But be careful! The SSH daemon, sshd
, among these processes, is responsible for the session we are using to connect to the server. Killing sshd
will lock ourselves out of the system, bringing unnecessary trouble. Thus, we will treat sshd
a bit differently, restarting it instead of killing it.
Before that, we must initialize pacman
to install sshd
in the Arch installer.
Standard procedure to initialize the keyring:
pacman-key --init && pacman-key --populate archlinux
At this point I realized that no mirror is enabled for pacman, so pacman cannot synchronize with the package registry. Unfortunately the Arch bootstrap environment does not have an editor (none of vim, vi, emacs or nano). I ended up using a sed command to enable the first mirror:
sed -i "s/#Server/Server/" /etc/pacman.d/mirrorlist
Alternatively, use nano from the old Ubuntu rootfs (/mnt/usr/bin/nano
) to edit the mirrorlist.
Now we can install OpenSSH:
pacman -Sy openssh
I also installed neovim so that I can edit /etc/ssh/sshd_config
more comfortably. It is highly recommended to install an editor anyway.
The mechanism we will use to restart sshd
is to send it a SIGHUP
.
sshd rereads its configuration file when it receives a hangup signal, SIGHUP, by executing itself with the name and options it was started with, e.g. /usr/sbin/sshd
man 8 sshd
To make sure this would succeed, check how sshd
was launched:
root@node0:/# ps -ax | grep sshd
1549 ? Ss 0:00 sshd: /usr/sbin/sshd -D [listener] 1 of 10-100 startups
*irrelevant entries omitted*
and double check that we now have /usr/sbin/sshd
. Before we actually do this, edit the configuration file as necessary. I copied the host keys over from the old Ubuntu installation, so that I don’t need to re-verify them when I later reconnect:
cp /mnt/etc/ssh/ssh_host_* /etc/ssh/
I also added my own public key to /root/.ssh/authorized_keys
. Make sure the permissions are good! Finally,
killall -HUP sshd
to actually restart it. The connection will be dropped, and we now need to reconnect using the root account and the corresponding private key. Now as sshd
is again safely running under the Arch environment, we can kill off the remaining processes that still use /mnt
:
lsof -F p +f -- /mnt | sed 's/^p//g' | xargs kill
But we still cannot unmount /mnt
:
[root@node0 ~]# umount -R /mnt
umount: /mnt/run: target is busy.
This is because even when we have attempted to kill all processes using anything under /mnt
, there is one process we won’t be able to kill: init
, a.k.a. systemd
, as it is designed to ignore SIGTERM
and SIGKILL
. Attempts to kill it will thus return normally but without any effect. We can confirm that systemd
is the only process still keeping some special filesystems busy.
[root@node0 ~]# lsof /mnt/run/
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
systemd 1 root 56u FIFO 0,25 0t0 718 /mnt/run/dmeventd-server
systemd 1 root 58u FIFO 0,25 0t0 719 /mnt/run/dmeventd-client
Although it is impossible to unmount these filesystems, they merely reside in /mnt
and have nothing to do with the filesystem underneath /mnt
itself. We only need to move them out of /mnt
to unmount /mnt
. Fortunately, there is a tool we can leverage:
mount --move /mnt/dev /dev
It does exactly what we want: move the mount without unmounting and remounting it. Do the same for /run
and /sys
, and /mnt
should now be safe to unmount.
umount -R /mnt # should succeed!!
There are two ways to proceed from here. On one hand, we can wipe Ubuntu entirely, removing its partition or re-format it for Arch as we wish; On the other hand, we can install Arch parallel to Ubuntu.
Normally, we would want the former, but CloudLab is a bit special. It embeds a few services of its own in the Ubuntu images. These services communicate to CloudLab’s control plane and enables managing the machines from CloudLab’s web UI. In particular, the “status” shown here is reported via such a mechanism. If we break it, CloudLab will think the host has failed and attempt to reboot it physically periodically. This effectively renders the host unusable.
Thus, I am keeping the Ubuntu installation intact and installing Arch in a separate partition. CloudLab machines do not use all available disk space for the OS, so there is plenty of free space available.
[root@node0 ~]# fdisk -l
Disk /dev/sdb: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: ST91000640NS
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/sda: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: ST91000640NS
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 916BB3F3-230A-44F6-8AFC-B0EE55667823
Device Start End Sectors Size Type
/dev/sda1 2048 526335 524288 256M EFI System
/dev/sda2 526336 528383 2048 1M BIOS boot
/dev/sda3 528384 134746111 134217728 64G Linux filesystem
/dev/sda99 1936746496 1953523711 16777216 8G Linux swap
To lay out the plan, I am not touching Ubuntu’s bootloader, so that Ubuntu would normally boot up each time the machine starts. CloudLab’s healthcheck service will normally run, reporting the status properly. Luckily, CloudLab does not require heartbeats, so one single report is enough to make it happy.
But how are we going to use the new Arch installation then? We could use a chroot, but it does not give us the full power, as we cannot use our own kernel. Enter kexec
, Linux’s syscall that allows replacement of the running kernel without a reboot.
Here is the small snippet I use to automate the installation process, but it’s basically just following the official installation guide.
# Create new partition next to the Ubuntu system partition
sys_drive=/dev/sda # change to yours if it differs
fdisk_output=$( (echo n; echo; echo; echo; echo w) | fdisk $sys_drive)
new_part="$sys_drive"$(echo $fdisk_output | grep -o -P "Created a new partition \K\d+")
# Format and mount the new partition
yes | mkfs.ext4 $new_part
mkdir /target
mount $new_part /target
# Install Arch Linux
echo nameserver 1.1.1.1 >> /etc/resolv.conf
pacstrap /target linux-lts grub base base-devel zsh tmux git dhcpcd openssh neovim
cp /root/.ssh/authorized_keys /target/root/.ssh/
# Enter chroot to do some setup
arch-chroot /target bash <<EOF
set -x
echo nameserver 1.1.1.1 >> /etc/resolv.conf
echo $HOSTNAME > /etc/hostname
systemctl enable dhcpcd sshd
useradd -m -s /usr/bin/zsh -G wheel saltyfish
sed -i "s/# %wheel ALL=(ALL:ALL) NOPASSWD/%wheel ALL=(ALL:ALL) NOPASSWD/" /etc/sudoers
mkdir /home/saltyfish/.ssh
chown -R saltyfish:saltyfish /home/saltyfish/.ssh
EOF
kexec
Now we are finished with the (half-complete) installation. We can try booting into it. We will be using the kexec
system call. There is a utility wrapping around it with the same name. In both Arch Linux and Ubuntu, install the package kexec-tools
to have it.
To run a Linux kernel, you need at least three things: (1) the kernel itself, (2) the initramfs, and (3) kernel parameters. You will use kexec -l
to load these before running kexec -e
.
Back in the installer environment, the kernel we want to run is /target/boot/vmlinuz-linux-lts
(because I installed linux-lts
) and the initramfs is /target/boot/initramfs-linux-lts.img
. These are easy to find and fill in, but the kernel parameters are a bit trickier. Let’s start by looking at Ubuntu’s parameters:
[root@node0 ~]# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.0-53-generic root=UUID=ccb0832d-c80e-430f-b248-d117889f1992 emulabcnet=34:17:eb:e5:59:d5 ro console=ttyS1,115200
BOOT_IMAGE
: Points to the kernel in use. Presumably there to enable certain userspace programs to identify the kernel. We should change it to /boot/vmlinuz-linux-lts
, the kernel’s location from the new installation’s POV.root
: Points to the partition containing the root filesystem. The initramfs will later pivot_root
(or switch_root
, see man page pivot_root(2)
for details) to it to finalize the boot procedure. We should set it correspondingly.emulabcnet
: Looks like CloudLab-specific stuff. Safe to drop.ro
: Mount the initramfs read-only. The kernel’s default is rw
, but mkinitcpio
ro
. It does not matter much, so is safe to drop.console=ttyS1,115200
: Enables serial console so that CloudLab’s “console” feature works. Safe to drop, but better keep it as-is.To find the correct root UUID, run
blkid -o value -s UUID $new_part
where $new_part
is your partition containing the root filesystem. Attach the printed UUID to UUID=
to make the value for the root
parameter.
The final parameters to use for our new installation is thus BOOT_IMAGE=/boot/vmlinuz-linux-lts root=UUID=<YOUR_UUID> console=ttyS1,115200
. It’s time to put everything together:
kexec -l /target/boot/vmlinuz-linux-lts --initrd=/target/boot/initramfs-linux-lts.img --command-line="BOOT_IMAGE=/boot/vmlinuz-linux-lts root=UUID=<YOUR_UUID> console=ttyS1,115200"
If this finishes successfully, we can now kexec -e
. Your SSH session should freeze after you issue this command because the kernel is abruptly replaced, not giving sshd
time to do anything to gracefully shutdown the connection. You can kill the ssh
process or wait for it to timeout. It should not take long to boot up the new kernel, so you should shortly be able to login again. After you login, you can confirm you’re on the Arch-packaged kernel using uname -a
.
From here, simply start enjoying the power of Arch. If you ever need to reboot, the machine will still reboot into Ubuntu, but you again can use kexec
to switch over to Arch. (Attention: Will need to mount the Arch partition to find the kernel and initramfs files.)
It is, in theory, possible to port the necessary CloudLab component to Arch and create a systemd unit for it so we no longer need the Ubuntu, but I will leave it to Future Work™ to figure out how.