Setting up CUDA GPU Passthrough in Linux Containers (LXC)

8.17.2015

This is a technical article about how to get CUDA passthrough working in a particular Linux container implementation, LXC. We will demonstrate GPU passthrough for LXC, with a short CUDA example program.

Linux containers can be used for many things. We are going to set up something which is like a light-weight virtual machine. This can then be used to help with clean builds, testing, or to help with deployment. Linux containers can also support limiting resource access, resource prioritization, resource accounting and process freezing.

Linux Container Implementation

Linux containers are built on two features in the Linux kernel, cgroups, https://en.wikipedia.org/wiki/Cgroups, and namespace isolation. There are several projects building on these kernel features in order to make them a bit easier to use. We are going to use one called LXC. You can read about it here: https://linuxcontainers.org/.

How to Get CUDA in a Container

Here is how to get CUDA working in a container on Ubuntu Server 15.04. First install Ubuntu server 15.04 in the usual way.

Update, and install LXC

We want to use the latest LXC:

sudo add-apt-repository ppa:ubuntu-lxc/lxc-stable

Then we can update the system like this:

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install lxc

After doing this, it is probably a good idea to reboot, to avoid the possibility of having issues connected to the systemd upgrade bug mentioned in the sidebar.

If you get a problem with systemd or ‘Connection timed out’, when running apt-get, then try rebooting the machine and re-running apt-get. You might need to run either

sudo apt-get install -f

sudo dpkg --configure -a

after rebooting to finish the apt-get install. I think this is because of a bug in the Ubuntu package upgrade for one of the dependencies of systemd. If ‘sudo reboot’ hangs, you can ctrl-z it into the background (it might eventually time out with an error after several minutes), then run ‘sync;sync;sync;’, then do a hard reset and the system should recover OK after you run the apt-get/dpkg recovery commands above.

Install Nvidia CUDA driver

Next we’ll install the Nvidia driver on the host operating system. We will install from the Nvidia driver .run installer.
You will probably have an issue where the Nouveau kernel module has been loaded by Ubuntu. We don’t want this because it conflicts with the Nvidia driver kernel module.
Let’s fix this issue. Create this file, ‘/etc/modprobe.d/nvidia-installer-disable-nouveau.conf’, with these contents:

blacklist nouveau
options nouveau modeset=0

Then we should reboot so that we are running without the Nouveau module loaded.
Here is where you can get the driver from Nvidia: http://www.nvidia.com/object/unix.html.
I used Linux x86_64/AMD64/EM64T – Latest Long Lived Branch version: 352.21, which has the filename ‘NVIDIA-Linux-x86_64-352.21.run’. This driver is compatible with CUDA 7.0.
In order to install the driver from source we’ll need gcc and make.

sudo apt-get install gcc make

Then install it:

sudo sh ./NVIDIA-Linux-x86_64-352.21.run

We don’t care about the Xorg stuff on a server, when the installer asks about it just ignore it or tell the installer to do nothing.
You can check that CUDA is working on the host machine at this point, by installing the CUDA SDK and compiling and running a simple CUDA program. There is an example program at the bottom of this post. There is also a precompiled exe linked at the bottom of the post which might work on your system and you can avoid having to install the CUDA SDK on the host at this time.

Instead of using the Nvidia .run installer, you could try to use the Ubuntu packages. The benefits of installing Nvidia via the distribution packages are:

for a production machine you don’t need to install g++, etc., and then clean up g++ and its dependencies after the driver is installed;
you can update the driver more easily;
the packages should avoid any conflicts with the Nouveau driver;
if you run Xorg, it should be less error prone to get it working than when you are using the Nvidia .run installer.

There can be problems with installing from the Ubuntu packages:

The default Nvidia packages are not kept up to date enough so they are often not new enough for the latest released CUDA version. This isn’t the case right now with Ubuntu 15.04 and CUDA 7.0 – the versions are compatible if you install the nvidia-346-updates package in Ubuntu.
Often, the dependencies for packages for the Nvidia drivers/CUDA pull in huge amounts of Xorg, Gnome and other stuff which we are not interested in when we just want to run CUDA apps on a server.
Packaged drivers sometimes have poor permissions which makes the Nvidia driver only usable by root so you can’t run CUDA as a normal user. This can be fixed e.g. by changing the group on the /dev/nvidia* files and then adding your user to this group. You can debug issues with permissions by seeing if a CUDA program will run as root, but not as a normal user. Sometimes, you will not be able to run CUDA as a normal user after booting, but you will be able to run as a normal user after running a single CUDA program as root first.
Packaged drivers sometimes put files in weird places. I only noticed this with Ubuntu, which puts libnvidia-ml.so and some other files in a really useless place. You can work around this by e.g. symlinking these files to /usr/local/lib. You often won’t need these files so you might not notice this issue.

These issues will sometimes apply to installing the Nvidia drivers on other Linux distributions.

Prepare for unprivileged containers

We are going to run an unprivileged container. This means our container will be created and run under our normal user and not under the root user. We need to do a little manual set up to make this work:
edit /etc/lxc/lxc-usernet and add the line:

your-username veth lxcbr0 10

Replace your-username with the user you are using. This is to support networking to the container.
Do the following:

mkdir -p ~/.config/lxc
cp /etc/lxc/default.conf ~/.config/lxc/default.conf

and add these lines to ‘~/.config/lxc/default.conf’:

lxc.id_map = u 0 100000 65536
lxc.id_map = g 0 100000 65536

These should match the numbers in /etc/subuid, /etc/subgid for your user.

Create the LXC Container

lxc-create -t download -n mycontainer -- --dist ubuntu --release vivid --arch amd64

Add the Nvidia devices to the container, edit the file ‘~/.local/share/lxc/mycontainer/config’ and add these lines to the bottom:

lxc.mount.entry = /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry = /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry = /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file

Setup the LXC container for access via ssh, and with a normal user which can use sudo:

lxc-start -n mycontainer
lxc-attach -n mycontainer

After running lxc-attach, the console you are on is a root prompt in the container. Run

apt-get install openssh-server
adduser myuser
usermod -a -G sudo myuser

Use ctrl-d to exit the container back to the host system.
We can get the IP address of the container so we can log in with ssh:

lxc-info -n mycontainer

Use the IP address from the output of lxc-info in the two following commands.
First copy the Nvidia driver installer into the container:

scp NVIDIA-Linux-x86_64-352.21.run [email protected]:

Log into the container:

ssh 10.0.3.333 -l myuser

The basic driver setup includes adding a kernel module, and adding a bunch of .so files and a few extra bits. Inside the container we don’t want to try to add the kernel module since we are using the host kernel with the module already loaded. We can install without the kernel module like this:

sudo sh ./NVIDIA-Linux-x86_64-352.21.run --no-kernel-module

(We don’t need to install g++ or make in the container to run this because we are not installing the kernel module.)
At this point, if you have a simple CUDA test exe, you can scp it into the container and check it runs OK.
Now you can install the CUDA SDK using the .run from Nvidia inside the container, or copy your CUDA binaries into the container and you are good to go. If you install the CUDA SDK from the .run, make sure you don’t try to install/upgrade/replace the Nvidia driver during the CUDA SDK installation. You can use something like this:

# make sure we have the prerequisites to install the SDK, and g++ so
# we can use the SDK
sudo apt-get update
sudo apt-get install perl-modules g++
# install the sdk without the driver
./cuda_7.0.28_linux.run -toolkit -toolkitpath=~/cuda-7.0 -silent -override

You probably don’t want to do normal development in a container, and you definitely want to avoid leaving the CUDA SDK or g++, make, etc. installed on either the host or container for production. One good use for installing the CUDA SDK into a container is to create a convenient way to do repeatable production builds of your CUDA exes.

I have had some intermittent problems with the permissions on the /dev/nvidia* files inside the container which I haven’t got to the bottom of yet. I think this is because the Nvidia driver tries to automatically create or update the /dev/ files when you run a CUDA program and this doesn’t work correctly in the container (partly because our container runs without root permissions). Sometimes this can be fixed by restarting the container. To restart the container, run this on the host system:

lxc-stop -n mycontainer
lxc-start -n mycontainer

This doesn’t always work though. You can check if the permissions have gone weird using this inside the container:

~$ ls /dev/nvidia* -l
crw-rw-rw- 1 nobody nogroup 195,   0 Jul 15 08:24 /dev/nvidia0
crw-rw-rw- 1 nobody nogroup 195, 255 Jul 15 08:24 /dev/nvidiactl
crw-rw-rw- 1 nobody nogroup 248,   0 Jul 15 08:24 /dev/nvidia-uvm

If the permissions, owners, module major and minor numbers are different to the above, or any of the files are missing, then there is a problem.
If restarting the container doesn’t fix the issue, you could try the following in various orders:

make sure the permissions on the host dev files are correct and accessible to your user
restart the host
run a cuda exe on the host as root
run a cuda exe on the host as normal user
run a cuda exe in the container as root
run a cuda exe in the container as normal user

I didn’t see any problems like this following the instructions in this post directly, but only when experimenting and trying different things.

I have an LXC Container already

If you already have an LXC container, you can do the following:

make sure you have Nvidia driver installed in the host system, and the /dev files have the right permissions
edit the container config file to add entries for the Nvidia devices
restart the container to make the Nvidia devices appear in it
install the Nvidia driver in the container without the kernel module
install the CUDA SDK without the driver or use your CUDA binaries in the container

This should work for privileged containers also.
For non-LXC containers, you will need to figure out how to make the Nvidia device files on the host available in the container, and to install the Nvidia drivers in the host and install them in the container without the kernel module, or just expose these files from the host.

Notes

Maybe you want to try running the container on something other than Ubuntu 15.04.
You can install the latest stable LXC release from source on your distribution of choice, install Nvidia driver on the host system, then create a container as above. On different systems, the big difference is likely to be in the networking setup for the container. Also, on some systems you will have to add some entries to the /etc/subuid and /etc/subgid files.
One thing you have to be aware of is I think the Nvidia driver files (.so files etc.) have to match the kernel module version, so you need to make sure the versions are exactly the same in the host and the container. This might be tricky e.g. if you install Nvidia driver on the host using the host packaging system, then try to run a different Linux distribution in the container. The CUDA SDK version doesn’t need to match the Nvidia driver version, it just needs to be a compatible version. Running a CUDA program will tell you if the Nvidia driver you have is compatible with your CUDA exe or not.
The other issues are the possible problems with Nvidia permissions on the host (easily solved), and the device/permissions issues mentioned in the sidebar above.

CUDA example test executable

Here is a small CUDA test program which can be used to check if CUDA is working on a system. The expected output is:

16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46

If there is a problem, you will almost certainly get an error message, so you shouldn’t need to go through the output to make sure the numbers match!
Paste the below into hellocuda.cu and then compile with

nvcc hellocuda.cu -o hellocuda

You have to have the CUDA SDK installed to compile this.

#include <iostream>
using namespace std;
void _cudaCheck(cudaError_t err, const char *file, int line) {
   if (err != cudaSuccess) {
       cerr << "cuda error: " << cudaGetErrorString(err)
            << file << line << endl;
       exit(-1);
   }
}
#define cudaCheck(ans) { _cudaCheck((ans), __FILE__, __LINE__); }
__global__  void add(int *a, int *b, int *c)
{
   c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
int main()
{
    const int N = 16;
    int a[N] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
    int b[N] = {16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31};
    const size_t sz = N * sizeof(int);
    int *da;
    cudaCheck(cudaMalloc(&da, sz));
    int *db;
    cudaCheck(cudaMalloc(&db, sz));
    int *dc;
    cudaCheck(cudaMalloc(&dc, sz));
    cudaCheck(cudaMemcpy(da, a, sz, cudaMemcpyHostToDevice));
    cudaCheck(cudaMemcpy(db, b, sz, cudaMemcpyHostToDevice));
    add<<<1, N>>>(da, db, dc);
    cudaCheck(cudaGetLastError());
    int c[N];
    cudaCheck(cudaMemcpy(c, dc, sz, cudaMemcpyDeviceToHost));
    cudaCheck(cudaFree(da));
    cudaCheck(cudaFree(db));
    cudaCheck(cudaFree(dc));
    for (unsigned int i = 0 ; i < N; ++i) {
        cout << c[i] << " ";
    }
    cout << endl;
    return 0;
}

Setting up CUDA GPU passthrough in Linux containers (LXC)