PowerAI Through Docker Containers Inside IBM Spectrum LSF

The overall process is made of the following steps:

  • Install Docker.
  • Install NVIDIA Docker plugin.
  • Create base NVIDIA Docker images.
  • Create PowerAI Docker image.

Components Overview

Component Version
CUDA Toolkit 8.0.61-1
CuDNN 6.0.21-1
Docker 17.06.1-ce
IBM Spectrum LSF 10.1.0.1
NVIDIA Docker 1.0.1
NVIDIA Drivers 375.51-1
PowerAI 4.0.0

The NVIDIA Docker plugin requires a system featuring:

  • NVIDIA driver version 361.93.03 or higher.
  • CUDA version 8.0.44 or higher.

Install & Configure Docker

The Docker installation is performed through the following procedure:

  • Setup the Docker repository from Unicamp:

    [unicamp]
    baseurl=http://ftp.unicamp.br/pub/ppc64el/rhel/7/docker-ppc64el/
    enabled=1
    gpgcheck=0
    name=Unicamp
    
  • Install the Docker package from the Docker repository:

    $ yum install docker-ce
    
  • Create the default configuration file for the Docker daemon the following content.

    /etc/docker/daemon.json
    
    • Declare the local Docker registry (to be created):
      {
       "insecure-registries": ["registry:5000"]
      }
      
  • Validate the Docker installation:

    • Check that the Docker deamon is running by issuing the following command:

      $ systemctl status docker
      
    • If not already running, start the Docker daemon:

      $ systemctl start docker
      
    • Open a command-line terminal and run some basic Docker commands:

    • Display various information related to the host system:

      $ docker version
      
    • Search for available ppc64le images on the official Docker public registry:

      $ docker search ppc64le
      
    • Try running a Ubuntu container on your RHEL host by running the following command:

      $ docker run --rm -it ppc64le/ubuntu /bin/bash
      root@6bfb0623bce7:/# cat /etc/lsb-release
      DISTRIB_ID=Ubuntu
      DISTRIB_RELEASE=16.04
      DISTRIB_CODENAME=xenial
      DISTRIB_DESCRIPTION="Ubuntu 16.04.1 LTS"
      

Install & Configure NVIDIA Docker

In order to exploit the host GPUs from the docker image, the NVIDIA Docker plugin needs to be installed through the following additional steps:

  • Get latest NVIDIA Docker source package for ppc64le architecture from GitHub:

    $ wget https://github.com/NVIDIA/nvidia-docker/archive/ppc64le.zip
    
  • Extract source package:

    $ unzip ppc64le.zip
    
  • Build NVIDIA Docker:

    $ cd nvidia-docker-ppc64le
    $ make
    $ make install
    
  • At the end of the build process, the following two binaries have been generated inside the tools/bin subdirectory:

    • nvidia-docker
    • nvidia-docker-plugin

The build process is performed inside a Docker image that is retrieved from the Docker public registry. An access to this remote registry is therefore required.

  • Check the NVIDIA Docker version:

    $ bin/nvidia-docker-plugin -v
    NVIDIA Docker plugin: 1.0.1
    
  • Create the NVIDIA Docker service by adapting the following configuration file:

    $ vi /usr/lib/systemd/system/nvidia-docker.service
    

  • Start the NVIDIA Docker service:
    $ systemctl daemon-reload
    $ systemctl start nvidia-docker
    

Install & Configure Docker Registry

  • Install the Docker Registry package from the Docker repository:

    $ yum install docker-distribution
    
  • Adjust the default configuration by modifying the following configuration file:

    /etc/docker-distribution/registry/config.yml
    
    • Specify the custom location of Docker images storage:
      /install/custom/registry
      
  • Start the private registry:

    $ systemctl start docker-distribution
    

Generate CUDA + cuDNN Images

Once the NVIDIA Docker has been enabled, a first baseline of Docker images needs to be created:

  • CUDA Image: Operating System + CUDA Toolkit.
  • cuDNN Image: Operating System + CUDA Toolkit + cuDNN Library.

CUDA Image

The process for generating CUDA images is the following:

  • Create the following Dockerfile:

  • Launch image generation using the above Dockerfile:

    $ docker build -f Dockerfile.cuda-devel-8.0 -t cuda/devel:8.0.61-1 .
    

    A remote access to both the Docker public registry and the CUDA repository is required.

  • Check image existence:

    $ docker images cuda/devel
    REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
    cuda/devel          8.0.61-1            3f2ae272ee0f        8 weeks ago         1.65 GB
    

cuDNN Image

The process for generating cuDNN images is the following:

  • Download the NVIDIA cuDNN 6.1 Developer Library Debian packages (Development):

  • Create the following Dockerfile:

  • Launch image generation using the above Dockerfile:

    $ docker build -f Dockerfile.cudnn-devel-6 -t cudnn/devel:6.0.21-1 .
    
  • Check image existence:

    $ docker images cudnn/devel
    REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
    cudnn/devel         6.0.21-1            4cefb6f3b28d        4 days ago          6.01 GB
    
  • Test the newly-created image:

    $ nvidia-docker run --rm cudnn/devel:6.0.21-1 /bin/bash -c "nvidia-smi"
    

Generate PowerAI Docker Image

The following procedure makes it possible to create a PowerAI Docker image:

  • Download the PowerAI repository Debian package:

    $ wget https://public.dhe.ibm.com/software/server/POWER/Linux/mldl/ubuntu/mldl-repo-local_4.0.0_ppc64el.deb
    
  • Create a Dockerfile in the same directory where the PowerAI repository Debian packages resides:

  • Build the Docker image from previously created Dockerfile:

    $ docker build -f Dockerfile.powerai-base-4 -t powerai/base:4.0.0 .
    
  • Validate the PowerAI Docker image:

    • Start a container with the image previously created:

      $ docker run -it powerai/base:4.0.0 /bin/bash
      
    • Install some prerequisites packages :

      apt-get update && apt-get install –y vim wget
      
    • Load the caffe framework:

      source /opt/DL/caffe/bin/caffe-activate
      
    • Install the Caffe example:

      caffe-install-samples $HOME/caffe
      
    • You will first need to download and convert the data format from the MNIST website. To do this, simply run the following commands:

      cd /root/caffe/
      ./data/mnist/get_mnist.sh
      
    • After running the script there should be two datasets, mnist_train_lmdb, and mnist_test_lmdb.

      ./examples/mnist/create_mnist.sh
      
    • We are going to train the model over 1000 iterations. Based on the solver setting, we will print the training loss function every 100 iterations, and test the network every 500 iterations.

      ./examples/mnist/train_lenet.sh
      
    • More details can be found from this page the "Training LeNet on MNIST with Caffe" (http://caffe.berkeleyvision.org/gathered/examples/mnist.html).

Configure IBM Spectrum LSF

In order for Docker containers to be instantiated through Spectrum LSF, the following changes must be applied to the Spectrum LSF configuration:

  • Add the following settings to the lsf.conf configuration file:

    LSB_RESOURCE_ENFORCE="cpu memory"
    LSF_LINUX_CGROUP_ACCT=Y
    LSF_PROCESS_TRACKING=Y
    
  • Add the following resource definition to the lsf.shared configuration file:

    Begin Resource
    RESOURCENAME  TYPE    INTERVAL INCREASING CONSUMABLE DESCRIPTION  # Keywords
    docker        Boolean ()       ()         ()         (Docker container)
    End Resource
    
  • Assign the docker resource to the Compute Nodes in the lsf.cluster configuration file:

    Begin Host
    HOSTNAME  model  type  server  r1m  mem  swp  RESOURCES
    [...]
    host01    !      !     1       3.5  ()   ()   (docker)
    [...]
    End Host
    
  • Define a Docker application in the lsb.applications configuration file:

    Begin Application
    CONTAINER = docker[image(registry:5000/powerai/base:4.0.0) --device=/dev/infiniband --device=/dev/nvidia0 --device=/dev/nvidia1 --device=/dev/nvidia2 --device=/dev/nvidia3 --device=/dev/nvidia-uvm --device=/dev/nvidia-uvm-tools --device=/dev/nvidiactl --ipc=host --network=host --rm --ulimit memlock=819200000:819200000 --volume /etc/group:/etc/group:ro --volume /etc/passwd:/etc/passwd:ro --volume /gpfs/home:/gpfs/home --volume /gpfs/scratch:/gpfs/scratch --volume-driver=nvidia-docker --volume=nvidia_driver_375.:/usr/local/nvidia:ro)]
    DESCRIPTION = PowerAI
    NAME = powerai
    End Application
    

Note 1: The filesystems that need to be visible inside the container are to be specified as arguments to the Docker start command through the -v option.

Note 2: A second application declaration is required if the Docker image is expected to be used interactively.

  • Reconfigure LSF Batch daemons:

    $ badmin reconfig
    $ lsadmin reconfig
    
  • Check that the Docker resource is well associated to the Compute Nodes:

    $ lshosts host01
    HOST_NAME      type    model  cpuf ncpus maxmem maxswp server RESOURCES
    host01      LINUXPP   POWER8 250.0   160   128G      -    Yes (docker)
    

It is also required to add the user that will own the docker process instantiated by Spectrum LSF to the docker user group. By default, unless specific setting at the Spectrum LSF application definition level, the user owning this process is lsfadmin; it should therefore be added to the docker group:

docker:x:596:lsfadmin

Current Issues / Limitations

  • Restrictive User Mask Prevents Docker-Based LSF Job Startup

    • Status:
      • Service Request opened: 54109,661,706
      • Fix to be implemented in IBM Spectrum LSF 10.1.0.3
  • Handling of Bash Functions in User Environment at Docker Startup

    • Status:
    • Workaround:
      • Prevent Bash functions transfer to Docker through -env option of bsub command:
        $ bsub -app powerai -env "all,~BASH_FUNC_ml(),~BASH_FUNC_module()" < job.sh
        

results matching ""

    No results matching ""