GPU Device Exploitation

GPU exploitation is managed at three distinct levels:

  • Sharing of the GPU device between multiple processes.
  • Resource Requirement through IBM Spectrum LSF directive.
  • GPU Device visibility management through CUDA_VISIBLE_DEVICES environment variable.

GPU Device Configuration

Compute Mode

A single GPU device can be:

  • Dedicated to one single user process.
  • Shared between multiple user processes.

This configuration is statically defined at the GPU level through the notion of Compute Mode:

Compute Mode Purpose
Shared GPU Device can be accessed concurrently by multiple user processes
Exclusive Process GPU Device can be accessed by one single user process at a time
Prohibited GPU Device is not accessible to user processes

By default, GPU devices are configured as Shared. Changing the Compute Mode of a GPU device requires root access to the system, and is normally not directly manageable by the application users.

NVIDIA Multi-Process Service (MPS)

Whenever the GPU device needs to be shared between multiple user processes, a specific mechanism named NVIDIA Multi-Process Service (MPS) comes into play. MPS takes care of optimizing the sharing of the resource between multiple user processes.

GPU Device Resource Allocation Request

At the IBM Spectrum LSF level, both the number of GPU devices to be allocated to the job and the use of MPS must be explicitely defined. This requirement is achieved through the following directive inside the submission script:

  • Since Spectrum LSF 10.1.0.3, GPU resource requirement is expressed through one single-line, dedicated directive:

    #BSUB -gpu num=<num_gpus>:gtile=<tile_num>:mode={exclusive_process|shared*|}:mps={no*|yes}:j_exclusive={no|yes*}:nvlink={no*|yes}
    

    where:

    • num_gpus: Requested number of GPU devices
    • tile_num: Number of GPU devices per socket

cf. IBM Knowledge Center: https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_command_ref/bsub.gpu.1.html

  • Prior to Spectrum LSF 10.1.0.3, the following limitations exist:

    • Compute Mode of the GPUs is not dynamically manageable through Spectrum LSF
    • Two distinct directives are required for GPU resource requirement:

      • Number of GPU devices + Compute Mode:

        #BSUB -R “rusage[{ngpus_excl_p|ngpus_shared}=<num_gpus>]”
        
      • MPS:

        #BSUB -env all,LSB_START_JOB_MPS=Y
        

> The following environment variables would show requested resource requirement as well as the effective resource requirement:

  • LSB_SUB_RES_REQ
  • LSB_EFFECTIVE_RSRCREQ

GPU Device Visibility

The set of GPU devices that are visible to a given process is managed through the environment variable: CUDA_VISIBLE_DEVICES.

If a proper GPU resource requirement has been expressed in the job submission script, IBM Spectrum LSF assigns a default value to this variable. This value corresponds to the GPU devices that have been allocated to the job:

  • ngpus_excl_p=2 -> CUDA_VISIBLE_DEVICES=1,0
  • ngpus_exlc_p=4 -> CUDA_VISIBLE_DEVICES=0,1,2,3

However, it is often (always) necessary to manually redefine the value of the variable, so that each MPI task can only address one single GPU device.

> CUDA_VISIBLE_DEVICES=‘’ (null value) means that no GPU are assigned to the task. This would trigger error messages whenever the task would try accessing the GPU resource for the first time.

results matching ""

    No results matching ""