Hyperion Cluster: GPU Programming with CUDA

GPU Programming with CUDA

In October 2010, SNS Computing added 4 CUDA cables servers to the computing environment. Each server has one Tesla C2050 GPU, one 8-core AMD Opteron 6134 processor, and 16 GB of RAM.

Two of the servers, node65.hyperion and node66.hyperion, are part of the Hyperion cluster and are only available for batch jobs submitted to SGE using the qsub command. The use of cuda can be requested by specifying '-l cuda' as part of your qsub command.

The other two nodes, coda01.sns.ias.edu and cuda02.sns.ias.edu, are available for interactive access so that you can develop and test your programs before submitting them to the cluster to run as batch jobs.

On a CUDA-capable node, you need to make sure your PATH and LD_LIBRARY_PATH environment variables include the proper settings for CUDA, This should be done automatically, but the correct commands are shown below for completeness.

bash syntax:

PATH=/usr/local/cuda/bin:$PATH
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/lib
export PATH LD_LIBRARY_PATH

csh syntax:

setenv PATH /usr/local/cuda/bin:$PATH
setenv LD_LIBRARY_PATH /usr/local/cuda/lib64:/usr/local/cuda/lib

You can check that your environment is setup correctly by checking to see if the cuda compiler, nvcc, is installed in your using the 'which' command:

[prentice@cuda02 ~]$ which nvcc
/usr/local/cuda/bin/nvcc


Once this is done you can compile some sample cuda code, like this code. Be sure to save with the ".cu" extension, or else nvcc won't handle it correctly.

#include <stdio.h>

#define N 10

__global__ void add(int *a, int *b, int *c) {
  int tid = blockIdx.x;
  if (tid < N) 
    c[tid] = a[tid] + b [tid];
}

int main (int argc, char *argv[]) {

  int count;
  int i; 
  cudaError_t err;
  cudaDeviceProp prop;

  int a[N], b[N], c[N];
  int *dev_a, *dev_b, *dev_c;

  // How many CUDA devices are there? 
  err = cudaGetDeviceCount(&count);
  if (err != cudaSuccess) {
    printf("Error getting device count\n");
    return(1);
  }	

  if (count == 0) {
    printf("No CUDA devices found\n");
    return(0);
  }
  printf("\nCUDA Device count = %i\n\n", count);

  // List the properties for each device
  for (i=0; i <count; i++) {
    err = cudaGetDeviceProperties(&prop, i);
    if (err != cudaSuccess) {
      printf("Error getting device properties for device %i\n", i);
      return(1);
    }
    printf("Some properties of CUDA device %i:\n", i);
    printf("=================================\n");
    printf("Name: %s\n", prop.name);
    printf("Compute capability: %i.%i\n", prop.major, prop.minor);
    printf("Number of multiprocessors: %i\n", prop.multiProcessorCount);
    printf("Total global memory: %ld bytes\n", prop.totalGlobalMem);
    printf("Shared Mem/Block: %ld bytes\n", prop.sharedMemPerBlock);
    printf("\n");
  }

  // Do a simple vector addition, a + b =c 

  // allocate memory on the CPU
  err = cudaMalloc( (void**)&dev_a, N*sizeof(int));
  if (err != cudaSuccess) {
    printf ("Error allocating memory on device\n");
    return(1);
  }
  err = cudaMalloc( (void**)&dev_b, N*sizeof(int));
  if (err != cudaSuccess) {
    printf ("Error allocating memory on device\n");
    return(1);
  }
  err = cudaMalloc( (void**)&dev_c, N*sizeof(int));
  if (err != cudaSuccess) {
    printf ("Error allocating memory on device\n");
    return(1);
  }
  
  //fill the vectors with data
  for (i = 0; i < N; i++) {
    a[i] = i;
    b[i] = i;
  }

  //copy data to device
  err = cudaMemcpy(dev_a, a,  N*sizeof(int),cudaMemcpyHostToDevice);
  if (err != cudaSuccess) {
    printf("Error copying data to device\n)");
    return(1);
  }
  err = cudaMemcpy(dev_b, b, N*sizeof(int),cudaMemcpyHostToDevice);
  if (err != cudaSuccess) {
    printf("Error copying data to device\n)");
    return(1);
  }
  
  // Call CUDA kernel to do the actual work
  add<<<N,1>>>(dev_a, dev_b, dev_c);
  
  //copy results from device to host
  err = cudaMemcpy(c, dev_c, N*sizeof(int),cudaMemcpyDeviceToHost);
  if (err != cudaSuccess) {
    printf("Error copying data from device\n)");
    return(1);
  }
  
  printf("Results of vector addition on GPU:\n");
  printf("==================================\n");
  for (i=0; i<N; i++) {
    printf("%i + %i = %i\n", a[i], b[i], c[i]);
  }

  //Free allocated memory on GPU
  cudaFree(dev_a);
  cudaFree(dev_b);
  cudaFree(dev_c);

  printf("\nSample CUDA program completed successfully\n");
  return(0);
  
}

Compile it with the nvcc command:

$ nvcc -o example example.cu

And then run the resulting program "example". If everything goes right, the output should look like that below:

$ ./example 

CUDA Device count = 1

Some properties of CUDA device 0:
=================================
Name: Tesla C2050
Compute capability: 2.0
Number of multiprocessors: 14
Total global memory: 2817982464 bytes
Shared Mem/Block: 49152 bytes

Results of vector addition on GPU:
==================================
0 + 0 = 0
1 + 1 = 2
2 + 2 = 4
3 + 3 = 6
4 + 4 = 8
5 + 5 = 10
6 + 6 = 12
7 + 7 = 14
8 + 8 = 16
9 + 9 = 18

Sample CUDA program completed successfully


Ta-da! You've just successfully run a CUDA program. Now run it in batchmode on the cluster, you just need to wrap it in a qsub script just like the other examples above. The only difference is that we are going to request the resources "cuda and exclusive" using the -l switch in the submission script:

#!/bin/bash

#$ -l cuda
#$ -l exclusive
#$ -V 
#$ -cwd 

./example

If you saved the above shell script as 'cuda.sh', you can then run it on the cluster with the qsub command:

$ qsub cuda.sh
Your job 552495 ("cuda.sh") has been submitted

If all goes well, the standard output file for this job should be identical to the output shown above for the case when the program was run on the command-line.