Apollo GPU Nodes

Hardware

Each Apollo node has 8 NVIDIA A100 40GB GPUs, 2 x 64 core AMD Epyc 7742 processors, 1024 GB RAM, and 15TB of local scratch space running Springdale Linux 8.

Configuration

All nodes mount the same /home and /data filesystems as the other computers in SNS. Scratch space locations have been tweaked to help identify local vs network resources. /scratch/lustre is the new mount point for the parallel file system and /scratch/local/ will be for any system local storage.

Scheduler

Job queuing is provided by SLURM; the following hosts have been configured as SLURM submit hosts for the Apollo nodes:

  • Apollo-login1.sns.ias.edu

Access to the Apollo nodes is restricted and requires a cluster account.

Submitting / Connecting to Apollo Nodes

You can submit jobs to the Apollo nodes from apollo-login1.sns.ias.edu by requesting a gpu resource. A job submit script will automatically assign you to the appropriate queue. At this time we are enforcing a maximum of four gpus per job.

GPU resources can be requested by using --gpus=1, --gres=gpu:1, or --gpus-per-node=1 :
  srun --time=1:00 --gpus=1 nvidia-smi
  srun --time=1:00 --gres=gpu:1 nvidia-smi
  srun --time=1:00 --gpus-per-node=1 nvidia-smi

You can ssh to an Apollo node once you have an active job or allocation on said node:
  apollo-login1$> salloc --time=5:00 --gpus=1
    salloc: Granted job allocation 134
    salloc: Waiting for resource configuration
    salloc: Nodes apollo01 are ready for job

Checking GPU Usage

You can check GPU usage by using the nvidia-smi command. Please be aware that you will need to be logged in to a GPU node to run nvidia-smi or use srun to use an already allocated GPU job with srun --jobid=$<JOBID> nvidia-smi. You are able to ssh interactively to any node that you have an active job assigned.

Apollo GPU nvidia-smi output
nvidia-smi output for Apollo GPU nodes