Typhon Cluster

Typhon

Overview
For large parallel computations and batch jobs, IAS has a 64 node beowulf cluster, named Typhon. Each node has quad 24 core 64-bit Intel Cascade Lake processors, providing a total of of 6144 processor cores. Each node has 384 GB RAM (4 GB/core). For low-latency message passing, all the nodes are interconnected using HDR100 infiniband.

Job queuing is provided by SLURM. The operating system is Springdale Linux 8.

All nodes mount the same /home and /data filesystems as the other computers in SNS. Scratch space locations have been tweaked to help identify local vs network resources. /scratch/lustre is the new mount point for the parallel file system and /scratch/local/ will be for any system local storage.

Accessing
Access to the Typhon cluster is restricted. If you would like to use the Typhon cluster, please contact the computing staff.

Submitting Jobs/Interactive Use
Jobs can be submitted to the Typhon cluster from the login nodes. For an overview on submitting jobs, please refer to: Submitting Jobs with Slurm

Login Nodes
The primary login nodes, typhon-login1 and typhon-login2, should be used for interactive work such as compiling programs and submitting jobs. Please remember that these are shared resources for all users.

Access to /data, /home/ and /scratch file systems are available on all login and cluster nodes.

All nodes have access to our parallel filesystem through /scratch/lustre.
600GB of local scratch is available on each node in /scratch/local/.

Job Scheduling
The cluster determines job scheduling and priority using Fair Share. This is a score determined per user based on past usage; the more jobs that you run the lower your score will temporarily be.

Jobs will be assigned a quality of service (QOS) based on the length of time requested for the job.

QOS Time Limit Cores Available
short 24 hours 6144
medium 72 hours 3072
long 168 hours 1536

The current maximum allowed time is 168 hours or 7 days. Users needing to run jobs for longer than the maximum time window should add the capability to utilize restart files into their jobs so that they comply with these limits.

Current Utilization
You can view the current utilization of the cluster here.