BIMSA Compute Cluster Usage Guide

Introduction
BIMSA Compute Cluster (BCC) is a mid-powered, shared and scheduled cluster that aims to serve BIMSA academic and research activities. During Nov-2023, IT department at BIMSA conducted a survey on compute needs to faculties. Upon completion, IT has consolidated the needs, sought for possible solutions, allocated budgets from management, prepared and conducted a fair tender, evaluated the optimal solution, submitted for approval and finally exercised procurement processes. The cluster was acquired afterwards, properly configured and deployed. 

Basic Server Configuration
1. Compute Servers (6 sets)

Hardware

CPU

Dual Intel Xeon Gold 6444Y (3.6GHz, 16C/32T)

Memory

768GB (24 x 32GB RDIMM 4800MT/s)

Storage

480GB SSD mounted on /
100TB Parallel and Shared File System mounted on /home

Network

10Gbps uplink (to login nodes)
25Gbps uplink (to storage)

GPU

Dual Nividia L40 GPU PCIe

Software

OS

Ubuntu 22.04 LTS

System Packages

build-essentials

GPU Driver

Nvidia 550.90.07

CUDA

11.8

CUDNN

9.2.1.18

Anaconda

2024.06-1

Pytorch

2.4.0

2. Login Servers (2 Sets)

Hardware

CPU

Dual Intel Xeon Gold 6442Y (2.6GHz, 24C/48T)

Memory

128GB (4 x 32GB RDIMM 4800MT/s)

Storage

480GB SSD mounted on /
100TB Parallel and Shared File System mounted on /home

Network

10Gbps uplink (to user network)
25Gbps uplink (to storage)

Software

OS

Ubuntu 22.04 LTS

System Packages

build-essentials

Anaconda

2024.06-1

Basic Cluster Usage Guide
Prerequisite
Users of this cluster should be familiar with the following skills / know-hows
1. Linux (especially Ubuntu) operating system, command-line operations
2. SLURM job-scheduling system (sample tutorial)
3. Build your own compute environment at your home directory without root privilege

Access BCC
Upon successful account application, you may use any ssh client to login to sls.bimsa.net Port 22. If you are accessing outside BIMSA, make sure you connect to BIMSA VPN service first.

Build Your Compute Environment
The whole BCC cluster (including login and compute nodes) shares the same user home directory. The cluster has already provided anaconda environment. You may build your own virtual environment in your home directory to perform computation. 

Alternatively, you may compile programs and put in your home directory 

Use SLURM 
The Simple Linux Utility for Resource Management (SLURM) is the resource management and job scheduling system of BCC. All compute jobs much be submitted by SLURM.

Here is a sample script (test.sh)

#!/bin/bash
#SBATCH -o job.%j.out
#SBATCH -J myFirstGPUJob
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=6
#SBATCH --gres=gpu:1
nvidia-smi

Then, you may submit your job with

$ sbatch test.sh

And you may read the job output via

$ cat job.[nn].out