Setting Up a Deep Learning Environment on an HPC Cluster

If you already know the basic Slurm workflow, the next practical step is getting a usable deep learning environment running on the cluster. The exact setup may change as the platform is updated, and some issues only show up after real use, so it helps to treat this as a working checklist rather than a fixed recipe.

Check which queue you can use

Before submitting anything, find out which partition is available to your account:

whichpartition

If the output looks like this:

PartitionName=染念最帅

then that is the queue where your jobs should be submitted.

Once you know the partition, you can choose the usual Slurm submission methods: salloc, srun, or sbatch.

Using the preinstalled software stack

Many clusters already provide ready-made software environments. You can inspect them with:

module avail

For example, if you see something like apps/Pytorch/llama_py38, you can load it directly with module load and start using it right away.

Building your own Conda environment

A preinstalled environment is convenient, but it is not the only option. You can also create your own environment using the cluster-provided apps/anaconda3/5.2.0, or install Miniconda under your home directory and manage everything yourself.

One practical lesson is worth calling out: if you need to compile Python packages from source, installing your own Miniconda is usually the safer route. Using the shared cluster environment can lead to permission-related build failures.

Creating a custom environment from the shared Anaconda module

A typical setup looks like this:

module load apps/anaconda3/5.2.0
conda create -n torch python=3.10 to create the environment. Python 3.10 is recommended, and many packages are only available for 3.10.
source activate torch to activate it (source deactivate to leave it)
Check /public/software/apps/DeepLearning for the wheel package you need. When choosing the DTK version, newer is generally better here; this is different from the usual NVIDIA habit where older combinations may sometimes be more stable.

Because these commands need to be run every time you start a shell, it is convenient to put them into a script such as env.sh:

#!/bin/bash
module purge
# 根据最新情况进行更改，第一次进入命令行就module list看看基础是什么，在此基础上灵活更改
module load compiler/devtoolset/7.3.1 mpi/hpcx/2.11.0/gcc-7.3.1 compiler/dtk/24.04 apps/anaconda3/5.2.0
source activate torch

Run it with:

source env.sh

Using your own Miniconda

The main reason to choose this approach is the package compilation issue mentioned above. Download the installer into your home directory and install Miniconda there.

One important warning: do not run init on the cluster yourself. Instead, activate it explicitly with a script like this:

#!/bin/bash
module purge
module load compiler/devtoolset/7.3.1 mpi/hpcx/2.11.0/gcc-7.3.1 compiler/dtk/24.04
source ~/miniconda3/bin/activate
conda activate llama

Notice that the final activation command here is conda activate, not source activate. You can also execute the same steps manually in the shell if you prefer.

Common issues you may run into

1. Requested node configuration is not available

If you see:

sbatch: error: Batch job submission failed: Requested node configuration is not available

then the resources you requested exceed what is allowed.

2. The hardware shown on the platform homepage is not always what you can request per node

On the AC platform, having available resources such as 64 CPU cores and 8 accelerator cards does not mean you can request all of that in a single node job. You also need to check what your accessible queue actually supports.

For example, if the queue is listed as 7285-32C-128G-4卡, then the maximum single-node configuration is 32 cores and 4 cards. In that case, a single-node job can reasonably use 1, 2, 3, or 4 cards.

For multi-node jobs, the layout matters. If you are using 2 nodes, then 8 cards total is usually the sensible choice. If you ask for 5 cards, you may end up needing 5 nodes with 1 card per node, which wastes resources badly.

In practice, it is usually best to test and run with 1, 2, 4, 8, and other powers of two so that the allocated hardware is used more cleanly.

3. Odd GPU counts can trigger device ordinal errors

If you run with an odd number of cards, it is easy to hit errors such as:

RuntimeError: HIP/CUDA error: invalid device ordinal

This typically means the program cannot correctly identify the available GPU device count.

4. CPU requests do not match the job layout

If you get:

More processors requested than permitted

or your job stays in queue for a long time, a common reason is that the number of CPU cores you requested does not match what the task configuration allows.

5. Log file paths can also block scheduling

If the default Slurm logs are too messy, you can redirect them from:

#SBATCH -o %j.out

and

#SBATCH -e %j.err

to something like:

#SBATCH -o ./logs/%j.out

and

#SBATCH -e ./logs/%j.err

so the files are written into a dedicated directory.

However, the AC platform can be inconsistent here: sometimes a job remains queued simply because that target directory could not be created. Sometimes it creates the directory automatically, sometimes it does not. Creating the directory yourself in advance is the safer option.

6. pip version problems during installation

If pip install fails with:

Local version label can only be used with == or != operators

then downgrade pip with:

pip install pip==24.0

7. Missing shared libraries such as `libglog.so` or `libgalaxyhip.so`

If these libraries are missing, check whether your DTK version in the environment variables matches the packages you installed.

Small but useful tricks

Once a job is submitted, if your shell script writes logs, you can monitor progress in real time with:

tail -f logfile

You can also SSH into the compute node and watch accelerator memory usage with:

watch rocm-smi

By default it refreshes every second. To change the interval, use:

watch -n x rocm-smi

where x is the refresh interval in seconds.

If you forgot the compute node ID, check the nodelist field in squeue.

And one last reminder: compute time is expensive. If you do not shut things down properly, the bill will teach the lesson for you.