If you already know the basic Slurm workflow, the next practical step is getting a usable deep learning environment running on the cluster. The exact setup may change as the platform is updated, and some issues only show up after real use, so it helps to treat this as a working checklist rather than a fixed recipe.
Check which queue you can use
Before submitting anything, find out which partition is available to your account:
whichpartition
If the output looks like this:
PartitionName=染念最帅
then that is the queue where your jobs should be submitted.
Once you know the partition, you can choose the usual Slurm submission methods: salloc, srun, or sbatch.
Using the preinstalled software stack
Many clusters already provide ready-made software environments. You can inspect them with:
module avail
For example, if you see something like apps/Pytorch/llama_py38, you can load it directly with module load and start using it right away.
Building your own Conda environment
A preinstalled environment is convenient, but it is not the only option. You can also create your own environment using the cluster-provided apps/anaconda3/5.2.0, or install Miniconda under your home directory and manage everything yourself.
One practical lesson is worth calling out: if you need to compile Python packages from source, installing your own Miniconda is usually the safer route. Using the shared cluster environment can lead to permission-related build failures.
Creating a custom environment from the shared Anaconda module
A typical setup looks like this:
module load apps/anaconda3/5.2.0conda create -n torch python=3.10to create the environment. Python 3.10 is recommended, and many packages are only available for 3.10.source activate torchto activate it (source deactivateto leave it)- Check
/public/software/apps/DeepLearningfor the wheel package you need. When choosing the DTK version, newer is generally better here; this is different from the usual NVIDIA habit where older combinations may sometimes be more stable.
Because these commands need to be run every time you start a shell, it is convenient to put them into a script such as env.sh:
#!/bin/bash
module purge
# 根据最新情况进行更改,第一次进入命令行就module list看看基础是什么,在此基础上灵活更改
module load compiler/devtoolset/7.3.1 mpi/hpcx/2.11.0/gcc-7.3.1 compiler/dtk/24.04 apps/anaconda3/5.2.0
source activate torch
Run it with:
source env.sh
Using your own Miniconda
The main reason to choose this approach is the package compilation issue mentioned above. Download the installer into your home directory and install Miniconda there.
One important warning: do not run init on the cluster yourself. Instead, activate it explicitly with a script like this:
#!/bin/bash
module purge
module load compiler/devtoolset/7.3.1 mpi/hpcx/2.11.0/gcc-7.3.1 compiler/dtk/24.04
source ~/miniconda3/bin/activate
conda activate llama
Notice that the final activation command here is conda activate, not source activate. You can also execute the same steps manually in the shell if you prefer.
Common issues you may run into
1. Requested node configuration is not available
If you see:
sbatch: error: Batch job submission failed: Requested node configuration is not available
then the resources you requested exceed what is allowed.
2. The hardware shown on the platform homepage is not always what you can request per node
On the AC platform, having available resources such as 64 CPU cores and 8 accelerator cards does not mean you can request all of that in a single node job. You also need to check what your accessible queue actually supports.
For example, if the queue is listed as 7285-32C-128G-4卡, then the maximum single-node configuration is 32 cores and 4 cards. In that case, a single-node job can reasonably use 1, 2, 3, or 4 cards.
For multi-node jobs, the layout matters. If you are using 2 nodes, then 8 cards total is usually the sensible choice. If you ask for 5 cards, you may end up needing 5 nodes with 1 card per node, which wastes resources badly.
In practice, it is usually best to test and run with 1, 2, 4, 8, and other powers of two so that the allocated hardware is used more cleanly.
3. Odd GPU counts can trigger device ordinal errors
If you run with an odd number of cards, it is easy to hit errors such as:
RuntimeError: HIP/CUDA error: invalid device ordinal
This typically means the program cannot correctly identify the available GPU device count.
4. CPU requests do not match the job layout
If you get:
More processors requested than permitted
or your job stays in queue for a long time, a common reason is that the number of CPU cores you requested does not match what the task configuration allows.
5. Log file paths can also block scheduling
If the default Slurm logs are too messy, you can redirect them from:
#SBATCH -o %j.out
and
#SBATCH -e %j.err
to something like:
#SBATCH -o ./logs/%j.out
and
#SBATCH -e ./logs/%j.err
so the files are written into a dedicated directory.
However, the AC platform can be inconsistent here: sometimes a job remains queued simply because that target directory could not be created. Sometimes it creates the directory automatically, sometimes it does not. Creating the directory yourself in advance is the safer option.
6. pip version problems during installation
If pip install fails with:
Local version label can only be used with == or != operators
then downgrade pip with:
pip install pip==24.0
7. Missing shared libraries such as libglog.so or libgalaxyhip.so
If these libraries are missing, check whether your DTK version in the environment variables matches the packages you installed.
Small but useful tricks
Once a job is submitted, if your shell script writes logs, you can monitor progress in real time with:
tail -f logfile
You can also SSH into the compute node and watch accelerator memory usage with:
watch rocm-smi
By default it refreshes every second. To change the interval, use:
watch -n x rocm-smi
where x is the refresh interval in seconds.
If you forgot the compute node ID, check the nodelist field in squeue.
And one last reminder: compute time is expensive. If you do not shut things down properly, the bill will teach the lesson for you.