Introduction
Setonix is the most powerful supercomputer in the Southern Hemisphere — and the most unstable one I’ve ever used. It crashes about once a month: firmware updates, Lustre filesystem issues, you name it.
Because it uses AMD Instinct MI250X GPUs, the PyTorch backend is based on ROCm. This setup introduces all kinds of weird problems.
If you’re training deep learning models, stay away from AMD cards. They’ve ruined my youth.
PyTorch Environment: Unable to Install Packages
First, load the module and create a virtual environment:
|
|
After this, a venv
directory will appear in your current working directory. Go to venv/bin
and check with ls -l
. You’ll see that python3
is linked to the wrong path:
|
|
To fix it:
|
|
Now it’s correctly linked:
|
|
The pip
executable is missing in this environment. Install it manually:
|
|
Then install jupyterlab
and other packages:
|
|
If your home partition quota is full, move .local
to a project directory:
|
|
After installing Jupyter, add it to your PATH
by appending this line to ~/.bashrc
:
|
|
If Jupyter still fails to start:
|
|
Edit the file ~/.local/bin/jupyter
and change the first line to point to the Python interpreter in your virtual environment:
|
|
Now jupyter lab
should work correctly.
Here’s the SLURM batch script, save it as batch_script
:
|
|
Replace the virtual environment path with your own in the source
command.
Submit the job: sbatch batch_script
Check the status: squeue --me
Once running, check the slurm-{task_id}.out
file. It will include output like this:
|
|
Run the ssh -L
command on your local machine and open the provided URL in your browser. Done!
To cancel the job: scancel {task_id}
or shut down from File > Shut Down
in Jupyter Lab.
Running a Model Immediately Fails
|
|
No solution found online. No reply to my support ticket either. So I figured it out myself.
Turns out only some nodes have this issue. So I consider it an environmental bug.
To avoid these problematic nodes, use the --exclude
parameter in your sbatch
command:
|
|
Works Fine on NVIDIA, Loss Goes NaN on AMD
While testing an open-source model, adding a simple nn.Linear
layer caused the loss to immediately become NaN
.
Same code worked fine on NVIDIA cards.
I investigated further — manually set the layer’s weights and verified the output values. Turns out AMD’s nn.Linear
produced incorrect values.
By lowering the batch_size
and lr
, I found the bug disappeared when batch_size <= 2
.
Still puzzled, I later discovered the code enabled AMP (automatic mixed precision). Disabling AMP fixed everything.
Although AMD officially supports AMP, this bug made me too scared to use it in the future.