Introduction
Setonix is the most powerful supercomputer in the Southern Hemisphere — and the most unstable one I’ve ever used. It crashes about once a month: firmware updates, Lustre filesystem issues, you name it.
Because it uses AMD Instinct MI250X GPUs, the PyTorch backend is based on ROCm. This setup introduces all kinds of weird problems.
If you’re training deep learning models, stay away from AMD cards. They’ve ruined my youth.
PyTorch Environment: Unable to Install Packages
First, load the module and create a virtual environment:
| |
After this, a venv directory will appear in your current working directory. Go to venv/bin and check with ls -l. You’ll see that python3 is linked to the wrong path:
| |
To fix it:
| |
Now it’s correctly linked:
| |
The pip executable is missing in this environment. Install it manually:
| |
Then install jupyterlab and other packages:
| |
If your home partition quota is full, move .local to a project directory:
| |
After installing Jupyter, add it to your PATH by appending this line to ~/.bashrc:
| |
If Jupyter still fails to start:
| |
Edit the file ~/.local/bin/jupyter and change the first line to point to the Python interpreter in your virtual environment:
| |
Now jupyter lab should work correctly.
Here’s the SLURM batch script, save it as batch_script:
| |
Replace the virtual environment path with your own in the source command.
Submit the job: sbatch batch_script
Check the status: squeue --me
Once running, check the slurm-{task_id}.out file. It will include output like this:
| |
Run the ssh -L command on your local machine and open the provided URL in your browser. Done!
To cancel the job: scancel {task_id} or shut down from File > Shut Down in Jupyter Lab.
Running a Model Immediately Fails
| |
No solution found online. No reply to my support ticket either. So I figured it out myself.
Turns out only some nodes have this issue. So I consider it an environmental bug.
To avoid these problematic nodes, use the --exclude parameter in your sbatch command:
| |
Works Fine on NVIDIA, Loss Goes NaN on AMD
While testing an open-source model, adding a simple nn.Linear layer caused the loss to immediately become NaN.
Same code worked fine on NVIDIA cards.
I investigated further — manually set the layer’s weights and verified the output values. Turns out AMD’s nn.Linear produced incorrect values.
By lowering the batch_size and lr, I found the bug disappeared when batch_size <= 2.
Still puzzled, I later discovered the code enabled AMP (automatic mixed precision). Disabling AMP fixed everything.
Although AMD officially supports AMP, this bug made me too scared to use it in the future.