Mounting Large Dataset Archives on Supercomputers

Background

I recently worked with the AffectNet dataset, which contains over a million files. I was using the Setonix supercomputer, which utilises the Lustre file system. A temporary storage partition at /scratch was provided, but it had a file count limit of 1 million. When I tried to extract the dataset, I ran into errors due to exceeding this limit.

The standard solution is to submit a ticket requesting a quota increase. But, as anyone familiar with Australian bureaucracy might expect, I submitted the request on Monday and only got approval by Friday after a few back-and-forth questions.

Instead of relying on others, I found an alternative solution to this issue.

Concept

The dataset was packaged as a tar archive. While .tar files are technically archives rather than compressed files, I’ll refer to them as “archives” for simplicity. It turns out that there’s a way to mount such archives directly as directories without extracting them.

The key technology here is FUSE — Filesystem in Userspace. This allows users to mount filesystems without root privileges. After some research, I discovered a Python tool called ratarmount that supports this functionality.

Step-by-Step Guide

Since I’m using a supercomputer environment, I first need to load the relevant module to use Python, then activate my virtual environment:

1
2
module load pytorch/2.2.0-rocm5.7.3
source /home/liyumin/software/venv/bin/activate

If you already have access to Python without loading modules, you can skip the steps above.

Next, install ratarmount using pip:

1
pip install ratarmount

Once installed, you can mount the archive with the following command:

1
ratarmount train_set.tar train_set

This mounts the train_set.tar archive to the train_set directory.

You might encounter an error like the following:

1
2
Created mount point at: /scratch/pawsey1001/liyumin/datasets/AffectNet8/train_set
fusermount: mounting over filesystem type 0x0bd00bd0 is forbidden

This means the file system doesn’t allow mounting at that path. You can resolve this by choosing a different path, such as the /tmp directory:

1
ratarmount train_set.tar /tmp/train_set

If there are no errors, the mount was successful. You can now access the contents of train_set.tar simply by browsing the /tmp/train_set directory — no extraction needed.

When you’re done, don’t forget to unmount the directory:

1
fusermount -u /tmp/train_set