Phenomenon
During neural network training, the training accuracy (Train ACC) starts off very high, but gradually decreases. See the example below:
|
|
The key issue is: Train ACC starts off very high, but Val ACC is very low. As epochs increase, Train ACC decreases, while Val ACC almost stays unchanged.
Debugging Process
I first removed the Argumentation part of the code, but the problem remained. Since I am using distributed training, I tried running with only 1 process, but the issue was still there.
Finally, I took out my old single-machine training code and debugged it part by part. In the end, I found the root cause: the Dataloader
had shuffle=False
. Once I enabled shuffle, the training became normal.
Why was shuffle disabled before? I can’t really remember… In distributed training, the DistributedSampler
can be set with shuffle=False
. The code looks like this:
|
|