Phenomenon
During neural network training, the training accuracy (Train ACC) starts off very high, but gradually decreases. See the example below:
| |
The key issue is: Train ACC starts off very high, but Val ACC is very low. As epochs increase, Train ACC decreases, while Val ACC almost stays unchanged.
Debugging Process
I first removed the Argumentation part of the code, but the problem remained. Since I am using distributed training, I tried running with only 1 process, but the issue was still there.
Finally, I took out my old single-machine training code and debugged it part by part. In the end, I found the root cause: the Dataloader had shuffle=False. Once I enabled shuffle, the training became normal.
Why was shuffle disabled before? I can’t really remember… In distributed training, the DistributedSampler can be set with shuffle=False. The code looks like this:
| |