Skip to content

Conversation

@filipekstrm
Copy link
Collaborator

Trying to make training faster by improving dataloading. This has been developed on a GTX 3090 and A100, and experiences are hence from there:

  • Added argument --num_workers which is passed to dataloaders. Default is 0 (which is default also in DataLoader and hence what is used currently), but increasing it can substantially make training faster
  • Added argument --pin_memory which is passed to dataloaders. The default is True which is opposite of what is used in Pytorch
  • Added argument --persistent_workers which is passed to dataloaders if ---num_workers > 0. The default is True which is opposite of what is used in Pytorch
  • Creating a new instance of Data when getting the data (i.e., in the get method of WyckoffDataset). This instance only contains the bare minimum information necessary (e.g, it does not include the matrix x as that is not used)

I am a little unsure about using True as default for --pin_memory and --persistent_workers as False is the default in Pytorch. On the other hand, I think they help in improving speed for our use, and hence they can be True in our codebase. On the other hand, for --num_workers I think it can be system-specific what is suitable and hence I left 0 as the default. I did, however, include it in the training command example in the README together with a note

Base automatically changed from docstring_fix to main September 4, 2025 07:46
Copy link
Collaborator

@oke464 oke464 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussions with @filipekstrm we think setting parameters persistent_workers, pin_memory, num_workers to pytorch default and adding description in readme might be the best option. @rartino what do you think?

"epochs": 1000,
"val_interval": 1,
"num_workers": 0,
"pin_memory": True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"pin_memory": True,
"pin_memory": False,

"val_interval": 1,
"num_workers": 0,
"pin_memory": True,
"persistent_workers": True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"persistent_workers": True,
"persistent_workers": False,

Warning: using logger ```none``` will not save any checkpoints (or anything else), but can be used for, e.g., debugging.

This command will use the default values for all other parameters, which are the ones used in the paper.
This command will use the default values for all other parameters, which are the ones used in the paper. **Note: It is not strictly necessary to set ```num_workers```, and if not it will default to 0. However, in our experience, increasing it can substantially speed up training**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This command will use the default values for all other parameters, which are the ones used in the paper. **Note: It is not strictly necessary to set ```num_workers```, and if not it will default to 0. However, in our experience, increasing it can substantially speed up training**
This command will use the default values for all other parameters, which are the ones used in the paper. **Note: It is not strictly necessary to set ```num_workers```, ``persistent_workers``, and ``pin_memory``. However, in our experience, increasing ``num_workers``, and setting ``persistent_workers=True``, and ``pin_memory=True`` can substantially speed up training.** Optimum ``num_workers`` value depends on your system, we have used the maximum suggested value from PyTorch warning.

To train a WyckoffDiff model on WBM, a minimal example is
```
python main.py --mode train_d3pm --d3pm_transition [uniform/marginal/zeros_init] --logger [none/model_only/local_only/tensorboard/wandb]
python main.py --mode train_d3pm --d3pm_transition [uniform/marginal/zeros_init] --logger [none/model_only/local_only/tensorboard/wandb] --num_workers [NUM_WORKERS]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
python main.py --mode train_d3pm --d3pm_transition [uniform/marginal/zeros_init] --logger [none/model_only/local_only/tensorboard/wandb] --num_workers [NUM_WORKERS]
python main.py --mode train_d3pm --d3pm_transition [uniform/marginal/zeros_init] --logger [none/model_only/local_only/tensorboard/wandb] --num_workers [NUM_WORKERS] --persistent_workers [True/False] --pin_memory [True/False]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants