Neural Probabilistic Circuit Dataset Utilities

Project Overview

This codebase provides utilities for processing and manipulating datasets for the Neural Probabilistic Circuit (NPC) project. The NPC project focuses on the following four datasets:

Scripts are provided for each dataset to process and organize them into the format and structure required by the NPC pipeline. For the NPC project, the MNIST dataset is further processed into the MNIST Addition dataset, as described in this paper.

Certain datasets, such as GTSRB, lack attribute labels. To address this, the codebase includes NPC Attribute Utilities, which allows users to create and examine attribute labels for any dataset. A verification script is also included to check the dataset configuration for the consistency and correctness of user-created labels.

In addition, utility scripts are available for splitting datasets into training, validation, and testing subsets, and for generating Probabilistic Circuit (PC) datasets from those splits. These PC datasets are then used by the learnspn project to construct and generate PCs.

Project Hierarchy

This project is part of the NPC pipeline. To ensure compatibility and maintain consistent references across the pipeline, organize the project directories as follows:

npc
├── datasets
│   ├── awa2
│   ├── celeba
│   ├── gtsrb
│   └── mnist
├── learnspn
├── npc-dataset-utils
├── npc-models
└── venv

All subsequent instructions assume the above project hierarchy.

The npc/datasets directory does not exist by default. Create the initial directory structure as follows:

cd npc
mkdir -pv datasets/awa2 datasets/celeba datasets/gtsrb datasets/mnist

Before running the project, review header.py and ensure that all relevant parameters are set to the desired values. More detailed instructions on certain parameters are provided in later sections.

Project Prerequisites

This project requires the following system packages:

Ubuntu:

apt install libgl1-mesa-dev python3.10 python3-venv unzip

Arch Linux:

yay -S mesa python310 unzip

This project was developed on Ubuntu and tested on both Ubuntu and Arch Linux. Other Linux distributions, macOS, or Windows Subsystem for Linux (WSL) may also work with additional setup. However, these platforms are not officially supported.

This project is designed to run within a simple Python virtual environment. Create and activate the environment as follows:

cd npc
deactivate
python3.10 -m venv venv
source venv/bin/activate
python3.10 -m pip install -r npc-dataset-utils/requirements.txt

Always ensure the virtual environment is activated before running the project.

Setting Up the Datasets

The NPC pipeline relies on multiple datasets, each requiring setup steps to ensure compatibility and reproducibility. To clarify this process, dedicated setup guides are provided for each dataset, covering directory structures, file handling, processing, configuration, and PC dataset generation.

Detailed instructions for setting up and organizing dataset contents within npc/datasets are provided in the following documents:

NPC Attribute Utilities

The NPC attribute utilities are a set of Qt-based graphical software designed for labeling and examining dataset attributes within the NPC pipeline. These utilities are especially important for datasets that lack native attribute annotations, such as GTSRB, enabling users to define, edit, and review attributes in a consistent format that downstream components expect.

Together, the utilities streamline both the creation of new attribute labels and the examination of labeled data across classes, helping ensure the correctness, consistency, and validity of the attribute labels created for datasets used in the NPC project.

Detailed usage instructions are available in the following documents:

Preview of the NPC attribute utilities:

Publications

Upon using this project, cite any relevant publications listed below:

Neural Probabilistic Circuit (NPC)

@article{chen2025neural,
  title={Neural probabilistic circuits: Enabling compositional and interpretable predictions through logical reasoning},
  author={Chen, Weixin and Yu, Simon and Shao, Huajie and Sha, Lui and Zhao, Han},
  journal={arXiv preprint arXiv:2501.07021},
  year={2025}
}

@inproceedings{chenneural,
  title={Neural Probabilistic Circuits: An Overview},
  author={Chen, Weixin and Yu, Simon and Shao, Huajie and Sha, Lui and Zhao, Han},
  booktitle={Eighth Workshop on Tractable Probabilistic Modeling}
}

Animals with Attributes 2 (AwA2)

@inproceedings{xian2017zero,
  title={Zero-shot learning-the good, the bad and the ugly},
  author={Xian, Yongqin and Schiele, Bernt and Akata, Zeynep},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={4582--4591},
  year={2017}
}

CelebFaces Attributes (CelebA)

@inproceedings{liu2015deep,
  title={Deep learning face attributes in the wild},
  author={Liu, Ziwei and Luo, Ping and Wang, Xiaogang and Tang, Xiaoou},
  booktitle={Proceedings of the IEEE international conference on computer vision},
  pages={3730--3738},
  year={2015}
}

German Traffic Sign Recognition Benchmark (GTSRB)

@article{stallkamp2012man,
  title={Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition},
  author={Stallkamp, Johannes and Schlipsing, Marc and Salmen, Jan and Igel, Christian},
  journal={Neural networks},
  volume={32},
  pages={323--332},
  year={2012},
  publisher={Elsevier}
}

Modified National Institute of Standards and Technology (MNIST)

@article{lecun2010mnist,
  title={MNIST handwritten digit database},
  author={LeCun, Yann and Cortes, Corinna and Burges, Chris and others},
  year={2010},
  publisher={Florham Park, NJ, USA}
}

@article{manhaeve2018deepproblog,
  title={Deepproblog: Neural probabilistic logic programming},
  author={Manhaeve, Robin and Dumancic, Sebastijan and Kimmig, Angelika and Demeester, Thomas and De Raedt, Luc},
  journal={Advances in neural information processing systems},
  volume={31},
  year={2018}
}

Acknowledgements

Special thanks to Rahim Khan, Tommy Tang, Alex Tanthiptham, and Trusha Vernekar for their contributions to the implementation, testing, and experiments involved in this project.

License

This codebase is released under the Creative Commons Attribution NonCommercial ShareAlike (CC BY-NC-SA) license, which can be viewed under LICENSE.

Contact

For questions, feedback, or comments, open an issue or reach out to Simon Yu.

Written by Simon Yu.

Name		Name	Last commit message	Last commit date
Latest commit History 358 Commits
configs/npc-dataset-utils		configs/npc-dataset-utils
docs/npc-dataset-utils		docs/npc-dataset-utils
src/npc-dataset-utils		src/npc-dataset-utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Neural Probabilistic Circuit Dataset Utilities

Table of Contents

Project Overview

Project Hierarchy

Project Prerequisites

Setting Up the Datasets

NPC Attribute Utilities

Publications

Neural Probabilistic Circuit (NPC)

Animals with Attributes 2 (AwA2)

CelebFaces Attributes (CelebA)

German Traffic Sign Recognition Benchmark (GTSRB)

Modified National Institute of Standards and Technology (MNIST)

Acknowledgements

License

Contact

About

Uh oh!

Languages

License

uiuctml/npc-dataset-utils

Folders and files

Latest commit

History

Repository files navigation

Neural Probabilistic Circuit Dataset Utilities

Table of Contents

Project Overview

Project Hierarchy

Project Prerequisites

Setting Up the Datasets

NPC Attribute Utilities

Publications

Neural Probabilistic Circuit (NPC)

Animals with Attributes 2 (AwA2)

CelebFaces Attributes (CelebA)

German Traffic Sign Recognition Benchmark (GTSRB)

Modified National Institute of Standards and Technology (MNIST)

Acknowledgements

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages