PROV-IO is an I/O-centric provenance management framework for scientific data. It provides an interface for data provenance tracking and stores provenance as RDF triples. PROV-IO data model follows W3C PROV-DM and is an extension of it. PROV-IO has been integarted with HDF5 vol-provenance connector to track provenance of scientific data in HDF5 applications. PROV-IO is tested on Ubuntu 18.04 and Cray Linux.
Please cite the following paper if your project uses PROV-IO:  
PROV-IO: An I/O-Centric Provenance Framework for Scientific Data on HPC Systems (HPDC'22) [Bibtex] 
Other pulications:  
Towards A Practical Provenance Framework for Scientific Data on HPC Systems (poster@FAST'22) 
PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems (preprint)
The easiest way of trying out PROV-IO is through docker. PROV-IO docker image is available now at rzhan/prov-io. The image is based on Debian 11 with Python 3.9 installed. Download the basic PROV-IO docker image:
docker pull rzhan/prov-io:1.0
We also publish the docker image of Megatron-LM instrumented with PROV-IO as an example use case. Download the instrumented Megatron-LM docker image:
docker pull rzhan/prov-io:megatron-lm
This section is for building PROV-IO from scratch.
PROV-IO library needs to be built with libtool. Install it by: 
sudo apt-get install -y gcc make
sudo apt-get install -y autoconf automake libtool pkg-config gtk-doc-tools 
PROV-IO's RDF schema is based on Redland librdf (including raptor2-2.0.15, rasqal-0.9.33, librdf-1.0.17) and its Python binding (redland-bindings-1.0.17.1). Install their dependencies first: 
sudo apt-get install -y libltdl-dev libxml2 libxml2-dev flex bison swig uuid uuid-dev
We provide specific releases of librdf at: https://github.com/hpc-io/prov-io/tree/master/packages. Unzip and install them in the sequence of raptor2-2.0.15->rasqal-0.9.33->librdf-1.0.17->redland-bindings-1.0.17.1. 
For example, install raptor2-2.0.15 and export path:
cd raptor2-2.0.15
./autogen.sh
./configure --prefix=<your_prov_io_path>/lib/lib-raptor
make && make install
export PKG_CONFIG_PATH=<your_prov_io_path>/lib/lib-raptor/lib/pkgconfig:$PKG_CONFIG_PATH
Then, install rasqal-0.9.33 and librdf-1.0.17 using similar commands with correct path. 
Finally, install the Python binding (redland-bindings-1.0.17.1):
./autogen.sh
./configure --with-python 
make && make install
PROV-IO Python library tracks provenance information defined in PROV-IO Extensible class. Follow instructions in /python to use it.
PROV-IO C library tracks low level I/O information. Build PROV-IO C library and export path:
cd c/provio
make
export LD_LIBRARY_PATH=<your_prov_io_path>/c/provio
To run a basic PROV-IO test, in the same directory:
export PROVIO_CONFIG=<your_prov_io_path>/doc/example_config/prov.cfg
./provio_test
Check out the provenance file (prov.turtle) and stat file (prov.stat) generated by PROV-IO.
PROV-IO HDF5 Lib Connector is used to track HDF5 I/O. Follow instructions to build it:
- Build and install HDF5 (provided in the repo: https://github.com/hpc-io/prov-io/tree/master/packages).
- Make sure HDF5 is correclty installed and its path is correctly exported, and build HDF5 vol-connector (dynamic library) instrumented with PROV-IO:
cd c/vol-provenance
make
- To redirect HDF5 I/Os to provenance vol-connector, set these environment variables:
export HDF5_VOL_CONNECTOR="provenance under_vol=0;under_info={};path=<trace_file_path>/my_trace.log;level=2;format="           
export HDF5_PLUGIN_PATH=<hdf5_vol_connector_path>                                                                                     
Note: HDF5_VOL_CONNECTOR contains the original provenance file (plain text) configurations of HDF5 provenance vol-connector. PROV-IO configuration is in it's own configuration file under $PROVIO_CONFIG. <hdf5_vol_connector_path> is the path that holds libh5prov.so.
- Run a testcase application (VPIC) under the same directory:
./vpicio_uni_h5.exe ./my_data.dat 2 2 1 ./my_trace.log
You may compare the default plain text provenance file generated by vol-connector with the RDF provenance file generated by PROV-IO.  
More provenance traces tracked from real workflows in paper are provided at: https://github.com/hpc-io/prov-io/tree/master/example_provenance.
PROV-IO Syscall wrapper is developed to track high frequency POSIX I/O APIs such as open,write,fsync,etc. It's based on LLNL's GOTACH project. 
Syscall Wrapper is still under testing, stay tuned!
Check out query engine at: https://github.com/hpc-io/prov-io/tree/master/user_engine/query.
Check out visualizer at: https://github.com/hpc-io/prov-io/tree/master/user_engine/visualizer. An example of visualized RDF provenance is also given.
If you run into issues when using PROV-IO, please email me at hanrz AT iastate DOT edu. I'm more than happy to help.