Machine learning frameworks such as PyTorch, TensorFlow, and JAX are designed to scale numerical computations and data processing from single to multi-GPU and distributed computing environments, while abstracting much of the underlying complexity of parallelisation. Amid the growing use of machine learning in traditional high-performance computing application areas-for example, protein modeling and material discovery—researchers and practitioners increasingly employ these frameworks to integrate ML algorithms into their scientific computing applications, such as those involving solution of partial differential equations, eigenvalue problems, etc.
Linear algebra operations, which are the building blocks of countless computational problems, often constitute major performance bottlenecks. The HPC community invests significant effort in developing architecture-specific optimised kernels, such as those provided by the BLAS and LAPACK libraries, to accelerate these operations. Thanks to high-level frameworks, users no longer need to engage in the error-prone and time-consuming process of directly invoking such kernels; instead, they can express computations using a high-level, mathematics-like syntax that closely mirrors textbook notation, without having to manage the low-level implementation details themselves. However, it is not guaranteed that these frameworks will automatically exploit knowledge of linear algebra to accelerate computations in the way expert-tuned implementations would.
This benchmark consists of several test expressions categorised into the following five test sets that evaluate the linear algebra awareness of high-level Python interfaces in a given installation of PyTorch, TensorFlow, and JAX.
Test-set 1: Matrix multiplications. Matrix-matrix multiplications are one of the most expensive operations. The execution times of matrix multiplications invoked through the high-level APIs of the frameworks are compared against those of an optimised reference implementation. For general dense matrices, execution time is compared with the BLAS kernel gemm. For triangular matrices and symmetric rank-k updates, which require only about half the floating-point operations of gemm, the execution speed is compared against the specialised BLAS kernels trmm and syrk respectively. For multiplication with a tri-diagnoal matrix (where only a fraction of the matrix elements are non-zero), the execution can be orders of magnitude faster than the multiplication that assumes regular matrices. Here, the execution times are compared against those of a reference that uses compressed row formats.
Test-set 2: Common sub-expression elimination. In general, sub-expressions within an expression that evaluate to the same value can be computed once and reused via a temporary reference in subsequent instances. For example, in the expression
Test-set 3: Matrix chains. Given
Test-set 4: Expression rewrites. Algebraic properties can be used to rewrite an expression in several alternative ways that can result in accelerated evaluations. For example, the expression
Test-set 5: Code motion. Some operations, when moved around, can result in improved performance. This test evaluates if the frameworks can identify loop-invariant code and partial operand accesses to improve performance.
Each test expression in a test-set is invoked using one more high-level interfaces provided by the framework. For each such invocation (which we refer to as input), the execution times are measured 10 times and compared against those of an optimised reference implementation. The fraction by which the input is slower than the reference implmentation is computed as follows,
Here,
The slowdown values across different text expressions need not be on a comparable scale.
For instance, a slowdown value of 0.5 may indicate a substantial loss for expression A, yet represent only a minor loss for expression B.
Therefore, for each input, we also consider a negative reference, which is a sub-optimal implmentation, and measure the execution times
For each input, the
If you want to run the benchmark, refer here.
If you just want to access the results, the evaluation reports from previously run benchmarks are available in the reports/ directory.
An example evaluation report for the installation of PyTorch/2.1.2-foss-2023a on the HPC system Kebnekaise, hosted at Umea University in Sweden is available here.
More details about this benchmark can be found in the following publications,
@inproceedings{sankaran2022benchmarking,
title={Benchmarking the Linear Algebra Awareness of TensorFlow and PyTorch},
author={Sankaran, Aravind and Alam, Shourov and Karlsson, Lars and Bientinesi, Paolo},
booktitle={Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)},
pages={924--933},
year={2022},
address={Lyon, France},
publisher={IEEE},
doi={10.1109/IPDPSW55747.2022.00150}
}

