Skip to content

Single-threaded BLAS/LAPACK calls create race conditions with OpenBLAS #3329

@david-cortes-intel

Description

@david-cortes-intel

There are some sections within oneDAL which call BLAS and LAPACK functions in parallel, for which we have separate functions that run those BLAS/LAPACK calls single-threaded to avoid nested parallelism - for example:

static void xxsyrk(const char * uplo, const char * trans, const DAAL_INT * p, const DAAL_INT * n, const double * alpha, const double * a,

When building with MKL as backend for BLAS/LAPACK, the threading is controlled through functions that have effects only within the thread that calls them:

int old_nthr = mkl_set_num_threads_local(1);

https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2025-2/mkl-set-num-threads-local.html

Hence, it works in such a way that every time a single-threaded call is needed, the thread executing it queries its current config, configures itself to not spawn more threads, and then restores its previous config, leaving everything as it was.

However, when oneDAL is built with OpenBLAS as backend, it instead calls functions that have global effects beyond the thread that executes them, since OpenBLAS doesn't have thread-local config functions the same way MKL does:

previous_thread_count = openblas_get_num_threads();

This has two issues:

  • There's a potential bottleneck from multiple threads modifying the same global state at the same time, which very likely involves a mutex.
  • It's unlikely that at the end of all of this the global number of threads configured will be restored correctly, because only the last change will have an effect. There's no guarantee that the first thread seeing the initial global state with all threads will be the one finishing last (plus there's multiple calls to these functions under each thread), so potentially the configuration could be left globally single-threaded afterwards.

I guess one potential solution could be to have a separate version of these wrapper classes that would handle the configuration and restoration within its constructor and destructor, and place this appropriately where the single-threaded calls are needed.

Another potentially better way of doing this (not 100% sure if it's feasible though) would be to set a custom threading backend for OpenBLAS using the TBB thread pool that oneDAL uses for parallel executions, which should now be supported as of OpenBLAS>=0.3.30, but might be quite complicated.

CC @keeranroth @rakshithgb-fujitsu

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions