Skip to content

BWDW JIT 256 error reproducible through gtests, but not through benchdnn #4124

@gassan-arm

Description

@gassan-arm

Summary

Depthwise backward-weights on AArch64 SVE-256 produces incorrect results for strided, padded cases (e.g., C=24, Kh=3, Sh=2, Ph=1). PyTorch test TestConvolutionNN.test_Conv2d_OneDNN fails, while benchdnn does not flag the issue. A regression (gtest) comparing a legacy blocked-oh path vs a new per-row path exposes the defect.
Fix merged in PR #4081.

cc. @Sqvid

Version

  • oneDNN v3.9.1 (commit 80a3a8e745d2f0186e674b0af9332fd6e074c94f)
  • Also reproduced with oneDNN v3.7.1

Environment

  • CPU: AArch64 SVE (256-bit) (Neoverse V1)
  • oneDNN runtime: OpenMP, nthr=32
  • PyTorch (arm/aarch64 build) using oneDNN backend
  • Python 3.10

Steps to reproduce

1) PyTorch unit test (fails)

# ONEDD_VERBOSE=all to capture impl & commit
export ONEDD_VERBOSE=all
python pytorch/test/nn/test_convolution.py TestConvolutionNN.test_Conv2d_OneDNN

Typical verbose snippet at failure:

onednn_verbose,v1,info,oneDNN v3.9.1 (commit 80a3a8e...)
onednn_verbose,v1,primitive,exec,cpu,convolution,jit_dw:sve_256,forward_training,...
onednn_verbose,v1,primitive,exec,cpu,convolution,jit_dw:sve_256,backward_weights,...
g24mb1_ic24oc24_ih6oh3kh3sh2ph1_iw6ow3kw3sw2pw1

2) Detailed C++ gtest reproduction steps

Start from oneDNN v3.9.1 (commit 80a3a8e745d2f0186e674b0af9332fd6e074c94f) on AArch64 SVE-256 (Neoverse V1).

Prerequisites

  • Replace tests/gtests/test_convolution_backward_weights_dw_compare.cpp and src/cpu/aarch64/jit_uni_dw_convolution.cpp with the supplied versions (attachments)
  • File: tests/gtests/test_convolution_backward_weights_dw_compare.cpp (attachment)
  • Compares legacy vs new AArch64 DW BWD_W (env-switchable):
    • ONEDNN_AARCH64_DW_BWDW_USE_OLD=1 → legacy path
    • unset → new per-row path
  • Descriptor used: g24mb1_ic24ih8iw8_oc24oh4ow4_kh3kw3_sh2sw2_ph1pw1

Build configuration

# Configure with tests enabled
cmake -S . -B build -DDNNL_BUILD_TESTS=ON

# Rebuild so both the kernel and gtest pick up changes
cmake --build build --target all -- -j$(nproc)

Run regression test

cd build && ctest -V -R test_convolution_backward_weights_dw_compare

Optional: benchdnn verification

ONEDNN_VERBOSE=all ./build/tests/benchdnn/benchdnn --conv --dir=BWD_W --fast-ref=false g24mb1_ic24ih8iw8_oc24oh4ow4_kh3kw3_sh2sw2_ph1pw1

Logs & diff evidence

  • Each run writes depthwise_bwdw_compare.log next to the binary (build/tests/gtests/depthwise_bwdw_compare.log)
  • Header shows both impl IDs, benchdnn descriptor, and replay command (see tests/gtests/test_convolution_backward_weights_dw_compare.cpp:186-201)

Observed behavior

  • PyTorch test failure:
    AssertionError: Tensor-likes are not close!
    Mismatched elements: 72 / 216 (33.3%)
    Greatest absolute difference: 3.0
    
  • OneDNN chooses jit_dw:sve_256 for both FWD and BWD_W on the above config.
  • gtest A/B shows legacy path accumulates extra bottom-row contributions on strided, padded cases (duplicate accumulation at tile boundaries). New per-row path matches a naïve reference.
  • benchdnn did not reproduce the mismatch (even with --fast-ref=false and buffer replay).

Workaround validated: removing the AArch64 jit BWD_W (SVE-256) path from the CPU convolution list avoids the failure (fallback path passes like it already does for Neoverse N1 & Neoverse V2).

Expected behavior

Backward-weights results should match the naïve reference (and mkldnn-disabled PyTorch path) with zero elementwise diffs for these configs.

Additional notes

Attachments

Related PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    component:testsCodeowner: @oneapi-src/onednn-archhelp wantedplatform:cpu-aarch64Codeowner: @oneapi-src/onednn-cpu-aarch64sightingSuspicious library behavior. Should be promoted to a bug when confirmed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions