Skip to content

Conversation

@mdhaber
Copy link
Contributor

@mdhaber mdhaber commented Nov 6, 2025

Reference issue

Closes gh-22794
Supersedes gh-22894. This PR addresses the concerns expressed in gh-22794, is more general (adds weights support for all methods except 'harwell-davis' rather than just inverted_cdf), and include more extensive tests.
data-apis/array-api-extra#494

What does this implement/fix?

This adds support for frequency weights to scipy.stats.quantile.

Results agree with NumPy for method='inverted_cdf'

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
rng = np.random.default_rng(23458924473496)
x = rng.random(100)
weights = rng.random(100)
p = np.linspace(0., 1., 300)
res = stats.quantile(x, p, weights=weights, method='inverted_cdf')
ref = np.quantile(x, p, weights=weights, method='inverted_cdf')
plt.plot(p, res, '-', label='SciPy')
plt.plot(p, res, '--', label='NumPy')
plt.xlabel('p')
plt.ylabel('quantile(p)')
plt.legend()
plt.show()
image

For non-negative integer weights, stats.quantile(x, p, weights=counts, ...) is equivalent to stats.quantile(np.repeat(x, counts), p, ...).

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
rng = np.random.default_rng(23458924473496)
x = rng.random(10)
counts = rng.integers(10, size=10)
p = np.linspace(0., 1., 10000)
res = stats.quantile(x, p, weights=counts, method='linear')
ref = stats.quantile(np.repeat(x, counts), p, method='linear')
plt.plot(p, res, '-', label='Using weights')
plt.plot(p, res, '--', label='Using repetitions')
plt.xlabel('p')
plt.ylabel('quantile(p)')
plt.legend()
plt.show()

The documentation specifies that weights that do not satisfy the requirements (non-negative integers) are accepted but may not have valid statistical interpretations.

image

Additional information

Support for weights in scipy.stats has come up several times before, most notably in gh-6931. As mentioned in #22794 (comment), an important discussion point in gh-6931 was the different meanings of "weights" in statistics (e.g., see here), in other statistical software, and possibly within scipy.stats. Here are my responses to my own concerns.

I found twelve functions/classes in scipy.stats that accept a weights argument: gmean, hmean, pmean, cumfreq, relfreq, wasserstein_distance_nd, wasserstein_distance, energy_distance, expectile, combine_pvalues, kde, Mixture. While it's not always documented clearly, it seems that the first nine of these produce valid results when used for frequency weights1.

import numpy as np
from scipy import stats

rng = np.random.default_rng(285494985394583)
nx, ny = 100, 100
x = rng.integers(1, 11, size=nx) / 10.
y = rng.integers(1, 11, size=ny) / 10.
x_vals, x_weights = np.unique_counts(x)
y_vals, y_weights = np.unique_counts(y)

ref = stats.gmean(x)
res = stats.gmean(x_vals, weights=x_weights)
np.testing.assert_allclose(res, ref)

ref = stats.hmean(x)
res = stats.hmean(x_vals, weights=x_weights)
np.testing.assert_allclose(res, ref)

ref = stats.pmean(x, 3)
res = stats.pmean(x_vals, 3, weights=x_weights)
np.testing.assert_allclose(res, ref)

ref = stats.cumfreq(x)
res = stats.cumfreq(x_vals, weights=x_weights)
np.testing.assert_equal(res, ref)

ref = stats.relfreq(x)
res = stats.relfreq(x_vals, weights=x_weights)
# np.testing.assert_equal(res, ref)  # see gh-23931

ref = stats.wasserstein_distance(x, y)
res = stats.wasserstein_distance(x_vals, y_vals, x_weights, y_weights)
np.testing.assert_equal(res, ref)

ref = stats.energy_distance(x, y)
res = stats.energy_distance(x_vals, y_vals, x_weights, y_weights)
np.testing.assert_equal(res, ref)

ref = stats.expectile(x, 0.7)
res = stats.expectile(x_vals, 0.7, weights=x_weights)
np.testing.assert_equal(res, ref)

The weights parameter of combine_pvalues does not support frequency weights. (I haven't looked into what it is doing.)

ref = stats.combine_pvalues(x[x != 1], method='stouffer')
res = stats.combine_pvalues(x_vals[x_vals != 1], method='stouffer', 
                           weights=x_weights[x_vals != 1])
np.testing.assert_equal(res, ref)  # fails - both statistic and p-value are different
# AssertionError: 
# Items are not equal:
#  ACTUAL: np.float64(0.4734226456871095)
#  DESIRED: np.float64(0.41556093437208813)

I haven't dug into it, but the results of using weights and repeating observations affects the smoothing of gaussian_kde. I'm not sure whether this is intentional - it might be.

import matplotlib.pyplot as plt
x_ = np.linspace(-1, 2, 300)
ref = stats.gaussian_kde(x)(x_)
res = stats.gaussian_kde(x_vals, weights=x_weights)(x_)
# np.testing.assert_allclose(res, ref)
plt.plot(x_, res, x_, ref)  # affects smoothing
image

Mixture only accepts relative weights that sum to one, and they don't really represent weights of observations at all, so I don't think a "frequency weights" interpretation would be valid.

That said, it is pretty common in statistical software (e.g. SAS, SPSS, Stata) to support different types of weights in different functions, and we are not alone in being imperfect about documenting what kind of weights are implemented2.

Frequency weights are currently the most widely supported in scipy.stats, and they will probably remain the most common in the future because:

  • they are conceptually simple,
  • it is always possible (although probably not advisable) to add a trivial implementation for counting number-valued weights using repeats,
  • a more efficient implementation is often easy, and
  • correctness is easy to verify with property-based testing.

With all that in mind, I think it's OK to implement frequency weights in quantile using the keyword argument weights. I don't think it needs to have a more specific name (Stata uses fweights to distinguish them from pweights, aweights, and iweights.) because we can give the other kinds of weights more specific names if/when the need arises. As for combine_pvalues and other places where weights does NOT refer to frequency weights, we at least need to document how the weights can be interpreted, and we can consider renaming the parameters2.

Footnotes

  1. In some cases (e.g. sample mean), multiply weight types may lead to the same weighted version of a statistic. I know that these functions support frequency weights, but the results might still be valid for other types of weighting.

  2. I'll submit a separate PR to improve this, though. 2

@mdhaber mdhaber added scipy.stats enhancement A new feature or improvement labels Nov 6, 2025
Copy link
Contributor Author

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relies on _xp_searchsorted (gh-23930), so that should merge first. I'd also much prefer to get gh-22644 merged first and resolve the merge conflicts here rather than the other way around. There is also a third PR (yet another ECDF function) that I will post shortly; I'd also like to wait until that's in to merge this.



@xp_capabilities(skip_backends=[("dask.array", "No take_along_axis yet.")])
def _xp_searchsorted(x, y, *, side='left', xp=None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being reviewed separately in gh-23930.

Comment on lines +376 to +378
n_int = xp.asarray(n, dtype=xp.int64)
n_int = xp.broadcast_to(n_int, cumulative_weights.shape[:-1] + (1,))
total_weight = xp.take_along_axis(cumulative_weights, n_int-1, axis=-1)
Copy link
Contributor Author

@mdhaber mdhaber Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For nan_policy='omit' and marray, n is an array the is the number of valid entries in a slice; that's why we can't just take the last element. (marray tests are skipped right now because I'm running into a mutability issue. I think it's an marray issue and the implementation will probably "just work" when that gets resolved because we're already accounting for n depending on the slice.)

with pytest.raises(ValueError, match=message):
stats.quantile(x, xp.asarray([0.5, 0.6]), keepdims=False)

def _get_weights_x_rep(self, x, axis, rng):
Copy link
Contributor Author

@mdhaber mdhaber Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this function is to generate integral weights and use them to produce an x_rep array that is x with elements repeated weights times. The somewhat tricky thing is to ensure that the weights have the same sum along axis so we don't have a ragged x_rep array. I thought about using random_table... maybe that or another function would help here? It felt very clumsy, but I didn't really try to optimize this.

Note that for methods ``inverted_cdf`` and ``averaged_inverted_cdf``, only the
relative proportions of tied observations (and relative weights) affect the
results; for all other methods, the total number of observations (and absolute
weights) matter.
Copy link
Contributor Author

@mdhaber mdhaber Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that if I were to include weights in an example, this is what I'd show.

I think I'll be asked to include the information about coverage conditions, etc., that appears in the NumPy documentation, so I'll address that now: I would prefer not to, but I can add references. I don't think the math underlying the weights needs to be spelled out because it's exactly the same math as repeated samples; it's only the implementation that is generalized. Now, that same generalized implementation happens to work for non-integer weights, too, so if there's a reference for valid statistical interpretations of non-integral weights, I'd be interested in including that.

@ogrisel
Copy link
Contributor

ogrisel commented Nov 6, 2025

I haven't dug into it, but the results of using weights and repeating observations affects the smoothing of gaussian_kde. I'm not sure whether this is intentional - it might be.

I think this is what caused this PR in scikit-learn to get stalled: https://github.com/scikit-learn/scikit-learn/pull/27971/files#r1745872881 and https://github.com/scikit-learn/scikit-learn/pull/27971/files#r1763478046 (I was worried about not being able to implement weight/repetition equivalence in our scikit-learn estimator).

@mdhaber
Copy link
Contributor Author

mdhaber commented Nov 6, 2025

@ogrisel thanks - replied there. Let's get that addressed.

@ogrisel
Copy link
Contributor

ogrisel commented Nov 7, 2025

I built scipy with this PR's branch and patched scikit-learn to use this instead of our embedded sklearn.utils.stats._weighted_percentile function:

diff --git a/sklearn/utils/stats.py b/sklearn/utils/stats.py
index 453b0ab122..c35a2b4523 100644
--- a/sklearn/utils/stats.py
+++ b/sklearn/utils/stats.py
@@ -90,7 +90,27 @@ def _weighted_percentile(
         If `array` is 2D and `percentile_rank` is 1D, returns a 2D array
             of shape `(array.shape[1], percentile_rank.shape[0])`
     """
+
+    from scipy.stats import quantile
     xp, _, device = get_namespace_and_device(array)
+
+    array = xp.asarray(array, device=device)
+    if not xp.isdtype(array.dtype, "real floating"):
+        array = xp.asarray(array, dtype=xp.asarray([0.0]).dtype, device=device)
+
+    p = xp.asarray(percentile_rank, dtype=array.dtype, device=device) / 100
+    sample_weight = xp.asarray(sample_weight, dtype=array.dtype, device=device)
+
+    results = quantile(
+        array.T if array.ndim == 2 else array,
+        p=p,
+        axis=-1 if array.ndim == 2 else None,
+        nan_policy="omit",
+        weights=sample_weight.T if sample_weight.ndim == 2 else sample_weight,
+        method="averaged_inverted_cdf" if average else "inverted_cdf",
+    )
+    return results
+
     # `sample_weight` should follow `array` for dtypes
     floating_dtype = _find_matching_floating_dtype(array, xp=xp)
     array = xp.asarray(array, dtype=floating_dtype, device=device)

Then I ran the full scikit-learn tests suite. Once #23930 (comment) is fixed, the remaining errors seems to be related to the fact that 0 weight data points are not ignored (as they should according to frequency weight semantics).

In particular, the following scikit-learn test fails:

https://github.com/scikit-learn/scikit-learn/blob/c8a5f807a365a069b4d96da1a84318ce6253eab7/sklearn/utils/tests/test_stats.py#L112-L132

There are two other tests failures, but I suspect they stem from the same cause.

EDIT: to make this more directly related to scipy, here is a minimal reproducer. The frequency semantics seem to be violated for the "(averaged_)inverted_cdf" methods:

>>> from scipy.stats import quantile
>>> import numpy as np
>>> data = np.array([0, 1, 2, 3, 4, 5, 6])
>>> weights = np.array([0, 0, 1, 1, 0, 1, 0])
>>> quantile(data, p=[0.0, 0.5, 1.0], weights=weights, method="averaged_inverted_cdf")
array([1. , 3. , 5.5])
>>> quantile(data[weights != 0], p=[0.0, 0.5, 1.0], method="averaged_inverted_cdf")
array([2., 3., 5.])
>>> quantile(data, p=[0.0, 0.5, 1.0], weights=weights, method="inverted_cdf")
array([0., 3., 5.])
>>> quantile(data[weights != 0], p=[0.0, 0.5, 1.0], method="inverted_cdf")
array([2., 3., 5.])

The "linear" method seems to not be impacted by this bug:

>>> quantile(data, p=[0.0, 0.5, 1.0], weights=weights, method="linear")
array([2., 3., 5.])

@mdhaber
Copy link
Contributor Author

mdhaber commented Nov 7, 2025

Hmm there is a test with zero weights that compares against NumPy; surprised it didn't find this. I'll take a look. Thanks for the MRE.

@mdhaber
Copy link
Contributor Author

mdhaber commented Nov 7, 2025

@ogrisel I thought zero-weights were being tested, but apparently the RNG gods spared me the frustration of debugging opportunity to fix that earlier ; ) The simple fix is to treat them like omitted NaNs - move them to the end of the sorted array and don't include them in the count. I didn't include your test case specifically, but updated an existing test. There's an argument for adding a special test for this, though, since it is not testing all the methods.
I noticed but didn't handle the case of all-zero weights yet. I think that will interpolate between the first two elements along the slice instead of returning NaN like it should. A full slice of NaNs with nan_policy='omit' works correctly, so I can generalize that code to fix zero weights, too. However, I'd like to address some other cleanups regarding nan handling at the same time, so if it doesn't break sklearn tests, I'd like to save that for a follow-up rather than complicating this PR.

@mdhaber
Copy link
Contributor Author

mdhaber commented Nov 7, 2025

Added more extensive tests of zero-weights. Skipped CI for now - when gh-23930 merges, we can merge main and run again, and hopefully then the sklearn tests will pass, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement A new feature or improvement scipy.stats

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH: add weights param to stats.quantile similarly to NumPy

2 participants