-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
ENH: stats.quantile: add support for frequency weights #23941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
mdhaber
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This relies on _xp_searchsorted (gh-23930), so that should merge first. I'd also much prefer to get gh-22644 merged first and resolve the merge conflicts here rather than the other way around. There is also a third PR (yet another ECDF function) that I will post shortly; I'd also like to wait until that's in to merge this.
|
|
||
|
|
||
| @xp_capabilities(skip_backends=[("dask.array", "No take_along_axis yet.")]) | ||
| def _xp_searchsorted(x, y, *, side='left', xp=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being reviewed separately in gh-23930.
| n_int = xp.asarray(n, dtype=xp.int64) | ||
| n_int = xp.broadcast_to(n_int, cumulative_weights.shape[:-1] + (1,)) | ||
| total_weight = xp.take_along_axis(cumulative_weights, n_int-1, axis=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For nan_policy='omit' and marray, n is an array the is the number of valid entries in a slice; that's why we can't just take the last element. (marray tests are skipped right now because I'm running into a mutability issue. I think it's an marray issue and the implementation will probably "just work" when that gets resolved because we're already accounting for n depending on the slice.)
| with pytest.raises(ValueError, match=message): | ||
| stats.quantile(x, xp.asarray([0.5, 0.6]), keepdims=False) | ||
|
|
||
| def _get_weights_x_rep(self, x, axis, rng): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this function is to generate integral weights and use them to produce an x_rep array that is x with elements repeated weights times. The somewhat tricky thing is to ensure that the weights have the same sum along axis so we don't have a ragged x_rep array. I thought about using random_table... maybe that or another function would help here? It felt very clumsy, but I didn't really try to optimize this.
| Note that for methods ``inverted_cdf`` and ``averaged_inverted_cdf``, only the | ||
| relative proportions of tied observations (and relative weights) affect the | ||
| results; for all other methods, the total number of observations (and absolute | ||
| weights) matter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that if I were to include weights in an example, this is what I'd show.
I think I'll be asked to include the information about coverage conditions, etc., that appears in the NumPy documentation, so I'll address that now: I would prefer not to, but I can add references. I don't think the math underlying the weights needs to be spelled out because it's exactly the same math as repeated samples; it's only the implementation that is generalized. Now, that same generalized implementation happens to work for non-integer weights, too, so if there's a reference for valid statistical interpretations of non-integral weights, I'd be interested in including that.
I think this is what caused this PR in scikit-learn to get stalled: https://github.com/scikit-learn/scikit-learn/pull/27971/files#r1745872881 and https://github.com/scikit-learn/scikit-learn/pull/27971/files#r1763478046 (I was worried about not being able to implement weight/repetition equivalence in our scikit-learn estimator). |
|
@ogrisel thanks - replied there. Let's get that addressed. |
|
I built scipy with this PR's branch and patched scikit-learn to use this instead of our embedded diff --git a/sklearn/utils/stats.py b/sklearn/utils/stats.py
index 453b0ab122..c35a2b4523 100644
--- a/sklearn/utils/stats.py
+++ b/sklearn/utils/stats.py
@@ -90,7 +90,27 @@ def _weighted_percentile(
If `array` is 2D and `percentile_rank` is 1D, returns a 2D array
of shape `(array.shape[1], percentile_rank.shape[0])`
"""
+
+ from scipy.stats import quantile
xp, _, device = get_namespace_and_device(array)
+
+ array = xp.asarray(array, device=device)
+ if not xp.isdtype(array.dtype, "real floating"):
+ array = xp.asarray(array, dtype=xp.asarray([0.0]).dtype, device=device)
+
+ p = xp.asarray(percentile_rank, dtype=array.dtype, device=device) / 100
+ sample_weight = xp.asarray(sample_weight, dtype=array.dtype, device=device)
+
+ results = quantile(
+ array.T if array.ndim == 2 else array,
+ p=p,
+ axis=-1 if array.ndim == 2 else None,
+ nan_policy="omit",
+ weights=sample_weight.T if sample_weight.ndim == 2 else sample_weight,
+ method="averaged_inverted_cdf" if average else "inverted_cdf",
+ )
+ return results
+
# `sample_weight` should follow `array` for dtypes
floating_dtype = _find_matching_floating_dtype(array, xp=xp)
array = xp.asarray(array, dtype=floating_dtype, device=device)Then I ran the full scikit-learn tests suite. Once #23930 (comment) is fixed, the remaining errors seems to be related to the fact that 0 weight data points are not ignored (as they should according to frequency weight semantics). In particular, the following scikit-learn test fails: There are two other tests failures, but I suspect they stem from the same cause. EDIT: to make this more directly related to scipy, here is a minimal reproducer. The frequency semantics seem to be violated for the >>> from scipy.stats import quantile
>>> import numpy as np
>>> data = np.array([0, 1, 2, 3, 4, 5, 6])
>>> weights = np.array([0, 0, 1, 1, 0, 1, 0])
>>> quantile(data, p=[0.0, 0.5, 1.0], weights=weights, method="averaged_inverted_cdf")
array([1. , 3. , 5.5])
>>> quantile(data[weights != 0], p=[0.0, 0.5, 1.0], method="averaged_inverted_cdf")
array([2., 3., 5.])
>>> quantile(data, p=[0.0, 0.5, 1.0], weights=weights, method="inverted_cdf")
array([0., 3., 5.])
>>> quantile(data[weights != 0], p=[0.0, 0.5, 1.0], method="inverted_cdf")
array([2., 3., 5.])The >>> quantile(data, p=[0.0, 0.5, 1.0], weights=weights, method="linear")
array([2., 3., 5.]) |
|
Hmm there is a test with zero weights that compares against NumPy; surprised it didn't find this. I'll take a look. Thanks for the MRE. |
|
@ogrisel I thought zero-weights were being tested, but apparently the RNG gods spared me the |
|
Added more extensive tests of zero-weights. Skipped CI for now - when gh-23930 merges, we can merge main and run again, and hopefully then the sklearn tests will pass, too. |
Reference issue
Closes gh-22794
Supersedes gh-22894. This PR addresses the concerns expressed in gh-22794, is more general (adds weights support for all
methods except'harwell-davis'rather than justinverted_cdf), and include more extensive tests.data-apis/array-api-extra#494
What does this implement/fix?
This adds support for frequency weights to
scipy.stats.quantile.Results agree with NumPy for
method='inverted_cdf'For non-negative integer weights,
stats.quantile(x, p, weights=counts, ...)is equivalent tostats.quantile(np.repeat(x, counts), p, ...).The documentation specifies that weights that do not satisfy the requirements (non-negative integers) are accepted but may not have valid statistical interpretations.
Additional information
Support for weights in
scipy.statshas come up several times before, most notably in gh-6931. As mentioned in #22794 (comment), an important discussion point in gh-6931 was the different meanings of "weights" in statistics (e.g., see here), in other statistical software, and possibly withinscipy.stats. Here are my responses to my own concerns.I found twelve functions/classes in
scipy.statsthat accept aweightsargument:gmean,hmean,pmean,cumfreq,relfreq,wasserstein_distance_nd,wasserstein_distance,energy_distance,expectile,combine_pvalues,kde,Mixture. While it's not always documented clearly, it seems that the first nine of these produce valid results when used for frequency weights1.The
weightsparameter ofcombine_pvaluesdoes not support frequency weights. (I haven't looked into what it is doing.)I haven't dug into it, but the results of using
weightsand repeating observations affects the smoothing ofgaussian_kde. I'm not sure whether this is intentional - it might be.Mixtureonly accepts relative weights that sum to one, and they don't really represent weights of observations at all, so I don't think a "frequency weights" interpretation would be valid.That said, it is pretty common in statistical software (e.g. SAS, SPSS, Stata) to support different types of weights in different functions, and we are not alone in being imperfect about documenting what kind of weights are implemented2.
Frequency weights are currently the most widely supported in
scipy.stats, and they will probably remain the most common in the future because:repeats,With all that in mind, I think it's OK to implement frequency weights in
quantileusing the keyword argumentweights. I don't think it needs to have a more specific name (Stata usesfweightsto distinguish them frompweights,aweights, andiweights.) because we can give the other kinds of weights more specific names if/when the need arises. As forcombine_pvaluesand other places whereweightsdoes NOT refer to frequency weights, we at least need to document how the weights can be interpreted, and we can consider renaming the parameters2.Footnotes
In some cases (e.g. sample mean), multiply weight types may lead to the same weighted version of a statistic. I know that these functions support frequency weights, but the results might still be valid for other types of weighting. ↩
I'll submit a separate PR to improve this, though. ↩ ↩2