- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.3k
[WIP] ENH: Resample additional arrays apart from X and y #463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[WIP] ENH: Resample additional arrays apart from X and y #463
Conversation
| Hello @glemaitre! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on August 29, 2018 at 09:59 Hours UTC | 
| Sampling extra arrays could be easy in the case of prototype selection. However, what would be a good and meaningful default when new sample weight need to be created. Right now, I created an  | 
| Ups I forgot to ping @jnothman in my last comment | 
| Codecov Report
 @@            Coverage Diff             @@
##           master     #463      +/-   ##
==========================================
- Coverage   98.92%   98.57%   -0.36%     
==========================================
  Files          85       75      -10     
  Lines        5324     4633     -691     
==========================================
- Hits         5267     4567     -700     
- Misses         57       66       +9
 Continue to review full report at Codecov. 
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I don't think returning non-X,y will work with the current handling of Pipeline.fit's kwargs. We really need sample prop routing to handle that case. Currently, the handling would be unambiguous if one of:
- the resampler is the last step, in which case we return any additional sample props like weights
- the resampler is the second-last step, and there is no fit_param called last_step_name__sample_weight, in which case we pass all sample props into the last step'sfit, I think.
Otherwise, it's unclear where to pass the returned sample_weight given that **fit_params intends to prescribe this at the time Pipeline.fit is called.
| y_resampled : ndarray, shape (n_samples_new,) | ||
| The corresponding label of `X_resampled`. | ||
| sample_weight_resampled : ndarray, shape (n_samples_new,) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather have a dict of non-X,y returned. (Optionally? In scikit-learn I would rather this be mandatory so we don't need to handle both cases.)
| ``sample_weight`` was not ``None``. | ||
| idx_resampled : ndarray, shape (n_samples_new,) | ||
| Indices of the selected features. This output is optional and only | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the selected samples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
| Resampled sample weights. This output is returned only if | ||
| ``sample_weight`` was not ``None``. | ||
| idx_resampled : ndarray, shape (n_samples_new,) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why this should be returned from fit_resample, rather than stored as an attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it was some original design (before it was in scikit-learn). But actually it would be better to keep it as an attribute with the single fit_resample.
| 
 If I understand well and from what I could see, Pipeline does not support  | 
| You're right  | 
bbf2b12    to
    513203c      
    Compare
  
    f1bc189    to
    8f87307      
    Compare
  
    eae6c6b    to
    ffdde80      
    Compare
  
    | @glemaitre #460 is closed but #457 is still open and probably relevant. Could we close this PR in favor of new one in the future? It is two years old. | 
Implement the last point of #462 and should be merged after it.
Partially addressing #460