-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CommonGridBinner #651
base: develop
Are you sure you want to change the base?
CommonGridBinner #651
Conversation
376c5e6
to
345da0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this requires a major rework.
Most important thing: normalize data and reduce heavily the number of different code paths. With this alone, the code could probably be reduced to the half, or even a quarter or what it is now.
Second important thing: vectorization. Using appropriate vectorization you can reduce the actual code of binning to a couple of well-though function calls, achieving not just more performance, but also more reduced and much more clear code.
Please, try to modify the code with these ideas in mind, and bring the topic to the weekly discussion with Alberto, as I am sure that he can also help you with this.
@@ -10,5 +10,6 @@ | |||
"missing", | |||
"registration", | |||
"smoothing", | |||
"binning", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not seem likely that we will have more methods in this category, right? I would try to think of a more general category, or just place this transformer in the "preprocessing" namespace, without any subcategory.
############################################################################## | ||
|
||
|
||
@pytest.fixture( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the impression that you are abusing fixtures here. You do not need to use fixtures for everything, even less parameterized fixtures, which are very hard to read. Why not making each of these cases into a separate test?
In general, tests should be easy to read. Fixtures are good to have initialization code that you use in several tests in one place, but they should be used sparingly, or you risk the test code being more difficult to read that the code we want to test.
Optional[str], | ||
Optional[ArrayNDimType], | ||
] | ||
TupleOutputGrid = Tuple[None, Tuple[NDArray[np.float64], ...]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have several types already defined for the whole project, which may be helpful to you.
|
||
class GridBinner( # noqa: WPS230 | ||
BaseEstimator, | ||
TransformerMixin[FDataGrid, FDataGrid, object], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not also accept FDataIrregular
? I thought that was the main case.
Note: if a value falls in the limit of two bins, it will be included in | ||
the bin on the right. | ||
|
||
Parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using Google syntax for the docstrings, not NumPy syntax. Please change your docstrings accordingly.
dim_codomain = X.dim_codomain | ||
|
||
if ( | ||
self.dim == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, do not special-case for 1D.
np.nan, | ||
) | ||
|
||
for sample_idx in range(n_samples): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should attempt to vectorize algorithms, and not use a for
loop over the samples (or worse, over the discretization points!), as this makes algorithms slower and does not scale well.
Iterating over dimensions is not as problematic (although often can also be avoided).
Idea: You can try to find the index of the corresponding bin for a particular dimension, for all points of all samples, using one call to np.searchsorted
. Then it is just a matter of finding the way to aggregate them in a vectorized way.
'right' or a tuple of numpy arrays with the grid points for each | ||
dimension, which must fit within the output bins. | ||
bin_aggregation: Method to compute the value of the bin. The available | ||
methods are: 'mean', 'median'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the median computed for n dimensions? I think this probably does not worth it, at least for now.
) | ||
|
||
# Vectorized binning | ||
for i in range(self.n_bins_[0]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vectorized? You are using a loop!
np.nan, | ||
) | ||
|
||
for k, combination in enumerate(points_in_bin_combinations): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this should be vectorized using np.searchsorted
.
Describe the proposed changes
New CommonGridBinner class that integrates with scikit-learn Transformer objects, and allows user to bin FDataGrid and FDataIrregular on a common grid to "discretize" them, similar to grouping points in a histogram. It includes tests and documentation for the module
Checklist before requesting a review