Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CommonGridBinner #651

Open
wants to merge 17 commits into
base: develop
Choose a base branch
from
Open

CommonGridBinner #651

wants to merge 17 commits into from

Conversation

aleexarias
Copy link
Contributor

Describe the proposed changes

New CommonGridBinner class that integrates with scikit-learn Transformer objects, and allows user to bin FDataGrid and FDataIrregular on a common grid to "discretize" them, similar to grouping points in a histogram. It includes tests and documentation for the module

Checklist before requesting a review

  • I have performed a self-review of my code
  • The code conforms to the style used in this package
  • The code is fully documented and typed (type-checked with Mypy)
  • I have added thorough tests for the new/changed functionality

Sorry, something went wrong.

@aleexarias aleexarias marked this pull request as ready for review February 28, 2025 17:40
Copy link
Member

@vnmabus vnmabus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this requires a major rework.

Most important thing: normalize data and reduce heavily the number of different code paths. With this alone, the code could probably be reduced to the half, or even a quarter or what it is now.

Second important thing: vectorization. Using appropriate vectorization you can reduce the actual code of binning to a couple of well-though function calls, achieving not just more performance, but also more reduced and much more clear code.

Please, try to modify the code with these ideas in mind, and bring the topic to the weekly discussion with Alberto, as I am sure that he can also help you with this.

@@ -10,5 +10,6 @@
"missing",
"registration",
"smoothing",
"binning",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not seem likely that we will have more methods in this category, right? I would try to think of a more general category, or just place this transformer in the "preprocessing" namespace, without any subcategory.

##############################################################################


@pytest.fixture(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the impression that you are abusing fixtures here. You do not need to use fixtures for everything, even less parameterized fixtures, which are very hard to read. Why not making each of these cases into a separate test?

In general, tests should be easy to read. Fixtures are good to have initialization code that you use in several tests in one place, but they should be used sparingly, or you risk the test code being more difficult to read that the code we want to test.

Optional[str],
Optional[ArrayNDimType],
]
TupleOutputGrid = Tuple[None, Tuple[NDArray[np.float64], ...]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have several types already defined for the whole project, which may be helpful to you.


class GridBinner( # noqa: WPS230
BaseEstimator,
TransformerMixin[FDataGrid, FDataGrid, object],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not also accept FDataIrregular? I thought that was the main case.

Note: if a value falls in the limit of two bins, it will be included in
the bin on the right.

Parameters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using Google syntax for the docstrings, not NumPy syntax. Please change your docstrings accordingly.

dim_codomain = X.dim_codomain

if (
self.dim == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, do not special-case for 1D.

np.nan,
)

for sample_idx in range(n_samples):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should attempt to vectorize algorithms, and not use a for loop over the samples (or worse, over the discretization points!), as this makes algorithms slower and does not scale well.

Iterating over dimensions is not as problematic (although often can also be avoided).

Idea: You can try to find the index of the corresponding bin for a particular dimension, for all points of all samples, using one call to np.searchsorted. Then it is just a matter of finding the way to aggregate them in a vectorized way.

'right' or a tuple of numpy arrays with the grid points for each
dimension, which must fit within the output bins.
bin_aggregation: Method to compute the value of the bin. The available
methods are: 'mean', 'median'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the median computed for n dimensions? I think this probably does not worth it, at least for now.

)

# Vectorized binning
for i in range(self.n_bins_[0]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vectorized? You are using a loop!

np.nan,
)

for k, combination in enumerate(points_in_bin_combinations):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this should be vectorized using np.searchsorted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants