New issue

Jump to bottom

CommonGridBinner #651

Open

aleexarias wants to merge 17 commits into GAA-UAM:develop from aleexarias:data_binning

+1,514 −0

Contributor

aleexarias commented Feb 28, 2025

Describe the proposed changes

New CommonGridBinner class that integrates with scikit-learn Transformer objects, and allows user to bin FDataGrid and FDataIrregular on a common grid to "discretize" them, similar to grouping points in a histogram. It includes tests and documentation for the module

Checklist before requesting a review

I have performed a self-review of my code
The code conforms to the style used in this package
The code is fully documented and typed (type-checked with Mypy)
I have added thorough tests for the new/changed functionality

aleexarias and others added 16 commits

January 16, 2025 16:54


          First functional version of DataBinner object + simple test

dbe7da1


          Version 1.0 DataBinner

e78372c


          Merge branch 'GAA-UAM:develop' into data_binning

4fbbae3


          Merge branch 'GAA-UAM:develop' into data_binning

d9e835f


          Working multidimensional binning class

358627e


          Working tests and final version of DataBinner 0.1

969c2a3


          Change name 'center' for 'middle' in binning class

ccb8344


          Working databinner for irregular data

06de6be


          Pytest DataBinner v0.1

c2cd914


          Pytest DataBinner v0.2

79afe79


          Pytest Databinner v1.0

a6907ed


          Pytest DataBinner flake8

c47e040


          Flake binning.py

1fbed17


          Final version DataBinner, awaiting name change

a07374b


          Final code with flake8 and mypy

8be22cf


          Deletion of development files

2008a12

aleexarias marked this pull request as ready for review

February 28, 2025 17:40


          Updated and final version of mypied code

345da0f

aleexarias force-pushed the data_binning branch from 376c5e6 to 345da0f Compare

March 14, 2025 11:42

vnmabus requested changes

View reviewed changes

Member

vnmabus left a comment

Ok, this requires a major rework.

Most important thing: normalize data and reduce heavily the number of different code paths. With this alone, the code could probably be reduced to the half, or even a quarter or what it is now.

Second important thing: vectorization. Using appropriate vectorization you can reduce the actual code of binning to a couple of well-though function calls, achieving not just more performance, but also more reduced and much more clear code.

Please, try to modify the code with these ideas in mind, and bring the topic to the weekly discussion with Alberto, as I am sure that he can also help you with this.

skfda/preprocessing/__init__.py

@@ @@ -10,5 +10,6 @@ @@
                       "missing",
                       "registration",
                       "smoothing",
+                      "binning",

Member

vnmabus Mar 22, 2025

It does not seem likely that we will have more methods in this category, right? I would try to think of a more general category, or just place this transformer in the "preprocessing" namespace, without any subcategory.

skfda/tests/test_gridbinner.py

		##############################################################################


		@pytest.fixture(

Member

vnmabus Mar 22, 2025

I have the impression that you are abusing fixtures here. You do not need to use fixtures for everything, even less parameterized fixtures, which are very hard to read. Why not making each of these cases into a separate test?

In general, tests should be easy to read. Fixtures are good to have initialization code that you use in several tests in one place, but they should be used sparingly, or you risk the test code being more difficult to read that the code we want to test.

skfda/preprocessing/binning.py

+                  Optional[str],
+                  Optional[ArrayNDimType],
+              ]
+              TupleOutputGrid = Tuple[None, Tuple[NDArray[np.float64], ...]]

Member

vnmabus Mar 22, 2025

We have several types already defined for the whole project, which may be helpful to you.

skfda/preprocessing/binning.py

+              class GridBinner(  # noqa: WPS230
+                  BaseEstimator,
+                  TransformerMixin[FDataGrid, FDataGrid, object],

Member

vnmabus Mar 22, 2025

Do we not also accept FDataIrregular? I thought that was the main case.

skfda/preprocessing/binning.py

+                  Note: if a value falls in the limit of two bins, it will be included in
+                  the bin on the right.
+                  Parameters:

Member

vnmabus Mar 22, 2025

We are using Google syntax for the docstrings, not NumPy syntax. Please change your docstrings accordingly.

skfda/preprocessing/binning.py

+                      dim_codomain = X.dim_codomain
+                      if (
+                          self.dim == 1

Member

vnmabus Mar 22, 2025

Again, do not special-case for 1D.

skfda/preprocessing/binning.py

+                          np.nan,
+                      )
+                      for sample_idx in range(n_samples):

Member

vnmabus Mar 22, 2025

Ideally we should attempt to vectorize algorithms, and not use a for loop over the samples (or worse, over the discretization points!), as this makes algorithms slower and does not scale well.

Iterating over dimensions is not as problematic (although often can also be avoided).

Idea: You can try to find the index of the corresponding bin for a particular dimension, for all points of all samples, using one call to np.searchsorted. Then it is just a matter of finding the way to aggregate them in a vectorized way.

skfda/preprocessing/binning.py

+                          'right' or a tuple of numpy arrays with the grid points for each
+                          dimension, which must fit within the output bins.
+                      bin_aggregation: Method to compute the value of the bin. The available
+                          methods are: 'mean', 'median'.

Member

vnmabus Mar 22, 2025

How is the median computed for n dimensions? I think this probably does not worth it, at least for now.

skfda/preprocessing/binning.py

+                      )
+                      # Vectorized binning
+                      for i in range(self.n_bins_[0]):

Member

vnmabus Mar 22, 2025

Vectorized? You are using a loop!

skfda/preprocessing/binning.py

+                          np.nan,
+                      )
+                      for k, combination in enumerate(points_in_bin_combinations):

Member

vnmabus Mar 22, 2025

Again, this should be vectorized using np.searchsorted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet