Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLFlow dataset with modern kedro dataset breaks #598

Closed
Calychas opened this issue Oct 15, 2024 · 8 comments · Fixed by #599
Closed

MLFlow dataset with modern kedro dataset breaks #598

Calychas opened this issue Oct 15, 2024 · 8 comments · Fixed by #599
Assignees
Labels
bug Something isn't working

Comments

@Calychas
Copy link
Contributor

Calychas commented Oct 15, 2024

Description

Using mlflow datasets that wrap a modern dataset results in an error

Context

Cannot run the pipeline successfully. Breaks on saving the data (possibly on loading too?)

Steps to Reproduce

  1. Wrap json.JSONDataset with MLFlowArtifactDataset
    my_dataset:
      type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
      dataset:
        type: json.JSONDataset
        filepath: data/08_reporting/my_dataset.json
      artifact_path: model
    
  2. Run the pipeline

Expected Result

Should save successfully.

Actual Result

Errors while trying to save on that dataset


 /Users/project/.venv/lib/python3.10/site-packages/kedro/io/core.py:270 in save                                                                               
                                                                                                  
   267                                                                                         
   268          try:                                                                           
   269             self._logger.debug("Saving %s", str(self))                                 
 ❱ 270             save_func(self, data)                                                      
   271          except (DatasetError, FileNotFoundError, NotADirectoryError):                  
   272             raise                                                                      
   273          except Exception as exc:                                                       
                                                                                                  
 /Users/project/.venv/lib/python3.10/site-packages/kedro_mlflow/io/artifacts/mlflow_artifact_dataset.py:68 in _save                                           
                                                                                                  
    65             if getattr(super().save, "__savewrapped__", False):  # modern dataset      
    66                super().save.__wrapped__(self, data)                                   
    67             else:  # legacy dataset                                                    
 ❱  68                super()._save(data)                                                    
    69                                                                                        
    70             if self._logging_activated:                                                
    71                if self.run_id:   

AttributeError: 'super' object has no attribute '_save'

---

DatasetError: Failed while saving data to dataset 
MlflowJSONDataset(filepath=/Users/project/data/08_reporting/my_dataset.json, protocol=file).
'super' object has no attribute '_save'

Workaround

Experimented with changing __savewrapped__ to __wrapped__ and it gets rid of the error

Your Environment

  • kedro and kedro-mlflow version used (pip show kedro and pip show kedro-mlflow): 0.19.9 and 0.13.1
  • kedro-datasets: 5.0.0
  • Python version used (python -V): 3.10.14
  • Operating system and version: MacOS M2 Pro 15.0.1

Does the bug also happen with the last version on master?

Yes

@stroblme
Copy link

stroblme commented Oct 15, 2024

Can confirm, happens also with various other datasets 😢

Edit: Probably found the problematic line. In mlflow_artifact_dataset.py line 65f:

if getattr(super().save, "__savewrapped__", False):  # modern dataset
    super().save.__wrapped__(self, data)
else:  # legacy dataset
    super()._save(data)

should be

if getattr(super().save, "__wrapped__", False):  # modern dataset
    super().save.__wrapped__(self, data)
else:  # legacy dataset
    super()._save(data)

@Galileo-Galilei
Copy link
Owner

(@deepyaman FYI)

Hi, just to confirm, are you both using kedro-datasets==5.0.0 released a couple of days ago? Very likely, the faulty PR has been merged on October, 5th : kedro-org/kedro-plugins#829.

Not sure about your fix @stroblme, does it solve it for you? This looks like a bug but I'm afraid the problem is deeper than this unfortunately. I'll have a look very quick.

@Galileo-Galilei Galileo-Galilei self-assigned this Oct 15, 2024
@Galileo-Galilei Galileo-Galilei added the bug Something isn't working label Oct 15, 2024
@stroblme
Copy link

@Galileo-Galilei , yes, this fixed it for me.
Package versions are as follow:

kedro==0.19.8
kedro-datasets==5.0.0
kedro-mlflow==0.13.1

@deepyaman
Copy link
Contributor

https://github.com/kedro-org/kedro/blob/0.19.7/kedro/io/core.py#L259 should set __savewrapped__. My understanding is that super() should refer to json.JSONDataset, and this should have it set.

Trying to think through this, is it possible __init_subclass__ is never getting called?

It's interesting that changing it to __wrapped__ does work, since it's the same part of the code in Kedro core that wraps (setting __wrapped__) and manually sets the __savewrapped__ attribute...

It will be hard for me to prioritize looking into this in the next few days, but let me see if I get a chance.

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Oct 15, 2024

Can you make a PR @stroblme ? I am on my phone so it's not easy but I can release it immediately if someone submits the PR

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Oct 15, 2024

@deepyaman it's intellectually interesting to understand why, but if the fix works, don't worry and don't bother wasting time to investigate unless you really want it ;)

@deepyaman
Copy link
Contributor

if getattr(super().save, "__wrapped__", False):  # modern dataset
    super().save.__wrapped__(self, data)
else:  # legacy dataset
    super()._save(data)

If want to go with this fix, I think it should be ideally:

if hasattr(super().save, "__wrapped__"):  # modern dataset
    super().save.__wrapped__(self, data)
else:  # legacy dataset
    super()._save(data)

(as __savewrapped__ should be a bool, while checking __wrapped__ is testing function presence)

@Galileo-Galilei
Copy link
Owner

Some follow up :

  • In debugging mode, the cls.save has the __savewrapped__ attribute and __init__subclass is properly called, but super().save does not have the attribute.
  • Many tests fails with the most recent version of kedro-datasets, but all passes with above fix ✅

➡️ I release the fix and won't investigate more for now

Galileo-Galilei added a commit that referenced this issue Oct 15, 2024

Verified

This commit was signed with the committer’s verified signature.
Galileo-Galilei Yolan Honoré-Rougé
…atasets without private _load and _save (#598)
Galileo-Galilei added a commit that referenced this issue Oct 15, 2024
…atasets without private _load and _save (#598)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: ✅ Done
4 participants