Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to blacklist params. also sanitise keys #595

Merged

Conversation

pascalwhoop
Copy link
Contributor

@pascalwhoop pascalwhoop commented Oct 2, 2024

Description

  • special chars need to be sanitized before sending to MLflow

  • I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":

  • I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.

  • I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.

  • I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

@Galileo-Galilei
Copy link
Owner

Hi, I definitely understand why you don't want to log secrets and I am not strictly against this change, but I think the main issue at stake here is that they should not be params in the first place. Avoid logging in mlflow is likely the least of your problems : if some secrets are prams, they will be logged to your remote version control forge alongside with configuration. Even if you use it in a local folder, this means these secrets will be required in a non crypted file in production.

Can you elaborate on why you need to have sensitive data in the params in the first place?

I suggest you use environment variable to retrieve these sensitive data directly in the node, so you have control over what is logged.

@pascalwhoop
Copy link
Contributor Author

@Galileo-Galilei let me get back on the secrets one, I think you have a point although I think kedro works this way (I agree ENV variables injection is better and we do do that but we still pass them through the catalog into the python functions to keep them pure)

On the sanitization you're OK? The error was that we had properties in a data-fabricator that were something like customer_ids:STRING[] as we fabricate the real systems data structures. But that failed because MLFLow API doesn't like [ in a key, while in yaml it's OK.

@pascalwhoop
Copy link
Contributor Author

pascalwhoop commented Oct 3, 2024

Clearly an active discussion, spotted your name again here
https://github.com/kedro-org/kedro/wiki/Credentials-in-kedro

@merelcht @Galileo-Galilei given this plugin takes all params and sends off to 3rd party service, but kedro recommends passing credentials through parameters into nodes, wdyt is best here?

Maybe just use Pydantic Settings and create a Secrets singleton?
https://docs.pydantic.dev/latest/concepts/pydantic_settings/#usage

It does feel like a broader "what's kedros' stance" here. I like the recommendation of not passing secrets through the catalog by Galileo but then we need to also articulate an alternative in kedro docs.

@pascalwhoop
Copy link
Contributor Author

I removed the secrets removal from the parameters as that feels like a separate and much larger problem than the one where the parameters can have special chars in the key in kedro but not on MLflow. I hope this smaller change is less controversial

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Oct 6, 2024

Hello @pascalwhoop definitely ok to sanitize parameters like this, but I'd like to ensure consistency with mlflow. I was pretty sure it has existed at some point in the past, not sure why it's no longer the case.

This issue mlflow/mlflow#9425 refers to the code with the exact regex mlflow uses under the hood (and explains the rationale behind this) , and we should use the same one. The implementation should import the mlflow function that performs validation so we don't have to mirror the evolution of the behaviour along with mlflow. Could you update that?

EDIT: The regex is available here: https://github.com/mlflow/mlflow/blob/e40e782b6fcab473159e6d4fee85bc0fc10f78fd/mlflow/utils/validation.py#L140C1-L148C44 , and we can spot a slight difference between Windows and other OS. We cannot import this function, but we should mimic it and force replacement. Optionnaly, this should be an entry in the mlflow.yml to let user choose the conversion character

@pascalwhoop
Copy link
Contributor Author

done @Galileo-Galilei

Copy link
Owner

@Galileo-Galilei Galileo-Galilei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, just a couple of questions / suggestions, but I'm ok to merge it without any change if you disagree.

Comment on lines +104 to +108
bootstrap_project(kedro_project)
with KedroSession.create(
project_path=kedro_project,
) as session:
context = session.load_context()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nitpick, but can we just ignore all this? Since you invoke the hook manually, I don't think it's necessary creating a whole session and contexte, unless I miss something?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, this is needed to call the after_context_created hook, which format and stores the entire configuration.

) as session:
context = session.load_context()
mlflow_node_hook = MlflowHook()
mlflow_node_hook.after_context_created(context)
Copy link
Owner

@Galileo-Galilei Galileo-Galilei Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure you need after_context_created since you already set mlflow_tracking_uri up just above.

Overall really like the test👍

pyproject.toml Outdated
@@ -1,5 +1,5 @@
[tool.pytest.ini_options]
addopts = "--cov=kedro_mlflow --cov-report=html tests/"
Copy link
Owner

@Galileo-Galilei Galileo-Galilei Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't it necessary anymore? Don't remember exactly but I wonder if this is used by codecov to upload the coverage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah my bad, I had removed this to be able to call pytest ..path..to.test manually, the tests/ at the end blocked that from being possible

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just for the record, I guess you're searching pytest -k "regex" to execute only some tests based on a regex)

return name
else:
# Replace invalid characters with underscore
return re.sub(r"[^/\w.\- :]", "_", name)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a logger.warning to explain that the parameter is renamed? This might be confusing.

@Galileo-Galilei
Copy link
Owner

Almost good to go, I just don't understand why he test fails with "valid:param" -> "valid:param". I get mlflow.exceptions.MlflowException: Invalid value "valid:param" for parameter 'key' supplied: Names may only contain alphanumerics, underscores (_), dashes (-), periods (.), spaces ( ), colon(:) so it is likely on mlflow's side, but i'd like to understand.

I'll resume tomorrow, else it's good to go.

@Galileo-Galilei
Copy link
Owner

It's fixed, it was a windows issue as usual :) I'll merge it tomorrow

@Galileo-Galilei Galileo-Galilei merged commit fc584c9 into Galileo-Galilei:master Oct 23, 2024
9 of 15 checks passed
@pascalwhoop
Copy link
Contributor Author

hey @Galileo-Galilei thanks, I was a bit swamped and failed to respond. Appreciate the merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants