-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix platform instance support on Druid ingestion #12716
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
... and 10 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
11307c2
to
201db8e
Compare
if self.platform_instance | ||
else f"{table}" | ||
) | ||
return f"{table}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is restoring the original implementation, from 2021 https://github.com/datahub-project/datahub/pull/2284/files
which was updated on 2022 https://github.com/datahub-project/datahub/pull/3996/files#diff-d2b27025056d73e6bf1508f17c065c63dd4249caacf26de6199aaeaf446eec60
Have we been producing "duplicated" platform instance since then? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, pretty sure that is the case, or something along the way changed.
But you can see that the default SQLAlchemy has no mention of a platform_instance:
https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py#L576
If you check any other SQL ingestion with the platform_instance support, e.g. mssql, you can see that also there there is no need to specify the platform_instance at this level:
https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/mssql/source.py#L731
The Druid ingestion is probably not used a lot, because you will other big inconsistencies on the ingestion. For example, the Druid table is always hardcoded to druid/v2/sql
because of the sql alchemy implementation that by default will use the URI to compute the table name. This is probably something that should be fixed at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integration tests for druid source were missed; I just added them here #12717
And indeed, the golden file shows the platform instance appears twice in the generated URNs for datasets.
Let's merge PR with integration tests first and then yours. Thanks for raising the issue and the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will wait on #12717 to get merged, and rebase this PR with your changes.
Thanks a lot on the support!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR #12717 merged
If the platform instance was specified in a Druid ingestion, it was always set twice, because it was overrided in the get_identifier, and then added again down the line. This may break existing users that already rely on the broken platform instance implementation, but they could still manually set it to the old pattern in their dhub file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks for the contrib!
One of the builds seems to have failed on a non-related test:
Not sure if it's possible to restart this single job instead of the whole pipeline. |
No worries. Those kafka errors in |
5f5e395
into
datahub-project:master
Co-authored-by: rasnar <11248833+Rasnar@users.noreply.github.com> Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
If the platform instance was specified in a Druid ingestion, it was always set twice, because it was overrided in the get_identifier, and then added again down the line (see #11639).
This may break existing users that already rely on the broken platform instance implementation, but they could still manually set it to the old pattern in their dhub file.
Checklist
Druid dataset URN is not generated correctly #11639
Druid ingestion using platform_instance creates a duplicate in the dataset URN #12546