Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Druid ingestion using platform_instance creates a duplicate in the dataset URN #12546

Closed
Rasnar opened this issue Feb 4, 2025 · 1 comment
Closed
Labels
bug Bug report

Comments

@Rasnar
Copy link
Contributor

Rasnar commented Feb 4, 2025

Describe the bug
When using the source with druid and with the platform_instance set, the instance platform name is added twice to the final URN.

To Reproduce
Using the datahub cli ingest datahub ingest -c <path to yaml content shown below>.

Using the following recipe:

source:
  type: druid
  config:
    username: ${USER}
    password: ${PASS}
    host_port: ${DRUID_ENDPOINT}
    scheme: druid+https
    env: PROD
    platform_instance: "team1"
    # Ignore internal tables
    schema_pattern:
      deny: ["^(lookup|sysgit|view|sys).*"]
    stateful_ingestion:
      enabled: True # False by default

Let's say that the Druid has a database named druid-metrics, then I get an URN like this:

'urn:li:dataset:(urn:li:dataPlatform:druid,team1.team1.druid-metrics,PROD)'

We can see the team1 entry being set twice in the dataset path.

If I remove the platform_instance, the path is like this, which is correct:

'urn:li:dataset:(urn:li:dataPlatform:druid-metrics,PROD)'

Expected behavior

Based on the example above, the URN should only have the platform_instance once in the name, which would result on something like this:

'urn:li:dataset:(urn:li:dataPlatform:druid,team1.druid-metrics,PROD)'

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):
We are currently using v0.14.1, but I also tested with the CLI v0.15.1 and the result is the same, the platform_instance is added twice to the final URN.

Additional context
Add any other context about the problem here.

In the druid case, the platform_instance is added here:
https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/druid.py#L54

but maybe it's also added somewhere else on top of this call, but I don't know enough the code to know if this platform_instance is added somewhere else.

@Rasnar Rasnar added the bug Bug report label Feb 4, 2025
@Rasnar
Copy link
Contributor Author

Rasnar commented Feb 4, 2025

Oh I just found a duplicate:
#11639

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

1 participant