Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for schema evolution in Delta table sink #1

Closed
richiesgr opened this issue Mar 21, 2023 · 2 comments
Closed

Support for schema evolution in Delta table sink #1

richiesgr opened this issue Mar 21, 2023 · 2 comments

Comments

@richiesgr
Copy link

Hi Vitor

I'm trying to use your idea for schema evolution in my use case but I can't get it work may be you've an idea for me.
So basically we read data from kafka (avro format) and using schema registry.
This data is written directly to Delta table we work in databricks
After using your code I can validate the input data is really transformed using the new schema at each version BUT nothing is reflected in the table sink.
I suspect that once spark have build the query plan/model he's not able to change it on the fly only restarting the process will catch the new version of the schema.

However you did it in case of kafka as a sink because you're able to transform the data back to avro

Do you think it's possible to do this guys from abris project claim that it's impossible at all as mentioned here

Thanks for any help

@Orpheuz
Copy link
Owner

Orpheuz commented Mar 27, 2023

Hey @richiesgr, it is indeed not possible to support schema evolution in streaming mode. If you notice on SchemaEvolution.scala#L29 I'm mimicking a batch job by using Trigger.Once with the streaming API. I'm running 3 different jobs in the example.
The issue is not related to the sink but to the way spark streaming works. The query execution plan is generated before running the stream which means it will only know the schema fields that were fetched from the schema registry on generation, any other updates will be ignored as the plan will be the same. There is no way of doing it seamlessly unless you build something that listens to schema updates and manages your jobs accordingly.

@richiesgr
Copy link
Author

Thanks for your help and great job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants