Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle deserialization issues in from_avro? #182

Closed
nsanglar opened this issue Feb 8, 2021 · 4 comments
Closed

How to handle deserialization issues in from_avro? #182

nsanglar opened this issue Feb 8, 2021 · 4 comments
Labels
question Further information is requested

Comments

@nsanglar
Copy link

nsanglar commented Feb 8, 2021

Hello!

I am currently facing the following issue:

  1. We get avro records from a topic that we read with spark streaming (2.4.x)
  2. One of the avro record contains some malformed byte array (the type is bytes with logical type decimal)
  3. This makes the deserialization fail, and the job cannot commit the processed offset since it aborts.
  4. Upon restart, the job re-reads the faulty data and cannot go further

I would like to be able to ignore such cases where deserialization fails, but am struggling to find a nice solution.
Would you have any idea?

@cerveada
Copy link
Collaborator

Hello, sorry right now there is no option in Abris to solve that. I created a ticket for it #183. For now the only option is to detect and replace/fix that row before Abris is called.

@cerveada cerveada added the question Further information is requested label Feb 10, 2021
@moyphilip
Copy link

@nsanglar hey do you have a solution for your problem? I ran into a similar issue.

@nsanglar
Copy link
Author

@moyphilip I currently have a fork of the project in which I apply a different logic here:

case NonFatal(e) => throw new SparkException("Malformed records are detected in record parsing.", e)

I just don't throw an exception but return an empty row and log some error.
I guess this is quite specific to my use case, so I am not sure this would be appropriate to incoporate it upstream.

@ScaddingJ
Copy link
Contributor

@nsanglar how did you manage to return an empty row in your use case? I have a reason to do the same in mine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants