-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet reader: option to pass INT96 as bytes instead of as Timestamp #7220
Comments
This makes sense to me -- if the reader wants to read a INT96 as FixedSizedBinary(12) that seems quite reasonable to me |
@paleolimbot mentioned there is a similar thing in arrow-cpp: https://github.com/apache/arrow/blob/784aa6faf69f5cf135e09976a281dea9ebf58166/cpp/src/parquet/arrow/schema_internal.cc#L205-L206 |
This looks to just influence what TimeUnit it coerces to, e.g. milliseconds, nanoseconds, etc...
My 2 cents is that whilst possible, this results in an unfortunate UX. IMO we should support Int96 to the best of our ability, rather than forcing every downstream to reproduce this logic. Whilst it may be somewhat depressing that Spark is STILL writing a type that has been deprecated for almost a decade, it is where we are at and we should support it. That being said I would suggest we split this issue into two parts:
I suspect most users only actually care about the first of these - the number of people writing dates pre-1900 is likely small, and the people doing so with a half decade old version of Spark or Hive is likely even smaller, we can likely leave it as an issue for someone to pick up if they have a use-case for it. |
This topic came up on the DataFusion call this week. I think @mbutrovich has the usecase for handling this in Spark While not opposed to adding spark specific rebase mode (or whatever we will call it 🤮 ) I also think adding the ability in general for the parquet reader to pass out uninterpreted bytes (in this case FixedSizeByte(12)) is a good addition as it allows downstream crates an "escape hatch" until / if we implement a more holistic solution in this crate |
I'm good with this approach. As I mentioned, I wasn't sure how much Spark-specific logic we wanted to bring down to the Parquet reader level, but I can work with this design. I might ask some followup questions about how to expose options that far into the Parquet reader since most of the API seems to be encoded with |
One challenge I had with an implementation that doesn't copy the data twice is that |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We are adapting DataFusion Comet (Spark accelerator) to use DataFusion's native Parquet scan backed by arrow-rs. Spark defaults to writing timestamps in Parquet as INT96 (a la Hive, Impala, and other systems), which most systems infer as a timestamp despite the Parquet spec having a separate timestamp type. In arrow-rs's case, it converts to a
Timestamp(TimeUnit::Nanoseconds, None)
. The nanosecond-precision renders the data type unable to represent the same range of dates as what Spark wrote to the file originally.Describe the solution you'd like
An opt-in feature that allows INT96 to pass unmodified bytes for each value, perhaps as
FixedSizedBinary(12)
.Describe alternatives you've considered
Timestamp(TimeUnit::Microsecond, None)
would support a larger range of dates. However, I do not think it's reasonable to push Spark-specific options into arrow-rs.Time64
andDate32
Arrow types, which is essentially what an INT96 timestamp represents, however I take the same issue with the previous point.Additional context
SchemaAdapter
gives us a lot of control over how to adjust data coming out of its Parquet scan. However, because this "lossy" conversion from INT96 to an Arrow type happens in arrow-rs, it's too late for us to fix it in a customSchemaAdapter
. If we implement this feature, we will be able to handle all of the Spark-specific quirks in aSchemaAdapter
.The text was updated successfully, but these errors were encountered: