You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the usage question you have. Please include as many useful details as possible.
I'm reading about various methods to read a pyarrow.dataset.Dataset, in the of large dataset case (.to_table is excluded).
it seems that it is impossible to read a dataset in by chunk, yet in ordered manner, to_batches doesn't offer any guarntees about the order of the retrieved batches.
The best I've come up with is to list the fragments of the dataset and read each one individually, then sort partial outputs.
However, if that's the case i'm loosing the benefit of pyarrow loading stuff in the background.
(im using parquet stored in s3 as backend, doesn't seem to be relevant though)
Component(s)
Python
Parquet
The text was updated successfully, but these errors were encountered:
Describe the usage question you have. Please include as many useful details as possible.
I'm reading about various methods to read a
pyarrow.dataset.Dataset
, in the of large dataset case (.to_table
is excluded).it seems that it is impossible to read a dataset in by chunk, yet in ordered manner,
to_batches
doesn't offer any guarntees about the order of the retrieved batches.The best I've come up with is to list the fragments of the dataset and read each one individually, then sort partial outputs.
However, if that's the case i'm loosing the benefit of pyarrow loading stuff in the background.
(im using parquet stored in s3 as backend, doesn't seem to be relevant though)
Component(s)
The text was updated successfully, but these errors were encountered: