[Python] Order of reading pyarrow.dataset.Dataset #45855

bwi-earth · 2025-03-18T20:54:16Z

Describe the usage question you have. Please include as many useful details as possible.

I'm reading about various methods to read a pyarrow.dataset.Dataset, in the of large dataset case (.to_table is excluded).

it seems that it is impossible to read a dataset in by chunk, yet in ordered manner, to_batches doesn't offer any guarntees about the order of the retrieved batches.

The best I've come up with is to list the fragments of the dataset and read each one individually, then sort partial outputs.

However, if that's the case i'm loosing the benefit of pyarrow loading stuff in the background.

(im using parquet stored in s3 as backend, doesn't seem to be relevant though)

Component(s)

Python
Parquet

The text was updated successfully, but these errors were encountered:

bwi-earth added the Type: usage Issue is a user question label Mar 18, 2025

github-actions bot added the Component: Python label Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Order of reading pyarrow.dataset.Dataset #45855

[Python] Order of reading pyarrow.dataset.Dataset #45855

bwi-earth commented Mar 18, 2025 •

edited

Loading

[Python] Order of reading pyarrow.dataset.Dataset #45855

[Python] Order of reading pyarrow.dataset.Dataset #45855

Comments

bwi-earth commented Mar 18, 2025 • edited Loading

Describe the usage question you have. Please include as many useful details as possible.

Component(s)

bwi-earth commented Mar 18, 2025 •

edited

Loading