Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Order of reading pyarrow.dataset.Dataset #45855

Open
bwi-earth opened this issue Mar 18, 2025 · 0 comments
Open

[Python] Order of reading pyarrow.dataset.Dataset #45855

bwi-earth opened this issue Mar 18, 2025 · 0 comments
Labels
Component: Python Type: usage Issue is a user question

Comments

@bwi-earth
Copy link

bwi-earth commented Mar 18, 2025

Describe the usage question you have. Please include as many useful details as possible.

I'm reading about various methods to read a pyarrow.dataset.Dataset, in the of large dataset case (.to_table is excluded).

it seems that it is impossible to read a dataset in by chunk, yet in ordered manner, to_batches doesn't offer any guarntees about the order of the retrieved batches.

The best I've come up with is to list the fragments of the dataset and read each one individually, then sort partial outputs.

However, if that's the case i'm loosing the benefit of pyarrow loading stuff in the background.

(im using parquet stored in s3 as backend, doesn't seem to be relevant though)

Component(s)

  • Python
  • Parquet
@bwi-earth bwi-earth added the Type: usage Issue is a user question label Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

1 participant