Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Parquet] Data corruption when writing partitioned parquet dataset with nested lists depending on number of rows #45765

Open
martinstuder opened this issue Mar 13, 2025 · 1 comment

Comments

@martinstuder
Copy link

martinstuder commented Mar 13, 2025

Describe the bug, including details regarding any error messages, version, and platform.

Environment

  • Windows 11 (build 22631.4890)
  • pyarrow 19.0.1
  • Python 3.11.10

Issue Description
Data in nested list columns can be corrupted when writing partitioned parquet datasets, seemingly depending on how many rows the dataset has.

Reproducible example

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

filename = 'repro.parquet'

n_rows = 11_000  # works with 10_000 rows
n_partitions = 7
n_nested_size = 100_000

lst = list(range(n_nested_size))
r = range(n_rows)

table = pa.table({
    'id': list(r),
    'partition': [i % n_partitions for i in r],
    'nested': [np.random.permutation(lst) for _ in r],
})

pq.write_to_dataset(table, filename, partition_cols=['partition'])  # works without partitioning

df = pq.read_pandas(filename)
for i in r:
    assert len(set(df['nested'][i])) == n_nested_size  # assertion error is triggered

Findings

  • Issue does not appear if you only choose 10'000 rows in the example above
  • Issue does not appear without partitioning
  • Issue does not appear with pyarrow 10.0.1
  • Issue also happens in pyarrow 11.0.0, 15.0.1, 16.1.0, 17.0.0 and 18.1.0

Component(s)

Python

@martinstuder
Copy link
Author

Also reproduced the issue in the following environment:

  • Linux 6.13.5-2-MANJARO
  • Python 3.12.2
  • pyarrow 19.0.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant