You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug, including details regarding any error messages, version, and platform.
Environment
Windows 11 (build 22631.4890)
pyarrow 19.0.1
Python 3.11.10
Issue Description
Data in nested list columns can be corrupted when writing partitioned parquet datasets, seemingly depending on how many rows the dataset has.
Reproducible example
importnumpyasnpimportpyarrowaspaimportpyarrow.parquetaspqfilename='repro.parquet'n_rows=11_000# works with 10_000 rowsn_partitions=7n_nested_size=100_000lst=list(range(n_nested_size))
r=range(n_rows)
table=pa.table({
'id': list(r),
'partition': [i%n_partitionsforiinr],
'nested': [np.random.permutation(lst) for_inr],
})
pq.write_to_dataset(table, filename, partition_cols=['partition']) # works without partitioningdf=pq.read_pandas(filename)
foriinr:
assertlen(set(df['nested'][i])) ==n_nested_size# assertion error is triggered
Findings
Issue does not appear if you only choose 10'000 rows in the example above
Issue does not appear without partitioning
Issue does not appear with pyarrow 10.0.1
Issue also happens in pyarrow 11.0.0, 15.0.1, 16.1.0, 17.0.0 and 18.1.0
Component(s)
Python
The text was updated successfully, but these errors were encountered:
Describe the bug, including details regarding any error messages, version, and platform.
Environment
Issue Description
Data in nested list columns can be corrupted when writing partitioned parquet datasets, seemingly depending on how many rows the dataset has.
Reproducible example
Findings
Component(s)
Python
The text was updated successfully, but these errors were encountered: