[Python][Parquet] Data corruption when writing partitioned parquet dataset with nested lists depending on number of rows #45765

martinstuder · 2025-03-13T14:19:12Z

Describe the bug, including details regarding any error messages, version, and platform.

Environment

Windows 11 (build 22631.4890)
pyarrow 19.0.1
Python 3.11.10

Issue Description
Data in nested list columns can be corrupted when writing partitioned parquet datasets, seemingly depending on how many rows the dataset has.

Reproducible example

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

filename = 'repro.parquet'

n_rows = 11_000  # works with 10_000 rows
n_partitions = 7
n_nested_size = 100_000

lst = list(range(n_nested_size))
r = range(n_rows)

table = pa.table({
    'id': list(r),
    'partition': [i % n_partitions for i in r],
    'nested': [np.random.permutation(lst) for _ in r],
})

pq.write_to_dataset(table, filename, partition_cols=['partition'])  # works without partitioning

df = pq.read_pandas(filename)
for i in r:
    assert len(set(df['nested'][i])) == n_nested_size  # assertion error is triggered

Findings

Issue does not appear if you only choose 10'000 rows in the example above
Issue does not appear without partitioning
Issue does not appear with pyarrow 10.0.1
Issue also happens in pyarrow 11.0.0, 15.0.1, 16.1.0, 17.0.0 and 18.1.0

Component(s)

Python

martinstuder · 2025-03-13T17:02:37Z

Also reproduced the issue in the following environment:

Linux 6.13.5-2-MANJARO
Python 3.12.2
pyarrow 19.0.1

martinstuder added the Type: bug label Mar 13, 2025

github-actions bot added the Component: Python label Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][Parquet] Data corruption when writing partitioned parquet dataset with nested lists depending on number of rows #45765

[Python][Parquet] Data corruption when writing partitioned parquet dataset with nested lists depending on number of rows #45765

martinstuder commented Mar 13, 2025 •

edited

Loading

martinstuder commented Mar 13, 2025

[Python][Parquet] Data corruption when writing partitioned parquet dataset with nested lists depending on number of rows #45765

[Python][Parquet] Data corruption when writing partitioned parquet dataset with nested lists depending on number of rows #45765

Comments

martinstuder commented Mar 13, 2025 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

martinstuder commented Mar 13, 2025

martinstuder commented Mar 13, 2025 •

edited

Loading