Disable or configure date and/or time inference when reading CSV files #45814

rudolfbyker · 2025-03-17T10:46:15Z

Status quo

ConvertOptions has column_types and timestamp_parsers.
column_types allows specifying a specific column type, but it does not allow limiting the inference to a subset of all available types (e.g., saying that column "A" may be string, int, or bool, but not date or time.)
timestamp_parsers does not seem to allow disabling date detection (e.g. for a string like "2025-03-17"), and it's not configurable per column.

In the software I'm writing, we follow the following rules for reading CSV files:

The user may specify one or more data types which is allowed for a column. The type is inferred, but restricted to one of those types.
If the user does not specify the allowed data types for a column, the allowed types are string and float, since numeric columns are very common, and much easier to detect reliably than dates and/or times.

We often encounter data containing strings which look like dates, but aren't.

Some of these suggestions are complementary, and I would understand if you want to split them into separate issues:

Allow specifying a list of allowed data types per column, and a default list of allowed data types.
- This could be done using a defaultdict in Python, OR two separate options in ConvertOptions where one is Mapping[str, Sequence[DataType]] (keyed by column name) and the other is Sequence[DataType].
- Duck DB has an option like this, called auto_type_candidates, but unfortunately, theirs is also not configurable per column.
- What would be even more amazing is to map regular expressions (to be executed on the column names) to lists of allowed data types. (Example use case: All columns named foo_\d+ should be limited to float|int.)
Make timestamp_parsers configurable per column, but keep a default global one that works for columns that are not specified explicitly.
Make something similar to timestamp_parsers that also work for dates.

Python

The text was updated successfully, but these errors were encountered:

rudolfbyker added the Type: enhancement label Mar 17, 2025

github-actions bot added the Component: Python label Mar 17, 2025