You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
column_types allows specifying a specific column type, but it does not allow limiting the inference to a subset of all available types (e.g., saying that column "A" may be string, int, or bool, but not date or time.)
timestamp_parsers does not seem to allow disabling date detection (e.g. for a string like "2025-03-17"), and it's not configurable per column.
Use case / application
In the software I'm writing, we follow the following rules for reading CSV files:
The user may specify one or more data types which is allowed for a column. The type is inferred, but restricted to one of those types.
If the user does not specify the allowed data types for a column, the allowed types are string and float, since numeric columns are very common, and much easier to detect reliably than dates and/or times.
We often encounter data containing strings which look like dates, but aren't.
Suggestions
Some of these suggestions are complementary, and I would understand if you want to split them into separate issues:
Allow specifying a list of allowed data types per column, and a default list of allowed data types.
This could be done using a defaultdict in Python, OR two separate options in ConvertOptions where one is Mapping[str, Sequence[DataType]] (keyed by column name) and the other is Sequence[DataType].
Duck DB has an option like this, called auto_type_candidates, but unfortunately, theirs is also not configurable per column.
What would be even more amazing is to map regular expressions (to be executed on the column names) to lists of allowed data types. (Example use case: All columns named foo_\d+ should be limited to float|int.)
Make timestamp_parsers configurable per column, but keep a default global one that works for columns that are not specified explicitly.
Make something similar to timestamp_parsers that also work for dates.
Component(s)
Python
The text was updated successfully, but these errors were encountered:
Describe the enhancement requested
Status quo
ConvertOptions
hascolumn_types
andtimestamp_parsers
.column_types
allows specifying a specific column type, but it does not allow limiting the inference to a subset of all available types (e.g., saying that column "A" may be string, int, or bool, but not date or time.)timestamp_parsers
does not seem to allow disabling date detection (e.g. for a string like "2025-03-17"), and it's not configurable per column.Use case / application
In the software I'm writing, we follow the following rules for reading CSV files:
We often encounter data containing strings which look like dates, but aren't.
Suggestions
Some of these suggestions are complementary, and I would understand if you want to split them into separate issues:
defaultdict
in Python, OR two separate options inConvertOptions
where one isMapping[str, Sequence[DataType]]
(keyed by column name) and the other isSequence[DataType]
.auto_type_candidates
, but unfortunately, theirs is also not configurable per column.foo_\d+
should be limited tofloat|int
.)timestamp_parsers
configurable per column, but keep a default global one that works for columns that are not specified explicitly.timestamp_parsers
that also work for dates.Component(s)
Python
The text was updated successfully, but these errors were encountered: