Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How does the column data type inference work? I've run into that challenge myself in the past.


The CSV auto-detector was implemented by Till Döhmen, who did his master thesis on the subject [1] and has actually written a paper about it [2].

Essentially we have a list of candidate types for each column (starting with all types). We then sample a number of tuples from various parts of the file, and progressively reduce the number of candidate types as we detect conflicts. We then take the most restrictive type from the remaining set of types, with `STRING` as a last resort in case we cannot convert to any other type. After we have figured out the types, we start the actual parsing.

Note that it is possible we can end up with incorrect types in certain edge cases, e.g. if you have a column that has only numbers besides one row that is a string. If that row is not present in the sampling an error will be thrown and the user will need to override the type inference manually. This is generally rather rare, however.

You could also use DuckDB to do your type-inference for you!

  duckdb -c "DESCRIBE SELECT * FROM taxi.csv"
And if you want to change the sample size:

  duckdb -c "DESCRIBE SELECT * FROM read_csv_auto('taxi.csv', sample_size=9999999999999)"
[1] https://homepages.cwi.nl/~boncz/msc/2016-Doehmen.pdf

[2] https://ir.cwi.nl/pub/26416/SSDM1111.pdf


This is fantastic, thanks.

My solution is a lot less smart - I loop through every record and keep track of which potential types I've seen for each column: https://sqlite-utils.datasette.io/en/latest/python-api.html#...

Implementation here: https://github.com/simonw/sqlite-utils/blob/3fbe8a784cc2f3fa...


That works and is similar to what DuckDB does for small CSV files. We have the sampling step primarily for larger CSV files, when you might not want to do two passes over the file. This way we can keep the benefits of streamed file reading while offering type inference that "just works" most of the time without a major performance penalty.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: