I Can't Believe It's Not Blogging

The Message is Medium Rare

CINAFF

Apr 22 2010

Riffing off one of my favorite file format-related rants, I'd like to take a moment to speak to you about CSV. CSV is not a good format. CSV is not even a bad format. In fact, CSV is arguably not a file format as such. It's a a sort of vaguely similar practices that have coalesced around the import-export features of a few popular programs. As Wikipedia puts it: "No general standard specification for CSV exists. Variations between CSV implementations in different programs are quite common and can lead to interoperation difficulties."

It all seems so simple: rows/records are separated by an end-of-line. Columns/fields are separated by a comma. In most programming languages these days, you've got some sort of facility for reading in a line of a file, and an easy mechanism for splitting that line by commas into an array. So this should be simple... right?

The question that quickly comes up (and which I believe most flavors of CSV get wrong) is "What do we do when our record and field separators are part of that data -- when a column needs to contain a comma or an end-of-line?" And the answer seems to be, "Well, we bring in another token (the double quote) which can be optionally used to enclose field data."

And with that decision, the door to dead-simple parsing closes. You can't simply split fields by a character. You could probably employ varying regular expression strategies to split fields out of a line... if it weren't for the fact that there's no easy way to grab a line -- because a carriage return could conceivably be contained in a field.

And this is before we even get to questions like "What do we do with double quotes as part of field data?" (The quasi-official and visually painful answer: double 'em up), "Do we let people use single quotes?", "Do we let people use other separators like semi-colons, tabs, vertical bars?", and "How do we get produce CSV files that all versions of Excel will recognize and parse correctly by default?"

How could this have been done differently? The funny thing about this problem is that there's a scheme for solving it that's as least as old as strings in the C programming language. What do you do when a token that would signify the end of piece of data needs to represented as part of the data? You designate a single token as an "escape" character, one that signifies the character following it is literal data.

In C:

"A backslash signifies an \"escape\" sequence."

In an alternate universe where CSV was done this way, here's how commas as field data would be handled:

Row 1\, Field 1,Row 1\, Field 2,Row 1\, Field 3

How do we handle an end-of-line as data?

Row 2\, Field 1,Row 2\

Field 2,Row 2\, Field3

Quotes aren't even an issue -- they're just data. But what about backslashes? Well, since backslashes signify a literal, a backslash following a backslash would be a literal backslash:

C:\\Windows,D:\\Windows\\System,E:\\Network\\MarbledHall

This system isn't perfect -- you still can't do conventional line-reads (because a line feed can still be literal), so you have to either read character-by-character (I/O intensive and slow), or you pull in a buffer of predetermined size and go through that one character at a time. But the parsing moves back towarads dead simple. A character in the stream is either a separator token, an escape signifying a literal, or a literal, and there's no need to keep track of state. You can actually use regular expressions easily again... split rows on linefeeds not preceeded by a backslash, split columns on commas not preceeded by a backslash.

And you can fix the line-reading problem with a small tweak: represent line feed data with good ol' "\n." To revisit our example above:

Row 2, Field 1,Row 2\nField 2,Row 2, Field3

This scheme makes parsing CSV really as simple as it seems it should be. Every end-of-line is a record separator. Every comma not preceeded by a backslash a field separator. Read each line. Split on commas not preceeded by a backslash. Process each field for escape sequences. You're done.

Anyone think we can get the whole software industry on board?

(I know: as nice as this might be, it's unlikely that major software manufacturers are ever going to re-implement this aspect of their import/export features, and even when they have, XML-based solutions seem to be the flavor of the last decade, and there are libraries for dealing with this... except when you're dealing with badly-formed CSV, which tends to happen when what you have is a psuedo standard.

But I think there's a larger lesson here. If you're ever devising a data encoding format, and you have to deal with the problem of encoding syntactically significant tokens as data, and you have a choice between using an escape sequence and a balanced quoting sequence that signifies a regional change in semantics... use the escape sequence.)

Leave a Reply