Wed, 28 October 2009

How to parse csv files with regex allowing for commas in quotes.

A common enough task, this is a way of doing it using .NET regular expressions.

Take the following test data:
Hi,"How, are you","I'm good thanks, you?"
Hi again, How, are you now,"I'm still good thanks"

You can see that there is text in and out of double quotes and commas inside quotes, all these situations need to be coped with.

Here's the regex:
("(?<target>[^"]*)"|(?<target>[^",]+))(,\s*|(?<line>\r?\n|$))
	
broken down:

"(?<target>[^"]*)" matches any quoted items and puts the result in the named group 'target'
(?<target>[^",]+) matches non quoted items
,\s* matches commas
\r?\n|$ matches end of lines and end of files

Using the regex tester you can see the results as required found in the 'target' group and end of lines and the end of the file is indicated by something in the 'line' group

Now to wrap this up in a string extension
public static string[][] ParseCsv(this string csvText) {
    var csvRegex = new Regex(
        @"(""(?<target>[^""]*)""|(?<target>[^"",]+))(,\s*|(?<line>\r?\n|$))");
    var lines = new List<string[]>();
    var line = new List<string>();

    foreach (var match in csvRegex.Matches(csvText).Cast<Match>()) {
        line.Add(match.Groups["target"].Value);
        
        if (!match.Groups["line"].Success) continue;

        // end of line or file found
        lines.Add(line.ToArray());
        if (match.Groups["line"].Length > 0) {
            // end of line
            line = new List<string>();
        }
    }

    return lines.ToArray();
}
		

Some thing to note: I suspect this will not be very efficient for large amounts of data as it takes a string as its input, for a large file I'd use a stream as the input and therefore a different strategy to regular expressions


Development, Regex, Parsing

Comments

Please feel free to add your comments here


(required)


(required, not shown)

(required)