Wed, 28 October 2009
A common enough task, this is a way of doing it using .NET regular expressions.
Hi,"How, are you","I'm good thanks, you?" Hi again, How, are you now,"I'm still good thanks"
You can see that there is text in and out of double quotes and commas inside quotes, all these situations need to be coped with.
("(?<target>[^"]*)"|(?<target>[^",]+))(,\s*|(?<line>\r?\n|$))
"(?<target>[^"]*)" matches any quoted items and puts the result in the named group 'target'
(?<target>[^",]+) matches non quoted items
,\s* matches commas
\r?\n|$ matches end of lines and end of files
Using the regex tester you can see the results as required found in the 'target' group and end of lines and the end of the file is indicated by something in the 'line' group
public static string[][] ParseCsv(this string csvText) {
var csvRegex = new Regex(
@"(""(?<target>[^""]*)""|(?<target>[^"",]+))(,\s*|(?<line>\r?\n|$))");
var lines = new List<string[]>();
var line = new List<string>();
foreach (var match in csvRegex.Matches(csvText).Cast<Match>()) {
line.Add(match.Groups["target"].Value);
if (!match.Groups["line"].Success) continue;
// end of line or file found
lines.Add(line.ToArray());
if (match.Groups["line"].Length > 0) {
// end of line
line = new List<string>();
}
}
return lines.ToArray();
}
Some thing to note: I suspect this will not be very efficient for large amounts of data as it takes a string as its input, for a large file I'd use a stream as the input and therefore a different strategy to regular expressions
Development, Regex, Parsing
Antix Software Limited is registered in England and Wales.
Registered Number: 3491105 Registered Office: 100-103 Church St., Brighton, BN1 1UJ