'Regex splitting on newline outside of quotes
I want to split a stream of data on new lines that are NOT within double quotes. The stream contains rows of data, where each row is separated by a newline. However, the rows of data can potentially contain newlines within double quotes. These newlines do not signify that the next row of data has started, so I want to ignore them.
So the data might look something like this:
Row 1: bla bla, 12345, ...
Row 2: "bla
bla", 12345, ...
Row 3: bla bla, 12345, ...
I tried using regex from a similar post about splitting on commas not found with double quotes (Splitting on comma outside quotes) by replacing the comma with the newline character:
\n(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
This regex doesn't match where I'd expect it to though. Am I missing something?
Solution 1:[1]
Here are two ways of doing that.
#1
You can match the regular expression
[^"\r\n]+(?:"[^"]*"[^"\r\n]+)*
The expression can be broken down as follows.
[^"\r\n]* # match zero or more characters other than those in the
# character class
(?: # begin non-capture group
"[^"]*" # match double-quote followed by zero or more characters
# other than a double-quote, followed by a double-quote
[^"\r\n]+ # match zero or more characters other than those in the
# character class
)* # end non-capture group and execute it zero or more times
#2
Matching line terminators that are not between double-quotes is equivalent to matching line terminators that are preceded, from the beginning of the string, by an even number of double quotes. You can match such line terminators with the following regular expression (with the multi-line flag not set, so that ^
matches the beginning of the string, not the beginning of a line).
/(?<=^[^"]*(?:"[^"]*"[^"]*)*)\r?\n/
Javascript's regex engine (which impressively supports variable-length lookbehinds) performs the following operations.
(?<= : begin positive lookbehind
^ : match beginning of string (not line)
[^"]* : match 0+ chars other than '"'
(?: : begin non-capture group
"[^"]*" : match '"', 0+ chars other than '"', '"'
[^"]* : match 0+ chars other than '"'
)* : end non-capture group and execute 0+ times
) : end positive lookbehind
\r?\n : match line terminator
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |