'Clean up a comma-separated list by regex
I want to clean up a tag list separated by comma to remove empty tags and extra spaces. I came up with
$str='first , second ,, third, ,fourth suffix';
echo preg_replace('#[,]{2,}#',',',preg_replace('#\s*,+\s*#',',',preg_replace('#\s+#s',' ',$str)));
which works well so far, but is it possible to do it in one replacement?
Solution 1:[1]
You can use
preg_replace('~\s*(?:(,)\s*)+|(\s)+~', '$1$2', $str)
Merging the two alternatives into one results in
preg_replace('~\s*(?:([,\s])\s*)+~', '$1', $str)
See the regex demo and the PHP demo. Details:
\s*(?:(,)\s*)+
- zero or more whitespaces and then one or more occurrences of a comma (captured into Group 1 ($1
)) and then zero or more whitespaces|
- or(\s)+
- one or more whitespaces while capturing the last one into Group 2 ($2
).
In the second regex, ([,\s])
captures a single comma or a whitespace character.
The second regex matches:
\s*
- zero or more whitespaces(?:([,\s])\s*)+
- one or more occurrences of([,\s])
- Group 1 ($1
): a comma or a whitespace\s*
- zero or more whitespaces
See the PHP demo:
<?php
$str='first , second ,, third, ,fourth suffix';
echo preg_replace('~\s*(?:(,)\s*)+|(\s)+~', '$1$2', $str) . PHP_EOL;
echo preg_replace('~\s*(?:([,\s])\s*)+~', '$1', $str);
// => first,second,third,fourth suffix
// first,second,third,fourth suffix
BONUS
This solution is portable to all NFA regex flavors, here is a JavaScript demo:
const str = 'first , second ,, third, ,fourth suffix';
console.log(str.replace(/\s*(?:(,)\s*)+|(\s)+/g, '$1$2'));
console.log(str.replace(/\s*(?:([,\s])\s*)+/g, '$1'));
It can even be adjusted for use in POSIX tools like sed
:
sed -E 's/[[:space:]]*(([,[:space:]])[[:space:]]*)+/\2/g' file > outputfile
See the online demo.
Solution 2:[2]
You can use:
[\h*([,\h])[,\h]*
See an online demo. Or alternatively:
\h*([,\h])(?1)*
See an online demo
\h*
- 0+ (Greedy) horizontal-whitespace chars;([,\h])
- A 1st capture group to match a comma or horizontal-whitespace;[,\h]*
- Option 1: 0+ (Greedy) comma's or horizontal-whitespace chars;(?1)*
- Option 2: Recurse the 1st subpattern 0+ (Greedy) times.
Replace with the 1st capture group:
$str='first , second ,, third, ,fourth suffix';
echo preg_replace('~\h*([,\h])[,\h]*~', '$1', $str);
echo preg_replace('~\h*([,\h])(?1)*~', '$1', $str);
Both print:
first,second,third,fourth suffix
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |