'Perl splitting on multiple occurrence of the same pattern
I wrote the following Perl script to split on multiple occurrences of the same pattern.
The pattern is: (some text)
This is what I've tried:
foreach my $line (@input) {
if ($line =~ /(\(.*\))+/g) {
my @splitted = split(/(\(.*\))/, $line);
foreach my $data (@splitted) {
print $data, "\n";
}
}
}
For the given input text:
Non-rapid eye movement sleep (NREMS).
Cytokines such as interleukin-1 (IL-1), tumor necrosis factor, acidic fibroblast growth factor (FGF), and interferon-alpha (IFN-alpha).
I'm getting the following output:
Non-rapid eye movement sleep
(NREMS).
Cytokines such as interleukin-1
(IL-1), tumor necrosis factor, acidic fibroblast growth factor (FGF), and interferon-alpha (IFN-alpha).
The code doesn't split the text on the second and third occurrence of the pattern in line 2 of the text. I can't figure out what I'm doing wrong.
Solution 1:[1]
Split by this instead:
(\([^(]*\))
Your regex is greedy, so make it non greedy (\(.*?\))
.
See demo.
https://regex101.com/r/dU7oN5/14
Problem with your regex can e seen here https://regex101.com/r/dU7oN5/15
Your regex matches (
and then greedily looks for the last )
and not the first )
it encounters.
So the whole last line is being captured by it.
Solution 2:[2]
You haven't described your purpose, but I suggest that you use a regular expression match instead of split
. But it looks like you're processing free-form text, which will never work properly in the general case.
This program finds all of the text (and bracketed meanings) in the input data.
use strict;
use warnings;
while (<DATA>) {
while ( / ( [^()]* ) \( ( [^()]* ) \) /xg ) {
my ($defn, $abbr) = ($1, $2);
print "$defn\n";
print "-- $abbr\n\n";
}
}
__DATA__
Non-rapid eye movement sleep (NREMS).
Cytokines such as interleukin-1 (IL-1), tumor necrosis factor, acidic fibroblast growth factor (FGF), and interferon-alpha (IFN-alpha).
output
Non-rapid eye movement sleep
-- NREMS
Cytokines such as interleukin-1
-- IL-1
, tumor necrosis factor, acidic fibroblast growth factor
-- FGF
, and interferon-alpha
-- IFN-alpha
Solution 3:[3]
Have a try with:
foreach my $line (@input) {
if($line =~/\(.*?\)/) { # modifier g can be removed here
my @splitted = split(/(\(.+?\))/, $line); # make the match non greedy
foreach my $data (@splitted) {
print $data, "\n";
}
}
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | U. Windl |
Solution 2 | Borodin |
Solution 3 | U. Windl |