'How do I remove the special characters within multiple lines in Regex?

I'm trying to solve a problem that wants to display the given text from a file omitting the special characters and modifying the multi-line input to a single-formatted output in only the language of Perl/Regex (no other languages like XML, etc.). Here's the given text in my flight.txt file:

<start> 
<flight number="12345">
<pilot> Holland, Tom</pilot>
<major>Aeronautics Engineer</major>
<company>Boeing</company>
<price>200</price>
<date>06-09-1969</date>
<details>Flight from DC to VA.</details>
</flight>
</start>

The required output is:

Holland, T. "Aeronautics Engineer" 200 06/09/1969 Flight from DC to VA.

As you can see, I need the output in a single line; and first name should be first initial while output also major should be in "" while output and date format should be changed from - to /.

This is what I have in my code so far:

#!/bin/perl
use strict;
use warnings;
my $filename = "flights.txt"
open(my $input, '<:encoding(UTF-8)', $filename)
        or die "Could not open file '$filename' $!";
while (my $row = <$input>){
my $text = <>;
$text =~ s/<[^>]*>//g;
print $text;
}
close $input

Please suggest me on what to do next and how to format the output of the given file. I'm new to Regex & Perl so need help.



Solution 1:[1]

Here's your homework problem, as you noted in a comment to ikegami's answer:

Create the Perl script “code.pl” that print lines that contain an opening and closing XML heading tag from “flights.txt”. Valid tags are pilot, major, company, price, date, and details regardless of the case. Tags may also have any arbitrary content inside of them. You may assume that a '<' or a '>' character will not appear inside of the attribute's value portion

Let's forget that your input is XML, for all the reasons that ikegami has already explained. The entire thing is a contrived example to get you to practice some particular regex feature. I'll go through a process of solving this problem, but also reveal later what I think the instructor expects.

First, you only need to think about one line at a time, so you don't care about nodes where the opening and closing is on separate lines, such as <start> and </start>, or <flight> and </flight>. You want to find lines such as:

<node>...</node>

The pattern is that there is some string you match near the start of the line, and that match has to show up later in the line. I think your intended task is to practice backreferences. Writing good exercises is tough, and people fall back on things, such as XML, that are familiar. My Learning Perl Exercises is more thoughtful about this.

Your basic program needs to look something like this first attempt. Read lines of input, skip the ones that don't match your pattern, and output the rest. Whenever you see ... in this answer, that's just something I need to fill in and is not Perl syntax (ignoring the yada operator, which cannot appear in a regex):

use strict;
use warnings;
while( <> ) {
    next unless m/ ... /;
    print;
    }

I'll mostly ignore that program structure and focus on the match operator, m//. update the pattern as I step through this.

The trick, then, is what goes in the pattern. You know you have to match something that looks like an XML open tag (again, ignoring that this is XML because it's not a good example for input). That starts with < and ends with > with some stuff in the middle. This pattern uses the /x flag to make whitespace insignificant. I can spread out the pattern so I can grok it easier:

m/ < ... > /x;

So what can go inside the angle brackets? In the inputL which I'm pretending isn't XML, the stuff inside the angles follows these rules, which you could read about in the XML standard if this were XML:

  • case-sensitive
  • starts with a letter or underscore
  • can contain letters, digits, hyphens, underscores, and periods
  • cannot start with xml in any case

Let's ignore that last one for a moment because I don't think it's part of the simple exercise you need to do. And the rules are actually slightly more complicated.

Case sensitive is easy. We aren't going to use the /i flag on the match operator, so we get that for free.

Starts with a letter or underscore. That's pretty easy. Since I'm pretending this is not XML, I'm not going to support all the Unicode scripts that current XML will allow. I'll restrict that to ASCII, and use a character class to represent all the letters that I'll allow right after the >:

m/ < [a-zA-Z_] ... > /x;

After that, I can have letters and underscores, but now also have hyphens, digits, and periods. As an aside, many such things have a set of characters for the start of an "identifier" (ID_Start) and a wider set for the rest (ID_Continue). Perl has similar rules for its variable name.

I use a second character class for the continuation. There's a slight gotcha here because you want a literal hyphen, but that also forms a range in the character class. That is, it forms a range unless it's at the end. The . in a character class is literal .:

m/ < [a-zA-Z_] [a-zA-Z_0-9.-]+ > /x;

With this pattern, you get much more than you wanted. The output is every line that has a start tag. Note that it does not match <flight number="12345"> because this pattern doesn't handle attributes, which is fine because I'm pretending this isn't XML:

<start>
<pilot> Holland, Tom</pilot>
<major>Aeronautics Engineer</major>
<company>Boeing</company>
<price>200</price>
<date>06-09-1969</date>
<details>Flight from DC to VA.</details>

The end tag has the same name as the start tag. In our input, there's one start tag and one end tag per line, and since I look at one line at a time, I can ignore many things that an XML parser has to care about. Now I spread my pattern over several lines because /x allows me to do that, and \x also allows me to add comments so I remember what each part of the pattern does. The / in the end tag is also the match operator delimiter, so I escape that as \/:

m/ 
    < [a-zA-Z_] [a-zA-Z_0-9.-]+ >  # start tag
    ...                            #   the interesting text
    < \/ ... >                     # end tag
/x;

I need to fill in the ... parts. The "interesting text" part is easy. I'll match anything. The .* greedily matches zero or more non-newline characters:

m/ 
    < [a-zA-Z_] [a-zA-Z_0-9.-]+ >  # start tag
    .*                             #   the interesting text, greedily
    < \/ ... >                     # end tag
/x;

But, I don't really want * to be greedy. I don't want it to match the end tag, so I can add the non-greedy modifier ? to the .*:

m/ 
    < [a-zA-Z_] [a-zA-Z_0-9.-]+ >  # start tag
    .*?                            #   the interesting text, non-greedily
    < \/ ... >                     # end tag
/x;

Now I need to fill in the name portion of the end tag. It has to be the same as the start name. By surrounding the start name in (...), I capture that part of the string that matched. That goes into the capture buffer $1. I can then re-use that exact match within the pattern with a "back reference" (the point of your problem, I'm guessing). A backreference starts with a \ and uses the number of the capture buffer you want to use. So, \1 uses the exact text matched in $1; not the same pattern but the actual text matched:

m/ 
    <                              # start tag
      ([a-zA-Z_] [a-zA-Z_0-9.-]+)  #  $1
    >  
    .*?                            #   the interesting text, non-greedily
    < \/ \1 >                      # end tag
/x;

Now the output excludes <start> because it doesn't have an end tag:

<pilot> Holland, Tom</pilot>
<major>Aeronautics Engineer</major>
<company>Boeing</company>
<price>200</price>
<date>06-09-1969</date>
<details>Flight from DC to VA.</details>

If you modified your data to change </date> to </data>, that line wouldn't match because the start and end tags are different.

But, what you really want is the text in the middle, so you need to capture that too. You can add another capture buffer. As the second set of parens, this is the the buffer $2, and doesn't disturb $1 or \1:

m/ 
    <                              # start tag
      ([a-zA-Z_] [a-zA-Z_0-9.-]+)  #  $1
    >  
    ( .*? )                        #   $2, the interesting text, non-greedily
    < \/ \1 >                      # end tag
/x;

But now you want to print the interesting test, not the entire line, so I'll print the $2 capture buffer instead of the entire line. Remember, these buffers are only valid after a successful match, but I've skipped the lines where it doesn't match, so I'm fine:

use strict;
use warnings;

while( <DATA> ) {
    next unless m/
        <                              # start tag
          ([a-zA-Z_] [a-zA-Z_0-9.-]+)  #  $1
        >
        (.*?)                          #  $2, the interesting text, non-greedily
        < \/ \1 >                      # end tag
    /x;

    print $2;
    }

print "\n";  # end all the output!

This gets me close. I'm missing some whitespace between elements (And note there is a leading space before Holland):

 Holland, TomAeronautics EngineerBoeing20006-09-1969Flight from DC to VA.

I can add a space at the end of each print:

    print $2, ' ';

Now you have your output:

  Holland, Tom Aeronautics Engineer Boeing 200 06-09-1969 Flight from DC to VA.

What the answer probably is

I'm guessing that the answer you'll see is much simpler. If you ignore all the rules about names and only handle exactly the input from the problem, you can probably get away with this:

m/ <(.*?)> (.*?) < \/ \1 > /x

As an exercise simply to practice back references, that's fine. But, you'll eventually create problems handling real XML like that. Note that $1 could capture all of flight number="1234" because this doesn't exclude whitespace or the other disallowed characters.

Let's go a bit deeper

The pattern I showed was pretty complicated, especially if you are just learning things. I can precompile the pattern and save it in a scalar, then use that scalar inside the match operator:

use strict;
use warnings;

my $pattern = qr/
        <                              # start tag
          ([a-zA-Z_] [a-zA-Z_0-9.-]+)  #  $1
        >
        ( .*? )                        #   the interesting text, non-greedily
        < \/ \1 >                      # end tag
    /x;

while( <DATA> ) {
    next unless m/$pattern/;
    print $2, ' ';
    }

This way, the mechanics of the while loop are distinct from the particulars. The complexity of the pattern doesn't affect my ability to understand the loop.

Now, having done that, I'll get more complicated. So far I used numbered captures and backreferences, but I might mess that up if I add more captures. If there's another capture before the start tag, the start tag capture is no longer $1, which means \1 now refers to the wrong thing. Instead of numbers, I can give them my own labels with the (?<LABEL>...) feature that Perl stole from Python. The back reference to that label is \k<LABEL>:

my $pattern = qr/
        <                              # start tag
          (?<tag>                      # labeled capture
            [a-zA-Z_] [a-zA-Z_0-9.-]+
          )
        >
        ( .*? )                        #   the interesting text, non-greedily
        < \/ \k<tag> >                 # end tag
    /x;

I can even label the "interesting text" portion:

my $pattern = qr/
        <                              # start tag
          (?<tag>
            [a-zA-Z_] [a-zA-Z_0-9.-]+
          )
        >
        (?<text> .*? )                 #   the interesting text, non-greedily
        < \/ \k<tag> >                 # end tag
    /x;

The rest of the program still works because these labels are aliases to the numbered capture variables. However, I don't want to rely on that (hence, the label). The hash %+ has the values in the labeled captures, and the label is the key. The interesting text is in $+<text>:

while( <DATA> ) {
    next unless m/$pattern/;
    print $+{'text'}, ' ';
    }

The rule I ignored

Now, there was the rule that I ignored. A tag name cannot start with xml in any case. That's tied to an XML feature I'll ignore here. I'll change my input to include an xmlmeal node:

<start>
<flight number="12345">
<pilot> Holland, Tom</pilot>
<xmlmeal> chicken</xmlmeal>
</flight>
</start>

I match that xmlmeal node because I haven't done anything to follow the rule. I can add a negative lookahead assertion, (?!...) to exclude that. As an assertion (\b and \A are other assertions), the lookahead does not consume text; it merely matches a condition. I use (?!xml) to mean "wherever I am right now, xml cannot be next":

my $pattern = qr/
        <                              # start tag
          (?<tag>
            (?!xml)
            [a-zA-Z_] [a-zA-Z_0-9.-]+
          )
        >
        (?<text> .*? )                 #   the interesting text, non-greedily
        < \/ \k<tag> >                 # end tag
    /x;

That's fine and it won't show " chicken" in the output. But, what if the input tag name was XMLmeal? I've only excluded the lowercase version. I need to exclude much more:

<start>
<flight number="12345">
<pilot> Holland, Tom</pilot>
<XMLmeal>chicken</XMLmeal>
<xmldrink>diet coke</xmldrink>
<Xmlsnack>almonds</Xmlsnack>
</flight>
</start>

I can get fancier. I'm not use the /i flag for case insensitivity because the start and end tag need to match exactly. I can, however, turn on case insensitivity for part of a pattern with (?i), and everything past that will ignore case:

my $pattern = qr/
        <                              # start tag
          (?<tag>
            (?i)                       # ignore case starting here
            (?!xml)
            [a-zA-Z_] [a-zA-Z_0-9.-]+
          )
        >
        (?<text> .*? )                 #   the interesting text, non-greedily
        < \/ \k<tag> >                 # end tag
    /x;

But, inside grouping parentheses, the (?i) is in effect only until the end of that group. I can limit which part of my pattern ignores case. The (?: ... ) groups without capturing (so doesn't disturb what $1 or $2 capture):

(?: (?i) (?!xml) )

Now my pattern excludes those three tags I added:

my $pattern = qr/
        <                              # start tag
          (?<tag>
            (?: (?i) (?!xml) )         # not XmL in any case
            [a-zA-Z_] [a-zA-Z_0-9.-]+
          )
        >
        (?<text> .*? )                 #   the interesting text, non-greedily
        < \/ \k<tag> >                 # end tag
    /x;

Some Mojo

So far, none of what I've presented handles attributes in the tags, which you want to ignore anyway. You should be able to add those to the regex yourself. But, I'll shift gears into other ways to handle XML like things.

Here's a Mojolicious program that understands XML and can extract things. Since it's a real Document Object Model (DOM) parser, it doesn't care about lines.

#!perl

use Mojo::DOM;

my $not_xml = <<~'HERE';
    <start>
    <flight number="12345">
    <pilot> Holland, Tom</pilot>
    <major>Aeronautics Engineer</major>
    <company>Boeing</company>
    <price>200</price>
    <date>06-09-1969</date>
    <details>Flight from DC to VA.</details>
    </flight>
    </start>
    HERE

Mojo::DOM->new( $not_xml )->xml(1)
    ->find( 'flight *' )
    ->map( 'text' )
    ->each( sub { print "$_ " } );

print "\n";

The find uses a CSS Selector to decide what it wants to process. The selector flight * is all the child nodes inside flight (so, any child tag no matter its name). The map calls the text method on each portion of the tree that find produces, and each outputs each result. It's very simple because someone has already done all of the hard work.

But, Mojo::DOM is not appropriate for every situation. It wants to know the entire tree at once, and for very large documents that's a burden on memory. There are "streaming" parsers that can handle that.

Twiggy

The problem you present in the original question is different than the homework that you posted in the comments. You want to transform text based on which tag it comes from. This is a different sort of problem all together because

XML::Twig is useful for processing different node types differently. It has the added advantage is that it doesn't need the entire XML tree in memory at one time.

Here's an example that uses two different handlers for the pilot and major portion. When Twig runs into those nodes, it calls the appropriate subroutine that you referenced in twig_handlers. I won't explain the particular Perl features here:

use XML::Twig;

my $twig = XML::Twig->new(
    twig_handlers => {
        pilot => \&pilot,
        major => \&major,
        },
    );

sub pilot {
    my( $twig, $e ) = @_;
    my $text = $e->text;
    $text =~ s/,\s.\K.*/./;
    print $text, ' ';
    $twig->purge;
    }

sub major {
    my( $twig, $e ) = @_;
    print '"' . $e->text . '"' . ' ';
    $twig->purge;
    }

my $xml = <<~'HERE';
    <start>
    <flight number="12345">
    <pilot> Holland, Tom</pilot>
    <major>Aeronautics Engineer</major>
    <company>Boeing</company>
    <price>200</price>
    <date>06-09-1969</date>
    <details>Flight from DC to VA.</details>
    </flight>
    </start>
    HERE

$twig->parse($xml);

This outputs:

 Holland, T. "Aeronautics Engineer"

Now you'd complete that with subroutines for all of the other things that you want to process.

Solution 2:[2]

Forenote

Based on comments made after this answer was posted, this is an assignment where the teacher is encouraging the OP to make numerous bad assumptions about XML. They are teaching them to do exactly what one should never do. If the teacher were to define a format, that would be fine; it wouldn't be XML but merely something inspired by XML. But they didn't do that. They explicitly stated it was XML. I can't help the OP any further because

  • I won't teach how to do this incorrectly,
  • doing it correctly without using an existing module would require a time expenditure that's too large,
  • doing it correctly without using an existing module would fall outside the scope of the site, and
  • I don't even know what the teacher wants (having been provided the exact wording of the assignment).

What follows is an answer the Question asked (as opposed to a solution to the OP's homework).


Answer

You are trying to parse XML. There are existing XML parsers you can use instead of spending considerable effort writing your own. I personally use XML::LibXML.

use XML::LibXML qw( );

my $doc = XML::LibXML->new->parse_file("flight.txt");

for my $flight_node ($doc->findnodes("/start/flight")) {
   my $pilot   = $flight_node->findvalue("pilot");
   my $major   = $flight_node->findvalue("major");
   my $price   = $flight_node->findvalue("price");
   my $date    = $flight_node->findvalue("date");
   my $details = $flight_node->findvalue("details");

   say "$pilot \"$major\" $price $date $details";
}

Solution 3:[3]

Just to give you some hints:

Your code is "ok" but

my $text = <>;

in your while loop is wrong. You already have the line in $row, so just use $row instead.

and your row also contains a linefeed at the end, so before printing it out you might remove this.

chomp($row);

So wrapping it up:

chomp($row);
$row =~ s/<[^>]*>//g;
print $row . " ";

might be the code in your while-loop you are looking for. And for extra grades, start thinking how to remove unnecessary white space at the beginning/end.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 HoldOffHunger
Solution 2
Solution 3 Georg Mavridis