'Parsing HTML on the command line; How to capture text in <strong></strong>?

I'm trying to grab data from HTML output that looks like this:

<strong>Target1NoSpaces</strong><span class="creator"> ....
<strong>Target2 With Spaces</strong><span class="creator"> ....

I'm using a pipe train to whittle down the data to the targets I'm trying to hit. Here's my approach so far:

grep "/strong" output.html | awk '{print $1}'

Grep on "/strong" to get the lines with the targets; that works fine.

Pipe to 'awk '{print $1}'. That works in case #1 when the target has no spaces, but fails in case #2 when the target has spaces..only the first word is preserved as below:

<strong>Target1NoSpaces</strong><span
<strong>Target2

Do you have any tips on hitting the target properly, either in my awk or in different command? Anything quick and dirty (grep, awk, sed, perl) would be appreciated.



Solution 1:[1]

Using Perl regex's look-behind and look-ahead feature in grep. It should be simpler than using awk.

grep -oP "(?<=<strong>).*?(?=</strong>)" file

Output:

Target1NoSpaces
Target2 With Spaces

Add:

This implementation of Perl's regex's multi-matching in Ruby could match values in multiple lines:

ruby -e 'File.read(ARGV.shift).scan(/(?<=<strong>).*?(?=<\/strong>)/m).each{|e| puts "----------"; puts e;}' file

Input:

<strong>Target
A
B
C
</strong><strong>Target D</strong><strong>Target E</strong>

Output:

----------
Target
A
B
C
----------
Target D
----------
Target E

Solution 2:[2]

Try pup, a command line tool for processing HTML. For example:

$ pup 'strong text{}' < file.html 
Target1NoSpaces
Target2 With Spaces

To search via XPath, try xpup.

Alternatively, for a well-formed HTML/XML document, try html-xml-utils.

Solution 3:[3]

One way using mojolicious and its DOM parser:

perl -Mojo -E '
    g("http://your.web")
    ->dom
    ->find("strong")
    ->each( sub { if ( $t = shift->text ) { say $t } } )'

Solution 4:[4]

Here's a solution using xmlstarlet

xml sel -t -v //strong input.html

Solution 5:[5]

Trying to parse HTML without a real HTML parser is a bad idea. Having said that, here is a very quick and dirty solution to the specific example you provided. It will not work when there is more than one <strong> tag on a line, when the tag runs over more than one line, etc.

awk -F '<strong>|</strong>' '/<strong>/ {print $2}' filename

Solution 6:[6]

You never need grep with awk and the field separator doesn't have to be whitespace:

$ awk -F'<|>'  '/strong/{print $3}' file
Target1NoSpaces
Target2 With Spaces

You should really use a proper parser for this however.

Solution 7:[7]

Since you tagged perl

perl -ne 'if(/(?:<strong>)(.*)(?:<\/strong>)/){print $1."\n";}' input.html

Solution 8:[8]

I am surprised no one mensions W3C HTML-XML-utils

curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
  hxnormalize -x |
  hxselect -s '\n' strong

output:

<strong class="fc-black-750 mb6">Stack Overflow
                    for Teams</strong>
<strong>Teams</strong>

To capture only content:

curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
  hxnormalize -x |
  hxselect -s '\n' -c strong
Stack Overflow
                    for Teams
Teams

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 Birei
Solution 4 Slaven Rezic
Solution 5 Community
Solution 6 Chris Seymour
Solution 7 Jean
Solution 8 Weihang Jian