'Parsing HTML on the command line; How to capture text in <strong></strong>?

I'm trying to grab data from HTML output that looks like this:

<strong>Target1NoSpaces</strong><span class="creator"> ....
<strong>Target2 With Spaces</strong><span class="creator"> ....

I'm using a pipe train to whittle down the data to the targets I'm trying to hit. Here's my approach so far:

grep "/strong" output.html | awk '{print $1}'

Grep on "/strong" to get the lines with the targets; that works fine.

Pipe to 'awk '{print $1}'. That works in case #1 when the target has no spaces, but fails in case #2 when the target has spaces..only the first word is preserved as below:


Do you have any tips on hitting the target properly, either in my awk or in different command? Anything quick and dirty (grep, awk, sed, perl) would be appreciated.

Solution 1:[1]

Using Perl regex's look-behind and look-ahead feature in grep. It should be simpler than using awk.

grep -oP "(?<=<strong>).*?(?=</strong>)" file


Target2 With Spaces


This implementation of Perl's regex's multi-matching in Ruby could match values in multiple lines:

ruby -e 'File.read(ARGV.shift).scan(/(?<=<strong>).*?(?=<\/strong>)/m).each{|e| puts "----------"; puts e;}' file


</strong><strong>Target D</strong><strong>Target E</strong>


Target D
Target E

Solution 2:[2]

Try pup, a command line tool for processing HTML. For example:

$ pup 'strong text{}' < file.html 
Target2 With Spaces

To search via XPath, try xpup.

Alternatively, for a well-formed HTML/XML document, try html-xml-utils.

Solution 3:[3]

One way using mojolicious and its DOM parser:

perl -Mojo -E '
    ->each( sub { if ( $t = shift->text ) { say $t } } )'

Solution 4:[4]

Here's a solution using xmlstarlet

xml sel -t -v //strong input.html

Solution 5:[5]

Trying to parse HTML without a real HTML parser is a bad idea. Having said that, here is a very quick and dirty solution to the specific example you provided. It will not work when there is more than one <strong> tag on a line, when the tag runs over more than one line, etc.

awk -F '<strong>|</strong>' '/<strong>/ {print $2}' filename

Solution 6:[6]

You never need grep with awk and the field separator doesn't have to be whitespace:

$ awk -F'<|>'  '/strong/{print $3}' file
Target2 With Spaces

You should really use a proper parser for this however.

Solution 7:[7]

Since you tagged perl

perl -ne 'if(/(?:<strong>)(.*)(?:<\/strong>)/){print $1."\n";}' input.html

Solution 8:[8]

I am surprised no one mensions W3C HTML-XML-utils

curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
  hxnormalize -x |
  hxselect -s '\n' strong


<strong class="fc-black-750 mb6">Stack Overflow
                    for Teams</strong>

To capture only content:

curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
  hxnormalize -x |
  hxselect -s '\n' -c strong
Stack Overflow
                    for Teams


This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 Birei
Solution 4 Slaven Rezic
Solution 5 Community
Solution 6 Chris Seymour
Solution 7 Jean
Solution 8 Weihang Jian