'Grep exclude count of occurence match between comments  of curl body

I am very new to linux & bash script. I'm trying to read an xml file using curl command and count the number of occurrence of the word </entity> in it.

curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text%2Fxml%3Bcharset%3Dutf-8" | grep '</entity>' -oP | wc -l

This works correctly, however the xml file consists of comments like below resulting in wrong count.

Sample XML file

.........
........
 <entity>
.......
.......
</entity>
........
........
<!--
.......
<entity>
........
</entity>
.......
.......
-->
<entity>
.......
........
</entity>

The expected output should be 2 since one of the match is inside the comment block.

Solution 1:^[1]

Since you're using gnu-grep here is a PCRE regex solution for your problem:

curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text%2Fxml%3Bcharset%3Dutf-8" |
grep -ZzoP '(?s)<!--.*?-->(*SKIP)(*F)|</entity>' |
tr '\0' '\n' |
wc -l

2

RegEx Demo

RegEx Details:

(?s): Enable DOTALL mode so that dot matches line breaks also
: Match a commented block
(*SKIP)(*F): skips and fails this commented block
|: OR
</entity>: Match </entity> outside commented block
tr '\0' '\n': Converts NUL bytes to line break
wc -l: Counts number of lines

Solution 2:^[2]

As usual when dealing with XML, regular expressions are the wrong tool for the job. Use something aware of the format. For example, using xmllint and some XPath:

curl ... | xmllint --xpath 'count(//entity)' -

(Note the trailing -; unlike many programs, xmllint won't automatically read from standard input if not given a filename on the command line)

Solution 3:^[3]

With your shown samples, please try following awk code. Written and tested in GNU awk.

your_curl_command | 
awk -v RS="" '
match($0,/(^|\n)<!--[^-]*-->/){
  val=substr($0,RSTART,RLENGTH)
  gsub(val,"")
}
END{
  while(match($0,/(\n|^)[[:space:]]*<entity>[^<]*<\/entity>/)){
    count++
    $0=substr($0,RSTART+RLENGTH)
  }
  print count
}
'

Explanation: Adding detailed explanation for above code.

your_curl_command |                ##Running curl command and sending its output to awk command.
awk -v RS="" '                     ##Setting RS as NULL for this awk program.
match($0,/(^|\n)<!--[^-]*-->/){    ##Using match function of awk where using regex (^|\n)<!--[^-]*-->(explained below)
  val=substr($0,RSTART,RLENGTH)    ##if match of regex is found then assigning sub string value of matched value to val here.
  gsub(val,"")                     ##Using gsub(Global substitution) function to substitute globally val with NULL in current line in whole line.
}
END{                               ##Starting END block of this awk program from here.
  while(match($0,/(\n|^)[[:space:]]*<entity>[^<]*<\/entity>/)){  ##Using while loop to match regex (\n|^)[[:space:]]*<entity>[^<]*<\/entity> in match function to get all the matches to get count.
    count++                        ##Adding 1 to count variable here.
    $0=substr($0,RSTART+RLENGTH)   ##Assigning rest of line value to current line to avoid previous match.
  }
  print count                      ##Printing count value here.
}
'

Explanation of 1st regex((^|\n)):

(^|\n)    ##Matching either starting of value OR new line here.
<!--[^-]* ##Followed by <!-- till next value of - here.
-->       ##Followed by --> here.

Explanation of 2nd regex((\n|^)[[:space:]]*<entity>[^<]*<\/entity>):

(\n|^)                ##Matching new line OR starting of value.
[[:space:]]*<entity>  ##Followed by spaces(0 or more occurrence) followed by <entity>
[^<]*                 ##Followed by matching just before <
<\/entity>            ##Followed by </entity> here.

Solution 4:^[4]

gawk/mawk/mawk2/nawk '
BEGIN {
 1      FS = RS = "^$"
 1      _____ = "[<][\\/]entity[>]"
 1      ____ = "\23\4"
 1      ___ =   "\32"
 1      __ = ("[\\n][<][!]")(_="[-][-][\\n]")
 1      sub("......","[\\n]&[>]",_)
}

# Rule(s)

 1  ($!-_=gsub(_____,"&",
     $((  gsub(__,____)*gsub(_, ___)*\
          gsub(____"[^"(___)"]*"___,""))~"")))_'

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	anubhava
Solution 2	Shawn
Solution 3
Solution 4

'Grep exclude count of occurence match between comments  of curl body

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]

Solution 4:^[4]