'Grep exclude count of occurence match between comments <!-- --> of curl body
I am very new to linux & bash script. I'm trying to read an xml file using curl command and count the number of occurrence of the word </entity>
in it.
curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text%2Fxml%3Bcharset%3Dutf-8" | grep '</entity>' -oP | wc -l
This works correctly, however the xml file consists of comments like below resulting in wrong count.
Sample XML file
.........
........
<entity>
.......
.......
</entity>
........
........
<!--
.......
<entity>
........
</entity>
.......
.......
-->
<entity>
.......
........
</entity>
The expected output should be 2 since one of the match is inside the comment block.
Solution 1:[1]
Since you're using gnu-grep
here is a PCRE regex solution for your problem:
curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text%2Fxml%3Bcharset%3Dutf-8" |
grep -ZzoP '(?s)<!--.*?-->(*SKIP)(*F)|</entity>' |
tr '\0' '\n' |
wc -l
2
RegEx Details:
(?s)
: Enable DOTALL mode so that dot matches line breaks also<!--.*?-->
: Match a commented block(*SKIP)(*F)
: skips and fails this commented block|
: OR</entity>
: Match</entity>
outside commented blocktr '\0' '\n'
: Converts NUL bytes to line breakwc -l
: Counts number of lines
Solution 2:[2]
As usual when dealing with XML, regular expressions are the wrong tool for the job. Use something aware of the format. For example, using xmllint
and some XPath:
curl ... | xmllint --xpath 'count(//entity)' -
(Note the trailing -
; unlike many programs, xmllint
won't automatically read from standard input if not given a filename on the command line)
Solution 3:[3]
With your shown samples, please try following awk
code. Written and tested in GNU awk
.
your_curl_command |
awk -v RS="" '
match($0,/(^|\n)<!--[^-]*-->/){
val=substr($0,RSTART,RLENGTH)
gsub(val,"")
}
END{
while(match($0,/(\n|^)[[:space:]]*<entity>[^<]*<\/entity>/)){
count++
$0=substr($0,RSTART+RLENGTH)
}
print count
}
'
Explanation: Adding detailed explanation for above code.
your_curl_command | ##Running curl command and sending its output to awk command.
awk -v RS="" ' ##Setting RS as NULL for this awk program.
match($0,/(^|\n)<!--[^-]*-->/){ ##Using match function of awk where using regex (^|\n)<!--[^-]*-->(explained below)
val=substr($0,RSTART,RLENGTH) ##if match of regex is found then assigning sub string value of matched value to val here.
gsub(val,"") ##Using gsub(Global substitution) function to substitute globally val with NULL in current line in whole line.
}
END{ ##Starting END block of this awk program from here.
while(match($0,/(\n|^)[[:space:]]*<entity>[^<]*<\/entity>/)){ ##Using while loop to match regex (\n|^)[[:space:]]*<entity>[^<]*<\/entity> in match function to get all the matches to get count.
count++ ##Adding 1 to count variable here.
$0=substr($0,RSTART+RLENGTH) ##Assigning rest of line value to current line to avoid previous match.
}
print count ##Printing count value here.
}
'
Explanation of 1st regex((^|\n)<!--[^-]*-->
):
(^|\n) ##Matching either starting of value OR new line here.
<!--[^-]* ##Followed by <!-- till next value of - here.
--> ##Followed by --> here.
Explanation of 2nd regex((\n|^)[[:space:]]*<entity>[^<]*<\/entity>
):
(\n|^) ##Matching new line OR starting of value.
[[:space:]]*<entity> ##Followed by spaces(0 or more occurrence) followed by <entity>
[^<]* ##Followed by matching just before <
<\/entity> ##Followed by </entity> here.
Solution 4:[4]
gawk/mawk/mawk2/nawk '
BEGIN {
1 FS = RS = "^$"
1 _____ = "[<][\\/]entity[>]"
1 ____ = "\23\4"
1 ___ = "\32"
1 __ = ("[\\n][<][!]")(_="[-][-][\\n]")
1 sub("......","[\\n]&[>]",_)
}
# Rule(s)
1 ($!-_=gsub(_____,"&",
$(( gsub(__,____)*gsub(_, ___)*\
gsub(____"[^"(___)"]*"___,""))~"")))_'
2
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | anubhava |
Solution 2 | Shawn |
Solution 3 | |
Solution 4 |