'summarise NA data from datafile on command line in linux / bash
I have a data file with missing values
I want to summarise this data and find out how many NA there are in the whole data file and in certain specific cols
I have tried
grep -i "NA" datafile.gz | wc -l
and
awk 'NA' datafile.gz | wc -l
But they both print '1' which is not right
Any ideas?
EDIT
@Roadowl provided
zgrep -i "na" datafile.gz
Which gives the total number of NA, how do I specify by column?
grep -i "na" datafile.gz | awk '$3=="na"'
Does not work
Solution 1:[1]
The fundamental problem here seems to be that you haven't noticed the difference between compressed data and plain text. Most Unix tools only work on the latter.
But first: Your attempts will look for "na" as a substring of any value in any column. So "banana" will match because ... well, I imagine you can see why.
Without seeing your actual data, it's hard to know what exactly you need, but for looking for entries which are exactly "NA" case-insensitively in column 3 of a tab-delimited file, try
awk -F '\t' 'tolower($3) == "na"' file
The tolower()
call converts to lower case so that you can then compare to a single string; this will cover all case variations of "NA"
, "Na"
, "nA"
, or "na"
but only look for an exact match after case conversion, instead of looking for a substring. (This assumes you don't have e.g. whitespace around the values in the column, too; it really is an exact comparison.)
To get the count, pipe to wc -l
(or refactor the Awk script:
awk -F '\t' 'tolower($3) == "na" { sum += 1 } END { print sum }' file
Probably read a 15-minute introduction to Awk if this isn't obvious.)
Perhaps more usefully, include file name and line number
awk -F '\t' 'tolower($3) == "na" { print FILENAME ":" FNR ":" $0 }' file
(You can pass multiple file names, and then the FILENAME
makes more sense.)
To examine a different column, replace $3
with e.g. $42
to examine column 42. To use a different delimiter, put the delimiter instead of '\t'
after -F
. This still doesn't correctly deal with e.g. quoting in CSV files (maybe switch to a language which knows how to parse CSV in all its variations correctly, like Python, or use a dedicated CSV tool -- I hear there's one called csvtool
).
If your file is compressed, you can't grep
(or Awk or sed
or etc) the compressed data directly; you have to uncompress it first:
gzcat datafile.gz |
awk -F '\t' -v gzfile="datafile.gz" 'tolower($3) == "na" { print gzfile ":" FNR ":" $0 }'
Notice how you don't put a file name after the Awk script when the input comes from a pipe (or not just Awk actually; any tool which is capable of reading a file or standard input, as is the norm among Unix text-processing tools).
The above also shows how to pass a string into Awk as a variable (-v variable="value"
). This might be more useful if you have multiple files you want to loop over:
for datafile in datafile.gz datafile0.gz datafile1.gz; do
gzcat "$datafile" |
awk -F '\t' -v gzfile="$datafile" 'tolower($3) == "na" { print gzfile ":" FNR ":" $0 }'
done
Slightly bewilderingly, the $datafile
variable is a shell variable (Bash or Zsh or what have you) which is entirely distinct from Awk's internal variables. We use the -v
trick from above to make it available to Awk in the variable gzfile
.
You could use grep
(which reads plain text) or gzgrep
(which can read gzip
-compressed data directly) as well, but then you want to pass in a regular expression which targets the specific column. Just to show how it's done, here is a regex which says "something with no tabs, followed by a tab, followed by something with no tabs, followed by a tab, followed by na
, followed by a tab" which (once you wrap your head around it) targets the third column.
gzgrep -Ei -c $'^[^\t]*\t[^\t]*\tna\t' filename.gz
The $'...'
notation is Bash-specific and allows us to use the symbol \t
instead of a literal tab in the regular expression. The -c
option to grep
asks it to report how many matching lines it found, so you don't even need wc -l
here. The -E
option selects a slightly less arcane and more modern regular expression dialect than the default from the early 1970s. (It's still not properly modern by any standard; the "extended" dialect is from the mid-1970s, with some later brushing up by the POSIX standardization in the 1990s. Newer tools support a plethora of extensions which are not standard in grep
.)
To look for na
alone with tabs on both sides in any column, try
gzgrep -Eic -e $'^na\t|\tna\t|\tna$' filename.gz
where |
in the regular expression stands for "or", and ^na\t
is the regular expression for na
bracketed by beginning of line on one side and a tab on the other, and \tna$
is the same for end of line.
If you don't use Bash, you can remove the $
before '
and type a literal tab, usually with ctrl V tab
Files whose name ends with .gz
are compressed with gzip
; there are other compression tools which conventionally use file extensions like .Z
(plain old compress
, from the dark ages), .bz2
(bzip2
), .xz
(xz
) etc; most of these ship with a bzcat
or similar tool which performs the same job as gzcat
for gzip
files, and some will also have something like bzgrep
to parallel gzgrep
. (To add to the alphabet soup, gzcat
could also mean GNU zcat
on some systems! Then it handles .Z
files, not .gz
files).
Unix-based systems generally don't rely on the file extension to identify the type of a file, but the human convention is to add an extension when you compress something, and many of the tools will do that automatically (so for example gzip file
creates file.gz
).
A quick examination of a compressed file, perhaps with a hex dump tool, should reveal why grep
doesn't work directly on the file. The compressed data is a tightly packed oblique binary which simply does not contain the uncompressed data in any easily discoverable form (unless you learn a lot about compression!)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |