'Standard Deviation from multiple files in bash
I wish to calculate the standard deviation from a range of files titled "res_NUMBER.cs" which are formatted as a CSV. Example data includes
1,M,CA,54.9130
1,M,CA,54.9531
1,M,CA,54.8845
1,M,CA,54.7517
1,M,CA,54.8425
1,M,CA,55.2648
1,M,CA,55.0876
I have calculated the mean using
#!/bin/bash
files=`ls res*.cs`
for f in $files; do
echo "$f"
echo " "
#Count number of lines N
lines=`cat $f | wc -l`
#Sum Total
sum=`cat $f | awk -F "," '{print $4}' | paste -sd+ | bc`
#Mean
mean=`echo "scale=5 ; $sum / $lines" | bc`
echo "$mean"
echo " "
I would like to calculate the standard deviation across each file. I understand that the standard deviation formula is
S.D=sqrt((1/N)*(sum of (value - mean)^2))
But I am unsure how I would implement this into my script.
Solution 1:[1]
awk
is powerful enough to calculate the mean of one file easily
$ awk -F, '{sum+=$4} END{print sum/NR}' file
to add standard deviation (not that your formula is for population, not for sample, that's what I replicate here)
$ awk -F, '{sum+=$4; ss+=$4^2} END{print m=sum/NR,sqrt(ss/NR-m^2)}' file
54.9567 0.15778
this uses the fact that stddev = sqrt(Var(x)) = sqrt( E(x^2) - E(x)^2 ) which has worse numerical accuracy (since squaring the values instead of diff) but works fine if your values have low bounds.
The simplest is then using this in a for loop for the files
for f in res*.cs
do
awk -F, '{sum+=$4; ss+=$4^2}
END {print FILENAME;
print "mean:", m=sum/NR, "stddev:", sqrt(ss/NR-m^2)}' "$f"
end
to run res1.cs .. res37.cs in that order, easiest is change the for loop
for f in res{1..37}.cs
# the rest of the code not changed.
which will expand in the numerical order specified.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |