'Standard Deviation from multiple files in bash

I wish to calculate the standard deviation from a range of files titled "res_NUMBER.cs" which are formatted as a CSV. Example data includes

1,M,CA,54.9130  
1,M,CA,54.9531  
1,M,CA,54.8845  
1,M,CA,54.7517  
1,M,CA,54.8425  
1,M,CA,55.2648  
1,M,CA,55.0876 

I have calculated the mean using

#!/bin/bash


files=`ls res*.cs`  
for f in $files; do 
        echo "$f" 
        echo " " 
        #Count number of lines N 
        lines=`cat $f | wc -l` 
        #Sum Total 
        sum=`cat $f | awk -F "," '{print $4}' | paste -sd+ | bc` 
        #Mean 
        mean=`echo "scale=5 ; $sum / $lines" | bc` 
        echo "$mean" 
        echo " "

I would like to calculate the standard deviation across each file. I understand that the standard deviation formula is

S.D=sqrt((1/N)*(sum of (value - mean)^2))

But I am unsure how I would implement this into my script.



Solution 1:[1]

awk is powerful enough to calculate the mean of one file easily

$ awk -F, '{sum+=$4} END{print sum/NR}' file

to add standard deviation (not that your formula is for population, not for sample, that's what I replicate here)

 $ awk -F, '{sum+=$4; ss+=$4^2} END{print m=sum/NR,sqrt(ss/NR-m^2)}' file
 54.9567 0.15778

this uses the fact that stddev = sqrt(Var(x)) = sqrt( E(x^2) - E(x)^2 ) which has worse numerical accuracy (since squaring the values instead of diff) but works fine if your values have low bounds.

The simplest is then using this in a for loop for the files

for f in res*.cs
do 
    awk -F, '{sum+=$4; ss+=$4^2} 
         END {print FILENAME; 
              print "mean:", m=sum/NR, "stddev:", sqrt(ss/NR-m^2)}' "$f"
end

to run res1.cs .. res37.cs in that order, easiest is change the for loop

for f in res{1..37}.cs
# the rest of the code not changed.

which will expand in the numerical order specified.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1