'How to sum the values in one data column of duplicate rows with awk and remove the duplicate rows?

I have a CSV file with a heading row and multiple data rows each with 11 data columns like this:

Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,[email protected],205583995140400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,[email protected],205583995140400,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,[email protected],205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal  Address
05 Jun 2018,[email protected],205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal  Address
05 Jun 2018,[email protected],205588935534900,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,[email protected],205588935534900,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address

I want to remove the duplicates in that file and sum the values in the Quantity data column. I want the result to be like this:

Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,[email protected],205583995140400,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,[email protected],205486016644400,,4,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal  Address
05 Jun 2018,[email protected],205588935534900,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address

I want to sum only the values in the fifth data column Quantity while leaving the rest as it is. I have tried the solution in Sum duplicate row values with awk, but the answer there works only if the file has only two data columns. My CSV file has 11 data columns and so it doesn't work.

How to do it with awk?

awk


Solution 1:[1]

Taking direct adaption from Karafka's solution and adding some code in it a bit to get the lines in proper order(in which they are present in Input_file) as per OP's request.

awk -F, '
FNR==1{
  print;
  next}
{
  val=$5;
  $5="~";
  a[$0]+=val
}
!b[$0]++{
  c[++count]=$0}
END{
  for(i=1;i<=count;i++){
     sub("~",a[c[i]],c[i]);
     print c[i]}
}' OFS=,   Input_file

Explanation: Adding explanation to above code too now.

awk -F, '                         ##Setting field separator as comma here.
FNR==1{                           ##Checking condition if line number is 1 then do following.
  print;                          ##Print the current line.
  next}                           ##next will skip all further statements from here.
{
  val=$5;                         ##Creating a variable named val whose value is 5th field of current line.
  $5="~";                         ##Setting value of 5th field as ~ here to keep all lines same(to create index for array a).
  a[$0]+=val                      ##Creating an array named a whose index is current line and its value is variable val value.
}
!b[$0]++{                         ##Checking if array b whose index is current line its value is NULL then do following.
  c[++count]=$0}                  ##Creating an array named c whose index is variable count increasing value with 1 and value is current line.
END{                              ##Starting END block of awk code here.
  for(i=1;i<=count;i++){          ##Starting a for loop whose value starts from 1 to till value of count variable.
     sub("~",a[c[i]],c[i]);       ##Substituting ~ in value of array c(which is actually lines value) with value of SUMMED $5.
     print c[i]}                  ##Printing newly value of array c where $5 is now replaced with its actual value.
}' OFS=, Input_file               ##Setting OFS as comma here and mentioning Input_file name here too.

Solution 2:[2]

awk to the rescue!

$ awk 'BEGIN{FS=OFS=","} 
       NR==1{print; next} 
            {q=$5; $5="~"; a[$0]+=q} 
       END  {for(k in a) {sub("~",a[k],k); print k}}' file

Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,[email protected],205588935534900,,3,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,[email protected],205486016644400,,4,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Faishal  Address
05 Jun 2018,[email protected],205583995140400,,3,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Syahrul Address

note that the order of records are not guaranteed, but also doesn't require them to be sorted initially. To preserve the order there are multiple solutions...

Also, I use ~ as a placeholder. If your data includes this char you can replace with an unused one.

UPDATE

To preserve the order (based on first appearance of a row)

$ awk 'BEGIN{FS=OFS=","} 
       NR==1{print; next} 
            {q=$5;$5="~"; if(!($0 in a)) b[++c]=$0; a[$0]+=q} 
       END  {for(k=1;k<=c;k++) {sub("~",a[b[k]],b[k]); print b[k]}}' file

keep a separate structure to mark the order of the rows and iterate over that data structure...

Solution 3:[3]

This solution uses an extra array to guarantee output is deduped and pre-sorted in original input order :

  • 1st array tracks input row order,
  • 2nd one both de-dupes and sums up $5

 % echo
 % cat testfile.txt
 % < testfile.txt mawk 'BEGIN { 
        
       print   $( (FS=OFS=",")*(getline))\
              ($(!(__=(_+=(_=_~_)+_)+--_-(_="")))=_) 
     
    } (______=($+__)($__=_=""))==(___[_=$_]+=______) { 
                 ____[++_____]=_ 
    } END { 
         for(_^=!__;_<=_____;_++) { 
            
             print $(($__=___[$!_=____[_]])<-_) } }'  

 
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,[email protected],205583995140400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,[email protected],205583995140400,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,[email protected],205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal  Address
05 Jun 2018,[email protected],205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal  Address
05 Jun 2018,[email protected],205588935534900,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,[email protected],205588935534900,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
 

 
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,[email protected],205583995140400,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,[email protected],205486016644400,,4,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal  Address
05 Jun 2018,[email protected],205588935534900,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 RARE Kpop Manifesto