'How to sum the values in one data column of duplicate rows with awk and remove the duplicate rows?
I have a CSV file with a heading row and multiple data rows each with 11 data columns like this:
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,[email protected],205583995140400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,[email protected],205583995140400,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,[email protected],205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,[email protected],205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,[email protected],205588935534900,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,[email protected],205588935534900,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
I want to remove the duplicates in that file and sum the values in the Quantity
data column. I want the result to be like this:
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,[email protected],205583995140400,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,[email protected],205486016644400,,4,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,[email protected],205588935534900,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
I want to sum only the values in the fifth data column Quantity
while leaving the rest as it is. I have tried the solution in Sum duplicate row values with awk, but the answer there works only if the file has only two data columns. My CSV file has 11 data columns and so it doesn't work.
How to do it with awk
?
Solution 1:[1]
Taking direct adaption from Karafka's solution and adding some code in it a bit to get the lines in proper order(in which they are present in Input_file) as per OP's request.
awk -F, '
FNR==1{
print;
next}
{
val=$5;
$5="~";
a[$0]+=val
}
!b[$0]++{
c[++count]=$0}
END{
for(i=1;i<=count;i++){
sub("~",a[c[i]],c[i]);
print c[i]}
}' OFS=, Input_file
Explanation: Adding explanation to above code too now.
awk -F, ' ##Setting field separator as comma here.
FNR==1{ ##Checking condition if line number is 1 then do following.
print; ##Print the current line.
next} ##next will skip all further statements from here.
{
val=$5; ##Creating a variable named val whose value is 5th field of current line.
$5="~"; ##Setting value of 5th field as ~ here to keep all lines same(to create index for array a).
a[$0]+=val ##Creating an array named a whose index is current line and its value is variable val value.
}
!b[$0]++{ ##Checking if array b whose index is current line its value is NULL then do following.
c[++count]=$0} ##Creating an array named c whose index is variable count increasing value with 1 and value is current line.
END{ ##Starting END block of awk code here.
for(i=1;i<=count;i++){ ##Starting a for loop whose value starts from 1 to till value of count variable.
sub("~",a[c[i]],c[i]); ##Substituting ~ in value of array c(which is actually lines value) with value of SUMMED $5.
print c[i]} ##Printing newly value of array c where $5 is now replaced with its actual value.
}' OFS=, Input_file ##Setting OFS as comma here and mentioning Input_file name here too.
Solution 2:[2]
awk
to the rescue!
$ awk 'BEGIN{FS=OFS=","}
NR==1{print; next}
{q=$5; $5="~"; a[$0]+=q}
END {for(k in a) {sub("~",a[k],k); print k}}' file
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,[email protected],205588935534900,,3,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,[email protected],205486016644400,,4,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,[email protected],205583995140400,,3,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Syahrul Address
note that the order of records are not guaranteed, but also doesn't require them to be sorted initially. To preserve the order there are multiple solutions...
Also, I use ~
as a placeholder. If your data includes this char you can replace with an unused one.
UPDATE
To preserve the order (based on first appearance of a row)
$ awk 'BEGIN{FS=OFS=","}
NR==1{print; next}
{q=$5;$5="~"; if(!($0 in a)) b[++c]=$0; a[$0]+=q}
END {for(k=1;k<=c;k++) {sub("~",a[b[k]],b[k]); print b[k]}}' file
keep a separate structure to mark the order of the rows and iterate over that data structure...
Solution 3:[3]
This solution uses an extra array to guarantee output is deduped and pre-sorted in original input order :
- 1st array tracks input row order,
- 2nd one both de-dupes and sums up
$5
% echo
% cat testfile.txt
% < testfile.txt mawk 'BEGIN {
print $( (FS=OFS=",")*(getline))\
($(!(__=(_+=(_=_~_)+_)+--_-(_="")))=_)
} (______=($+__)($__=_=""))==(___[_=$_]+=______) {
____[++_____]=_
} END {
for(_^=!__;_<=_____;_++) {
print $(($__=___[$!_=____[_]])<-_) } }'
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,[email protected],205583995140400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,[email protected],205583995140400,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,[email protected],205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,[email protected],205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,[email protected],205588935534900,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,[email protected],205588935534900,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,[email protected],205583995140400,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,[email protected],205486016644400,,4,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal Address
05 Jun 2018,[email protected],205588935534900,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | RARE Kpop Manifesto |