'Pipe awk and grep to save a particular field of a file
What I want to achieve:
- grep: extract lines with the contig number and length
- awk: remove "length:" from column 2
- sort: sort by length (in descending order)
Current code
grep "length:" test_reads.fa.contigs.vcake_output | awk -F:'{print $2}' |sort -g -r > contig.txt
Example content of test_reads.fa.contigs.vcake_output
:
>Contig_11 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_0 length:99995
ATTTATGCCGTTGGCCACGAATTCAGAATCATATTA
Expected output
>Contig_0 99995
>Contig_11 42
Solution 1:[1]
With your shown samples, please try following awk
+ sort
solution here.
awk -F'[: ]' '/^>/{print $1,$3}' Input_file | sort -nrk2
Explanation: Simple explanation would be, running awk
program to read Input_file first, where setting field separator as :
OR space and checking condition if line starts from >
then printing its 1st and 2nd fields then sending its output(as a standard input) to sort
command where sorting it from 2nd field to get required output.
Solution 2:[2]
Here is a gnu-awk solution that does it all in a single command without invoking sort
:
awk -F '[:[:blank:]]' '
$2 == "length" {arr[$1] = $3}
END {
PROCINFO["sorted_in"] = "@ind_num_asc"
for (i in arr)
print i, arr[i]
}' file
>Contig_0 99995
>Contig_11 42
Solution 3:[3]
Perhaps this, combining grep and awk:
awk -F '[ :]' '$2 == "length" {print $1, $3}' file | sort ...
Solution 4:[4]
Assumptions:
- if more than one row has the same length then additionally sort the 1st column using 'version' sort
Adding some additional lines to the sample input:
$ cat test_reads.fa.contigs.vcake_output
>Contig_0 length:99995
ATTTATGCCGTTGGCCACGAATTCAGAATCATATTA
>Contig_11 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_17 length:93
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_837 ignore-this-length:1000000
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_8 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
One sed/sort
idea:
$ sed -rn 's/(>[^ ]+) length:(.*)$/\1 \2/p' test_reads.fa.contigs.vcake_output | sort -k2,2nr -k1,1V
Where:
-En
- enable extended regex support and suppress normal printing of input data(>[^ ])+)
- (1st capture group) ->
followed by 1 or more non-space characterslength:
- space followed bylength:
(.*)
- (2nd capture group) - 0 or more characters (following the colon)$
- end of line\1 \2/p
- print 1st capture group +<space>
+ 2nd capture group-k2,2nr
- sort by 2nd (spaced-delimited) field inr
eversen
umeric order-k1,1V
- sort by 1st (space-delimited) field inV
ersion order
This generates:
>Contig_0 99995
>Contig_17 93
>Contig_8 42
>Contig_11 42
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | anubhava |
Solution 3 | glenn jackman |
Solution 4 |