'Pipe awk and grep to save a particular field of a file

What I want to achieve:

  • grep: extract lines with the contig number and length
  • awk: remove "length:" from column 2
  • sort: sort by length (in descending order)

Current code

grep "length:" test_reads.fa.contigs.vcake_output | awk -F:'{print $2}' |sort -g -r > contig.txt

Example content of test_reads.fa.contigs.vcake_output:

>Contig_11 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_0 length:99995
ATTTATGCCGTTGGCCACGAATTCAGAATCATATTA

Expected output

>Contig_0 99995
>Contig_11 42


Solution 1:[1]

With your shown samples, please try following awk + sort solution here.

awk -F'[: ]' '/^>/{print $1,$3}' Input_file | sort -nrk2

Explanation: Simple explanation would be, running awk program to read Input_file first, where setting field separator as : OR space and checking condition if line starts from > then printing its 1st and 2nd fields then sending its output(as a standard input) to sort command where sorting it from 2nd field to get required output.

Solution 2:[2]

Here is a gnu-awk solution that does it all in a single command without invoking sort:

awk -F '[:[:blank:]]' '
$2 == "length" {arr[$1] = $3}
END {
   PROCINFO["sorted_in"] = "@ind_num_asc"
   for (i in arr)
      print i, arr[i]
}' file

>Contig_0 99995
>Contig_11 42

Solution 3:[3]

Perhaps this, combining grep and awk:

awk -F '[ :]' '$2 == "length" {print $1, $3}' file | sort ...

Solution 4:[4]

Assumptions:

  • if more than one row has the same length then additionally sort the 1st column using 'version' sort

Adding some additional lines to the sample input:

$ cat test_reads.fa.contigs.vcake_output
>Contig_0 length:99995
ATTTATGCCGTTGGCCACGAATTCAGAATCATATTA
>Contig_11 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_17 length:93
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_837 ignore-this-length:1000000
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_8 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT

One sed/sort idea:

$ sed -rn 's/(>[^ ]+) length:(.*)$/\1 \2/p' test_reads.fa.contigs.vcake_output | sort -k2,2nr -k1,1V

Where:

  • -En - enable extended regex support and suppress normal printing of input data
  • (>[^ ])+) - (1st capture group) - > followed by 1 or more non-space characters
  • length: - space followed by length:
  • (.*) - (2nd capture group) - 0 or more characters (following the colon)
  • $ - end of line
  • \1 \2/p - print 1st capture group + <space> + 2nd capture group
  • -k2,2nr - sort by 2nd (spaced-delimited) field in reverse numeric order
  • -k1,1V - sort by 1st (space-delimited) field in Version order

This generates:

>Contig_0 99995
>Contig_17 93
>Contig_8 42
>Contig_11 42

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 anubhava
Solution 3 glenn jackman
Solution 4