'awk split on first occurrence of character

Trying to use awk to split each line. If there is more the one p or q the second split on the ( does not work correctly (line 2 is an example. I am not able to ignore the second if there is more then one occurrence. I tried ^pq but that did not produce the desired. Thank you :).

file

1p11.2(120785011_120793480)x3   
1q12q21.1(143192432_143450240)x1~2

awk

awk '{split($0,a,"[pq(_]"); print "id"a[1],a[3]}' file

current

id1 120785011
id1 21.1

desired

id1 120785011
id1 143192432
awk


Solution 1:[1]

another awk

$ awk -F'[(_]' '{split($0,a,"[pq]"); print "id"a[1],$2}' file

id1 120785011
id1 143192432

since you don't control the number of pqs in the line, use two different splits, one for the field delimiter to find the value, the second for the id.

Solution 2:[2]

the split function returns the number of fields, so we can take advantage of that:

{
    n = split($0, a, /[pq(_]/)
    printf "id%s %s\n", a[1], a[n-1]
}

outputs

id1 120785011
id1 143192432

Solution 3:[3]

Here is something you can do using FS regex itself and keeping awk simple:

awk -F '[(_]|[pq]([^pq]*[pq])*' '{print "id" $1, $3}' file

id1 120785011
id1 143192432

FS regex details

  • '[(_]: Match ( or _
  • |: OR
  • [pq]([^pq]*[pq])*: Match p or q followed by 0 or more non-pq characters followed by p or q

Solution 4:[4]

I'd use sed for this since it's simple substitutions on a single line which is what sed is best for:

$ sed 's/\([^pq]*\)[^(]*(\([^_]*\).*/id\1 \2/' file
id1 120785011
id1 143192432

Solution 5:[5]

UPDATE 1 : realized I could make it even more succinct :

mawk 'sub("^","id")<--NF' FS='[pq][^(]+[(]|[_].+$'

It works even when there are empty rows embedded in the input because sub() went first, so NF won't get decremented into negative zone and triggering an error message.

=============================================================

An awk-based solution without requiring:

  • further, and redundant, array-splitting, or

  • a back-reference-capable regex engine:

 input :

1p11.2(120785011_120793480)x3   
1q12q21.1(143192432_143450240)x1~2

 command ::

  mawk 'sub("^","id",$!(NF*=2<NF))' FS='[pq][^(]+[(]|[_].+$' 

 output :

id1 120785011 
id1 143192432 

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 karakfa
Solution 2 glenn jackman
Solution 3 anubhava
Solution 4 Ed Morton
Solution 5