'Remove part of string after "."
I am working with NCBI Reference Sequence accession numbers like variable a
:
a <- c("NM_020506.1","NM_020519.1","NM_001030297.2","NM_010281.2","NM_011419.3", "NM_053155.2")
To get information from the biomart package I need to remove the .1
, .2
etc. after the accession numbers. I normally do this with this code:
b <- sub("..*", "", a)
# [1] "" "" "" "" "" ""
But as you can see, this isn't the correct way for this variable. Can anyone help me with this?
Solution 1:[1]
You just need to escape the period:
a <- c("NM_020506.1","NM_020519.1","NM_001030297.2","NM_010281.2","NM_011419.3", "NM_053155.2")
gsub("\\..*","",a)
[1] "NM_020506" "NM_020519" "NM_001030297" "NM_010281" "NM_011419" "NM_053155"
Solution 2:[2]
We can pretend they are filenames and remove extensions:
tools::file_path_sans_ext(a)
# [1] "NM_020506" "NM_020519" "NM_001030297" "NM_010281" "NM_011419" "NM_053155"
Solution 3:[3]
You could do:
sub("*\\.[0-9]", "", a)
or
library(stringr)
str_sub(a, start=1, end=-3)
Solution 4:[4]
If the string should be of fixed length, then substr
from base R
can be used. But, we can get the position of the .
with regexpr
and use that in substr
substr(a, 1, regexpr("\\.", a)-1)
#[1] "NM_020506" "NM_020519" "NM_001030297" "NM_010281" "NM_011419" "NM_053155"
Solution 5:[5]
We can a lookahead regex to extract the strings before .
.
library(stringr)
str_extract(a, ".*(?=\\.)")
[1] "NM_020506" "NM_020519" "NM_001030297" "NM_010281"
[5] "NM_011419" "NM_053155"
Solution 6:[6]
Another option is to use str_split
from stringr
:
library(stringr)
str_split(a, "\\.", simplify=T)[,1]
[1] "NM_020506" "NM_020519" "NM_001030297" "NM_010281" "NM_011419" "NM_053155"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Hansi |
Solution 2 | zx8754 |
Solution 3 | johannes |
Solution 4 | akrun |
Solution 5 | benson23 |
Solution 6 | user438383 |