'Remove parentheses and text within from strings in R
In R, I have a list of companies such as:
companies <- data.frame(Name=c("Company A Inc (COMPA)","Company B (BEELINE)", "Company C Inc. (Coco)", "Company D Inc.", "Company E"))
I want to remove the text with parenthesis, ending up with the following list:
Name
1 Company A Inc
2 Company B
3 Company C Inc.
4 Company D Inc.
5 Company E
One approach I tried was to split the string and then use ldply:
companies$Name <- as.character(companies$Name)
c<-strsplit(companies$Name, "\\(")
ldply(c)
But because not all company names have parentheses portions, it fails:
Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) :
Results do not have equal lengths
I'm not married to the strsplit solution. Whatever removes that text and the parentheses would be fine.
Solution 1:[1]
A gsub
should work here
gsub("\\s*\\([^\\)]+\\)","",as.character(companies$Name))
# or using "raw" strings as of R 4.0
gsub(r"{\s*\([^\)]+\)}","",as.character(companies$Name))
# [1] "Company A Inc" "Company B" "Company C Inc."
# [4] "Company D Inc." "Company E"
Here we just replace occurrences of "(...)" with nothing (also removing any leading space). R makes it look worse than it is with all the escaping we have to do for the parenthesis since they are special characters in regular expressions.
Solution 2:[2]
You could use stringr::str_replace
. It's nice because it accepts factor variables.
companies <- data.frame(Name=c("Company A Inc (COMPA)","Company B (BEELINE)",
"Company C Inc. (Coco)", "Company D Inc.",
"Company E"))
library(stringr)
str_replace(companies$Name, " \\s*\\([^\\)]+\\)", "")
# [1] "Company A Inc" "Company B" "Company C Inc."
# [4] "Company D Inc." "Company E"
And if you still want to use strsplit
, you could do
companies$Name <- as.character(companies$Name)
unlist(strsplit(companies$Name, " \\(.*\\)"))
# [1] "Company A Inc" "Company B" "Company C Inc."
# [4] "Company D Inc." "Company E"
Solution 3:[3]
You could also use:
library(qdap)
companies$Name <- genX(companies$Name, " (", ")")
companies
Name
1 Company A Inc
2 CompanyB
3 Company C Inc.
4 Company D Inc.
5 CompanyE
Solution 4:[4]
If the parentheses are paired and balanced, you can use
gsub("\\s*(\\([^()]*(?:(?1)[^()]*)*\\))", "", x, perl=TRUE)
See the regex and R demo online:
companies <- data.frame(Name=c("Company A Inc (COMPA)","Company B (BEELINE)", "Company C Inc. (Coco)", "Company D Inc.", "Company E"))
gsub("\\s*(\\([^()]*(?:(?1)[^()]*)*\\))", "", companies$Name, perl=TRUE)
Output:
[1] "Company A Inc" "Company B" "Company C Inc." "Company D Inc."
[5] "Company E"
Regex details
\s*
- zero or more whitespaces(\([^()]*(?:(?1)[^()]*)*\))
- Capturing group 1 (required to recurse the pattern part between parentheses):\(
- a(
char[^()]*
- zero or more chars other than(
and)
(?:(?1)[^()]*)*
- zero or more occurrences of the whole Group 1 pattern ((?1)
is a regex subroutine recursing Group 1 pattern) and then zero or more chars other than(
and)
\)
- a)
char.
Solution 5:[5]
In your case it will come to the desired result, wenn you remove everything starting with (
.
sub(" \\(.*", "", companies$Name)
#[1] "Company A Inc" "Company B" "Company C Inc." "Company D Inc." "Company E"
To remove parentheses and text within from a strings you can use.
sub("\\(.*)", "", c("ab (cd) ef", "(ij) kl"))
#[1] "ab ef" " kl"
If there are more than one parentheses:
gsub("\\(.*?)", "", c("ab (cd) ef (gh)", "(ij) kl"))
#[1] "ab ef " " kl"
(
needs to be escaped \\(
, .
means everything, *
means repeated 0 to n, ?
means non greedy to remove not everything from the first to the last match.
As an alternative you can use [^)]
what means everything but not a )
.
sub("\\([^)]*)", "", c("ab (cd) ef", "(ij) kl"))
#[1] "ab ef" " kl"
gsub("\\([^)]*)", "", c("ab (cd) ef (gh)", "(ij) kl"))
#[1] "ab ef " " kl"
If there are nested parentheses:
gsub("\\(([^()]|(?R))*\\)", "", c("ab ((cd) ef) gh (ij)", "(ij) kl"), perl=TRUE)
#[1] "ab gh " " kl"
Where a(?R)z
is a recursion which match one or more letters a
followed by exactly the same number of letters z
.
Solution 6:[6]
library(qdap)
bracketX(companies$Name) -> companies$Name
Solution 7:[7]
Another gsub
solution: replace the term in the parens preceded by an optional space by ""
, i.e. empty string
gsub("(\\s*\\(\\w+\\))", "", companies$Name)
[1] "Company A Inc" "Company B" "Company C Inc." "Company D Inc."
[5] "Company E"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Gregor Thomas |
Solution 3 | |
Solution 4 | Wiktor Stribiżew |
Solution 5 | |
Solution 6 | Thushara Dulam |
Solution 7 | Eyayaw |