'splitting strings by list of separators irrespective of order
I have a string text
and a list names
- I want to split
text
every time an element ofnames
occurs.
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
desired output:
output = [['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]
FAQ
text
does not always start with anames
element. Thanks for VictorLee pointing that out. I dont care about that leading part but others maybe do, so thanks for the people answering "both cases"- The order of the separators within
names
is independend of their occurance intext
. - separators within
names
are unique but can occur multiple times throughouttext
. Therefore the output will have more lists thannames
has strings. text
will never have the same uniquenames
element occuring twice consecutively/<>.- Ultimately I want the output to be a list of lists where each split
text
slice corresponds to its separator, that it was split by. Order of lists doesent matter.
re.split()
wont let me use a list as a separator argument. Can I re.compile()
my separator list?
update: Thomas code works best for my case, but I noticed one caveat i havent realized before:
some of the elements of names
are preceded by 'Mrs.' or 'Mr.' while only some of the corresponding matches in text
are preceded by 'Mrs.' or 'Mr.'
so far:
names = ['Mr. Mike, ADS', 'Monika, TFO', 'Peter, WQR']
text1 = ['Mrs. Monika, TFO goes shopping. Then she rides bike. Mike, ADS likes Pizza. Monika, TFO hates me.']
text = str(text1)[1:-1]
def create_regex_string(name: List[str]) -> str:
name_components = name.split()
if len(name_components) == 1:
return re.escape(name)
salutation, *name = name_components
return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
regex_string = "|".join(create_regex_string(name) for name in names)
group_count = regex_string.count("(") + 1
fragments = re.split(f"({regex_string})", text)
if fragments:
# ignoring text before first occurrence, not specified in requirements
if not fragments[0] in names:
fragments = fragments[1:]
result = [[name, clist.rstrip()] for name, clist in zip(
fragments[::group_count+1],
fragments[group_count::group_count+1]
) if clist is not None
]
print(result)
[['Monika, TFO', ' goes shopping. Then she rides bike.'], ['Mike, ADS', ' likes Pizza.'], ['Monika, TFO', " hates me.'"]]
error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [86], in <module>
111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
115 group_count = regex_string.count("(") + 1
116 fragments = re.split(f"({regex_string})", clist)
Input In [86], in <genexpr>(.0)
111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
115 group_count = regex_string.count("(") + 1
116 fragments = re.split(f"({regex_string})", clist)
Input In [86], in create_regex_string(name)
109 if len(name_components) == 1:
110 return re.escape(name)
--> 111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
ValueError: not enough values to unpack (expected at least 1, got 0)
Solution 1:[1]
If you are looking for a way to use regular expressions, then:
import re
def do_split(text, names):
joined_names = '|'.join(re.escape(name) for name in names)
regex1 = re.compile('(?=' + joined_names + ')')
strings = filter(lambda s: s != '', regex1.split(text))
regex2 = re.compile('(' + joined_names + ')')
return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))
Prints:
[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]
Explanation
First we dynamically create a regex regex1
from the past names argument to be:
(?=Mike|Monika)
When you split the input on this you, because any of the passed names may appear at the beginning or end of the input, you could end up with empty strings in the result and so we will filter those out and get:
['Monika goes shopping. Then she rides bike. ', 'Mike likes Pizza. ', 'Monika hates me.']
Then we split each list on:
(Mike|Monika)
And again we filter out any possible empty strings to get our final result.
The key to all of this is that when our regex on which we split contains a capture group, the text of that capture group is also returned as part of the resulting list.
Update
You did not specify what should occur if the input text does not being with one of the names. On the assumption that you might want to ignore all of the string until you find one of the names, then check out the following version. Likewise, if the text does not contain any of the names, then the updated code will just return an empty list:
import re
def do_split(text, names):
joined_names = '|'.join(re.escape(name) for name in names)
regex0 = re.compile('(' + joined_names + ')[\s\S]*')
m = regex0.search(text)
if not m:
return []
text = m.group(0)
regex1 = re.compile('(?=' + joined_names + ')')
strings = filter(lambda s: s != '', regex1.split(text))
regex2 = re.compile('(' + joined_names + ')')
return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]
text = 'I think Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))
Prints:
[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]
Solution 2:[2]
Against with regular expressions, you also could reconstruct text to a suitable format which will get the expect result by split
method. And add some string format process.
# works on python2 or python3, but the time complexity is O(n2) means n*n
def do_split(text, names):
my_sprt = '|'
tmp_text_arr = text.split()
for i in range(len(tmp_text_arr)):
for sprt in names:
if sprt == tmp_text_arr[i]:
tmp_text_arr[i] = my_sprt + sprt + my_sprt
tmp_text = ' '.join(tmp_text_arr)
if tmp_text.startswith(my_sprt):
tmp_text = tmp_text[1:]
tmp_text_arr = tmp_text.split(my_sprt)
if tmp_text_arr[0] not in names:
tmp_text_arr.pop(0)
out_arr = []
for i in range(0, len(tmp_text_arr) - 1, 2):
out_arr.append([tmp_text_arr[i], tmp_text_arr[i + 1].rstrip()])
return out_arr
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
text = 'today Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))
This code will compatible with text which not start with the element in names.
Key point: reformat text value to |Monika| goes shopping. Then she rides bike. |Mike| likes Pizza. |Monika| hates me.
with self-define separator such as |
which should not occur in original text.
Solution 3:[3]
I took one of your given solutions and slightly refactored it.
def split(txt, seps, actual_sep='\1'):
order = [item for item in txt.split() if item in seps ]
for sep in seps:
txt = txt.replace(sep, actual_sep)
return list( zip( order, [i.strip() for i in txt.split(actual_sep) if bool(i.strip())] ) )
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print( split(text, names) )
EDITED
Another solution to account for some edge case mentioned here.
def split(txt, seps, sep_pack='\1'):
for sep in seps:
txt = txt.replace(sep, f"{sep_pack}{sep}{sep_pack}")
lst = txt.split(sep_pack)
temp = []
idx = 0
for _ in range(len(lst)):
if idx < len(lst):
if lst[idx] in seps:
temp.append( [lst[idx], lst[idx+1]] )
idx+=2
else:
temp.append( ['', lst[idx]] )
idx+=1
return temp
Kinda ugly though, looking to improve.
Solution 4:[4]
This is in a similar vein to some answers here, but simpler.
There are three steps:
- Find all occurrences of a separator
- Split apart the remaining text
- Combine the results from (1) and (2) into a list of lists, as desired
We can combine (1) and (2) but it makes creating the list of lists more complicated.
import re
def split_on_names(names: list[str], text: str) -> list[list[str]]:
pattern = re.compile("|".join(map(re.escape, names)))
# step 1: find the separators (in order)
separator = pattern.findall(text)
# step 2: split out the text between separators
remainder = list(filter(None, pattern.split(text)))
# at this point, if `remainder` is longer, it's because `text`
# didn't start with a separator. So, we add a blank separator
# to account for the prefix.
if len(remainder) > len(separator):
separator = ["", *separator]
# step 3: reshape the results into a list of lists
return list(map(list, zip(separator, remainder)))
names = ["Mike", "Monika"]
text = "Hi Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me."
split_on_names(names, text)
# output:
#
# [
# ['', 'Hi '],
# ['Monika', ' goes shopping. Then she rides bike. '],
# ['Mike', ' likes Pizza. '],
# ['Monika', ' hates me.']
# ]
Solution 5:[5]
You could use re.split
along with zip
:
import re
from pprint import pprint
text = "Monika goes shopping. Then she rides bike. Mike likes Pizza." \
"Monika hates me."
names = ["Henry", "Mike", "Monika"]
regex_string = "|".join(re.escape(name) for name in names)
fragments = re.split(f"({regex_string})", text)
if fragments:
# ignoring text before first occurrence, not specified in requirements
if not fragments[0] in names:
fragments = fragments[1:]
result = [
[name, text.rstrip()]
for name, text in zip(fragments[::2], fragments[1::2])
]
pprint(result)
Output:
[['Monika', ' goes shopping. Then she rides bike.'],
['Mike', ' likes Pizza.'],
['Monika', ' hates me.']]
Notes:
This is an answer for question revision 9.
- There is an update at the very end of this answer considering the changes in question revision 11.
You don't specify if "text" before the first occurrence of a name should be considered or not.
- Script above ignores "text" before the first occurrence.
You also don't specify what happens if the text ends with a name.
- Script above will include the occurrence by adding an empty string. However, can be easily be solved by removing the last element if the "text" is an empty string.
zip
works because there is always an even number of elements infragments
. We remove the first element if it does not match a name (either text or empty string), and the last element is always an empty string if the text ends with a name.
According to re.split
:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string [...]
Here is the same example but not ignoring "text" before the first occurrence:
import re
text = "Hi. Monika goes shopping. Then she rides bike. Mike likes Pizza." \
"Monika hates me."
names = ["Henry", "Mike", "Monika"]
regex_string = "|".join(re.escape(name) for name in names)
fragments = re.split(f"({regex_string})", text)
if fragments:
# not ignoring text before first occurrence; use empty string as name
if fragments[0].strip() == "":
fragments = fragments[1:]
elif not fragments[0] in names:
fragments = [""] + fragments
result = [
[name, text.rstrip()]
for name, text in zip(fragments[::2], fragments[1::2])
]
# # remove empty text
# if result and not result[-1][1]:
# result = result[:-1]
print(result) # [['', 'Hi.'], ['Monika', ...] ..., ['Monika', ' hates me.']]
Notes:
- This is an answer for question revision 9.
- There is an update at the very end of this answer considering the changes in question revision 11.
Update for Question Revision 11
Following an attempt to include id345678 additional requirement:
import re
from pprint import pprint
from typing import List
def create_regex_string(name: List[str]) -> str:
name_components = name.split()
if len(name_components) == 1:
return re.escape(name)
salutation, name_part = name_components
return f"({re.escape(salutation)} )?{re.escape(name_part)}"
text = "Monika goes shopping. Then she rides bike. Dr. Mike likes Pizza. " \
"Mrs. Monika hates me. Henry needs a break."
names = ["Henry", "Dr. Mike", "Mrs. Monika"]
regex_string = "|".join(create_regex_string(name) for name in names)
group_count = regex_string.count("(") + 1
fragments = re.split(f"({regex_string})", text)
if fragments:
# ignoring text before first occurrence, not specified in requirements
if not fragments[0] in names:
fragments = fragments[1:]
result = [
[name, text.rstrip()]
for name, text in zip(
fragments[::group_count+1],
fragments[group_count::group_count+1]
)
]
pprint(result)
Output:
[['Monika', ' goes shopping. Then she rides bike.'],
['Dr. Mike', ' likes Pizza.'],
['Mrs. Monika', ' hates me.'],
['Henry', ' needs a break.']]
Notes:
final regex string is then
(Henry|Mike|(Mrs\. )?Monika)
- eg.
create_regex_string("Mrs. Monika")
creates(Mrs\. )?Monika
- it will also work for other salutations (as long as there is one space separating the salutation from the name)
- eg.
because we introduced an additional grouping in the regex,
fragments
has more values- therefore, we needed to change the line with
zip
so it is dynamically
- therefore, we needed to change the line with
and if you don't want the salutation in the
result
, you can usename.split()[-1]
when creatingresult
:
result = [
[name.split()[-1], text.rstrip()]
for name, text in zip(
fragments[::group_count+1],
fragments[group_count::group_count+1]
)
]
# [['Monika', ' goes shopping. Then she rides bike.'],
# ['Mike', ' likes Pizza.'],
# ['Monika', ' hates me.'],
# ['Henry', ' needs a break.']]
Please note: I have not tested all use cases as I updated the script on my break time. Let me know if there are issues and then I will look into it when I am off work.
Solution 6:[6]
Your example doesn't fully match your desired output. Also, it's not clear is the example input will always have this structure e.g. with the period at the end of each sentence.
Having said that, you might want to try this dirty approach:
import re
text = 'Monika will go shopping. Mike likes Pizza. Monika hates me.'
names = ['Ruth', 'Mike', 'Monika']
rsplit = re.compile("|".join(sorted(names))).split
output = []
sentences = text.split(".")
for name in names:
for sentence in sentences:
if name in sentence:
output.append([name, f"{rsplit(sentence)[-1]}."])
print(output)
This outputs:
[['Mike', ' likes Pizza.'], ['Monika', ' will go shopping.'], ['Monika', ' hates me.']]
Solution 7:[7]
This is without the re, unless you explicitly need to use it.. Works for the test case given..
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
def sep(text, names):
foo = []
new_text = text.split(' ')
for i in new_text:
if i in names:
foo.append(new_text[:new_text.index(i)])
new_text = new_text[new_text.index(i):]
foo.append(new_text)
foo = foo[1:]
new_foo = []
for i in foo:
first, rest = i[0], i[1:]
rest = " ".join(rest)
i = [first, rest]
new_foo.append(i)
print(new_foo)
sep(text, names)
Gives the output:
[['Monika', 'goes shopping. Then she rides bike.'], ['Mike', 'likes Pizza.'], ['Monika', 'hates me.']]
Should work for other cases too..
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | |
Solution 4 | onepan |
Solution 5 | |
Solution 6 | baduker |
Solution 7 |