'splitting strings by list of separators irrespective of order

I have a string text and a list names

  • I want to split text every time an element of names occurs.

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'

names = ['Mike', 'Monika']

desired output:

output = [['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]

FAQ

  • text does not always start with a names element. Thanks for VictorLee pointing that out. I dont care about that leading part but others maybe do, so thanks for the people answering "both cases"
  • The order of the separators within names is independend of their occurance in text.
  • separators within names are unique but can occur multiple times throughout text. Therefore the output will have more lists than names has strings.
  • text will never have the same unique names element occuring twice consecutively/<>.
  • Ultimately I want the output to be a list of lists where each split text slice corresponds to its separator, that it was split by. Order of lists doesent matter.

re.split() wont let me use a list as a separator argument. Can I re.compile() my separator list?


update: Thomas code works best for my case, but I noticed one caveat i havent realized before:

some of the elements of names are preceded by 'Mrs.' or 'Mr.' while only some of the corresponding matches in text are preceded by 'Mrs.' or 'Mr.'


so far:

names = ['Mr. Mike, ADS', 'Monika, TFO', 'Peter, WQR']
text1 = ['Mrs. Monika, TFO goes shopping. Then she rides bike. Mike, ADS likes Pizza. Monika, TFO hates me.']
text = str(text1)[1:-1]

def create_regex_string(name: List[str]) -> str:
    name_components = name.split()
    if len(name_components) == 1:
        return re.escape(name)
    salutation, *name = name_components
    return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
    
regex_string = "|".join(create_regex_string(name) for name in names)
group_count = regex_string.count("(") + 1
fragments = re.split(f"({regex_string})", text)
if fragments:
    # ignoring text before first occurrence, not specified in requirements
    if not fragments[0] in names: 
        fragments = fragments[1:]
        result = [[name, clist.rstrip()] for name, clist in zip(
            fragments[::group_count+1],
            fragments[group_count::group_count+1]
        ) if clist is not None
    ]

print(result)
[['Monika, TFO', ' goes shopping. Then she rides bike.'], ['Mike, ADS', ' likes Pizza.'], ['Monika, TFO', " hates me.'"]]

error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [86], in <module>
    111     salutation, *name = name_components
    112     return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
    115 group_count = regex_string.count("(") + 1
    116 fragments = re.split(f"({regex_string})", clist)

Input In [86], in <genexpr>(.0)
    111     salutation, *name = name_components
    112     return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
    115 group_count = regex_string.count("(") + 1
    116 fragments = re.split(f"({regex_string})", clist)

Input In [86], in create_regex_string(name)
    109 if len(name_components) == 1:
    110     return re.escape(name)
--> 111 salutation, *name = name_components
    112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"

ValueError: not enough values to unpack (expected at least 1, got 0)


Solution 1:[1]

If you are looking for a way to use regular expressions, then:

import re

def do_split(text, names):
    joined_names = '|'.join(re.escape(name) for name in names)

    regex1 = re.compile('(?=' + joined_names + ')')
    strings = filter(lambda s: s != '', regex1.split(text))

    regex2 = re.compile('(' + joined_names + ')')
    return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

Prints:

[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]

Explanation

First we dynamically create a regex regex1 from the past names argument to be:

(?=Mike|Monika)

When you split the input on this you, because any of the passed names may appear at the beginning or end of the input, you could end up with empty strings in the result and so we will filter those out and get:

['Monika goes shopping. Then she rides bike. ', 'Mike likes Pizza. ', 'Monika hates me.']

Then we split each list on:

(Mike|Monika)

And again we filter out any possible empty strings to get our final result.

The key to all of this is that when our regex on which we split contains a capture group, the text of that capture group is also returned as part of the resulting list.

Update

You did not specify what should occur if the input text does not being with one of the names. On the assumption that you might want to ignore all of the string until you find one of the names, then check out the following version. Likewise, if the text does not contain any of the names, then the updated code will just return an empty list:

import re

def do_split(text, names):
    joined_names = '|'.join(re.escape(name) for name in names)

    regex0 = re.compile('(' + joined_names + ')[\s\S]*')
    m = regex0.search(text)
    if not m:
        return []
    text = m.group(0)

    regex1 = re.compile('(?=' + joined_names + ')')
    strings = filter(lambda s: s != '', regex1.split(text))

    regex2 = re.compile('(' + joined_names + ')')
    return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]

text = 'I think Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

Prints:

[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]

Solution 2:[2]

Against with regular expressions, you also could reconstruct text to a suitable format which will get the expect result by split method. And add some string format process.

# works on python2 or python3, but the time complexity is O(n2) means n*n
def do_split(text, names):
    my_sprt = '|'
    tmp_text_arr = text.split()
    for i in range(len(tmp_text_arr)):
        for sprt in names:
            if sprt == tmp_text_arr[i]:
                tmp_text_arr[i] = my_sprt + sprt + my_sprt

    tmp_text = ' '.join(tmp_text_arr)
    if tmp_text.startswith(my_sprt):
        tmp_text = tmp_text[1:]

    tmp_text_arr = tmp_text.split(my_sprt)
    if tmp_text_arr[0] not in names:
        tmp_text_arr.pop(0)

    out_arr = []
    for i in range(0, len(tmp_text_arr) - 1, 2):
        out_arr.append([tmp_text_arr[i], tmp_text_arr[i + 1].rstrip()])
    return out_arr

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
text = 'today Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

This code will compatible with text which not start with the element in names.

Key point: reformat text value to |Monika| goes shopping. Then she rides bike. |Mike| likes Pizza. |Monika| hates me. with self-define separator such as | which should not occur in original text.

Solution 3:[3]

I took one of your given solutions and slightly refactored it.

def split(txt, seps, actual_sep='\1'):
    order = [item for item in txt.split() if item in seps ]
    for sep in seps:
        txt = txt.replace(sep, actual_sep)
    return list( zip( order, [i.strip() for i in txt.split(actual_sep) if bool(i.strip())] ) )

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']

print( split(text, names) )

EDITED

Another solution to account for some edge case mentioned here.

def split(txt, seps, sep_pack='\1'):
    for sep in seps:
        txt = txt.replace(sep, f"{sep_pack}{sep}{sep_pack}")
    
    lst = txt.split(sep_pack)
    temp = []
    idx = 0
    for _ in range(len(lst)):
        if idx < len(lst):
            if lst[idx] in seps:
                temp.append( [lst[idx], lst[idx+1]] )
                idx+=2
            else:
                temp.append( ['', lst[idx]] )
                idx+=1

    return temp

Kinda ugly though, looking to improve.

Solution 4:[4]

This is in a similar vein to some answers here, but simpler.

There are three steps:

  1. Find all occurrences of a separator
  2. Split apart the remaining text
  3. Combine the results from (1) and (2) into a list of lists, as desired

We can combine (1) and (2) but it makes creating the list of lists more complicated.

import re

def split_on_names(names: list[str], text: str) -> list[list[str]]:
    pattern = re.compile("|".join(map(re.escape, names)))
    # step 1: find the separators (in order)
    separator = pattern.findall(text)
    # step 2: split out the text between separators
    remainder = list(filter(None, pattern.split(text)))

    # at this point, if `remainder` is longer, it's because `text` 
    # didn't start with a separator. So, we add a blank separator
    # to account for the prefix.
    if len(remainder) > len(separator):
        separator = ["", *separator]

    # step 3: reshape the results into a list of lists
    return list(map(list, zip(separator, remainder)))
names = ["Mike", "Monika"]
text = "Hi Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me."

split_on_names(names, text)

# output:
#
# [
#    ['', 'Hi '],
#    ['Monika', ' goes shopping. Then she rides bike. '],
#    ['Mike', ' likes Pizza. '],
#    ['Monika', ' hates me.']
# ]

Solution 5:[5]

You could use re.split along with zip:

import re
from pprint import pprint

text = "Monika goes shopping. Then she rides bike. Mike likes Pizza." \
       "Monika hates me."

names = ["Henry", "Mike", "Monika"]

regex_string = "|".join(re.escape(name) for name in names)

fragments = re.split(f"({regex_string})", text)

if fragments:

    # ignoring text before first occurrence, not specified in requirements
    if not fragments[0] in names: 
        fragments = fragments[1:]

    result = [
        [name, text.rstrip()] 
        for name, text in zip(fragments[::2], fragments[1::2])
    ]

    pprint(result)

Output:

[['Monika', ' goes shopping. Then she rides bike.'],
 ['Mike', ' likes Pizza.'],
 ['Monika', ' hates me.']]

Notes:

  • This is an answer for question revision 9.

    • There is an update at the very end of this answer considering the changes in question revision 11.
  • You don't specify if "text" before the first occurrence of a name should be considered or not.

    • Script above ignores "text" before the first occurrence.
  • You also don't specify what happens if the text ends with a name.

    • Script above will include the occurrence by adding an empty string. However, can be easily be solved by removing the last element if the "text" is an empty string.
  • zip works because there is always an even number of elements in fragments. We remove the first element if it does not match a name (either text or empty string), and the last element is always an empty string if the text ends with a name.

According to re.split:

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string [...]


Here is the same example but not ignoring "text" before the first occurrence:

import re

text = "Hi. Monika goes shopping. Then she rides bike. Mike likes Pizza." \
       "Monika hates me."

names = ["Henry", "Mike", "Monika"]

regex_string = "|".join(re.escape(name) for name in names)

fragments = re.split(f"({regex_string})", text)

if fragments:

    # not ignoring text before first occurrence; use empty string as name
    if fragments[0].strip() == "":
        fragments = fragments[1:]
    elif not fragments[0] in names:
        fragments = [""] + fragments

    result = [
        [name, text.rstrip()]
        for name, text in zip(fragments[::2], fragments[1::2])
    ]

    # # remove empty text
    # if result and not result[-1][1]:
    #     result = result[:-1]

    print(result)  # [['', 'Hi.'], ['Monika', ...] ..., ['Monika', ' hates me.']]

Notes:

  • This is an answer for question revision 9.
    • There is an update at the very end of this answer considering the changes in question revision 11.

Update for Question Revision 11

Following an attempt to include id345678 additional requirement:

import re
from pprint import pprint
from typing import List
def create_regex_string(name: List[str]) -> str:

    name_components = name.split()

    if len(name_components) == 1:
        return re.escape(name)

    salutation, name_part = name_components

    return f"({re.escape(salutation)} )?{re.escape(name_part)}"
text = "Monika goes shopping. Then she rides bike. Dr. Mike likes Pizza. " \
       "Mrs. Monika hates me. Henry needs a break."

names = ["Henry", "Dr. Mike", "Mrs. Monika"]

regex_string = "|".join(create_regex_string(name) for name in names)

group_count = regex_string.count("(") + 1

fragments = re.split(f"({regex_string})", text)

if fragments:

    # ignoring text before first occurrence, not specified in requirements
    if not fragments[0] in names: 
        fragments = fragments[1:]

    result = [
        [name, text.rstrip()] 
        for name, text in zip(
            fragments[::group_count+1],
            fragments[group_count::group_count+1]
        )
    ]

    pprint(result)

Output:

[['Monika', ' goes shopping. Then she rides bike.'],
 ['Dr. Mike', ' likes Pizza.'],
 ['Mrs. Monika', ' hates me.'],
 ['Henry', ' needs a break.']]

Notes:

  • final regex string is then (Henry|Mike|(Mrs\. )?Monika)

    • eg. create_regex_string("Mrs. Monika") creates (Mrs\. )?Monika
    • it will also work for other salutations (as long as there is one space separating the salutation from the name)
  • because we introduced an additional grouping in the regex, fragments has more values

    • therefore, we needed to change the line with zip so it is dynamically
  • and if you don't want the salutation in the result, you can use name.split()[-1] when creating result:

result = [
    [name.split()[-1], text.rstrip()] 
    for name, text in zip(
        fragments[::group_count+1],
        fragments[group_count::group_count+1]
    )
]

# [['Monika', ' goes shopping. Then she rides bike.'],
#  ['Mike', ' likes Pizza.'],
#  ['Monika', ' hates me.'],
#  ['Henry', ' needs a break.']]

Please note: I have not tested all use cases as I updated the script on my break time. Let me know if there are issues and then I will look into it when I am off work.

Solution 6:[6]

Your example doesn't fully match your desired output. Also, it's not clear is the example input will always have this structure e.g. with the period at the end of each sentence.

Having said that, you might want to try this dirty approach:

import re

text = 'Monika will go shopping. Mike likes Pizza. Monika hates me.'

names = ['Ruth', 'Mike', 'Monika']
rsplit = re.compile("|".join(sorted(names))).split

output = []
sentences = text.split(".")
for name in names:
    for sentence in sentences:
        if name in sentence:
            output.append([name, f"{rsplit(sentence)[-1]}."])

print(output)

This outputs:

[['Mike', ' likes Pizza.'], ['Monika', ' will go shopping.'], ['Monika', ' hates me.']]

Solution 7:[7]

This is without the re, unless you explicitly need to use it.. Works for the test case given..

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'

names = ['Mike', 'Monika']

def sep(text, names):
    foo = []
    new_text = text.split(' ')
    for i in new_text:
        if i in names:
            foo.append(new_text[:new_text.index(i)])
            new_text = new_text[new_text.index(i):]
    foo.append(new_text)
    foo = foo[1:]

    new_foo = []
    for i in foo:
        first, rest = i[0], i[1:]
        rest = " ".join(rest)
        i = [first, rest]
        new_foo.append(i)
    print(new_foo)

sep(text, names)

Gives the output:

[['Monika', 'goes shopping. Then she rides bike.'], ['Mike', 'likes Pizza.'], ['Monika', 'hates me.']]

Should work for other cases too..

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3
Solution 4 onepan
Solution 5
Solution 6 baduker
Solution 7