'Python: find the start index of a specific word number in a string

I have this string:

myString = "Tomorrow will be very very rainy"

I would like to get the start index of the word number 5 (very).

What I do currently, I do split myString into words:

words = re.findall( r'\w+|[^\s\w]+', myString)

But I am not sure on how to get the start index of the word number 5: words[5].

Using the index() is not working as it finds the first occurrence:

start_index = myString.index(words[5])


Solution 1:[1]

Not very elegant, but loop through the list of split words and calculate the index based on the word length and the split character (in this case a space). This answer will target the fifth word in the sentence.

myString = "Tomorrow will be very very rainy"

target_word = 5

split_string = myString.split()

idx_start = 0

for i in range(target_word-1):
    idx_start += len(split_string[i])
    if myString[idx_start] == " ":
        idx_start += 1

idx_end = idx_start + len(split_string[target_word-1]) + 1

print(idx_start, idx_end, myString[idx_start:idx_end])

Solution 2:[2]

wordnum = 5
l = [x.span()[1] for x in re.finditer(" +", string)]
pos = l[wordnum-2]
print(pos)

output

22

Solution 3:[3]

If only single spaces between words:

  • Sum all word lengths before the wanted word
  • Add amount of spaces
word_idx = 4  # zero based index
words = myString.split()
start_index = sum(len(word) for word in words[:word_idx]) + word_idx

Result:

22

Solution 4:[4]

If the string starts with 5 words, you can match the first 4 words and capture the fifth one.

The you can use the start method and pass 1 to it for the first capture group of the Match Object.

^(?:\w+\s+){4}(\w+)

Explanation

  • ^ Start of string
  • (?:\w+\s+){4} Repeat 4 times matching 1+ word characters and 1+ whitspace chars
  • (\w+) Capture group 1, match 1+ word characters

Example

import re

myString = "Tomorrow will be very very rainy"
pattern = r"^(?:\w+\s+){4}(\w+)"
m = re.match(pattern, myString)
if m:
    print(m.start(1))

Output

22

For a broader match you can use \S+ to match one or more non whitespace characters.

pattern = r"^(?:\S+\s+){4}(\S+)"

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Milos Cuculovic
Solution 2
Solution 3 ivvija
Solution 4 The fourth bird