'Python: find the start index of a specific word number in a string
I have this string:
myString = "Tomorrow will be very very rainy"
I would like to get the start index of the word number 5 (very).
What I do currently, I do split myString into words:
words = re.findall( r'\w+|[^\s\w]+', myString)
But I am not sure on how to get the start index of the word number 5: words[5].
Using the index() is not working as it finds the first occurrence:
start_index = myString.index(words[5])
Solution 1:[1]
Not very elegant, but loop through the list of split words and calculate the index based on the word length and the split character (in this case a space). This answer will target the fifth word in the sentence.
myString = "Tomorrow will be very very rainy"
target_word = 5
split_string = myString.split()
idx_start = 0
for i in range(target_word-1):
idx_start += len(split_string[i])
if myString[idx_start] == " ":
idx_start += 1
idx_end = idx_start + len(split_string[target_word-1]) + 1
print(idx_start, idx_end, myString[idx_start:idx_end])
Solution 2:[2]
wordnum = 5
l = [x.span()[1] for x in re.finditer(" +", string)]
pos = l[wordnum-2]
print(pos)
output
22
Solution 3:[3]
If only single spaces between words:
- Sum all word lengths before the wanted word
- Add amount of spaces
word_idx = 4 # zero based index
words = myString.split()
start_index = sum(len(word) for word in words[:word_idx]) + word_idx
Result:
22
Solution 4:[4]
If the string starts with 5 words, you can match the first 4 words and capture the fifth one.
The you can use the start
method and pass 1 to it for the first capture group of the Match Object.
^(?:\w+\s+){4}(\w+)
Explanation
^
Start of string(?:\w+\s+){4}
Repeat 4 times matching 1+ word characters and 1+ whitspace chars(\w+)
Capture group 1, match 1+ word characters
Example
import re
myString = "Tomorrow will be very very rainy"
pattern = r"^(?:\w+\s+){4}(\w+)"
m = re.match(pattern, myString)
if m:
print(m.start(1))
Output
22
For a broader match you can use \S+
to match one or more non whitespace characters.
pattern = r"^(?:\S+\s+){4}(\S+)"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Milos Cuculovic |
Solution 2 | |
Solution 3 | ivvija |
Solution 4 | The fourth bird |