'count words in a string without using split
I have a problem on which I am working where I need to count the number of words in a string without using the split()
function in Python.
I thought of an approach where I can take a variable word=0
and increment it every time there's an empty space in the string, but it doesn't seems to work as it always gave a count less than the actual count.
s="the sky is blue"
def countW(s):
print(s)
word=0
for i in s:
if i==" ":
word=word+1
print(word)
countW(s)
I know it's a simple question but I am struggling to understand what else I can keep into account to make sure I get the right count. The second approach I was thinking of involves too much for loop and array creation and then back string conversion. Can anyone point me to a simpler approach, where I don't increase the time complexity for this.
Solution 1:[1]
Counting the number of spaces is a good approach and works most of the time. Of course you have to add 1 to get the correct number of words.
However, since you seem to be concerned about poorly formatted strings, you have to consider multiple whitespaces, whitespaces at the beginning and the end as well as punctuation.
If you do not want to use regular expressions (as in Ezsrac's answer), here is an alternative that considers combinations of characters, numbers and the underscore as word, just like \w
does. It simply counts all transitions between word characters and non-word characters. The end requires special attention to consider non-word characters at the end (for example "a a "
vs. "a a"
).
def is_word_character(c):
return 'a' <= c <= 'z' or 'A' <= c <= 'Z' or '0' <= c <= '9' or c == '_'
def word_count(str):
c = 0
for i in range(1, len(str)):
if not is_word_character(str[i]) and is_word_character(str[i-1]):
c += 1
if is_word_character(str[-1]):
c += 1
return c
Here are some test cases:
>>> word_count("the sky is blue")
4
>>> word_count("the sky is blue.The")
5
>>> word_count(" the sky is blue ")
4
>>> word_count(" the sky is blue\nand not green ")
7
If you also want to include other characters you can simply extend the is_word_character
function, but be aware that it is not possible to consider all corner cases without using very advanced techniques. For example, consider "You are good-looking"
vs. "This is good-looking into the sky"
. It is not possible for such a simple program to recognize that the first one is a compound adjective while the second one consists of two sentences which are poorly linked.
Solution 2:[2]
You could also use itertools.groupby
, grouping by whether the characters are alpha-numeric or not, and summing all the values (True
equaling 1
).
>>> s = "the sky is blue"
>>> sum(k for (k, g) in itertools.groupby(s, key=str.isalnum))
4
Solution 3:[3]
The simplest finite automata with states - inside a word or outside. Pseudocode:
InsideWord = false
Count = 0
for c in s
if c is not letter
InsideWord = false
else
if not InsideWord
Count++
InsideWord = true
Solution 4:[4]
if you really don't want to use split you could try regex:
import re
s= "the sky is blue"
count = len(re.findall(r'\w+', s))
print (count)
Solution 5:[5]
Simply, take the value of word as 1 while initializing:
print("count words")
s = "the sky is dark and lit with stars"
def countW(s):
print(s)
word=1
for i in s:
if i == " ":
word=word+1
print(word)
countW(s)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | tobias_k |
Solution 3 | MBo |
Solution 4 | Ezsrac |
Solution 5 | Tyler2P |