'Word Break algorithm

I'm trying to implement the "Word Break" algorithm.

Problem: Given a non-empty string s and a dictionary wordDict containing a list of non-empty words, determine if s can be segmented into a space-separated sequence of one or more dictionary words.

Note:

The same word in the dictionary may be reused multiple times in the segmentation. You may assume the dictionary does not contain duplicate words.

Example:

Input: s = "leetcode", wordDict = ["leet", "code"]
Output: true
Explanation: Return true because "leetcode" can be segmented as "leet code".

My solution:

var wordBreak = function(s, wordDict) {
    if(!wordDict || wordDict.length === 0)
        return false;
    
    while(wordDict.length > 0 || s.length > 0) {
        const word = wordDict.shift();
        const index = s.indexOf(word);
        if(index === -1) {
            return false;
        }
        s = s.substring(0, index) + s.substring(index+word.length, s.length);
    }
    
    return s.length === 0 && wordDict.length === 0 ? true : false;
};

It works for the example (input) above. However it fails for the input below.

Input: s = "applepenapple", wordDict = ["apple", "pen"]
Output: true
Explanation: Return true because "applepenapple" can be segmented as "apple pen apple".
             Note that you are allowed to reuse a dictionary word.

How can I keep track of words that I already eliminate and check it at the end. This input above, the remaining s string contains "apple" which is in the word dictionary, so the output should be true.

Thanks



Solution 1:[1]

This is an interesting problem I met two years ago in a different context, i.e., query tokenization. In my case, the number of words in the dictionary was in the order of several million, therefore a recursive approach looking each time for a different word of the dictionary was not practicable. Furthermore, I needed to apply dynamic programming to solve the task for strict efficiency reasons.

First of all, I suggest you to use the AhoCorasick algorithm to find the words within your search string. The algorithm looks for an arbitrary number of patterns in a string in linear time in the length of the string regardless of the number of patterns to find (no more number of words times length of the string operation, indeed each find of a word in a string needs to scan the entire string..). Luckily, I found a javascript implementation of the algorithm here.

Using the code linked above and dynamic programming to track the words appearing in your string, I wrote the following javascript solution:

function wordBreak(s, wordDict) {
    const len = s.length;
    const memoization_array_words = new Array(len).fill(null);
    const memoization_array_scores = new Array(len).fill(0);

    const wordScores = {};
    wordDict.forEach(function(word) {
        wordScores[word] = 1
    });

    automata = new AhoCorasick(wordDict);
    results = automata.search(s);

    results.forEach(function(result) {
        // result[0] contains the end position
        // result[1] contains the list of words ending in that position
        const end_pos = result[0];
        result[1].forEach(function(word) {
            const prev_end_pos = end_pos - word.length;
            const prev_score = (prev_end_pos == -1) ? 0 : memoization_array_scores[prev_end_pos];
            const score = prev_score + wordScores[word];
            if (score > memoization_array_scores[end_pos]) {
                memoization_array_words[end_pos] = word;
                memoization_array_scores[end_pos] = score;
            }
        });
    });

    if (memoization_array_words[len-1] == null) {
        return false;
    }
    solution = []
    var pos_to_keep = len - 1;
    while (pos_to_keep >= 0) {
        const word = memoization_array_words[pos_to_keep];
        solution.push(word);
        pos_to_keep -= word.length;
    }
    return solution.reverse()
}

where memoization_array_words and memoization_array_scores are filled left to right when we meet a word occurring after a previous one or at the beginning of the string s. The code should be autoesplicative, but if you need any explanation write me a comment, please. As a plus, I associated a score to each word (here is 1 for simplicity) that allows you to distinguish between the different solutions. For instance, if you associate to each word an importance score, you will end up with the tokenization with the greatest score. In the code above, the tokenization with the highest number of words.

Solution 2:[2]

Extended version: I testing over the wordDict with some if there is one of the worde that beginns at the test-string (indexOf==0). If so I shorten the string about the length of the word and call the function recursivly with the shortened string. Otherwise the string is not splitable and I return false. I go this way on till an error occurs or the length of the string is 0 and I win because everything goes allright.

Remark: The error when the WordBreak is not clearly like with s= "cars" wordDict = ["car","ca","rs"] is now fixed. For this I calling in the some-methode the algorithm recursivly. So if one way stops before ending I go backwards and search for alternatives till I found one or there is no possibility left.

Remarks to; array.some
In an array.forEach there can't used a break without using some ugly tricks (like try...catch and throwing an error), so I could use the classic variant of the for-loop. But there exists the array.some method this loops like a forEach-loop but there had only one of the elements to be return true so the result is true.

Example:

const array = [1, 2, 3, 4, 5];

// checks whether an element is even
const even = (element) => element % 2 === 0;

console.log(array.some(even));

Here is the code of the working algorithm.

var wordBreak = function(s, wordDict) {  
    if (!wordDict || wordDict.length === 0) return false;
    while (s.length > 0) {
        let test = wordDict.some( (word,index) => {
            if (s.indexOf(word)===0) {
                s_new = s.substr(word.length);
                return wordBreak(s_new, wordDict);
            }
        });
        if (!test ) return false;
        s=s_new;
    }
    if (s.length === 0) return true;    
}

s = "leetcode"; wordDict = ["leet", "code"];
console.log(wordBreak(s, wordDict));

s = "applepenapple"; wordDict = ["apple", "pen"];
console.log(wordBreak(s, wordDict));

 s= "cars"; wordDict = ["car","ca","rs"];
 console.log(wordBreak(s, wordDict));

Solution 3:[3]

function wordBreak(dict, str){
  if (!str){
    return true;
  }

  for (const word of dict){
    if (str.startsWith(word)){
      return wordBreak(dict, str.substring(word.length, str.length))
    }
  }

  return false;
}

You could also probably optimize the loop over dict by pre-sorting the array and using binary search, but hopefully this gets the point across.

Solution 4:[4]

A simple Javascript solution.

This loops through the wordDict array and checks if each word exist in the str. If it doesn't that is when the indexOf the word return -1, the function returns false. However, if the words in the wordDict array are in the string, it returns true at the end of the for loop.

const wordBreak =(str, wordDict)=>{
    if (!wordDict || wordDict.length === 0) return false
    for(let i=0; I<wordDict.length; i++){
        const dictIndex = str.indexOf(wordDict[i])
        if(dictIndex === -1){
            return false
        }
    }
    return true
}

Solution 5:[5]

If you'd be looking for a Dynamic Programming solution, we'd use an array for recording, and then we'd loop through and keep track of the word.

This'll pass through in JavaScript:

const wordBreak = function(s, wordDict) {
    const len = s.length
    const dp = new Array(len + 1).fill(false)
    dp[0] = true
    for (let i = 1; i < len + 1; i++) {
        for (let j = 0; j < i; j++) {
            if (dp[j] === true && wordDict.includes(s.slice(j, i))) {
                dp[i] = true
                break
            }
        }
    }

    return dp[s.length]
}

In Python, we would have used a list (which is similar to an array of JavaScript) with the same size as our string:

class Solution:
    def wordBreak(self, s, words):
        dp = [False] * len(s)
        for i in range(len(s)):
            for word in words:
                k = i - len(word)
                if word == s[k + 1:i + 1] and (dp[k] or k == -1):
                    dp[i] = True
        return dp[-1]

Similarly in Java, we'd have used a boolean[]:


public final class Solution {
    public static final boolean wordBreak(
        String s,
        List<String> words
    ) {
        if (s == null || s.length() == 0) {
            return false;
        }

        final int len = s.length();
        boolean[] dp = new boolean[len];

        for (int i = 0; i < len; i++) {
            for (int j = 0; j <= i; j++) {
                final String sub = s.substring(j, i + 1);

                if (words.contains(sub) && (j == 0 || dp[j - 1])) {
                    dp[i] = true;
                    break;
                }
            }
        }

        return dp[len - 1];
    }
}

Here is LeetCode's DP solution:

enter image description here

public class Solution {
    public boolean wordBreak(String s, List<String> wordDict) {
        Set<String> wordDictSet=new HashSet(wordDict);
        boolean[] dp = new boolean[s.length() + 1];
        dp[0] = true;
        for (int i = 1; i <= s.length(); i++) {
            for (int j = 0; j < i; j++) {
                if (dp[j] && wordDictSet.contains(s.substring(j, i))) {
                    dp[i] = true;
                    break;
                }
            }
        }
        return dp[s.length()];
    }
}

References

  • For additional details, please see the Discussion Board which you can find plenty of well-explained accepted solutions in there, with a variety of languages including efficient algorithms and asymptotic time/space complexity analysis1, 2.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Roberto Trani
Solution 2
Solution 3 Vyas Alwar
Solution 4 Jayne Uchechukwu
Solution 5