'Most frequent next word for every word in the input

You are given a collection of sentences (words separated by whitespaces), for example:

{{My name is Mat},
 {This is easy},
 {Do you know where is Mat?}}

You have to feed this input into an API function BuildModel() and train system to output the next most frequent word for any given input word. Which means that after BuildModel() has completed to run, you may call another API function GetNextWord('is') and get 'Mat' instead of 'easy'. As you may see 'Mat' is appearing 2 times after 'is'.

BuildModel(vector<string> & v);
GetNextWord(string w);

Can we do it in linear time?

EDIT: I see it as scanning the words and building a map where each word is associated with a max priority queue. Priority queue is filled with a pair {word, frequency} so that the word with most frequency is on top. And we can simply call top() in GetNextWord. One problem with this is that priority queue cannot be easily updated once we need to add frequency. So first construct map for each word and then transform everything into queue. If we keep the map, the output is defined by key

map<string, int> MAP{ {"Mat",2}, {"easy",1} };


Solution 1:[1]

My standard answer would be: Use some kind of Trie. I think it was made for this purpose. But a trie has a huge memory consumption.

And, if the requirement is just to look at the next word, then this can be done with a standard counting approach.

First we make everything lower case with a simple std::transform. Then, I use any algorithm to split a text into sentences. Here, I use for example the method with a std::sregex_token_iterator. And next, I extract all words from a sentence with a simple std::istream_iterator. Good, now we have all words of a sentence in a std::vector. Using its index operator [] we can easily access a word and its following word, which has an "index + 1".

As the main data structure I use a

std::unordered_map<std::string, std::unordered_map<std::string, size_t>

In the first key, we will store the "front"-words. In the 2nd, inner std::unordered_map, we will store all "follow"-words for the given "front"-words and its count. With that we can use the standard counting approach:

for (size_t wordIndex{}; wordIndex < words.size() - 1; ++wordIndex)
    followCounter[words[wordIndex]]  [words[wordIndex+1]]  ++;

This is very simple.


Then, next is sorting.

I will use a Max Heap based on a std::vector. I do not need the std::priority_queue wrapper. So, in the first step, we copy data from the inner std::unordered_mapto the std::vector, using the std::vectors range constructor. That should be understandable.

Then we use std::make_heap to build a Max Heap by suppling a custom sorting functor.

And last but not least, we create a data structure

std::unordered_map<std::string, std::vector<std::pair<std::string, size_t>

The inner std::vector is the Max Heap. The outer key is the "front"-word.

We move the just created Max Heap into this structure, to get the final result.

And that's it. We need roundabout 13 statements for your "BuildModel" equivalent functionality.

Getting a "follow" word is also simple and fast. Please look at the debgug output.

If we want to add words later, unfortunately we need again to write 7 statements. We first get a reference to the new "front" word. If it is not existing, then the std::unordered_maps index operator [] will create a new entry for us and also return an index to it. It's associated Max Hep will be empty.

Then, we search in the (either already existing or just created) Max Heap the "follow"-word. If it is existing, then we will increment its counter. If, not, we use a std::push_back operation with a new pair of the "follow"-word and a count of 1. Last but not least, we heapify it again, and, that's it.

The data structures are a little bit hard to read. Therefore I added a lot of using-statements, to increase readablity.

I tend to think that this is a reasonable implementation, but only a benchmark will give the final answer.

As I said in the beginning, A Trie will outperfrom anything here.

Anyway. Please see one potential solution for your problem.

#include <iostream>
#include <sstream>
#include <vector>
#include <unordered_map>
#include <string>
#include <regex>
#include <algorithm>
#include <utility>
#include <iomanip>

// Copyright: https://linguapress.com/intermediate/silicon-valley.htm
std::string text{ R"(The story of Silicon Valley. 
 If old America was made in New York or Detroit, modern America is made in Silicon Valley. But what is "Silicon Valley", where is it? And why is it where it is?
San Jose. San Jose,in the heart of Silicon Valley. It is not made of silicon; and it is not a river valley; but forgetting that, Silicon Valley is probably the most famous valley in the world. 
Although it is not the place where the first computer was built (that was Manchester, England), Silicon Valley, near San Francisco,  
was the birthplace of the modern computer industry. For this, we can say thankyou to scientists at the universities in California, and to the Hippies of the 1960's.
It was in the nineteen-sixties that American "youth culture" really began. California, of course, already existed; but the Sixties Generation rediscovered it.
At the time there were really two different forms of youth culture; the "Beach Boy" culture on the one hand, and the anti-establishment hippies and radical students 
on the other hand; and they all dreamed of California. For the Beach Boys, that meant southern California, where they could sing about surfing and cars; 
for the Hippies and radicals, it meant San Francisco, "flower power" and revolutionary new ideas. The campuses at Berkeley and Stamford, near San Francisco, were hot-beds 
of new ideas, new technology, new culture, and new ways of living. When they finished university, many of the best students did not look for jobs with big 
companies like Ford or Exxon. Instead they wanted to be free and run their own operations and stay in California, not far from San Francisco. Silicon Valley 
is thus a group of small towns, including Palo Alto and San Jose, a few miles south of San Francisco. The high-technology industry was already present around 
San Francisco. Intel had been founded in 1968, and in the same year the first computer mouse was built at Stamford University. In 1970, Xerox opened a research 
center in Palo Alto. There were also other electronics companies, like Hewlett Packard, and Fairchild, the world's first "semiconductor" company.
Then, in 1976, an electronics student called Steve Jobs started a small computer company in his garage; he gave it the same name as the Beatles' record company: Apple.
Very soon, more companies, like Seagate and Google appeared. "Silicon Valley" had arrived. There was even a sort of primitive Internet connecting many addresses 
in Silicon Valley, called the Arpanet. Today, Silicon Valley is still the home of the computer industry; it is still full of high technology, but it is not 
the only center for high-tech in the USA. Today here are computer firms all over the USA and all over the world; but Silicon Valley still has the largest 
concentration of high-tech companies and research centers. Microsoft, the world's biggest high-tech company, is not based in Silicon Valley. 
It is further north, near Seattle in the state of Washington.)" };

// Create aliases. Save typing work and make code more readable -------------------------
using Pair = std::pair<std::string, size_t>;
using Counter = std::unordered_map<Pair::first_type, Pair::second_type>;
using FollowCounter = std::unordered_map<Pair::first_type, Counter>;
using MaxHeap = std::vector<Pair>;
using FollowMaxHeap = std::unordered_map <Pair::first_type, MaxHeap>;
struct Comp { bool operator ()(const Pair& p1, const Pair& p2) const { return p1.second < p2.second; }};
using SVec = std::vector<Pair::first_type>;
const std::regex reSentence("[.;:]");

// --------------------------------------------------------------------------------------
int main() {
    // Convert into all lowercase
    std::transform(text.begin(), text.end(), text.begin(),[](char c) { return static_cast<char>(std::tolower(c)); });

    // Split text into sentences
    SVec sentences(std::sregex_token_iterator(text.begin(), text.end(), reSentence, -1), {});

    // Here we will store the count of words that always follow a specific other word
    FollowCounter followCounter{};

    // Go over all sentences
    for (const Pair::first_type& sentence : sentences) {

        // Extract all words from this sentence
        std::istringstream iss(sentence);
        SVec words(std::istream_iterator<std::string>(iss), {});

        // And do the counting for this word and its follower
        for (size_t wordIndex{}; wordIndex < words.size() - 1; ++wordIndex)
            followCounter[words[wordIndex]][words[wordIndex+1]]++;
    }
    // Here we will store the sorted result.
    FollowMaxHeap followMaxHeap{};

    // Check all words and counts
    for (const auto& [word, counter] : followCounter) {

        // Copy all counts and make create a Max Heap for them for each word
        MaxHeap maxHeap(counter.begin(), counter.end());
        std::make_heap(maxHeap.begin(), maxHeap.end(), Comp());
        // And store the result
        followMaxHeap[word] = std::move(maxHeap);
    }
    // Show debug output
    for (const auto& [word, maxHeap] : followMaxHeap)
        std::cout << std::right << std::setw(20) << word << " --> " << std::left << std::setw(20) << maxHeap.front().first << " --> " << maxHeap.front().second << '\n';

    // Now add a new "follow" word
    std::string startWord{ "it" }, followWord{ "is" };

    // Get a reference to the Max hewp of the front word
    MaxHeap& mh = followMaxHeap[startWord];

    // Look, if the word exists in the MaxHeap
    MaxHeap::iterator fw = std::find_if(mh.begin(), mh.end(), [&](const Pair& p) { return p.first == followWord; });

    // If yes, then increment its count, otherwise, add a new one
    if (fw != mh.end())  
        ++fw->second;
    else
        mh.push_back({ followWord , 1});
    // heapify again
    std::push_heap(mh.begin(), mh.end(), Comp());

    // Show debug output
    for (const auto& [word, maxHeap] : followMaxHeap)
        std::cout << std::right << std::setw(20) << word << " --> " << std::left << std::setw(20) << maxHeap.front().first << " --> " << maxHeap.front().second << '\n';

    return 0;
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1