'Algorithm to find same substring from a list of strings

I'm a bit lost here, some help is welcomed. The idea is to find a matching substring from a list of strings. It doesn't has to be perfect. Let's explain this with an example :

"Country_Name" "Country_Id" "Rankcountry" "ThisWillNotMatch"

Would return "country"

It has to be something efficient, as a 'brute force' algo looks a bit scary.



Solution 1:[1]

Inspired by Jon Bentley's "Algorithm Alley" column in Dr. Dobb's.

Build an index of every suffix. Sorting the index brings common substrings together. Walk the sorted index comparing adjacent substrings, and you can easily find the longest one (or the most common one).

    #include <algorithm>
    #include <cstddef>
    #include <iostream>
    #include <string>
    #include <vector>

    std::size_t LengthInCommon(const char *left, const char *right) {
      std::size_t length_of_match = 0;
      while (*left == *right && *left != '\0') {
        ++length_of_match;
        ++left;
        ++right;
      }
      return length_of_match;
    }

    std::string FindLongestMatchingSubstring(const std::vector<std::string> &strings) {
      // Build an index with a pointer to each possible suffix in the array.  O(n)
      std::vector<const char *> index;
      for (const auto &s : strings) {
        for (const auto &suffix : s) {
          index.push_back(&suffix);
        }
      }

      // Sort the index using the underlying substrings.  O(n log_2 n)
      std::sort(index.begin(), index.end(), [](const char *left, const char *right) {
        return std::strcmp(left, right) < 0;
      });

      // Common strings will now be adjacent to each other in the index.
      // Walk the index to find the longest matching substring.
      // O(n * m) where m is average matching length of two adjacent strings.
      std::size_t length_of_longest_match = 0;
      std::string match;
      for (std::size_t i = 1; i < index.size(); ++i) {
        const char *left = index[i - 1];
        const char *right = index[i];
        std::size_t length_of_match = LengthInCommon(left, right);
        if (length_of_longest_match < length_of_match) {
          length_of_longest_match = length_of_match;
          match.assign(index[i], index[i] + length_of_longest_match);
        }
      }

      return match;
    }

    int main () {
      std::vector<std::string> strings;
      strings.push_back("Country_Name");
      strings.push_back("Country_id");
      strings.push_back("RankCountry");
      strings.push_back("ThisWillNotMatch");
      std::cout << FindLongestMatchingSubstring(strings) << std::endl;
      return 0;
    }

Prints:

Country_

Solution 2:[2]

Not sure if it is "efficient" or considered brute force... I leave that up to others to judge.

  1. input = list of strings
  2. for each s in input do: computeAllStrides ( string -> string list) (see code below)
  3. create empty, mutable dictionary with key of type string, value of type int
  4. all strides = list of list of strings from step 2 -> update Dictionary with update Dictionary (stride) when stride exists in dictionary -> increase value of respective entry update Dictionary (stride) when stride does not exist in dictionary -> Add (stride,1) to Dictionary
  5. find Dictionary entry which yields the maximum value of stride.Length * frequency
  6. report found maximum value.

In case of case insensitive, perform a toLowercase operation on each input string first.

    open System.Collections.Generic

    let input = ["Country_Name"; "Country_id"; "RankCountry"; "ThisWillNotMatch"; ]

    let rec getAllStrides text =
      let length = String.length text
      match length with
        | 0 -> []
        | 1 -> [text]
        | _ -> [ for i = 1 to length do yield text.Substring(0, i ) ] @ getAllStrides (text.Substring(1))

                                                                                      
    type HashTable = System.Collections.Generic.Dictionary<string,int>

    let update (ht : HashTable) strides =
      List.iter (fun s ->
                 if ht.ContainsKey(s) then ht.[s] <- ht.[s] + 1 else ht.Add( s, 1 )
                 ) strides

    let computeStrideFrequencies input =
      let ht = new HashTable()
      input |> List.iter (fun i -> update ht (getAllStrides i) )
      ht


    let solve input =
      let theBest = input |> computeStrideFrequencies |> Seq.maxBy (fun (KeyValue(k,v)) -> k.Length * v)
      theBest.Key


   solve input;;
   val it : string = "Country"

Solution 3:[3]

I still don't understand why "c" cannot be an answer. I guess you prefer longer strings. Need to get your optimization function straight!

In any case, you can solve this with Tries. Create a Trie for each string. make count 1 for each node. And merge all tries by summing up the counts. This way you get all substrings and their counts. Now, use your optimization function to pick the optimal one.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 BitTickler
Solution 2
Solution 3 ElKamina