'String comparison time complexity

Which comparison would take longer time?

a = helloworldhelloworldhelloworld
b = https://www.somerandomurls.com/directory/anotherdirectory/helloworld.html
if a != b: doThis()

versus

a=one, b=two
if a != b: doThis()

I often need to check this against my database which has thousands of rows. I am not looking for any specific programming language. I just want to know which comparison takes faster. As you see, the value of b is longer string on the first example and shorter on the second example. So I wonder if that might make any difference on comparison.



Solution 1:[1]

Time for string comparison is O(n), n being the length of the string.

However depending on the test data, you can manually optimize the matching algorithm. I have mentioned a few.

Optimization 1:

Check the size of both the strings, if unequal, return false. As this will stop the further O(n) comparison, and save time. Generally string data structure stores the size in memory, rather than calculating it each time. This allows O(1) time access to the string size.

Practically this is a huge optimization. I will explain how, by calculating the amortized time complexity.

If your string data structure can have a string of max size x, then there can be a total of (x + 1) possible string sizes (0, 1, 2, ... , x).

There are (x + 1) choose 2 ways of selecting two strings = x * (x + 1) / 2

If you use optimization 1, then you would need to compare the whole length only when two strings are of equal length. There will be only x + 1 such cases. Number of operations done will be 0 + 1 + 2 + .... + x = x * (x + 1) / 2 .

Remaining (x + 1) * (x - 2) / 2 cases will be calculated in O(1) time.

Hence total computations = x * (x + 1) / 2 + (x + 1) * (x - 2) / 2 = (x + 1) * (x - 1) which is O(n^2). Since we are doing x * (x + 1) / 2 string comparisons, hence amortized time complexity per comparison is O(1).

Where as without any optimization, there will be

0 + 1 * (x) * 1 + 2 * (x - 1) * 2 + 3 * (x - 3) * 3 + .... + x/2 * x/2 * x/2 calculations. Which will be without any doubt more than O(n^3). And amortized time complexity will be more than O(n).

Optimization 2:

Since you database contains web links, it is possible that they belong to the same website, hence their first few characters will always be same. This will lead to redundant CPU time usage. Hence better to check from the end for this case, as relative links will differ only from the end.

Note Theoretically speaking, we are not developing an algorithm that will change the worst case time complexity, it is still O(n). We are just optimizing the algorithm.

Solution 2:[2]

String comparisons typically do a linear scan of the characters, returning false at the first index where characters do not match.

The time complexity is O(N) and the actual time taken depends on how many characters need to be scanned before differences statistically emerge. If every one of your strings starts with http://, there will be a constant overhead to scan those first 7 characters (without tailoring the comparison algorithm to your specialized data).

If you have long strings, a tendency for the beginning of many strings to have the same starting characters, and extreme performance requirements you can consider hashing the strings, comparing the hashes first, and only doing a linear comparison of the strings if the hashes match (in order to rule out the possibility of a hash collision). If you do your initial comparison using hashes, which are shorter than the supposed long strings, you may be able to reduce the IO and RAM requirements of the system by carefully designing your query strategy.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Abhay Patil
Solution 2 Eric J.