'k-diff sequences in a float array

Looking for an algorithm to find longest sequences (pairs, triplets, up to quadruplets) that are separated by a constant, non-integer difference k in a sorted array arr of floats. Is there an O(n) or better solution?

find_sequences(arr=[1.20, 2.00, 2.20, 2.31, 3.09, 3.43, 4.20, 5.30], k=1.10, tol=0.01)
# with tolerance of 1% of k, or 0.011, first sequence includes 2.31 but not 3.43
# [[1.20, 2.31], [2.00, 3.09, 4.20, 5.30]]

find_sequences(arr=[1.20, 2.00, 2.20, 2.31, 3.00, 3.10, 3.43, 4.20, 5.30], k=1.10, tol=0.02)
# tolerance of 2% allows in 3.43
# [[1.20, 2.31, 3.43], [2.00, 3.10, 4.20, 5.30]]

# alternatively, return indices - as you can see they're overlapping:
# [[0, 3, 6], [1, 5, 7, 8]]

Tolerance seems to be easy to implement through __eq__ constructor with np.isclose(), not worried too much about that. Mainly wondering if there's a one-pass solution.

There is distant similarity to Leetcode's #532 (K-diff Pairs in an Array) https://leetcode.com/problems/k-diff-pairs-in-an-array/

Thus far I came up with this pretty slow pandas solution.

def find_series(s, delta, btol, utol):
    """Finds delta-diff sequences in a float array.
    
    Algorithm:
    1) find all matching pairs (M0, M1)
    2) recursively find longer sequences.
    """
    
    # step 1: find all matching pairs
    m01 = []
    for idx, val in s.items():
        lower, upper = val + delta - btol, val + delta + utol
        is_match = s[idx:].between(lower, upper)
        if sum(is_match) == 1:
            m01.append([idx, is_match.idxmax()])
        elif sum(is_match) > 1:  # starting series and tolerances are picked to not allow this to happen
            print(f'multiple matches for {idx}:{val}')
    
    m01 = np.array(m01)  # np.append / np.vstack are slower
    res = pd.DataFrame(data={
        'M0': s[m01[:,0]].values,
        'M1': s[m01[:,1]].values,
    })
    
    # check if M1 values are found in M0 column
    next_ = res['M0'].isin(res['M1'])
    n_matches = sum(next_)
    if n_matches == 0:
        return
    
    
    # step 2: recursion
    next_map = res[next_].set_index('M0')['M1'].to_dict()
    i = 2
    while True:
        next_col = res[f'M{i-1}'].map(next_map)
        n_matches = next_col.notna().sum()
        if n_matches > 0:
            res[f'M{i}'] = next_col
            i += 1
        else:
            break

    return res[~next_].to_numpy()

find_series(a, 1.1, 0.02, 0.02)

returns:

array([[1.2 , 2.31, 3.43,  nan],
       [2.  , 3.09, 4.2 , 5.3 ]])

Timing on a bigger dataset

|    n |   time(ms) |
|-----:|-----------:|
|  200 |         82 |
|  400 |        169 |
|  800 |        391 |
| 1600 |        917 |
| 3200 |       2500 |


Solution 1:[1]

Solution : Given an unsorted integer array, print all pairs with a given difference k in it.

A naive solution would be to consider every pair in a given array and return if the desired difference is found. The time complexity of this solution would be O(n^2), where n is the size of the input.

Naive Approach :

We can use a set to solve this problem in linear time. The idea is to insert each array element arr[i] into a set. We also check if element (arr[i] - diff) or (arr[i] + diff) already exists in the set or not. If the element is seen before, print the pair (arr[i], arr[i] - diff) or (arr[i] + diff, arr[I]).

So the code for it will

# This method does not handle duplicates in the list
def findPair(A, diff):
 
    # list is unsorted
 
    # take an empty set
    s = set()
 
    # do for every element in the list
    for i in A:
 
        # check if pair with the given difference `(i, i-diff)` exists
        if i - diff in s:
            print((i, i - diff))
 
        # check if pair with the given difference `(i + diff, i)` exists
        if i + diff in s:
            print((i + diff, i))
 
        # insert the current element into the set
        s.add(i)
 
 
if __name__ == '__main__':
 
    A = [1, 5, 2, 2, 2, 5, 5, 4]
    diff = 3
 
    findPair(A, diff)

As we have seen that we can't handle duplicates with above function then How can we handle it ?

So usually we can handle duplicate pairs by sorting them into array first and then skipping similar adjacent element.

# This method handles duplicates in the list
def findPair(A, diff):
 
    # sort list in ascending order
    A.sort()
 
    # take an empty set
    s = set()
 
    # do for every element in the list
    i = 0
    while i < len(A):
 
        # to avoid printing duplicates (skip adjacent duplicates)
        while i + 1 < len(A) and A[i] == A[i + 1]:
            i = i + 1
 
        # check if pair with the given difference `(A[i], A[i]-diff)` exists
        if A[i] - diff in s:
            print((A[i], A[i] - diff))
 
        # check if pair with the given difference `(A[i]+diff, A[i])` exists
        if A[i] + diff in s:
            print((A[i] + diff, A[i]))
 
        # insert the current element into the set
        s.add(A[i])
 
        i = i + 1
 
 
if __name__ == '__main__':
 
    A = [1, 5, 2, 2, 2, 5, 5, 4]
    diff = -3
 
    findPair(A, diff)

Time Complexity = O(n.log(n))

Space Complexity = O(n) where n is size of the input.

Solution 2:[2]

Yes, this can be done with sweep line technique in O(nlog(n)). Suppose that from a number x, I am able to go to a number y if x + a <= y <= x + b. This is a generalization of the problem you have stated.

The idea is this: Create events of types 1, 2 and 3, for each number x. Type 1 event for x occurs at position x, and indicates that we should process x with respect to the currently available numbers. Type 2 event for x occurs at position x + a, and it indicates we should now include x in the set of currently available numbers. As you suspect, type 3 event for x occurs at position x + b, and it indicates we should remove x from currently available numbers.

When processing x, the numbers that are currently available will all be smaller than x. So the point is that every number currently available could go from itself to x. When we process a number, we also give determine the maximum length chain that can lead to the number. So for every number before x, we know how many numbers led up to it optimally, which means that for everything in currently available set, we also know the answer. So we take max answer across everything in currently available set, add one to it, and set that as answer to x.

The log factor comes from the fact that we have to sort the events. The following code works on your sample.

  int main() {
    vector<double> arr{1.20, 2.00, 2.20, 2.31, 3.00, 3.10, 3.43, 4.20, 5.30};
    int n = arr.size();
    double tol = 0.01;
    double a = 1.1 * (1 - tol), b = 1.1 * (1 + tol);
    vector<pair<double, pair<double, int>>> events;
    for (int i = 0; i < n; ++i) {
        double x = arr[i];
        events.push_back({x, {1, i}});
        events.push_back({x + a, {2, i}});
        events.push_back({x + b, {3, i}});
    }
    sort(events.begin(), events.end());
    multiset<pair<int, int>> avail;  // set of pairs of answer, and index, for each currently available element
    vector<int> ans(n, 0), prev(n, -1);

    for (auto ev : events) {
        int type = ev.second.first, idx = ev.second.second;
        if (type == 1) { // process x
            if (avail.size()) {
                ans[idx] = 1 + avail.rbegin()->first;  // largest currently available answer
                prev[idx] = avail.rbegin()->second;
            } else ans[idx] = 1;
        } else if (type == 2) {  // add in x
            avail.insert({ans[idx], idx});
        } else if (type == 3) {  // remove x
            avail.erase(avail.lower_bound({ans[idx], idx}));
        }
    }
    int best = 0, pos = -1;
    for (int i = 0; i < n; ++i)
        if (ans[i] > best) {
            best = max(ans[i], best);
            pos = i;
        }
    vector<double> vals;
    while (pos != -1) {
        vals.push_back(arr[pos]);
        pos = prev[pos];
    }
    sort(vals.begin(), vals.end());
    for (auto x : vals) cout << x << ", ";
    cout << endl;
}

Note that this just finds the longest sequence that satisfies the constraints. I am a little confused when looking at your mention of pairs, triplets and quadruplets, because if you wanted to find all quadruplets, there could be O(n^4) of them with large enough tol and close enough values in arr.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Anushka Pandhere
Solution 2 Jerry Halisberry