'How to compare 1,000 images using the available memory efficiently

This is a tough problem. I have around 1,000 images stored in my disk, and I want to find images that are similar to each other by comparing them in pairs. So I have to do around 1,000 * 999 / 2 = 499,500 comparisons (the property of "being similar" is not transitive). My problem is not related with how to compare the images, but with how to manage efficiently the memory of my machine during the comparisons. I have already implemented the comparison function:

static bool AreSimilar(ImageInfo x, ImageInfo y)
{
    // Logic
}

...where ImageInfo is a class that holds the information for one image:

class ImageInfo : IDisposable
{
    public string Path { get; init; }
    public System.Drawing.Image Image { get; init; }
    public void Dispose() => Image.Dispose();
}

Ideally I would like to load all 1,000 images in memory, and then do a nested loop and invoke the AreSimilar method for each pair, but the memory needed for loading all of them at once exceeds by far the available memory of my machine. The image files are quite large, and their size varies considerably (most of them have sizes between 5 and 50 MB). The available RAM is 2 GB, so I can't have more than ~80 images loaded at the same time. Loading an image form the disk is quite slow. It is actually a lot slower to load two images from the disk, than to compare them and find whether they are similar.

My question is how can I implement a method that will have the responsibility of loading/unloading the images from the disk, and yielding them in pairs, while taking advantage of all the available memory, but without exceeding the memory limit. Here is the signature of the method that I am trying to implement:

static IEnumerable<(TItem, TItem)> GetPairs<TSource, TItem>(
    IReadOnlyList<TSource> source,
    Func<TSource, long> sizeSelector,
    Func<TSource, TItem> itemLoader,
    long maxConcurrentSize) where TItem : IDisposable;

The TSource will be the path of the file, and the TItem will be an ImageInfo. I am intending to use it like this:

string[] paths = Directory.GetFiles(@"C:\Images", "*.jpg");
var pairs = GetPairs(paths,
    path => new FileInfo(path).Length,
    path => new ImageInfo() { Path = path, Image = Image.FromFile(path) },
    2_000_000_000);
foreach (var (x, y) in pairs)
{
    if (AreSimilar(x, y))
        Console.WriteLine($"{x.Path} and {y.Path} are similar!");
}

I am currently out of ideas of how to implement this method. It looks like a serious undertaking. All I have right now is the simple version below, that loads the images in pairs and ignores the sizeSelector and maxConcurrentSize parameters:

static IEnumerable<(TItem, TItem)> GetPairs<TSource, TItem>(
    IReadOnlyList<TSource> source,
    Func<TSource, long> sizeSelector,
    Func<TSource, TItem> itemLoader,
    long maxConcurrentSize) where TItem : IDisposable
{
    for (int i = 0; i < source.Count; i++)
    {
        using var first = itemLoader(source[i]);
        for (int j = i + 1; j < source.Count; j++)
        {
            using var second = itemLoader(source[j]);
            yield return (first, second);
        }
    }
}

Obviously the performance is terrible, since each image is loaded ~500 times on average.



Solution 1:[1]

I managed to come up with a compact algorithm that solves this quite difficult problem. The algorithm maintains 5 pieces of information as state:

  1. A list with the items that are currently loaded.
  2. The total size of the currently loaded items.
  3. A queue with the items that are not currently loaded.
  4. An array containing the number of pairs that are remaining for each item.
  5. A BitArray storing whether any particular pair has been emitted.

The algorithm moves items between the two collections loaded and unloaded, loads and unloads items in respect to the maxConcurrentSize policy, yields as many pairs from the loaded list is possible, and continues until both collections are empty. At that point all possible pairs are expected to be emitted, provided that the algorithm is correct.

public static IEnumerable<(TItem, TItem)> GetPairs<TSource, TItem>(
    IReadOnlyList<TSource> source,
    Func<TSource, long> sizeSelector,
    Func<TSource, TItem> itemLoader,
    long maxConcurrentSize) where TItem : IDisposable
{
    List<(int Index, TItem Item, long Size)> loaded = new();
    try
    {
        long loadedSumSize = 0L;
        Queue<int> unloaded = new(Enumerable.Range(0, source.Count));
        int[] pending = new int[source.Count];
        for (int i = 0; i < pending.Length; i++) pending[i] = source.Count - 1;
        BitArray emittedPairs = new(checked(source.Count * (source.Count - 1) >> 1));
        while (unloaded.Count > 0)
        {
            // Select the next item to load, in FIFO order.
            int indexToLoad = unloaded.Dequeue();
            long sizeToLoad = Math.Max(0L, sizeSelector(source[indexToLoad]));

            // Unload already loaded items until there is enough space for the
            // new item, in LIFO order.
            while (loaded.Count > 1 && loadedSumSize + sizeToLoad > maxConcurrentSize)
            {
                var toUnload = loaded[loaded.Count - 1];
                loaded.RemoveAt(loaded.Count - 1);
                toUnload.Item.Dispose();
                unloaded.Enqueue(toUnload.Index);
                loadedSumSize -= toUnload.Size;
            }

            // Load the new item.
            var itemToLoad = itemLoader(source[indexToLoad]);
            loaded.Add((indexToLoad, itemToLoad, sizeToLoad));
            checked { loadedSumSize += sizeToLoad; }
            var justLoaded = loaded[loaded.Count - 1];

            // Yield pairs constituted of the new item with all already loaded items.
            for (int i = 0; i < loaded.Count - 1; i++)
            {
                var existing = loaded[i];
                Debug.Assert(existing.Index != justLoaded.Index);

                // Switch the order in case the pair is not in the correct order.
                var (x, y) = existing.Index < justLoaded.Index ?
                    (existing, justLoaded) : (justLoaded, existing);

                // Emit the pair if it's not already emitted.
                int emittedIndex = (y.Index * (y.Index - 1) >> 1) + x.Index;
                if (emittedPairs[emittedIndex]) continue;
                yield return (x.Item, y.Item);
                emittedPairs[emittedIndex] = true;
                pending[x.Index]--;
                pending[y.Index]--;
            }

            // Unload all items that are completed.
            for (int i = loaded.Count - 1; i >= 0; i--)
            {
                var (index, item, size) = loaded[i];
                if (pending[index] > 0) continue;
                Debug.Assert(pending[index] == 0);
                loaded.RemoveAt(i);
                item.Dispose();
                loadedSumSize -= size;
            }
        }

        // If the algorithm is correct, these final assertions should never fail.
        Debug.Assert(loaded.Count == 0);
        Debug.Assert(loadedSumSize == 0L);
        Debug.Assert(pending.All(n => n == 0));
        Debug.Assert(emittedPairs.Cast<bool>().All(b => b));
    }
    finally
    {
        // Ensure that in case the enumerable is enumerated only partially,
        // all currently loaded items will be disposed.
        foreach (var entry in loaded) entry.Item.Dispose();
    }
}

I measured the efficiency of this implementation by writing a simulation of the original problem:

  1. 1,000 files, each having size between 5 and 50 MB.
  2. maxConcurrentSize equal to 2 GB.
var source = Enumerable.Range(1, 1000).ToArray();
var random = new Random(0);
var sizes = source.ToDictionary(x => x,
    _ => (long)random.Next(5_000_000, 50_000_000));
int loaderInvokedCount = 0;
var query = GetPairs(source, x => sizes[x], x =>
{
    loaderInvokedCount++;
    return new MemoryStream();
}, 2_000_000_000);
var emittedPairsCount = query.Count();
Console.WriteLine($"Emitted {emittedPairsCount:#,0} pairs");
Console.WriteLine($"Loader invoked {loaderInvokedCount:#,0} times");

Output:

Emitted 499,500 pairs
Loader invoked 7,682 times

Live demo.

The naive implementation presented as part of the question invokes the loader 500,500 times for the same input, which is 65 times more compared to this implementation. Achieving a 98,5% discount is a quite satisfactory outcome.

Solution 2:[2]

Here is a solution that works for your problem, with an included unit test. Unfortunately it performs poorly in the case where only small numbers of images are able to load at a time, resulting in at worst 2x the number of loads as your proposed solution. However, when a large number of images can be loaded at a time, this algorithm begins to outperform your algorithm, limiting towards 1 load per image as the allowable memory size increases.

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace UnitTest;


[TestClass]
public class TestComparison
{
    [TestMethod]
    public void Test()
    {
        const int imageCount = 10;
        var (totalLoads, max, pairs) = RunTest(1, 5, 1000, imageCount);
        Assert.AreEqual(10, totalLoads);
        Assert.AreEqual(1, max);

        (_, max, var pairs2) = RunTest(5, 5, 12, imageCount);
        Assert.AreEqual(9, max);

        var expectedPairs = (imageCount - 1) * imageCount / 2;
        Assert.AreEqual(expectedPairs, pairs.Count);
        Assert.AreEqual(expectedPairs, pairs2.Count);
    }

    private (int totalLoads, int maxLoads, List<(ImageInfo, ImageInfo)> pairs) RunTest(int minImageSize, int maxImageSize, long memorySize, int imageCount)
    {
        var images = GenerateImages(imageCount, minImageSize, maxImageSize);
        var paths = images.Keys.ToList();
        var imageLoadCounts = new Dictionary<string, int>();
        var pairs = GetPairs(paths,p=>images[p].Size,LoadImage,memorySize).ToList();

        var totalLoads = imageLoadCounts.Values.Sum();
        var maxLoad = imageLoadCounts.Values.Max();
        return new(totalLoads, maxLoad,pairs);

        ImageInfo LoadImage(string path)
        {
            if (!imageLoadCounts.TryGetValue(path, out int count))
            {
                count = 0;
            }

            count++;
            imageLoadCounts[path] = count;

            return images[path];
        }
    }

    private Dictionary<string, ImageInfo> GenerateImages(int imageCount, int minImageSize, int maxImageSize)
    {
        var images = new Dictionary<string,ImageInfo>();
        for (int i = 0; i < imageCount; i++)
        {
            images[RandomString()] = new() { Value = _random.NextSingle(), Size = _random.Next(minImageSize, maxImageSize) };
        }

        return images;
    }

    class ImageInfo:IDisposable
    {
        public int Size { get; set; }
        public float Value { get; set; }

        public void Dispose()
        {
        }
    }

    private static readonly Random _random = new();


    static string RandomString()
    {
        const int length = 10;
        var str = string.Empty;
        for (int i = 0; i < length; i++)
        {
            str += (char)_random.Next(65, 90);
        }

        return str;
    }



    static bool AreSimilar(ImageInfo x, ImageInfo y) => Math.Abs(y.Value-x.Value)<.1f;
    record Comparison<T>(T Path1, T Path2)
    {
        public bool Contains(T path) => Path1.Equals(path) || Path2.Equals(path);


        public T Other(T path)
        {
            if (Path1.Equals(path)) return Path2;
            if (Path2.Equals(path)) return Path1;
            throw new Exception("invalid path");
        }

        public bool IsEquivalentTo(Comparison<T> other) => (other.Path1.Equals(Path1) && other.Path2.Equals(Path2)) ||
                                                           (other.Path2.Equals(Path1) && other.Path1.Equals(Path2));
    }
    static IEnumerable<(TItem, TItem)> GetPairs<TSource, TItem>(
        IReadOnlyList<TSource> source,
        Func<TSource, long> sizeSelector,
        Func<TSource, TItem> itemLoader,
        long maxConcurrentSize) where TItem : IDisposable
    {

        var itemCount = source.Count;
        var comparisons = new List<Comparison<TSource>>(itemCount * itemCount / 2);
        for (int i = 0; i < itemCount - 1; i++)
        {
            for (int j = i + 1; j < itemCount; j++)
            {
                comparisons.Add(new(source[i], source[j]));
            }
        }

        return GetPairs(comparisons,sizeSelector,itemLoader,maxConcurrentSize);
    }

    static IEnumerable<(TItem, TItem)> GetPairs<TSource,TItem>(List<Comparison<TSource>> remainingComparisons,
        Func<TSource, long> sizeSelector,
        Func<TSource, TItem> itemLoader,
        long maxConcurrentSize) where TItem:IDisposable
    {
        if(!remainingComparisons.Any()) yield break;
        var images = LoadImages(remainingComparisons,sizeSelector,itemLoader,maxConcurrentSize);//load as many images as possible from the remaining comparisons
        var imageCount = images.Count;
        for (int i = 0; i < imageCount - 1; i++)
        {
            var (path1, image1) = images[i];
            for (int j = i + 1; j < imageCount; j++)
            {
                var (path2, image2) = images[j];
                yield return new(image1, image2);
                var comparison = new Comparison<TSource>(path1, path2);
                remainingComparisons.RemoveAll(c => c.IsEquivalentTo(comparison));
            }
        }

        //dispose
        foreach (var image in images.Select(i => i.Image))
        {
            image.Dispose();
        }

        images = null;//allow GC
        foreach (var pair in GetPairs(remainingComparisons,sizeSelector,itemLoader,maxConcurrentSize))
        {
            yield return pair;
        }
    }

    /// <summary>
    /// Loads as many images into memory as possible from the remaining comparison list
    /// </summary>
    static List<(TSource Path, TITem Image)> LoadImages<TSource,TITem>(List<Comparison<TSource>> remainingComparisons, Func<TSource, long> sizeSelector,
        Func<TSource, TITem> itemLoader,
        long maxConcurrentSize)
    {
        var availableMemory = maxConcurrentSize;
        remainingComparisons = remainingComparisons.ToList();//copy the list so we can alter it in local scope
        var loadedImages = new List<(TSource Path, TITem Image)>();
        if (!TryGetSeed(out var seed)) throw new Exception("One of the images is too large to load into memory");
        while (remainingComparisons.Any())
        {
            if (!remainingComparisons.TryGetFirst(c => c.Contains(seed),out var comparison ))
            {
                if (!TryGetSeed(out seed)) break;
                continue;
            }

            var other = comparison.Other(seed);
            remainingComparisons.Remove(comparison);
            if (!TryLoad(other)) break;

        }

        return loadedImages;

        bool TryLoad(TSource path)
        {
            var fileSize = sizeSelector(path);
            if (fileSize > availableMemory) return false;
            loadedImages.Add(new(path, itemLoader(path)));
            availableMemory -= fileSize;
            return true;
        }

        bool TryGetSeed(out TSource seed)
        {
            //first, remove any comparisons that are relevant to the current loaded list
            var loadedImageCount = loadedImages.Count;
            for (int i = 0; i < loadedImageCount-1; i++)
            {
                for (int j = 1; j < loadedImageCount; j++)
                {
                    var localComparison = new Comparison<TSource>(loadedImages[i].Path, loadedImages[j].Path);
                    remainingComparisons.RemoveAll(c => c.IsEquivalentTo(localComparison));
                }
            }

            if (!remainingComparisons.TryGetFirst(out var firstRemaining))
            {
                seed = default;
                return false;
            }

            seed = firstRemaining.Path1;
            return TryLoad(seed);
        }

  


    }
}
public static class Extensions
{
    public static bool TryGetFirst<T>(this IEnumerable<T> items, out T value) =>
        items.TryGetFirst(t => true, out value);
    public static bool TryGetFirst<T>(this IEnumerable<T> items, Predicate<T> condition, out T value)
    {
        foreach (var item in items)
        {
            if (condition(item))
            {
                value = item;
                return true;
            }
        }
        value = default;
        return false;
    }
}

For the sake of answering your question, time complexity was ignored. The current implementation of TryGetSeed makes the time complexity O(N^3) but this can easily be improved. I suspect the same algorithm could be written in O(N^2) time, which is the best possible time complexity for this problem.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Theodor Zoulias
Solution 2