'Cache-friendliness std::list vs std::vector

With CPU caches becoming better and better std::vector usually outperforms std::list even when it comes to testing the strengths of a std::list. For this reason, even for situations where I need to delete/insert in the middle of the container I usually pick std::vector but I realized I had never tested this to make sure assumptions were correct. So I set up some test code:

#include <iostream>
#include <chrono>
#include <list>
#include <vector>
#include <random>

void TraversedDeletion()
{
    std::random_device dv;
    std::mt19937 mt{ dv() };
    std::uniform_int_distribution<> dis(0, 100000000);

    std::vector<int> vec;
    for (int i = 0; i < 100000; ++i)
    {
        vec.emplace_back(dis(mt));
    }

    std::list<int> lis;
    for (int i = 0; i < 100000; ++i)
    {
        lis.emplace_back(dis(mt));
    }

    {
        std::cout << "Traversed deletion...\n";
        std::cout << "Starting vector measurement...\n";

        auto now = std::chrono::system_clock::now();
        auto index = vec.size() / 2;
        auto itr = vec.begin() + index;
        for (int i = 0; i < 10000; ++i)
        {
            itr = vec.erase(itr);
        }

        std::cout << "Took " << std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - now).count() << " μs\n";
    }

    {
        std::cout << "Starting list measurement...\n";

        auto now = std::chrono::system_clock::now();
        auto index = lis.size() / 2;
        auto itr = lis.begin();
        std::advance(itr, index);
        for (int i = 0; i < 10000; ++i)
        {
            auto it = itr;
            std::advance(itr, 1);
            lis.erase(it);
        }

        std::cout << "Took " << std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - now).count() << " μs\n";
    }

}

void RandomAccessDeletion()
{
    std::random_device dv;
    std::mt19937 mt{ dv() };
    std::uniform_int_distribution<> dis(0, 100000000);

    std::vector<int> vec;
    for (int i = 0; i < 100000; ++i)
    {
        vec.emplace_back(dis(mt));
    }

    std::list<int> lis;
    for (int i = 0; i < 100000; ++i)
    {
        lis.emplace_back(dis(mt));
    }

    std::cout << "Random access deletion...\n";
    std::cout << "Starting vector measurement...\n";
    std::uniform_int_distribution<> vect_dist(0, vec.size() - 10000);

    auto now = std::chrono::system_clock::now();

    for (int i = 0; i < 10000; ++i)
    {
        auto rand_index = vect_dist(mt);
        auto itr = vec.begin();
        std::advance(itr, rand_index);
        vec.erase(itr);
    }

    std::cout << "Took " << std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - now).count() << " μs\n";

    std::cout << "Starting list measurement...\n";

    now = std::chrono::system_clock::now();

    for (int i = 0; i < 10000; ++i)
    {
        auto rand_index = vect_dist(mt);
        auto itr = lis.begin();
        std::advance(itr, rand_index);
        lis.erase(itr);
    }

    std::cout << "Took " << std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - now).count() << " μs\n";
}

int main()
{
    RandomAccessDeletion();
    TraversedDeletion();
    std::cin.get();
}

All results are compiled with /02 (Maximize speed).

The first, RandomAccessDeletion(), generates a random index and erases this index 10.000 times. My assumptions were right and the vector is indeed a lot faster than the list:

Random access deletion...

Starting vector measurement...

Took 240299 μs

Starting list measurement...

Took 1368205 μs

The vector is about 5.6x faster than the list. We can most likely thank our cache overlords for this performance benefit, even though we need to shift the elements in the vector on every deletion it's impact is less than the lookup time of the list as we can see in the benchmark.


So then I added another test, seen in the TraversedDeletion(). It doesn't use randomized positions to delete but rather it picks an index in the middle of the container and uses that as base iterator, then traverse the container to erase 10.000 times.

My assumptions were that the list would outperform the vector only slightly or as fast as the vector.

The results for the same execution:

Traversed deletion...

Starting vector measurement....

Took 195477 μs

Starting list measurement...

Took 581 μs

Wow. The list is about 336x faster. This is really far off from my expectations. So having a few cache misses in the list doesn't seem to matter at all here as cutting the lookup time for the list weighs in way more.


So the list apparently still has a really strong position when it comes to performance for corner/unusual cases, or are my test cases flawed in some way?

Does this mean that the list nowadays is only a reasonable option for lots of insertions/deletions in the middle of a container when traversing or are there other cases?

Is there a way I could change the vector access & erasure in TraversedDeletion() to make it at least a bit more competition vs the list?


In response to @BoPersson's comment:

vec.erase(it, it+10000) would perform a lot better than doing 10000 separate deletes.

Changing:

for (int i = 0; i < 10000; ++i)
{
    itr = vec.erase(itr);
}

To:

vec.erase(itr, itr + 10000);

Gave me:

Starting vector measurement...

Took 19 μs

This is a major improvement already.



Solution 1:[1]

In TraversedDeletion you are essentially doing a pop_front but instead of being at the front you are doing it in the middle. For a linked list this is not an issue. Deleting the node is a O(1) operation. Unfortunately when you do this in the vector is it a O(N) operation where N is vec.end() - itr. This is because it has to copy every element from deletion point forward one element. That is why it is so much more expensive in the vector case.

On the other hand in RandomAccessDeletion you are constantly changing the delete point. This means you have an O(N) operation to traverse the list to get to the node to delete and a O(1) to delete the node versus a O(1) traversersal to find the element and a O(N) operation to copy the elements in the vector forward. The reason this is not the same though is the cost to traverse from node to node has a higher constant than it takes to copy the elements in the vector.

Solution 2:[2]

The long duration for list in RandomDeletion is due to the time it takes to advance from the beginning of the list to the randomly selected element, an O(N) operation.

TraverseDeletion just increments an iterator, an O(1) operation.

Solution 3:[3]

The "fast" part about a vector is "reaching" the element which needs to be accessed (traversing). You don't actually traverse much on the vector in the deletion but only access the first element. ( I would say the adavance-by-one does not make much measurement wise)

The deletion then takes quite a lot of time ( O(n) so when deleting each one by itself it's O(n²) ) due to changing the elements in the memory. Because the deletion changes the memory on the locations after the deleted element you also cannot benefit from prefetching which also is a thing which makes the vector that fast.

I am not sure how much the deletion also would invalidate the caches because the memory beyond the iterator has changed but this can also have a very big impact on the performance.

Solution 4:[4]

In the first test, the list had to traverse to the point of deletion, then delete the entry. The time the list took was in traversing for each deletion.

In the second test, the list traversed once, then repeatedly deleted. The time taken was still in the traversal; the deletion was cheap. Except now we don't repeatedly traverse.

For the vector, traversal is free. Deletion takes time. Randomly deleting an element takes less time than it took for the list to traverse to that random element, so vector wins in the first case.

In the second case, the vector does the hard work many many more times than the list does it hard work.

But, the problem is that isn't how you should traverse-and-delete from a vector. It is an acceptable way to do it for a list.

The way you'd write this for a vector is std::remove_if, followed by erase. Or just one erase:

  auto index = vec.size() / 2;
  auto itr = vec.begin() + index;
  vec.erase(itr, itr+10000);

Or, to emulate a more complex decision making process involving erasing elements:

  auto index = vec.size() / 2;
  auto itr = vec.begin() + index;
  int count = 10000;
  auto last = std::remove_if( itr, vec.end(),
    [&count](auto&&){
      if (count <= 0) return false;
      --count;
      return true;
    }
  );
  vec.erase(last, vec.end());

Almost the only case where list is way faster than vector is when you store an iterator into the list, and you periodically erase at or near that iterator while still traversing the list between such erase actions.

Almost every other use case has a vector use-pattern that matches or exceeds list performance in my experience.

The code cannot always be translated line-for-line, as you have demonstrated.


Every time you erase an element in a vector, it moves the "tail" of the vector over 1.

If you erase 10,000 elements, it moves the "tail" of the vector over 10000 in one step.

If you remove_if, it removes the tail over efficiently, gives you the "wasted" remaining, and you can then remove the waste from the vector.

Solution 5:[5]

I want po point out something still not mentioned in this question:

In the std::vector, when you delete an element in the middle, the elements are moved thanks to new move semantics. That is one of the reasons the first test takes this speed, because you are not even copying the elements after the deleted iterator. You could reproduce the experiment with a vector and list of non-copiable type and see how the performance of the list (in comparation) is better.

Solution 6:[6]

I would suggest to run the same tests using a more complex data type in the std::vector: instead of an int, use a structure.

Even better use a static C array as a vector element, and then take measurements with different array sizes.

So, you could swap this line of your code:

std::vector<int> vec;

with something like:

const size_t size = 256;
struct TestType { int a[size]; };
std::vector<TestType> vec;

and test with different values of size. The behavior may depend on this parameter.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 NathanOliver
Solution 2 1201ProgramAlarm
Solution 3 Hayt
Solution 4 Yakk - Adam Nevraumont
Solution 5 LeDYoM
Solution 6 Pietro