'Why is Merge sort better for large arrays and Quick sort for small ones?

The only reason I see for using merge sort over quick sort is if the list was already (or mostly) sorted.

Merge sort requires more space as it creates an extra array for storing, and no matter what it will compare every item.

Quick sort on the other hand does not require extra space, and doesn't swap or compare more than necessary.

It would seem unintuitive to say that because of large or small data sets one is better than the other.

For example, quoting Geeksforgeeks article on that:

Merge sort can work well on any type of data sets irrespective of its size (either large or small). whereas The quick sort cannot work well with large datasets.

And next it says:

Merge sort is not in place because it requires additional memory space to store the auxiliary arrays. whereas The quick sort is in place as it doesn’t require any additional storage.

I understand that space complexity and time complexity are separate things. But it still is an extra step, and of course the fact that writing everything on a new array with large data sets it would take more time.

As for the pivoting problem, the bigger the data set, the lower the chance of picking the lowest or highest item (unless, again, it's an almost sorted list).

So why is it considered that merge sort is a better way of sorting large data sets instead of quick sort?



Solution 1:[1]

Why is Merge sort better for large arrays and Quick sort for small ones? It would seem unintuitive to say that because of large or small data sets one is better than the other.

Assuming the dataset fits in memory (not paged out), the issue is not the size of the dataset, but a worst case pattern for a particular implementation of quick sort that result in O(n2) time complexity. Quick sort can use median of medians to guarantee worst case time complexity is O(n log(n)), but that ends up making it significantly slower than merge sort. An alternative is to switch to heap sort if the level of recursion becomes too deep, known as intro sort, and is used in some libraries.

https://en.wikipedia.org/wiki/Median_of_medians

https://en.wikipedia.org/wiki/Introsort

Merge sort requires more space as it creates an extra array for storing, and no matter what it will compare every item.

There are variations of merge sort that don't require any extra storage for data, but they tend to be about 50+% slower than standard merge sort.

Quick sort on the other hand does not require extra space, and doesn't swap or compare more than necessary.

Every element of a sub-array will be compared to the pivot element. As the number of equal elements increases, Lomuto partition scheme gets worse, while Hoare partition scheme gets better. With a lot of equal elements, Hoare partition scheme needlessly swaps equal elements, but the check to avoid the swaps generally costs more time than just swapping.

sorting an array of pointers to objects

Merge sort does more moves but fewer compares than quick sort. If sorting an array of pointers to objects, only the pointers are being moved, but comparing objects requires deference of the pointers and what is needed to compare objects. In this case and other cases where compare takes more time than moves, merge sort is faster.

large datasets that don't fit in memory

For datasets too large to fit in memory, a memory base sort is used to sort "chunks" of the dataset that will fit into memory then written to external storage. Then the "chunks" on external storage are merged using a k-way merge to produce a sorted dataset.

https://en.wikipedia.org/wiki/External_sorting

Solution 2:[2]

I was trying to figure out which sorting algorithm (Merge/Quick) has a better time and memory efficiency when the input data size becomes increasing, Then I write a code that generates a list of random numbers and sorts the list by both algorithms. after that, the program generates 5 txt files that record the random numbers with 1M,2M,3M,4M,5M length (M- stands for Millions)then I got the following results.

Execution Time in seconds:

Execution Time in seconds

Execution Time in seconds graphical Interpretation:

Execution Time in seconds graphical Interpretation

Memory Usage in KB:

Memory Usage in KB

Memory Usage in KB Graphical Interpretation:

Memory Usage in KB Graphical Interpretation

if you want the code here is the Github repo. https://github.com/Nikodimos/Merge-and-Quick-sort-algorithm-using-php

In my scenario Merge sort becomes efficient when the file size becomes increase.

Solution 3:[3]

In addition to rcgldr's detailed response I would like to underscore some extra considerations:

  • large and small is quite relative: in many libraries, small arrays (with fewer than 30 to 60 elements) are usually sorted with insertion sort. This algorithm is simpler and optimal if the array is already sorted, albeit with a quadratic complexity in the worst case.

  • in addition to space and time complexities, stability is a feature that may be desirable, even necessary in some cases. Both Merge Sort and Insertion Sort are stable (elements that compare equal remain in the same relative order), whereas it is very difficult to achieve stability with Quick Sort.

  • As you mentioned, Quick Sort has a worst case time complexity of O(N2) and libraries do not implement median of medians to curb this downside. Many just use median of 3 or median of 9 and some recurse naively on both branches, paving the way for stack overflow in the worst case. This is a major problem as datasets can be crafted to exhibit worst case behavior, leading to denial of service attacks, slowing or even crashing servers. This problem was identified by Doug McIlroy in his famous 1999 paper A Killer Adversary for Quicksort. Implementations are available and attacks have been perpetrated using this technique (cf this discussion).

  • Almost sorted arrays are quite common in practice and neither Quick sort nor Merge sort treat them really efficiently. Libraries now use more advanced combinations of techniques such as Timsort to achieve better performance and stability.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 chqrlie
Solution 2 chqrlie
Solution 3