'Why is a parallel stream slower than a sequential stream?

I have two functions solving the same problem. The first uses a sequential stream, and the second uses a parallel stream.

public static int digitalRoot(int n) {
    int sum = String.valueOf(n).chars().map(i -> Integer.parseInt(String.valueOf((char) i))).sum();
    if (sum >= 10) {
        return digitalRoot(sum);
    } else {
        return sum;
    }
}

public static int digitalRootParallel(int n) {
    int sum = String.valueOf(n).chars().parallel().
            map(i -> Integer.parseInt(String.valueOf((char) i))).sum();
    if (sum >= 10) {
        return digitalRootParallel(sum);
    } else {
        return sum;
    }
}

I executed these functions once. The parallel function digitalRootParallel() was faster (5 ms) than the sequential (32 ms) digitalRoot().

But when I executed each of them in a loop of 1.000.000, the sequential (117 ms) was faster than the parallel (1124 ms).

    for (int i = 0; i < 1000000; i++) {
        sum = digitalRoot(n);
    }

Why is the loop with the parallel stream slower?



Solution 1:[1]

All I can share is my surprise that you ever measured anything but a slowdown for the parallel version. You give Fork/Join almost no work to do: just to parse and sum up to ten digits. F/J will probably decide it's not even worth splitting it and will execute the whole computation on a single thread, but the overhead involved in this will stifle performance.

If you want to see any benefit of parallelization, then make sure that you have at least half a second's worth of sequential computation and that it is easily splittable into subtasks.

Solution 2:[2]

First of all, your code is … a bit complicated. Instead of

String.valueOf(n).chars().map(i -> Integer.parseInt(String.valueOf((char) i))).sum();

you could write

String.valueOf(n).chars().map(i -> i-'0').sum();

However, even with your original code the operation is so simple that the HotSpot optimizer can transform it to something so short that it executes faster than any starting of a new thread or synchronizing will be.

This applies especially to your benchmark where you perform the operation multiple times without actually using its result. In a pure sequential context, an optimizer can recognize that the result is not used and elide the entire operation if it doesn’t have any side effects. In contrast, a parallel execution or, more abstract, code communicating with other threads has side-effects.

Therefore, the optimizer can’t do the same to operations spanning multiple threads. Nor can the Stream implementation recognize the simplicity of the operation. If you request a parallel execution, the Stream implementation will respect it, assuming you know what you are doing.


Additional remarks:

An interesting property of the current implementation of String.chars() is that it creates a PrimitiveIterator.OfInt that will be wrapped in a Spliterator.OfInt, rather than implementing Spliterator.OfInt directy. This implies that there is no direct splitting capability as PrimitiveIterator.OfInt doesn’t provide that. Instead, the splitting will be performed by iterating over one or more elements, copying them the into an array and provide this array for being processed by a different thread.

But iterating and copying into a temporary array is likely to take more time than the actual operation in a sequential execution.

These poor splitting capabilities are a known issue that will be addressed in Java 9. But still, your operation is unlikely to benefit from parallel execution.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Marko Topolnik
Solution 2