'How to remove Duplicated elements from a List based on Two properties using Java 8 streams?

I'm trying to figure out how to write a stream in Java 8 that removes duplicate records based on a property and a condition, for example:

+----+------------------+--------------+
| ID |       Name       |    Value     |
+----+------------------+--------------+
|  1 | Real Name        | Real Value   |
|  2 | Duplicate Name   | Real Value   |
|  3 | Duplicate Name   | NULL         |
|  4 | Duplicate Name   | NULL         |
|  5 | Real Name 2      | Real Value 2 |
|  6 | Duplicate Name 2 | Real Value   |
|  7 | Duplicate Name 2 | NULL         |
+----+------------------+--------------+

Example object:

Class ExampleObject {
  int id;
  String name;
  String value;
}

It should remove all the duplicate records with the same name and having value == null.

So in this case, it should remove records with ID 3, 4, and 7.

I can solve this by doing a little bit of streams and some for loops, but I don't really like the implementation, knowing it should be possible by having one stream.

How can I resolve this issue?



Solution 1:[1]

One possibility to do what we want is to

A stream-based implementation may look like this:

final TreeSet<ExampleObject> deduped = objects.stream()
    .collect(Collectors.toCollection(() -> new TreeSet<>(
        Comparator.comparing(ExampleObject::getName)
            .thenComparing(ExampleObject::getValue))));

Ideone demo

If we do not like the stream-based approach, we can also solve this with a "traditional", imperative approach:

final TreeSet<ExampleObject> deduped = new TreeSet<>(
        Comparator.comparing(ExampleObject::getName)
            .thenComparing(ExampleObject::getValue));
deduped.addAll(objects);

Ideone demo


A word on performance:

The deduplication does not come "for free". In the solution provided, we pay for it with execution time. TreeSet is an ordered data structure, thus each insert has time complexity O(log(n)). It then follows that constructing a set of size n has time complexity O(n log(n)).

Solution 2:[2]

remove all the duplicate records with the same name and having value == null. So in this case, it should remove records with ID 3, 4, and 7

If I understood your goal correctly, you want to remove only elements having a value of null if there's at least one more element that has the same name.

Meanwhile, all elements with non-null value regarding of their name should not be discarded (therefore elements with id of 2 and 6 in your example should be preserved; and if this assumption is not correct, this behavior of the implementation below can be easily changed).

Also, if there's only one element in a list with a particular name and a value of null. This assumption is based on pure logic: such element can not be considered as a duplicate because it's name attribute is unique.

Collector description

In order to achieve this, I've written a custom collector that can be vaguely compared with Collectors.groupingBy.

Its purpose is to create a map in which keys will be generated by the given keyExtractor function, and values will be represented with the Deque of elements (this data type was chosen in order to facilitate the convenient access to the most recently added element).

The process of populating each deque (which basically is used as a Stack data structure) will be governed with the provided predicate.

So in short, the overall idea is that all elements having a particular name should be placed into a separate stack and control the values of elements that are being added onto the stack.

In order to make the method responsible for creation of the collector to be uniform and reusable, it utilizes generics and expects two parameters mentioned above: a function and a predicate.

An auxiliary map generated by this collector will what the following structure Map<String,Deque<ExampleObject>>. After the intermediate map was created, to obtain the final result its values will be combined together and stored in a list.

How to create a Collector

To create a custom collector, you can utilize the static method Collector.of() which expects the following arguments:

  • Supplier Supplier<A> is meant to provide a mutable container which store elements of the stream. In this case, HashMap will be used as a container.
  • Accumulator BiConsumer<A,T> defines how to add elements into the mutable container provided by the supplier. This functionality was extracted into a separate method tryAdd() which will create a new ArrayDeque if the given key isn't present in the map. And depending on the state of the stack it might discard the provided element or (and) remove the element from the top of the stack.
  • Combiner BinaryOperator<A> combiner() establishes a rule on how to merge two containers obtained while executing stream in parallel. Here, combiner rely on the same logic that was described for accumulator.
  • Finisher Function<A,R> is meant to produce the final result by transforming the mutable container. The finisher of this collector dumps all the values of the intermediate map into a stream and creates a resulting list.
  • Characteristics allow fine-tuning the collector by providing additional information on how it should function. Here a characteristic Collector.Characteristics.UNORDERED is being applied. Which indicates that the order in which partial results of the reduction produced in parallel is not significant, that can improve performance of this collector with parallel streams.

Implementation

The code might look like this:

public static <T, U> Collector<T, ?, List<T>> toFilteredList(Function<T, U> keyExtractor,
                                                             Predicate<T> condition) {
    return Collector.of(
        HashMap::new,
        (Map<U, Deque<T>> map, T next) -> tryAdd(map, next, keyExtractor, condition),
        (left, right) -> merge(left, right, keyExtractor, condition),
        map -> map.values().stream().flatMap(Deque::stream).toList(),
        Collector.Characteristics.UNORDERED);
}

public static <T, U> void tryAdd(Map<U, Deque<T>> map, T next,
                                 Function<T, U> keyExtractor,
                                 Predicate<T> condition) {
    
    Deque<T> stack = map.computeIfAbsent(keyExtractor.apply(next), k -> new ArrayDeque<>());
    
    if      (!stack.isEmpty() && condition.test(stack.peek()))  stack.pop();      // stack is not empty and element on the top has a value of null
    else if (stack.isEmpty() || condition.negate().test(next))  stack.push(next); // stack is empty - then any element could be added, or new element has a non-null value
}

public static <T, U> Map<U, Deque<T>> merge(Map<U, Deque<T>> left, Map<U, Deque<T>> right,
                                            Function<T, U> keyExtractor,
                                            Predicate<T> condition) {
    
    right.forEach((k, v) -> v.forEach(next -> tryAdd(left, next, keyExtractor, condition)));
    
    return left;
}

main() - demo (the code of dummy ExampleObject class is not shown)

public static void main(String[] args) {
    List<ExampleObject> list =
        List.of(new ExampleObject(1, "Real Name", "Real Value"),
            new ExampleObject(2, "Duplicate Name", "Real Value"),
            new ExampleObject(3, "Duplicate Name", null),
            new ExampleObject(4, "Duplicate Name", null),
            new ExampleObject(5, "Real Name 2", "Real Value 2"),
            new ExampleObject(6, "Duplicate Name 2", "Real Value"),
            new ExampleObject(7, "Duplicate Name 2", null));

    List<ExampleObject> result = list.stream()
        .collect(toFilteredList(ExampleObject::getName, // the key of the auxiliary map
                                exampleObject -> exampleObject.getValue() == null)); // condition to determine an element to discard

    result.forEach(System.out::println);
}

Output (elements with ID 3, 4, and 7 were discarded as required)

ExampleObject{id=6, name='Duplicate Name 2', value='Real Value'}
ExampleObject{id=2, name='Duplicate Name', value='Real Value'}
ExampleObject{id=5, name='Real Name 2', value='Real Value 2'}
ExampleObject{id=1, name='Real Name', value='Real Value'}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2