Set Operations on Java Sets and Lists


Do an Internet search on "set operations java" and at the top of the results you're sure to find at least one page telling you to use Collection retainAll( ) and addAll( ) for intersection and union, respectively. Some online sources might note that sets and lists behave differently, but generally they don't explore how the two behave differently, or what it means to do a traditional set operation on a list. And they certainly don't warn you of the unexpected behavior you'll get if you try to perform a retainAll( ) on a list—what you won't get is true set-operation behavior.

On this page I explore these issues in detail. First I give a brief summary to serve as a quick reference for those who don't have time for, or interest in, the nitty-gritty details.

Summary

If you are working with true sets—in which every element is unique—it's slightly advantageous to use Guava's Sets methods.

Listing 1: Snapshot of Guava's Sets.intersect( ) and Sets.union( )

Set<String> set1 = Sets.newHashSet("a", "b", "c");
Set<String> set2 = Sets.newHashSet("c", "d", "e");
    
// Intersection
Set<String> setIntersection = Sets.intersection(set1, set2);
    
// Union
Set<String> setUnion = Sets.union(set1, set2);

If you are working with what are called multisets, or lists where each element may appear more than once, you should use Guava's Multiset methods. Make sure you are not using an old jar file for Guava. (The version in guava-15.0.jar is sufficient.)

Listing 2: Snapshot of Guava's Multisets.intersect( ) and Multisets.union( )

List<String> list1 = Lists.newArrayList("a", "b", "c", "c");
Multiset<String> multiset1 = HashMultiset.create(list1);
List<String> list2 = Lists.newArrayList("c", "d", "e");
Multiset<String> multiset2 = HashMultiset.create(list2);
    
// Union
Multiset<String> multisetUnion = Multisets.union(multiset1, multiset2);

// Intersection
Multiset<String> multisetIntersection = Multisets.intersection(multiset1, multiset2);

Set Operations on True Sets

Now we delve into the details. We start off with the easy case, doing set operations on true sets.

Set operations using Set.retainAll( ) and Set.addAll( )

For true sets, it is possible to use Java's Collection retainAll( ) for intersection and addAll( ) for union. The example below shows two unit tests, the first verifying intersection behavior, the second verifying union.

Listing 3: Using retainAll( ) and addAll( ) for set operations

@Test
public void testTrueSetRetainAll() {
    Set<String> set1 = Sets.newHashSet("a", "b", "c");
    Set<String> set2 = Sets.newHashSet("c", "d", "e");

    // Intersection
    Set<String> setIntersection = Sets.newHashSet(set1);
    setIntersection.retainAll(set2);   
        
    assertEquals(Sets.newHashSet("c"), setIntersection);
}

@Test
public void testTrueSetAddAll() {
    Set<String> set1 = Sets.newHashSet("a", "b", "c");
    Set<String> set2 = Sets.newHashSet("c", "d", "e");
    
    // Union
    Set<String> setUnion = Sets.newHashSet(set1);
    setUnion.addAll(set2);
    
    // Test the results.
    assertEquals(Sets.newHashSet("a", "b", "c", "d", "e"), setUnion);
}

In both cases, first we create the two sets using Guava's Sets.newHashSet( ). If you're not familiar with this, see "Intro to Guava" in the Internal Links at the side. After these sets are created, we instantiate what will be the intersection or union of the two sets by initializing each with all the elements of set1. (This second step isn't necessary if you don't need to keep a copy of the original set1.) Then we perform the intersection or union with the appropriate method, and finally we verify that the new set has the correct elements.

Set Operations with Guava

Guava makes set operations a little easier, and a little more intuitive, because it offers true intersect( ) and union( ) methods.

Listing 4: Using Guava's intersect( ) and union( ) for set operations

@Test
public void testTrueSetGuavaIntersection() {
    Set<String> set1 = Sets.newHashSet("a", "b", "c");
    Set<String> set2 = Sets.newHashSet("c", "d", "e");
    
    // Intersection
    Set<String> setIntersection = Sets.intersection(set1, set2);
    
    assertEquals(Sets.newHashSet("c"), setIntersection);
}

@Test
public void testTrueSetGuavaUnion() {
    Set<String> set1 = Sets.newHashSet("a", "b", "c");
    Set<String> set2 = Sets.newHashSet("c", "d", "e");
    
    // Union
    Set<String> setUnion = Sets.union(set1, set2);
    
    // Test the results.
    assertEquals(Sets.newHashSet("a", "b", "c", "d", "e"), setUnion);
}

Compare this to the native Java version in Listing 3. You see that we don't have to initialize the new intersection or union set with set1's values.

Set Operations on Lists (Multisets)

If I came across someone trying to perform set operations on lists, I would try to account for this odd behavior in one of three possible ways: (1) the person doesn't know what they're doing; (2) the person knows what they're doing, they know that their lists are really sets, and they're sure that as the project develops those lists will never one day have duplicate values, which would invalidate them as sets; or (3) the person is actually, and knowingly, working with what are called multisets.

Briefly, a multiset is a set of objects where order does not matter and it is possible to have more than one of each type of element. For example, {a, b, b, c, d} is a multiset, and it is equivalent to {a, c, d, b, b}. Note that the traditional set {a, b, c} is also a multiset, albeit one which has only one element of each type. Another name for multiset is a bag, as in "a bag of words model of a document."

While multisets are not true sets, it nevertheless makes perfect sense to want to perform set-like operations such as intersect and union on them.

As we saw above, there are two perfectly acceptable ways to perform set operations on true sets—although the Guava approach is slightly preferable to the use of native Java Set methods. In this section we'll see that when it comes to multisets, there really is only one option.

Set operations using List.retainAll( ) and List.addAll( )

This is a long subsection whose upshot will be: Do not use retainAll( ) and addAll( ) for multisets!

Let's start the ball rolling by performing an intersection on a multiset, as implemented by a List.

Listing 5: Using retainAll( ) for a List implementation of multisets

@Test
public void testListRetainAll() {
    List<String> list1 = Lists.newArrayList("a", "b", "c");
    List<String> list2 = Lists.newArrayList("c", "c", "d", "e");
    
    // Intersection
    List<String> listIntersection = Lists.newArrayList(list1);
    listIntersection.retainAll(list2);
    
    // Test the results.
    assertEquals(Lists.newArrayList("c"), listIntersection);
}

So far, so good. We see that the two lists have a single element in common, and the unit test verifies that retainAll( ) did, indeed, retain that single element "c".

However, we expect set operations to be commutative—that is, we expect A ∩ B = B ∩ A. To verify this, let's add a couple of lines to the same unit test.

Listing 6: Testing commutativity of retainAll( ) for a List implementation of multisets

// Verify that the operation is commutative.
List<String> listIntersection2 = Lists.newArrayList(list2);
listIntersection2.retainAll(list1);
assertEquals(Lists.newArrayList("c"), listIntersection2);
    
// Test fails with the following output:
java.lang.AssertionError: expected:<[c]> but was:<[c, c]>

The first test, which passed, was performed on a list initialized with list1, and then the test called retainAll(list2). The second test initialized with list2, and then called retainAll(list1). And this test failed because it had one too many "c" elements, thus showing that this way of trying to implement the intersection operation is not commutative.

Which is the correct intersection of the two sets—{c} or {c, c}? Here's a formal definition of the multiset intersection operation from PlanetMath.org:

The intersection of f and g, denoted by f ∩ g, is the multiset whose domain is A ∩ B, such that (f ∩ g) (x) := min(f(x), g(x))1

The same page gives a good example of how this works on two multisets. (The Multiset page in Wikipedia offers much the same definition, but provides no good example.)

So that covers intersections with retainAll( ), showing that this is not a good way to implement multiset intersection. Now let's deal with union and addAll( ). First we'll go back to PlanetMath.org for a formal definition of what it means for the union operation to be applied to two multisets.

The union of f and g, denoted by f ∪ g, is the multiset whose domain is A ∪ B, such that (f ∪ g) (x) := max(f(x), g(x))1

So the union of the two multisets {c} and {c, c} is {c, c}, since the maximal multiplicity of the element c is 2. If you think the union should be {c, c, c}, what you are thinking of is defined as the multiset sum operation.

The next listing demonstrates that addAll( ) does not give the correct multiset behavior for the union operation.

Listing 7: Testing behavior addAll( ) for a List implementation of multisets

@Test
public void testListAddAll() {
    List<String> list1 = Lists.newArrayList("a", "b", "c");
    List<String> list2 = Lists.newArrayList("c", "c", "d", "e");
    
    // Union
    List<String> listUnion = Lists.newArrayList(list1);
    listUnion.addAll(list2);
    
    // Sort the expected and actual lists so that they can be compared by assertEquals( ).
    List<String> expectedList = Lists.newArrayList("a", "b", "c", "c", "d", "e");
    Collections.sort(expectedList);
    Collections.sort(listUnion);
    
    assertEquals(expectedList, listUnion);
}

// Test fails with the following output:
java.lang.AssertionError: expected:<[a, b, c, c, d, e]> but was:<[a, b, c, c, c, d, e]>

In the listing, first we create our two multisets as lists. Then we use addAll( ) to get the "union." Then a couple steps are necessary before we can do the assertEquals( ) of the unit tests. We have to take these steps because a multiset is, by definition, unordered, whereas this particular implementation of a multiset is using a List, which is ordered. We create our expected list, which is what the union of the two multisets should be. Then we sort both the expected list and the union list. Now order will no longer be the difference between the two lists, and we are testing only their contents. And the contents are found to be incorrect.

Guava's Multisets operations for intersection and union

We've seen the problems with the Java List methods; let's see what Guava's Multisets has to offer.

The first listing below shows how to properly do multiset intersection.

Listing 8: Using Multisets.intersection( )

@Test
public void testMultisetIntersection() {
    List<String> list1 = Lists.newArrayList("a", "b", "c", "c");
    Multiset<String> multiset1 = HashMultiset.create(list1);
    List<String> list2 = Lists.newArrayList("c", "d", "e");
    Multiset<String> multiset2 = HashMultiset.create(list2);
    
    // Intersection
    Multiset<String> multisetIntersection = Multisets.intersection(multiset1, multiset2);
    
    // Test the results.
    Multiset<String> expectedMultiset = HashMultiset.create(Lists.newArrayList("c"));
    assertEquals(expectedMultiset, multisetIntersection);

    // Verify that the operation is commutative.
    Multiset<String> multisetIntersection2 = Multisets.intersection(multiset2, multiset1);
    assertEquals(expectedMultiset, multisetIntersection2);
}

In Listing 8, first we create two multisets from exactly the same lists we've been working with in this section. Then we create the intersection of the two multisets with the intersection( ) method. Then we create the expected multiset and perform the unit test. Finally, in order to verify commutativity, we create a new intersection of the two multisets, but in reverse order. Note that in this case we get the correct multiset behavior for intersection.

The listing below shows how to perform multiset union. (Note: You'll get a compiler error on Multisets.union( ) if you are using an older version of Guava. Make sure you are using at least the version contained in guava-15.0.jar.)

Listing 9: Using Multisets.union()

@Test
public void testMultisetUnion() {
    List<String> list1 = Lists.newArrayList("a", "b", "c", "c");
    Multiset<String> multiset1 = HashMultiset.create(list1);
    List<String> list2 = Lists.newArrayList("c", "d", "e");
    Multiset<String> multiset2 = HashMultiset.create(list2);
    
    // Union
    Multiset<String> multisetUnion = Multisets.union(multiset1, multiset2);
    
    // Test the results.
    Multiset<String> expectedMultiset = HashMultiset.create(Lists.newArrayList("a", "b", "c", "c", "d", "e"));
    assertEquals(expectedMultiset, multisetUnion);

    // Verify that the operation is commutative.
    Multiset<String> multisetUnion2 = Multisets.union(multiset2, multiset1);
    assertEquals(expectedMultiset, multisetUnion2);
}

The steps involved here are identical to those in the previous listing, with the exception of the call to union( ) instead of intersection( ). Recall that, in order to test the List implementation of the multiset union, we first had to sort both the union and the expected result. We don't have to do that here because the assertEquals( ) override for multisets understands that order is not relevant.

1PlanetMath.org on Multisets