Guava Multisets


Guava's Multiset interface and its corresponding implementations offer programmers some very convenient tools. First and foremost, we now have a handy way to count things. Secondly, we can use Multiset to perform true set-like operations on multisets.

Summary

Use a Multiset to count different types of things, like the number of occurrences of a particular string, or the number of different types of named entities (Persons, Organizations, Dates, etc.) in a given file.

Listing 1: Counting Types of Objects with Guava's Multiset

@Test
public void simpleCase() {
    List<String> input = Lists.newArrayList("aaa", "bbb", "aaa");
    Multiset<String> multiset = HashMultiset.create(input);
    multiset.add("ddd");
    
    assertEquals(2, multiset.count("aaa"));
    assertEquals(1, multiset.count("bbb"));
    assertEquals(0, multiset.count("ccc"));
    assertEquals(1, multiset.count("ddd"));
    
    assertEquals(4, multiset.size());
}

Listing 1 is only a snapshot of how to use Multiset for counting. See below for a description of what is going on in this code snippet, as well as for other uses of Multiset.

Counting with a Multiset

Usually in Java we count things with a Map<X, Integer>. This means that, every time we find an object, we first have to check whether it already exists in the map. As the Guava documentation says: "This is awkward, prone to mistakes, and doesn't support collecting a variety of useful statistics, like the total number of words. We can do better."

And with Guava's Multiset, the folks at Google have done better.

Listing 1, above, illustrates the most useful functionality this interface provides. First we create a list of strings, some of which are duplicates. Then we initialize the HashMultiset implementation of Multiset with the create( ) method. Alternatively, you can add items to your multiset one at a time, as shown by the line multiset.add("ddd"). The next few lines in the code snippet above use unit test assertions to demonstrate the counting functionality, provided by count( ), which returns the count of each type, and size( ), which gives you the total number of elements contained in the multimap.

The real payoff when it comes to Multiset is demonstrated by the line reading assertEquals(0, multiset.count("ccc"));. In the Map<X, Integer> idiom, the call "map.count("ccc")" would produce a null pointer exception, stopping your application dead in its tracks.

What is a Multiset?

Briefly, a multiset (lowercase—not the Guava interface Multiset) is a set of objects in which order does not matter and for which it is possible to have more than one of each type of element. For example, {a, b, b, c, d} is a multiset, and it is equivalent to {a, c, d, b, b}. Note that the traditional set {a, b, c} is also a multiset, albeit one which has only one element of each type. Another name for multiset is a bag, as in "a bag of words model of a document."

Guava's Multiset implementations give us not just a way to count types of objects, but they also provide additional functionality that we may want to use on multisets.

Additional Guava Multiset Functionality

The next listing illustrates some of the non-counting behavior which Multiset provides.

Listing 2: Additional Multiset functionality

@Test
public void additionalFunctionality() {
    List<String> input = Lists.newArrayList("aaa", "bbb", "aaa");
    Multiset<String> multiset = HashMultiset.create(input);
    multiset.addAll(input);
    
    // Quick count check.
    assertEquals(4, multiset.count("aaa"));
    assertEquals(2, multiset.count("bbb"));
    assertEquals(0, multiset.count("ccc"));

    // Verify elementSet( ).
    Set<String> expectedEntrySet = Sets.newHashSet("aaa", "bbb");
    assertEquals(expectedEntrySet, multiset.elementSet());
    
    // Verify remove(Object, int).
    multiset.remove("aaa", 3);
    assertEquals(1, multiset.count("aaa"));
    
    // Verify remove(Object).
    multiset.remove("bbb");
    assertEquals(1, multiset.count("bbb"));
    
    // Verify setCount().
    multiset.setCount("ccc", 10);
    assertEquals(10, multiset.count("ccc"));
}

In Listing 2, first we initialize our multiset in the same way as in Listing 1. Then we double the number of each type of element by using the addAll( ) method—which we verify with a quick sanity check. At this point we begin verifying the non-count behavior, the first being elementSet( ), which provides a set of all the types which have been inserted into the multiset. In this case, our two types are simply the "aaa" string and the "bbb" string.

Next we demonstrate the remove( ) methods. The first of these methods takes an object type and an integer n, and removes from the multiset n occurrences of the object type. Here we remove three "aaa" objects from the multiset, giving us a final count of a single "aaa". The second remove( ) method takes a single type parameter, and is equivalent to calling the first version with "1" as the second parameter. In our case, we reduce the number of "bbb" objects from two to one.

Finally, our code snippet demonstrates the setCount( ) method by adding ten "ccc" objects to the multiset.

Two more methods are worth mentioning: the override of toString( ) and entrySet( ). Listing 3 illustrates them both.

Listing 3: Multiset.toString( ) and Multiset.entrySet( )

@Test
public void entrySetAndToString() {
    List<String> input = Lists.newArrayList("aaa", "bbb", "aaa", "ccc");
    Multiset<String> multiset = HashMultiset.create(input);
    multiset.addAll(input);
    
    assertEquals(8, multiset.size());
    
    System.out.println(multiset);
    
    for(Multiset.Entry<String> entrySet : multiset.entrySet())  {
        String key = entrySet.getElement();
        int count = entrySet.getCount();
        System.out.println(key + ": " + count);
    }
}

// Output:
[aaa x 4, ccc x 2, bbb x 2]
aaa: 4
ccc: 2
bbb: 2

Listing 3 is a unit test that we've hijacked so that we have a place for some println( ) statements. The first line of output comes from the System.out.println(multiset) call. It demonstrates the very nice override provided by Guava, which—instead of displaying every element individually—shows each type and the number of times it appears in the collection. The next three lines of output are the result of the for loop on the entrySet( ) method. Here we iterate through each entry, getting the element type (e.g. "aaa," "bbb," etc.) as well as the count of each. When simply written out, as it is here, it's not significantly different from toString( ), but the method could be useful in any number of other ways.

Set Operations with Multisets

Because multisets are very similar to sets, it makes sense to want to perform set operations, like union and intersection, on them. For example, in my own work I've wanted to compare the "topics" of several conversations, the "topic" of each conversation being represented by a multiset of the important words in the dialog. I could say that two conversations were more similar than a third conversation if the intersection of their topic multisets had more elements than the intersection of either with the third conversation.

The listing below shows you how to perform set operations on multisets with Guava. To get this to compile and run successfully, make sure you are not using an old jar file for Guava. (The version in guava-15.0.jar will do the trick.)

Listing 4: Guava's Multisets.intersect( ) and Multisets.union( )

    List<String> list1 = Lists.newArrayList("a", "b", "c", "c");
    Multiset<String> multiset1 = HashMultiset.create(list1);
    List<String> list2 = Lists.newArrayList("c", "d", "e");
    Multiset<String> multiset2 = HashMultiset.create(list2);
    
    // Union
    Multiset<String> multisetUnion = Multisets.union(multiset1, multiset2);

    // Intersection
    Multiset<String> multisetIntersection = Multisets.intersection(multiset1, multiset2);

Elsewhere on this website I discuss why you should NOT use the JDK List methods retainAll( ) and addAll( ) on multisets. See the internal link at right on set operations.