Top K or K-most frequent words in a document

Given a document (or stream) of words. Find the top k most frequent words in the document (or stream).

For example, if stream = “aa bb cc bb bb cc dd dd ee ff ee dd aa ee”. That is, {“dd”=3, “ee”=3, “ff”=1, “aa”=2, “bb”=3, “cc”=2}. Then top 3 most frequent words are: {“dd”, “ee”, “bb”}.

One quick solution would be to create a pair object with the word and its frequency and then sort the pair array with respect to the frequency of the pair. Now, take the first k pairs from the sorted array of pairs. This is O(nlgn) solution.

O(nlgk) solution with O(n) space
But we can improve this solution. Note that we are only concern about the top k elements. Sorting the array means we are sorting all the n elements which is unnecessary as we are only concerned for first k. Any idea popping in? Yeah, I am sure you have the same feeling that we could use a Min Heap of size k to keep top k most frequent words. That’s right. We also need to use a hashmap to keep frequency of each word.

  1. Calculated frequency of all the words in a hashmap from the word to its frequency.
  2. Start adding pair object of word and its frequency into a min heap where we use the frequency as the key for the min heap.
  3. If the heap is full then remove the minimum element (top) form the heap and add add the new word-frequency pair only if the frequency of this word has frequency greater than the top word in the heap.
  4. Once we scanned all the words in the map and the heap is properly updated then the elements contained in the min heap are the top k most frequents.

Below is a simple implementation of the above idea using pJava Priority Queue (Or we can use a generic min heap I have implemented in a previous post here). This solution is O(nlgk) time and O(n) space.

public String[] topKWords(final String stream, final int k) {
    final class WordFreq implements Comparable<WordFreq> {
        String word;
        int freq;

        public WordFreq(final String w, final int c) {
            word = w;
            freq = c;
        }

        @Override
        public int compareTo(final WordFreq other) {
            return Integer.compare(this.freq, other.freq);
        }
    }
    final Map<String, Integer> frequencyMap = new HashMap<String, Integer>();
    final PriorityQueue<WordFreq> topKHeap = new PriorityQueue<WordFreq>(k);

    final String[] words = stream.toLowerCase().trim().split(" ");
    for (final String word : words) {
        int freq = 1;
        if (frequencyMap.containsKey(word)) {
            freq = frequencyMap.get(word) + 1;
        }

        // update the frequency map
        frequencyMap.put(word, freq);
    }

    // Build the topK heap
    for (final java.util.Map.Entry<String, Integer> entry : frequencyMap.entrySet()) {
        if (topKHeap.size() < k) {
            topKHeap.add(new WordFreq(entry.getKey(), entry.getValue()));
        } else if (entry.getValue() > topKHeap.peek().freq) {
            topKHeap.remove();
            topKHeap.add(new WordFreq(entry.getKey(), entry.getValue()));
        }
    }

    // extract the top K
    final String[] topK = new String[k];
    int i = 0;
    while (topKHeap.size() > 0) {
        topK[i++] = topKHeap.remove().word;
    }

    return topK;
}

 

A more efficient O(n) time and O(n) space solution
We can solve this problem in linear time by using the QuickSelect for selecting the kth smallest/largest number in an array. I discussed the idea of selecting kth smallest number of an array in my previous post.

The idea is that if we know the kth largest (same as (n-k)th smallest) frequency among all the frequencies of the words then we can select first k elements that has frequency less than or equal to the kth frequency. This is only O(n) time scan. So, the algorithm runs with O(n) and O(n) space only. Below is the implementation of the above idea using the kth smallest implementation I posted earlier.

public String[] topKWordsSelect(final String stream, final int k) {
    final Map<String, Integer> frequencyMap = new HashMap<String, Integer>();

    final String[] words = stream.toLowerCase().trim().split(" ");
    for (final String word : words) {
        int freq = 1;
        if (frequencyMap.containsKey(word)) {
            freq = frequencyMap.get(word) + 1;
        }

        // update the frequency map
        frequencyMap.put(word, freq);
    }

    // Find kth largest frequency which is same as (n-k)th smallest frequency
    final int[] frequencies = new int[frequencyMap.size()];
    int i = 0;
    for (final int value : frequencyMap.values()) {
        frequencies[i++] = value;
    }
    final int kthLargestFreq = kthSmallest(frequencies, 0, i - 1, i - k);

    // extract the top K
    final String[] topK = new String[k];
    i = 0;
    for (final java.util.Map.Entry<String, Integer> entry : frequencyMap.entrySet()) {
        if (entry.getValue().intValue() >= kthLargestFreq) {
            topK[i++] = entry.getKey();
            if (i == k) {
                break;
            }
        }
    }

    return topK;
}

9 thoughts on “Top K or K-most frequent words in a document

  1. topKWords(“a b b c c c e e e e e d d d d g g g g g g g f f f f f f”, 3) produces [f, g, b]
    which seems wrong. if not then please explain how it is correct.

    • I made changes to your max/min heap by correcting the extract method and now i get the correct result : [e, f, g]
      thanks

    • Hey Ashutos, Thanks for the comment.

      My code actually returns correct answer [e, f, g] for the topKWords(“a b b c c c e e e e e d d d d g g g g g g g f f f f f f”, 3) call.

      Are you sure you are running the code correctly?

      • for (final java.util.Map.Entry entry : frequencyMap.entrySet()) {
        if (entry.getValue().intValue() >= kthLargestFreq) {
        topK[i++] = entry.getKey();
        if (i == k) {
        break;
        }
        }
        }
        Above part seems incorrect, since it does not check for values after the topK array has been filled to its capacity.

        • Can you construct an counter example to describe what you are saying.

          The way kthLargestFreq is computed guarantees that irrespective of the order of the elements, any set of number greater than this number will be one of topKs.

  2. Hi Zahid,
    Could you please explain why the selection based second approach is using only O(K) space?

    To me the first part does the counting, so at worst it will take O(N) space, because when there is only one instance for each word, the hash table will be as large as O(N).

    From my understanding, this leads to a O(N) space complexity, which contradicts with your claim that it uses O(K) space.

    • yes, you are right. It’s O(n) space as the stream could contain all unique words and the map takes O(n) space. Thanks for the comment. I have updated the writeup.

  3. Can you explain how this works in O(n) time complexity? As quick select works in O(n*logk) isn’t it?
    Can you explain the how kth largest works in O(n)?

Comments are closed.