Given a document (or stream) of words. Find the top k most frequent words in the document (or stream).

For example, if stream = “aa bb cc bb bb cc dd dd ee ff ee dd aa ee”. That is, {“dd”=3, “ee”=3, “ff”=1, “aa”=2, “bb”=3, “cc”=2}. Then top 3 most frequent words are: {“dd”, “ee”, “bb”}.

One quick solution would be to create a pair object with the word and its frequency and then sort the pair array with respect to the frequency of the pair. Now, take the first k pairs from the sorted array of pairs. This is O(nlgn) solution.

O(nlgk) solution with O(n) space
But we can improve this solution. Note that we are only concern about the top k elements. Sorting the array means we are sorting all the n elements which is unnecessary as we are only concerned for first k. Any idea popping in? Yeah, I am sure you have the same feeling that we could use a Min Heap of size k to keep top k most frequent words. That’s right. We also need to use a hashmap to keep frequency of each word.

Calculated frequency of all the words in a hashmap from the word to its frequency.
Start adding pair object of word and its frequency into a min heap where we use the frequency as the key for the min heap.
If the heap is full then remove the minimum element (top) form the heap and add add the new word-frequency pair only if the frequency of this word has frequency greater than the top word in the heap.
Once we scanned all the words in the map and the heap is properly updated then the elements contained in the min heap are the top k most frequents.

Below is a simple implementation of the above idea using pJava Priority Queue (Or we can use a generic min heap I have implemented in a previous post here). This solution is O(nlgk) time and O(n) space.

public String[] topKWords(final String stream, final int k) {
    final class WordFreq implements Comparable<WordFreq> {
        String word;
        int freq;

        public WordFreq(final String w, final int c) {
            word = w;
            freq = c;
        }

        @Override
        public int compareTo(final WordFreq other) {
            return Integer.compare(this.freq, other.freq);
        }
    }
    final Map<String, Integer> frequencyMap = new HashMap<String, Integer>();
    final PriorityQueue<WordFreq> topKHeap = new PriorityQueue<WordFreq>(k);

    final String[] words = stream.toLowerCase().trim().split(" ");
    for (final String word : words) {
        int freq = 1;
        if (frequencyMap.containsKey(word)) {
            freq = frequencyMap.get(word) + 1;
        }

        // update the frequency map
        frequencyMap.put(word, freq);
    }

    // Build the topK heap
    for (final java.util.Map.Entry<String, Integer> entry : frequencyMap.entrySet()) {
        if (topKHeap.size() < k) {
            topKHeap.add(new WordFreq(entry.getKey(), entry.getValue()));
        } else if (entry.getValue() > topKHeap.peek().freq) {
            topKHeap.remove();
            topKHeap.add(new WordFreq(entry.getKey(), entry.getValue()));
        }
    }

    // extract the top K
    final String[] topK = new String[k];
    int i = 0;
    while (topKHeap.size() > 0) {
        topK[i++] = topKHeap.remove().word;
    }

    return topK;
}

A more efficient O(n) time and O(n) space solution
We can solve this problem in linear time by using the QuickSelect for selecting the kth smallest/largest number in an array. I discussed the idea of selecting kth smallest number of an array in my previous post.

The idea is that if we know the kth largest (same as (n-k)th smallest) frequency among all the frequencies of the words then we can select first k elements that has frequency less than or equal to the kth frequency. This is only O(n) time scan. So, the algorithm runs with O(n) and O(n) space only. Below is the implementation of the above idea using the kth smallest implementation I posted earlier.

public String[] topKWordsSelect(final String stream, final int k) {
    final Map<String, Integer> frequencyMap = new HashMap<String, Integer>();

    final String[] words = stream.toLowerCase().trim().split(" ");
    for (final String word : words) {
        int freq = 1;
        if (frequencyMap.containsKey(word)) {
            freq = frequencyMap.get(word) + 1;
        }

        // update the frequency map
        frequencyMap.put(word, freq);
    }

    // Find kth largest frequency which is same as (n-k)th smallest frequency
    final int[] frequencies = new int[frequencyMap.size()];
    int i = 0;
    for (final int value : frequencyMap.values()) {
        frequencies[i++] = value;
    }
    final int kthLargestFreq = kthSmallest(frequencies, 0, i - 1, i - k);

    // extract the top K
    final String[] topK = new String[k];
    i = 0;
    for (final java.util.Map.Entry<String, Integer> entry : frequencyMap.entrySet()) {
        if (entry.getValue().intValue() >= kthLargestFreq) {
            topK[i++] = entry.getKey();
            if (i == k) {
                break;
            }
        }
    }

    return topK;
}

Algorithms

Top K or K-most frequent words in a document

Leave a Reply Cancel reply

Leave a Reply Cancel reply

Login