Lossy Countingの変種を実装してみた

前に以下の記事でLossy Countingを実装しましたが、より省メモリな変種があるらしいのでそちらも書いてみました

Lossy Countingを実装してみた - 省メモリな頻度計測 - 唯物是真 @Scaled_Wurm

以下の論文のAlgorithm 2の擬似コードそのままです。

"Methods for Finding Frequent Items in Data Streams", Graham Cormode and Marios Hadjieleftheriou

元々のLossy Countingではアイテムごとに追加された時のバケツ番号(≒誤差の大きさ)を持っていたのですが、この変種では明示的に持たずに頻度に足しあわせてしまいます(頻度のupper boundを持つ)
アイテムの頻度をカウントするための辞書に加えて、処理したデータ数、許容する誤差の大きさ、現在のバケツ番号の3つの変数しか使いません。
アイテムごとの誤差の大きさはわからなくなってしまいましたが、その分省メモリになっています

アルゴリズムの簡単な説明

データをバケツに分けます
順番にバケツ内のアイテムの頻度を数えていきます
1. ただし、アイテムを辞書に初めて追加するときには、頻度(のupper bound)を1 + バケツ番号として辞書に追加する。
2. 辞書内にあるアイテムに+1する場合には、頻度(のupper bound)をそのまま+1する
バケツの境界に来たらバケツ番号を+1して、頻度(のupper bound)が現在のバケツ番号より小さいアイテムを辞書から削除します

実験

以前の記事と同様の設定で行います
ニコニコ動画のコメントデータ1000万件に対して実行した結果をいかに示します
MeCabで形態素解析した結果を使って、単語のbigram(連続した2つの単語のペア)の頻度を計測します
誤差の大きさの割合\(\epsilon=[10^{-6}, 10^{-7}]\)とします

形態素解析後の全アイテム数は\(100000000=10^8\)強程度あったので\(\epsilon=10^{-6}\)だと、最大でおよそ\(100\)強程度の誤差、\(\epsilon=10^{-7}\)だと、最大でおよそ\(10\)強程度の誤差がそれぞれの単語の頻度についてあることがわかります

以下のグラフ(縦軸が辞書内のアイテム数、横軸が処理したアイテム数)をみると\(\epsilon=10^{-7}\)でも、すべてを数えるナイーブな方法と比べておよそ半分以下のメモリ消費で済みそうだということがわかります
f:id:sucrose:20130728142642p:plain

ソースコード

github

class LossyCountingVariant(object):
    def __init__(self, epsilon):
        self.N = 0
        self.count = {}
        self.epsilon = epsilon
        self.b_current = 0
    
    def getCount(self, item):
        return self.count[item]
    
    def trim(self):
        for item in self.count.keys():
            if self.count[item] < self.b_current:
                del self.count[item]
        
    def addCount(self, item):
        self.N += 1
        if item in self.count:
            self.count[item] += 1
        else:
            self.count[item] = 1 + self.b_current
        
        if self.N % int(1 / self.epsilon) == 0:
            self.b_current += 1
            self.trim()
    
    def iterateOverThresholdCount(self, threshold_count):
        assert threshold_count > self.epsilon * self.N, "too small threshold"
        
        for item in self.count:
            if self.count[item] >= threshold_count:
                yield (item, self.count[item])
    
    def iterateOverThresholdRate(self, threshold_rate):
        return self.iterateOverThresholdCount(threshold_rate * self.N)
    

if __name__ == '__main__':
    import random

    counter = LossyCountingVariant(5e-3)

    stream = ''
    for i, c in enumerate('abcdefghi', 1):
        stream += c * 2 ** i
    stream = list(stream)
    random.shuffle(stream)

    for c in stream:
        counter.addCount(c)
    
    for item, count in sorted(counter.iterateOverThresholdCount(10), key=lambda x:x[1]):
        print item, count