Python用のトピックモデルのライブラリgensim の使い方(主に日本語のテキストの読み込み)

gensimは前に以下の記事でも使ったPython用のトピックモデルなどの機能があるライブラリです。

小説家になろうのランキングをトピックモデルで解析(gensim) - 唯物是真 @Scaled_Wurm

以前紹介した以下の論文でもgensimが使われていました

論文紹介 “Representing Topics Using Images” (NAACL 2013) - 唯物是真 @Scaled_Wurm

deep learningで話題になったword2vecの機能も取り入れてたりして面白いライブラリです

Radim Řehůřek : Deep learning with word2vec and gensim

入力の作り方がすこしわかりにくいかなぁと思ったので、メモっておきます。

コーパスの作り方

以下の公式の例で説明します
この例ではリスト内のそれぞれの要素が1つの文書となります。

>>> from gensim import corpora, models, similarities
>>>
>>> documents = ["Human machine interface for lab abc computer applications",
>>>              "A survey of user opinion of computer system response time",
>>>              "The EPS user interface management system",
>>>              "System and human system engineering testing of EPS",
>>>              "Relation of user perceived response time to error measurement",
>>>              "The generation of random binary unordered trees",
>>>              "The intersection graph of paths in trees",
>>>              "Graph minors IV Widths of trees and well quasi ordering",
>>>              "Graph minors A survey"]

http://radimrehurek.com/gensim/tut1.html

英語の場合はまずはsplit()などで単語を分割して、小文字に直すなどの前処理をします。

>>> # remove common words and tokenize
>>> stoplist = set('for a of the and to in'.split())
>>> texts = [[word for word in document.lower().split() if word not in stoplist]
>>>          for document in documents]
>>>
>>> # remove words that appear only once
>>> all_tokens = sum(texts, [])
>>> tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
>>> texts = [[word for word in text if word not in tokens_once]
>>>          for text in texts]
>>>
>>> print texts
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

http://radimrehurek.com/gensim/tut1.html

次に辞書を作ります。

>>> dictionary = corpora.Dictionary(texts)
>>> dictionary.save('/tmp/deerwester.dict') # store the dictionary, for future reference
>>> print dictionary
Dictionary(12 unique tokens)

http://radimrehurek.com/gensim/tut1.html

辞書は単語と単語IDとのマッピングを表しています

>>> print dictionary.token2id
{'minors': 11, 'graph': 10, 'system': 5, 'trees': 9, 'eps': 8, 'computer': 0,
'survey': 4, 'user': 7, 'human': 1, 'time': 6, 'interface': 2, 'response': 3}

http://radimrehurek.com/gensim/tut1.html

これを使うとドキュメントを、単語IDと頻度のタプルに変換出来ます。

>>> new_doc = "Human computer interaction"
>>> new_vec = dictionary.doc2bow(new_doc.lower().split())
>>> print new_vec # the word "interaction" does not appear in the dictionary and is ignored
[(0, 1), (1, 1)]

http://radimrehurek.com/gensim/tut1.html

この変換をドキュメントを全体に対して行ったものが、gensimで入力として使うコーパスになります。

>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus) # store to disk, for later use
>>> print corpus
[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]

http://radimrehurek.com/gensim/tut1.html

日本語では形態素解析が必要

日本語では空白などで単語が簡単に分割できないので、形態素解析が必要になります
MeCabなどを使うのが簡単です。

とりあえず試すだけなら、1行1文書のファイルを入力として、以下のように分割できます

mecab -Owakati 入力ファイル名 > 出力ファイル名

上のコマンドで以下の文書を分割すると

1行に1文書のファイルを分割してみた
2行目
3行目

次のようになるので、空白や改行で分割すれば、文書ごとの単語のリストのリストが作れます。

1 行 に 1 文書 の ファイル を 分割 し て み た 
2 行 目 
3 行 目

品詞によるフィルタリングなどの複雑なことをやるならスクリプト内部からPython バインディングで呼び出した方がいいです。
ちなみにWindowsだとMeCabのPython バインディングをインストールするのが大変だったりするので、以下のようにsubprocessによる呼び出しで手を抜いたりするときもあります
ただしバインディングを使うよりも遅いですし、品詞などを自分でパースしなきゃいけないので、こちらはこちらでめんどくさいです

import subprocess
result = subprocess.check_output(u'mecab {0}'.format(input).encode(coding), shell = True)

辞書とコーパスの構築をまとめて行う

上の例では辞書とコーパスの構築を個別に行なっていました
辞書とコーパスの構築をまとめてやるための便利なクラスとしていくつかのクラスが用意されています
簡単に使えそうなものを2つ紹介します

以下のような形式で分割された単語が得られていればgensim.corpora.lowcorpus.LowCorpusを使って読み込めます

gensim: corpora.lowcorpus – Corpus in List-of-Words format

文数
文1の単語1 文1の単語2 文1の単語3
文2の単語1 文2の単語2

複雑な処理をする場合にはgensim.corpora.textcorpus.TextCorpusが用意されています。
TextCorpus.get_textsをオーバーライドして、引数で与えられたドキュメントの一つ一つに対する単語のリストをyieldで返すように処理を書けば、単語IDと頻度のタプルへの変換は内部で自動で行なってくれます。

単語IDと単語の対応の辞書はTextCorpus.dictionaryにあります。

実行例

gensim.corpora.lowcorpus.LowCorpusを使う方法

上のような形式のcorpus.txtを読み込んで、LDAで解析してトピックごとの上位の単語とドキュメントごとの各トピックの分布を出力します。

1行1文書のinput.txtをmecabで分かち書きして形式に沿ったファイルにするコマンド

wc -l input.txt | cut -d ' ' -f 1 > corpus.txt
mecab -Owakati input.txt >> corpus.txt

ソースコード

import gensim

if __name__ == '__main__':
    corpus = gensim.corpora.lowcorpus.LowCorpus('corpus.txt')
    
    lda = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=20, id2word=corpus.id2word)
    for topic in lda.show_topics(-1):
        print topic
    for topics_per_document in lda[corpus]:
        print topics_per_document

サンプル出力

ニコ動データセット中のコメント10000件を1コメント1文書とみなして推定
助詞とかが悪影響を与えてうまくいかないこともあるので抜いたりしておいたほうがいいかもしれません

上側では各トピックごとの確率上位の単語をshow_topics()関数で表示しています。
これは文字列にまとめた結果を返しますが formatted=False を指定すると単語の確率のタプルを返してくれたりもします

上から2つ目の行ではもじぴったんのトピックとかがとれていそうな感じですね(適当

下側の「以下略」の前の行は各トピックが含まれる確率を表しています

0.142*た + 0.070*て + 0.052*も + 0.049*に + 0.033*よい + 0.029*の + 0.029*し + 0.028*てる + 0.028*が + 0.026*で
0.784*　 + 0.039*ぴったん + 0.029*死ね + 0.019*萌え + 0.017*♥ + 0.016*たん + 0.013*もじ + 0.010*職人 + 0.008*. + 0.004*じ
0.063*日 + 0.054*誕生 + 0.043*すぎ + 0.027*アニメ + 0.024*ｗｗｗｗｗ + 0.021*年 + 0.021*ぃ + 0.019*可愛い + 0.018*つう + 0.015*お前
0.235*ああ + 0.158*■ + 0.128*～ + 0.082*) + 0.070*( + 0.038*あ + 0.026*･ + 0.019*(･ + 0.018*ﾟレ + 0.011*る
0.241*うう + 0.125*//\\ + 0.085*ぇい + 0.066*w + 0.063*う + 0.053*♥♥ + 0.032*なか + 0.030*\\ + 0.026*// + 0.016*鈍痛
0.144*の + 0.098*やよい + 0.063*が + 0.047*な + 0.041*。 + 0.033*で + 0.030*だ + 0.023*か + 0.023*よ + 0.022*ここ
0.909*♪ + 0.008*かわいい + 0.005*益 + 0.005*おめでとう + 0.003*ｗｗｗｗｗｗ + 0.003*すぎる + 0.002*深夜 + 0.002*ぴたたん + 0.002*くる + 0.002*日本
0.474*億 + 0.474*千 + 0.031*え + 0.002*ええ + 0.002*数 + 0.002*/ + 0.001*みな + 0.001*ねえ + 0.001*萌 + 0.001*おまえ
0.270*ｗ + 0.079*ฺ + 0.048*ｗｗｗ + 0.033*ちょっと + 0.032*ま + 0.029*✿ + 0.027*音痴 + 0.023*せ + 0.020*ぇ + 0.018*♬
0.118*g + 0.096*お + 0.079*わん + 0.057*^) + 0.045*おお + 0.036*＞ + 0.022*はーい + 0.015*＜ + 0.014*9 + 0.014*ｲｴｲヽ
0.383*！ + 0.173*わん + 0.154*ぅ + 0.039*a + 0.026*消えろ + 0.022*はぁ + 0.018*レ + 0.017*ふ + 0.012*アイマス + 0.010*ふわふわ
0.250*・ + 0.135*ハルヒ + 0.097*いい + 0.073*い + 0.049*いえ + 0.024*俺 + 0.017*次元 + 0.014*ぇいっ + 0.012*たま + 0.011*嫁
0.748*万 + 0.046*厨 + 0.035*アナル + 0.035*式 + 0.035*零 + 0.011*＼ + 0.010*／ + 0.006*め + 0.006*再生 + 0.005*すげ
0.213*☆ + 0.113*ﾟ + 0.106*★ + 0.081*ー + 0.038*♡ + 0.032*∀ + 0.027*゜ + 0.026*･｡.☆･ + 0.019*らん + 0.018*べり
0.245*つ + 0.039*Ｌ + 0.027*１ + 0.025*０ + 0.023*２ + 0.023*もう + 0.022*３ + 0.021*コメント + 0.020*ｘ + 0.020*みたい
0.102*と + 0.055*─ + 0.038*o + 0.027*で + 0.027*だい + 0.024*ぜ + 0.024*今日 + 0.021*から + 0.021*ない + 0.020*ぴ
0.327*は + 0.032*りん + 0.030*俺 + 0.019*神 + 0.018*ずっと + 0.016*ほっと + 0.015*ぺったん + 0.013*ひと + 0.013*胸 + 0.012*やすみ
0.063*っ + 0.055*、 + 0.036*ふたり + 0.034*は + 0.033*です + 0.025*ぴたり + 0.024*ひとり + 0.024*… + 0.023*に + 0.023*来
0.131*ん + 0.123*き + 0.069*ぴたた + 0.056*まき + 0.038*まし + 0.038*た + 0.025*動画 + 0.025*二 + 0.022*に + 0.018*て
0.116*ら + 0.109*や + 0.078*？ + 0.071*よ + 0.067*だ + 0.048*（ + 0.046*） + 0.030*ん + 0.027*か + 0.025*な
[(0, 0.012500000000000004), (1, 0.012500000000000004), (2, 0.012500000000000004), (3, 0.012500000000000004), (4, 0.012500000000000004), (5, 0.012500000000000004), (6, 0.012500000000000004), (7, 0.012500000000000004), (8, 0.012500000000000004), (9, 0.012500000000000004), (10, 0.012500000000000004), (11, 0.012500000000000004), (12, 0.012500000000000074), (13, 0.012500000000000004), (14, 0.51250000000732499), (15, 0.012500000000000004), (16, 0.26249999999267509), (17, 0.012500000000000004), (18, 0.012500000000000004), (19, 0.012500000000000004)]
以下略

gensim.corpora.textcorpus.TextCorpusのオーバーライド

以前作ったソースコードを晒しておきます。
かなり昔に書いたんで色々と汚いです
Windowsで使用していたのでShift_JISとかが書いてありますが、万が一以下のコードを使うことがあれば適宜変更してください

gensim.corpora.textcorpus.TextCorpusをオーバーライドしたJapaneseTextCorpusクラスで形態素解析用の関数を渡して、形態素解析しています
subprocessで呼び出したり、名詞に限定したり余分なコードが書いてあるので、本来はもっと簡潔に書けるはずです。

import gensim
import codecs
import subprocess

class JapaneseTextCorpus(gensim.corpora.TextCorpus):
    def __init__(self, input, coding, segmenter):
        self.segmenter = segmenter
        self.coding = coding
        gensim.corpora.TextCorpus.__init__(self, input)
    def get_texts(self):
        segment = self.segmenter(self.input)
        for s in segment:
            yield s

class JapaneseSegmenter:
    @staticmethod
    def mecab(coding):
        def segmentWithMeCab(input):
            ret = []
            result = subprocess.check_output(u'mecab {0}'.format(input).encode(coding), shell = True)
            result = unicode(result, coding)
            for doc in result.split('EOS'):
                docret = []
                for line in doc.split('\n'):
                    if u'名詞' in line and u'接尾' not in line and u'接頭' not in line and u'非自立' not in line and u'代名詞' not in line and u'数' not in line:
                        docret.append(line.split(',')[6])
                if docret != []:
                    ret.append(docret)
            return ret
        return segmentWithMeCab

if __name__ == '__main__':
    corpus = JapaneseTextCorpus('input.txt', 'shift-jis', JapaneseSegmenter.mecab('shift-jis'))
    
    lda = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=10, id2word=corpus.dictionary)
    for topic in lda.show_topics(-1):
        print topic
    for topics_per_document in lda[corpus]:
        print topics_per_document

参考

公式のドキュメントがわかりやすいです