アドベントカレンダーの参加者は毎年どれぐらい入れ替わっているのか？2015

去年Advent Calendarの参加者が1年間でどれぐらい入れ替わっているのか調べましたsucrose.hatenablog.com
せっかくなので今年も調べてみます

2015年と2014年のQiitaで公開されているアドベントカレンダーを調査の対象とします

2015年の分は360個、2014年の分は214個のアドベントカレンダーが得られました

名寄せがめんどくさかったので、上で得られたアドベントカレンダーのリストの内2015年と2014年のURLが年の部分以外一致するものを対象とします
すると101個のアドベントカレンダーが残りました

去年の著者の内どの程度の割合で今年も書いているか(以下、生存率と呼ぶ)を調べます

結果

今年の参加者が10人以下のアドベントカレンダーは除きました

去年と同様生存率は20%から30%ぐらいのものが多いみたいです
生存率の分布は下の図のような感じです
f:id:sucrose:20151206225351p:plain

結果の表です
参加者の生存率が高い順に並んでいます
上の方は専門性が高そうなものや所属組織のアドベントカレンダーが並んでいて、下の方には利用者の多そうなものや流行りのものが並んでいる傾向があるように見えます

順位	カレンダー名	生存率	共通の参加者数	2015年の参加者数	2014年の参加者数
1	netbsd	0.818	9	11	11
2	ipu	0.750	9	16	12
3	delphi	0.714	10	14	14
4	ue4	0.667	12	25	18
5	lig	0.667	4	16	6
6	linux	0.636	7	18	11
7	symfony	0.619	13	19	21
8	erlang	0.615	8	17	13
9	xamarin	0.600	12	21	20
10	javaee	0.600	15	25	25
11	groonga	0.583	7	11	12
12	pixiv	0.579	11	23	19
13	oculus-rift	0.560	14	25	25
14	cybird	0.542	13	24	24
15	azure	0.542	13	25	24
16	julialang	0.533	8	14	15
17	http2	0.533	8	12	15
18	emacs	0.520	13	25	25
19	haskell	0.500	12	25	24
20	clojure	0.500	12	25	24
21	lisp	0.471	8	16	17
22	yahoojapan-tech	0.462	6	14	13
23	postgresql	0.429	9	25	21
24	softlayer	0.417	10	22	24
25	td	0.412	7	16	17
26	aspnet	0.381	8	23	21
27	r-rstudio	0.375	6	24	16
28	qt	0.368	7	12	19
29	vim	0.360	9	25	25
30	unity	0.360	9	25	25
31	machinelearning	0.348	8	24	23
32	perl-entrance	0.333	6	14	18
33	java	0.320	8	25	25
34	ios	0.320	8	25	25
35	go	0.320	8	25	25
36	mlkcca	0.308	4	17	13
37	csharp	0.304	7	24	23
38	vs	0.300	6	17	20
39	pepabo	0.292	7	19	24
40	gcp	0.292	7	23	24
41	vue	0.286	2	14	7
42	mysql-casual	0.286	6	15	21
43	git	0.286	6	25	21
44	webgl	0.280	7	23	25
45	swift	0.280	7	25	25
46	php	0.280	7	25	25
47	html5	0.278	5	24	18
48	python	0.273	6	25	22
49	selenium	0.267	4	11	15
50	rails	0.250	6	21	24
51	nodejs	0.250	6	25	24
52	aws	0.250	6	25	24
53	scala	0.240	6	14	25
54	ruby	0.240	6	25	25
55	heroku	0.222	4	14	18
56	ouch-hack	0.190	4	20	21
57	javascript	0.160	4	25	25
58	cocos2d-x	0.158	3	11	19
59	ansible	0.130	3	15	23
60	elasticsearch	0.125	3	23	24
61	pronama-chan	0.095	2	14	21
62	lambda	0.080	2	24	25
63	docker	0.080	2	24	25
64	reactjs	0.000	0	17	1

ソースコード

pyqueryを使ってるので入ってなかったらpip install pyqueryとかで入れてください(他にもnumpyやmatplotlibなども使っています

カレンダー数と、上の結果(はてな記法の表形式)を出力するようになっています
ちょっとページやHTMLの構成が去年と変わっていて取ってくるのがめんどくさくなっていました

# -*- coding: utf-8 -*-

import pyquery
import time

def getCalendarList(year, page):
    calendar_list = pyquery.PyQuery(url='http://qiita.com/advent-calendar/{}/calendars?page={}'.format(year, page))
    title = set()
    for elm in calendar_list.find('.adventCalendarList_calendarTitle > a'):
        a = pyquery.PyQuery(elm)
        href = a.attr('href')
        title.add(href[22:])
    return title

def getAuthors(year, name):
    calendar = pyquery.PyQuery(url='http://qiita.com/advent-calendar/{}/{}'.format(year, name))
    author = set()
    for elm in calendar.find('.adventCalendarCalendar_author a'):
        a = pyquery.PyQuery(elm)
        text = a.text()
        author.add(text)
    return author

if __name__ == '__main__':
    title2015 = set()
    for i in range(1, 19):
        title2015 |= getCalendarList(2015, i)
        time.sleep(1)
    title2014 = getCalendarList(2014, 1)

    print '2015: {}, 2014: {}, Intersection: {}'.format(len(title2015), len(title2014), len(title2015 & title2014))

    result = []
    for name in title2015 & title2014:
        author2015 = getAuthors(2015, name)
        time.sleep(1)
        if len(author2015) <= 10:
            continue
        author2014 = getAuthors(2014, name)
        time.sleep(1)
        result.append((float(len(author2015 & author2014)) / len(author2014), name, len(author2015), len(author2014), len(author2015 & author2014)))
    result.sort(reverse=True)
    print u'|*順位|*カレンダー名|*生存率|*共通の参加者数|*2015年の参加者数|*2014年の参加者数|'.encode('utf-8')
    for i, (score, name, num2015, num2014, num_intersect) in enumerate(result, 1):
        print '|{0}|<a href="http://qiita.com/advent-calendar/2015/{1}">{1}</a>|{2:.3f}|{3}|{4}|{5}|'.format(i, name, score, num_intersect, num2015, num2014)


from pylab import *
import numpy as np
score = np.array(map(lambda x:x[0], result))
name = np.array(map(lambda x:x[1], result))

hist(score, bins=np.arange(0, 1.1, 0.1), histtype='stepfilled')
prop = matplotlib.font_manager.FontProperties(fname=r'C:\Windows\Fonts\meiryo.ttc', size=12)
xlabel(u'生存率', fontproperties=prop)
ylabel(u'カレンダー数', fontproperties=prop)
show()