AtCoderのレーティングの分布について調べてみた

競技プログラミングサイトのAtCoderのレーティングの仕組みが新しくなりました
別の競技プログラミングサイトであるCodeforcesのレーティングとどれぐらい相関があるのか、参加回数がどれぐらいあればレーティングに差がなくなってくるのかなど気になったので調べてみました

ちなみにたまにAtCoderに参加してますが、最近は全然解けないでレーティングの変動がなくなってきました(弱い
f:id:sucrose:20161002174028j:plain

データの収集

AtCoderの方は、AtCoderのレーティングのランキングのページから表示されてるデータをすべて取ってきます
CodeforcesはAPIがあるのでそれを使います

Codeforces API - Codeforces

http://codeforces.com/api/user.ratedList で1度でも参加したことのあるユーザーの情報を取ってくることができます(重いです)

AtCoderとCodeforcesでハンドルが同じユーザーを同一人物だとみなします

AtCoderのレーティングの分布

レーティングの色分けについてはこちら↓

https://atcoder.jp/post/14

f:id:sucrose:20161003010813p:plain

レーティング	レーティング上位何%か
2800(赤)	2%
2400(オレンジ)	6%
2000(黄色)	13%
1600(青)	26%
1200(水色)	46%
800(緑)	65%
400(茶色)	81%
0(灰色)	100%

AtCoderの参加回数とレーティングの関係

AtCoderのレーティングとCodeforcesのレーティングで散布図をかくと以下のようになります
AtCoderの参加回数で色分けされていて、1回しか参加していない赤色の点はAtCoderのレーティングが低めであることがわかります
参加回数の多い色の部分を見ると、だいたい片方のレーティングが上がるともう片方も上がっていることがわかります
f:id:sucrose:20161003010948p:plain
AtCoderは一定回数参加しないとレーティングが大きくならないので、試しに参加回数とレーティングの相関係数を出してみました(スピアマンの順位相関係数)

スピアマンの順位相関係数 - Wikipedia

全部のデータを使うと相関係数は0.7ぐらいですが、参加回数が4回未満のユーザーのデータを使わないで相関係数を計算したら、相関係数が0近くになりました
参加回数を一定以上に絞ると人数が減るので信頼性が多少損なわれますが、一定回数以上参加していれば参加回数とレーティングに強い正の相関はなさそうです

f:id:sucrose:20161003011017p:plain

AtCoderとCodeforcesのレーティングの関係

上で見たように、参加回数とレーティングには関係がありそうなのでAtCoderの参加回数別に、AtCoderとCodeforcesのレーティングの相関係数を出しました
1回だけ参加した人でも0.7、2回以上だと0.8以上の値になり、強い相関がありそうなのがわかります
f:id:sucrose:20161003011031p:plain

回帰をしてみる

というわけでCodeforcesのレーティングとAtCoderのレーティングには強い関係がありそうなので、回帰式を求めてみます
参加回数はレーティングに影響を与えていそうなので、参加回数とCodeforcesのレーティングの2つを使ってAtCoderのレーティングを予測します

今回はscikit-learnのリッジ回帰を使いました

1.1. Generalized Linear Models — scikit-learn 0.19.1 documentation

参加回数は線形な影響ではなさそうなので
OneHotEncoderで1-of-K表現に変えてから学習しました

4.3. Preprocessing data — scikit-learn 0.19.1 documentation

回帰式

以下のような式が得られました
\(\mathrm{Codeforcesのレーティング} \times 1.06\) \(+ \mathrm{参加回数ごとのバイアス}\)

参加回数	バイアス
1	-1287
2	-822
3	-639
4	-418
5	-258
6	-255
7	-210
8	-114
9	-124
10	-102

順にバイアスを見ると、参加回数5回目ぐらいからあまり変化がなくなってくるので、レーティングがあるべき数値に収束し始めているのかも
この式を使って予測してみたら、残念ながら数百ぐらいの誤差は普通にありました
参加回数が少ないユーザーのデータはばらつきが大きそうなので、取り除いて計算したほうがよかったかもしれません(？)
散布図に参加回数1回、6回、10回の場合の回帰式を書いたらまあある程度はうまくいってるかなぐらいの感じです
f:id:sucrose:20161003233209p:plain

まとめ

AtCoderとCodeforcesのレーティングには強い相関がある
参加回数が増えてくると、参加回数ではレーティングに差がつかなくなっていく(？)(強い人ほどたくさん参加してそうなので不思議ですが)

sucrose.hatenablog.com

ソースコード

# -*- coding: utf-8 -*-

import pyquery
import requests
import time
import scipy.stats
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.font_manager
import matplotlib.cm as cm

cf_rating = {}
cf_data = requests.get('http://codeforces.com/api/user.ratedList').json()
for i in cf_data['result']:
    cf_rating[i['handle']] = i['rating']

rating_atcoder = []
counts = []
rating_codeforces = []
for i in xrange(1, 29):
    table = pyquery.PyQuery(url='https://atcoder.jp/ranking?p={}'.format(i))
    for elm in table.find('tr')[1:]:
        tr = pyquery.PyQuery(elm)
        tds = tr.find('td')
        rank = int(pyquery.PyQuery(tds[0]).text())
        name = pyquery.PyQuery(tds[1]).text()
        rating = int(pyquery.PyQuery(tds[2]).text())
        count = int(pyquery.PyQuery(tds[4]).text())
        
        cf = cf_rating.get(name, np.nan)
        print rank, name, rating, cf, count
        rating_atcoder.append(rating)
        counts.append(count)
        rating_codeforces.append(cf)
    time.sleep(1)

df = pd.DataFrame({
    'rating_atcoder': rating_atcoder,
    'rating_codeforces': rating_codeforces,
    'count': counts
})

prop = matplotlib.font_manager.FontProperties(fname=r'C:\Windows\Fonts\meiryo.ttc', size=12)

sns.plt.hist(df[df['rating_atcoder'] < 400].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#808080')
sns.plt.hist(df[(400 <= df['rating_atcoder']) & (df['rating_atcoder'] < 800)].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#804000')
sns.plt.hist(df[(800 <= df['rating_atcoder']) & (df['rating_atcoder'] < 1200)].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#008000')
sns.plt.hist(df[(1200 <= df['rating_atcoder']) & (df['rating_atcoder'] < 1600)].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#00C0C0')
sns.plt.hist(df[(1600 <= df['rating_atcoder']) & (df['rating_atcoder'] < 2000)].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#0000FF')
sns.plt.hist(df[(2000 <= df['rating_atcoder']) & (df['rating_atcoder'] < 2400)].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#C0C000')
sns.plt.hist(df[(2400 <= df['rating_atcoder']) & (df['rating_atcoder'] < 2800)].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#FF8000')
sns.plt.hist(df[2800 <= df['rating_atcoder']].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#FF0000')
sns.plt.title(u'AtCoderのレーティングの分布', fontproperties=prop)
sns.plt.xlabel(u'AtCoderのレーティング', fontproperties=prop)
sns.plt.ylabel(u'ユーザー数', fontproperties=prop)
sns.plt.show()

print 'rating: percentile'
for i in [0, 400, 800, 1200, 1600, 2000, 2400, 2800]:
    print '{}: {:.3}'.format(i, 100 - scipy.stats.percentileofscore(df['rating_atcoder'], i))

df = df.dropna()

sns.plt.scatter(df['rating_codeforces'], df['rating_atcoder'], c=df['count'], cmap=cm.gist_rainbow)
cb = sns.plt.colorbar(label=u'AtCoderの参加回数')
cb.ax.yaxis.label.set_font_properties(prop)
sns.plt.title(u'AtCoderのレーティングとCodeforcesのレーティングの関係', fontproperties=prop)
sns.plt.xlabel(u'Codeforcesのレーティング', fontproperties=prop)
sns.plt.ylabel(u'AtCoderのレーティング', fontproperties=prop)
sns.plt.show()

print df.corr('spearman')

idx = []
value = []
for i in xrange(1, 10):
    idx.append(i)
    value.append(df[df['count'] >= i].corr('spearman')['rating_atcoder']['count'])

import seaborn as sns
sns.plt.plot(idx, value)
sns.plt.title(u'AtCoder出場回数とAtCoderのレーティングとの相関係数の関係', fontproperties=prop)
sns.plt.xlabel(u'何回以上出場した人のデータを使って計算したか', fontproperties=prop)
sns.plt.ylabel(u'スピアマンの順位相関係数', fontproperties=prop)
sns.plt.show()

idx = []
value = []
for i in xrange(1, 11):
    idx.append(i)
    value.append(df[df['count'] == i].corr('spearman')['rating_atcoder']['rating_codeforces'])

import seaborn as sns
sns.plt.plot(idx, value)
sns.plt.title(u'CodeforcesのレーティングとAtCoderのレーティングの相関係数と出場回数の関係', fontproperties=prop)
sns.plt.xlabel(u'ちょうど何回出場した人のデータを使って計算したか', fontproperties=prop)
sns.plt.ylabel(u'スピアマンの順位相関係数', fontproperties=prop)
sns.plt.show()


import sklearn.linear_model
model = sklearn.linear_model.RidgeCV(alphas=[0.0000001, 0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0], store_cv_values=True, fit_intercept=False)
import sklearn.preprocessing
enc = sklearn.preprocessing.OneHotEncoder(categorical_features=[1])

enc.fit(df[['rating_codeforces', 'count']])
model.fit(enc.transform(df[['rating_codeforces', 'count']]), df['rating_atcoder'])

print model.coef_, model.intercept_
print model.cv_values_.mean(axis=0)