唯物是真 @Scaled

2016年9月末のリリーズでBigQueryにいろいろ機能が追加されました
個人的に便利そうだな、と思ったものを紹介します

Release Notes | BigQuery | Google Cloud Platform

cloud.google.com

標準SQL (Standard SQL)

BigQueryでは独自のSQLを使っていたのですが、標準のSQLに対応しました(6月にベータリリースされていたのがベータでなくなりました)
ずっと古い方のSQLを使っていたので、移行して慣れるのが大変ですが、移行用のドキュメントがあります

Migrating from legacy SQL | BigQuery Documentation | Google Cloud Platform

WITH句

標準SQLではサブクエリに名前を付けて参照できるようになるので、何段もネストしてわけがわからなくなるのから開放されます

WITH サブクエリの名前 AS (
  サブクエリ
)

SELECT
  *
FROM サブクエリの名前
WHERE MOD(`id`, 2) = 0

UDF(User-Defined Functions)

前からJavaScriptでユーザー定義関数を書くことができて、記法がわかりづらかったのですが、標準SQLではだいぶ書きやすくなりました

CREATE TEMPORARY FUNCTION timesTwo(x FLOAT64)
RETURNS FLOAT64
  LANGUAGE js AS """
  return x*2;
""";
User-Defined Functions | BigQuery Documentation | Google Cloud Platform

これでJavaScriptの強力な機能をBigQuery内から使うことができるようになります
たとえば、Legacy SQLの方にはJSONから値を取り出すJSON_EXTRACT_SCALAR()という関数があったのですが、今の標準SQLにはなくなっていて代わりにUDFを使うことができます

JSON_EXTRACT in BigQuery Standard SQL? - Stack Overflow

INSERT, UPDATE, DELETE

今までBigQueryでは基本的にデータの追加だけで、UPDATEやDELETEなどのテーブル内の更新操作は許されていなかったのですが、こういった操作ができるようになります(標準SQLのみ。ベータリリース)

UPDATE table1 SET col1 = 1 WHERE col1 in (SELECT field1 from table2)
Data Manipulation Language | BigQuery Documentation | Google Cloud Platform

ただしUPDATE/DELETEできるのは1日に付き、テーブルごとに48回、プロジェクトごとに500回とかなり制限されているので、頻繁に更新するような用途では使えそうにありません

クエリの共有

クエリをURLで共有できるようになりました
公開範囲はパブリックかプロジェクト内かを選べます
「Save Query」クエリで保存すればシェア用のURLを発行できるようになります

競技プログラミングサイトのAtCoderのレーティングの仕組みが新しくなりました
別の競技プログラミングサイトであるCodeforcesのレーティングとどれぐらい相関があるのか、参加回数がどれぐらいあればレーティングに差がなくなってくるのかなど気になったので調べてみました

ちなみにたまにAtCoderに参加してますが、最近は全然解けないでレーティングの変動がなくなってきました(弱い
f:id:sucrose:20161002174028j:plain

データの収集

AtCoderの方は、AtCoderのレーティングのランキングのページから表示されてるデータをすべて取ってきます
CodeforcesはAPIがあるのでそれを使います

Codeforces API - Codeforces

http://codeforces.com/api/user.ratedList で1度でも参加したことのあるユーザーの情報を取ってくることができます(重いです)

AtCoderとCodeforcesでハンドルが同じユーザーを同一人物だとみなします

AtCoderのレーティングの分布

レーティングの色分けについてはこちら↓

https://atcoder.jp/post/14

f:id:sucrose:20161003010813p:plain

レーティング	レーティング上位何%か
2800(赤)	2%
2400(オレンジ)	6%
2000(黄色)	13%
1600(青)	26%
1200(水色)	46%
800(緑)	65%
400(茶色)	81%
0(灰色)	100%

AtCoderの参加回数とレーティングの関係

AtCoderのレーティングとCodeforcesのレーティングで散布図をかくと以下のようになります
AtCoderの参加回数で色分けされていて、1回しか参加していない赤色の点はAtCoderのレーティングが低めであることがわかります
参加回数の多い色の部分を見ると、だいたい片方のレーティングが上がるともう片方も上がっていることがわかります
f:id:sucrose:20161003010948p:plain
AtCoderは一定回数参加しないとレーティングが大きくならないので、試しに参加回数とレーティングの相関係数を出してみました(スピアマンの順位相関係数)

スピアマンの順位相関係数 - Wikipedia

全部のデータを使うと相関係数は0.7ぐらいですが、参加回数が4回未満のユーザーのデータを使わないで相関係数を計算したら、相関係数が0近くになりました
参加回数を一定以上に絞ると人数が減るので信頼性が多少損なわれますが、一定回数以上参加していれば参加回数とレーティングに強い正の相関はなさそうです

f:id:sucrose:20161003011017p:plain

AtCoderとCodeforcesのレーティングの関係

上で見たように、参加回数とレーティングには関係がありそうなのでAtCoderの参加回数別に、AtCoderとCodeforcesのレーティングの相関係数を出しました
1回だけ参加した人でも0.7、2回以上だと0.8以上の値になり、強い相関がありそうなのがわかります
f:id:sucrose:20161003011031p:plain

回帰をしてみる

というわけでCodeforcesのレーティングとAtCoderのレーティングには強い関係がありそうなので、回帰式を求めてみます
参加回数はレーティングに影響を与えていそうなので、参加回数とCodeforcesのレーティングの2つを使ってAtCoderのレーティングを予測します

今回はscikit-learnのリッジ回帰を使いました

1.1. Generalized Linear Models — scikit-learn 0.19.1 documentation

参加回数は線形な影響ではなさそうなので
OneHotEncoderで1-of-K表現に変えてから学習しました

4.3. Preprocessing data — scikit-learn 0.19.1 documentation

回帰式

以下のような式が得られました
\(\mathrm{Codeforcesのレーティング} \times 1.06\) \(+ \mathrm{参加回数ごとのバイアス}\)

参加回数	バイアス
1	-1287
2	-822
3	-639
4	-418
5	-258
6	-255
7	-210
8	-114
9	-124
10	-102

順にバイアスを見ると、参加回数5回目ぐらいからあまり変化がなくなってくるので、レーティングがあるべき数値に収束し始めているのかも
この式を使って予測してみたら、残念ながら数百ぐらいの誤差は普通にありました
参加回数が少ないユーザーのデータはばらつきが大きそうなので、取り除いて計算したほうがよかったかもしれません(？)
散布図に参加回数1回、6回、10回の場合の回帰式を書いたらまあある程度はうまくいってるかなぐらいの感じです
f:id:sucrose:20161003233209p:plain

まとめ

AtCoderとCodeforcesのレーティングには強い相関がある
参加回数が増えてくると、参加回数ではレーティングに差がつかなくなっていく(？)(強い人ほどたくさん参加してそうなので不思議ですが)

sucrose.hatenablog.com

ソースコード

# -*- coding: utf-8 -*-

import pyquery
import requests
import time
import scipy.stats
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.font_manager
import matplotlib.cm as cm

cf_rating = {}
cf_data = requests.get('http://codeforces.com/api/user.ratedList').json()
for i in cf_data['result']:
    cf_rating[i['handle']] = i['rating']

rating_atcoder = []
counts = []
rating_codeforces = []
for i in xrange(1, 29):
    table = pyquery.PyQuery(url='https://atcoder.jp/ranking?p={}'.format(i))
    for elm in table.find('tr')[1:]:
        tr = pyquery.PyQuery(elm)
        tds = tr.find('td')
        rank = int(pyquery.PyQuery(tds[0]).text())
        name = pyquery.PyQuery(tds[1]).text()
        rating = int(pyquery.PyQuery(tds[2]).text())
        count = int(pyquery.PyQuery(tds[4]).text())
        
        cf = cf_rating.get(name, np.nan)
        print rank, name, rating, cf, count
        rating_atcoder.append(rating)
        counts.append(count)
        rating_codeforces.append(cf)
    time.sleep(1)

df = pd.DataFrame({
    'rating_atcoder': rating_atcoder,
    'rating_codeforces': rating_codeforces,
    'count': counts
})

prop = matplotlib.font_manager.FontProperties(fname=r'C:\Windows\Fonts\meiryo.ttc', size=12)

sns.plt.hist(df[df['rating_atcoder'] < 400].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#808080')
sns.plt.hist(df[(400 <= df['rating_atcoder']) & (df['rating_atcoder'] < 800)].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#804000')
sns.plt.hist(df[(800 <= df['rating_atcoder']) & (df['rating_atcoder'] < 1200)].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#008000')
sns.plt.hist(df[(1200 <= df['rating_atcoder']) & (df['rating_atcoder'] < 1600)].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#00C0C0')
sns.plt.hist(df[(1600 <= df['rating_atcoder']) & (df['rating_atcoder'] < 2000)].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#0000FF')
sns.plt.hist(df[(2000 <= df['rating_atcoder']) & (df['rating_atcoder'] < 2400)].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#C0C000')
sns.plt.hist(df[(2400 <= df['rating_atcoder']) & (df['rating_atcoder'] < 2800)].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#FF8000')
sns.plt.hist(df[2800 <= df['rating_atcoder']].reset_index()['rating_atcoder'], bins=range(0, 4001, 100), histtype='stepfilled', color='#FF0000')
sns.plt.title(u'AtCoderのレーティングの分布', fontproperties=prop)
sns.plt.xlabel(u'AtCoderのレーティング', fontproperties=prop)
sns.plt.ylabel(u'ユーザー数', fontproperties=prop)
sns.plt.show()

print 'rating: percentile'
for i in [0, 400, 800, 1200, 1600, 2000, 2400, 2800]:
    print '{}: {:.3}'.format(i, 100 - scipy.stats.percentileofscore(df['rating_atcoder'], i))

df = df.dropna()

sns.plt.scatter(df['rating_codeforces'], df['rating_atcoder'], c=df['count'], cmap=cm.gist_rainbow)
cb = sns.plt.colorbar(label=u'AtCoderの参加回数')
cb.ax.yaxis.label.set_font_properties(prop)
sns.plt.title(u'AtCoderのレーティングとCodeforcesのレーティングの関係', fontproperties=prop)
sns.plt.xlabel(u'Codeforcesのレーティング', fontproperties=prop)
sns.plt.ylabel(u'AtCoderのレーティング', fontproperties=prop)
sns.plt.show()

print df.corr('spearman')

idx = []
value = []
for i in xrange(1, 10):
    idx.append(i)
    value.append(df[df['count'] >= i].corr('spearman')['rating_atcoder']['count'])

import seaborn as sns
sns.plt.plot(idx, value)
sns.plt.title(u'AtCoder出場回数とAtCoderのレーティングとの相関係数の関係', fontproperties=prop)
sns.plt.xlabel(u'何回以上出場した人のデータを使って計算したか', fontproperties=prop)
sns.plt.ylabel(u'スピアマンの順位相関係数', fontproperties=prop)
sns.plt.show()

idx = []
value = []
for i in xrange(1, 11):
    idx.append(i)
    value.append(df[df['count'] == i].corr('spearman')['rating_atcoder']['rating_codeforces'])

import seaborn as sns
sns.plt.plot(idx, value)
sns.plt.title(u'CodeforcesのレーティングとAtCoderのレーティングの相関係数と出場回数の関係', fontproperties=prop)
sns.plt.xlabel(u'ちょうど何回出場した人のデータを使って計算したか', fontproperties=prop)
sns.plt.ylabel(u'スピアマンの順位相関係数', fontproperties=prop)
sns.plt.show()


import sklearn.linear_model
model = sklearn.linear_model.RidgeCV(alphas=[0.0000001, 0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0], store_cv_values=True, fit_intercept=False)
import sklearn.preprocessing
enc = sklearn.preprocessing.OneHotEncoder(categorical_features=[1])

enc.fit(df[['rating_codeforces', 'count']])
model.fit(enc.transform(df[['rating_codeforces', 'count']]), df['rating_atcoder'])

print model.coef_, model.intercept_
print model.cv_values_.mean(axis=0)