2021-01-15

「Python実践データ分析100本ノック」の”ノック38”の利用回数予測モデルにH2OのAutoMLを使うメモ

データ分析初心者にとって非常に良著である「Python実践データ分析100本ノック」に紹介されている「利用回数予測モデル」の回帰モデルをH2OのAutoMLを使ってみる。

www.amazon.co.jp

まずはノック38で利用されているデータを確認。

dataset.head()

年月	customer_id	count_pred	count_0	count_1	count_2	count_3	count_4	count_5
201810	AS002855	3	7.0	3.0	5.0	5.0	5.0	4.0
201810	AS009373	5	6.0	6.0	7.0	4.0	4.0	3.0
201810	AS015315	4	7.0	3.0	6.0	3.0	3.0	6.0
201810	AS015739	5	6.0	5.0	8.0	6.0	5.0	7.0
201810	AS019860	7	5.0	7.0	4.0	6.0	8.0	6.0

回帰（ノック38記載）で線形回帰適用

from sklearn import linear_model
import sklearn.model_selection

model = linear_model.LinearRegression()

X = predict_data[["count_0", "count_1", "count_2", "count_3", "count_4", "count_5", "period"]]
y = predict_data["count_pred"]
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y)
model.fit(X_train, y_train)

RMSEを確認

lm_mse = mean_squared_error(y_test, y_pred)
print('RMSE: ', np.sqrt(lm_mse))

> RMSE: 1.669

AutoMLのH2Oを適用

H2Oをインストール

!pip install h2o

ライブラリの読み込み

import h2o
from h2o.automl import H2OAutoML

import pandas as pd
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

初期化（公式ドキュメントにはほとんどの場合にシンプルに下記の記載のみで良いとの記載あり。
Starting H2O — H2O 3.32.0.3 documentation

h2o.init()

データ準備

target_col = "count_pred"
feature_cols = ["count_0", "count_1", "count_2", "count_3", "count_4", "count_5", "period"]

X_train = dataset[feature_cols]
y_train = dataset[target_col]
X_train_f, X_valid_f, y_train_f, y_valid_f = train_test_split(
    X_train, y_train, test_size=0.30, random_state=1)

train = pd.merge(X_train_f, y_train_f, left_index=True, right_index=True)
valid = pd.merge(X_valid_f, y_valid_f, left_index=True, right_index=True)

h2o_train = h2o.H2OFrame(train) 
h2o_valid = h2o.H2OFrame(valid)

AutoMLを走らせる

aml = H2OAutoML(max_models=10, seed = 0, nfolds=5, keep_cross_validation_predictions=True)
aml.train(x=feature_cols, y=target_col,
          training_frame=h2o_train,
          leaderboard_frame=h2o_valid)

Leader Boardを確認

lb = aml.leaderboard
lb.head(rows=lb.nrows)

若干RMSEが改善した

model_id	mean_residual_deviance	rmse	mse	mae	rmsle
StackedEnsemble_AllModels_AutoML_20210115_052140	2.44109	1.5624	2.44109	1.27232	0.313276

2021-01-14

「Python実践データ分析100本ノック」の”ノック47”の退会予測モデルにH2OのAutoMLを使うメモ

データ分析初心者にとって非常に良著である「Python実践データ分析100本ノック」に紹介されている「退会予測モデル」の決定木をH2OのAutoMLを使ってみる。

www.amazon.co.jp

まずはノック47で利用されているデータを確認。

dataset.head()

count_1	routine_flg	period	is_deleted	campaign_name_入会費半額	campaign_name_入会費無料	class_name_オールタイム	class_name_デイタイム	gender_F
6.0	1.0	24	0.0	0	0	0	0	0
2.0	1.0	34	0.0	0	0	1	0	0
5.0	1.0	33	0.0	0	0	0	0	1
6.0	1.0	13	0.0	0	0	1	0	0
5.0	1.0	38	0.0	0	0	1	0	1

説明変数と被説明変数を格納

ctarget_col = 'is_deleted'
feature_cols = ['count_1', 'routine_flg', 'period', 'campaign_name_入会費半額',
       'campaign_name_入会費無料', 'class_name_オールタイム', 'class_name_デイタイム',
       'gender_F']

決定木（ノック47記載）で決定木適用

exit = dataset.loc[dataset["is_deleted"]==1]
conti = dataset.loc[dataset["is_deleted"]==0].sample(len(exit))

X = pd.concat([exit, conti], ignore_index=True)
y = X["is_deleted"]
del X["is_deleted"]
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X,y)

DTCmodel = DecisionTreeClassifier(random_state=0)
DTCmodel.fit(X_train, y_train)

AUCを確認

from sklearn.metrics import roc_auc_score

Y_score = DTCmodel.predict_proba(X_test)[:, 1]
print('auc = ', roc_auc_score(y_true=y_test, y_score=Y_score))

> auc = 0.9718441265387008