「Python実践データ分析100本ノック」の”ノック38”の利用回数予測モデルにH2OのAutoMLを使うメモ
データ分析初心者にとって非常に良著である「Python実践データ分析100本ノック」に紹介されている「利用回数予測モデル」の回帰モデルをH2OのAutoMLを使ってみる。
まずはノック38で利用されているデータを確認。
dataset.head()
年月 | customer_id | count_pred | count_0 | count_1 | count_2 | count_3 | count_4 | count_5 |
201810 | AS002855 | 3 | 7.0 | 3.0 | 5.0 | 5.0 | 5.0 | 4.0 |
201810 | AS009373 | 5 | 6.0 | 6.0 | 7.0 | 4.0 | 4.0 | 3.0 |
201810 | AS015315 | 4 | 7.0 | 3.0 | 6.0 | 3.0 | 3.0 | 6.0 |
201810 | AS015739 | 5 | 6.0 | 5.0 | 8.0 | 6.0 | 5.0 | 7.0 |
201810 | AS019860 | 7 | 5.0 | 7.0 | 4.0 | 6.0 | 8.0 | 6.0 |
回帰(ノック38記載)で線形回帰適用
from sklearn import linear_model import sklearn.model_selection model = linear_model.LinearRegression() X = predict_data[["count_0", "count_1", "count_2", "count_3", "count_4", "count_5", "period"]] y = predict_data["count_pred"] X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y) model.fit(X_train, y_train)
RMSEを確認
lm_mse = mean_squared_error(y_test, y_pred) print('RMSE: ', np.sqrt(lm_mse))
- > RMSE: 1.669
AutoMLのH2Oを適用
H2Oをインストール
!pip install h2o
ライブラリの読み込み
import h2o from h2o.automl import H2OAutoML import pandas as pd from sklearn.model_selection import train_test_split import warnings warnings.filterwarnings('ignore')
初期化(公式ドキュメントにはほとんどの場合にシンプルに下記の記載のみで良いとの記載あり。
Starting H2O — H2O 3.32.0.3 documentation
h2o.init()
データ準備
target_col = "count_pred" feature_cols = ["count_0", "count_1", "count_2", "count_3", "count_4", "count_5", "period"] X_train = dataset[feature_cols] y_train = dataset[target_col] X_train_f, X_valid_f, y_train_f, y_valid_f = train_test_split( X_train, y_train, test_size=0.30, random_state=1) train = pd.merge(X_train_f, y_train_f, left_index=True, right_index=True) valid = pd.merge(X_valid_f, y_valid_f, left_index=True, right_index=True) h2o_train = h2o.H2OFrame(train) h2o_valid = h2o.H2OFrame(valid)
AutoMLを走らせる
aml = H2OAutoML(max_models=10, seed = 0, nfolds=5, keep_cross_validation_predictions=True) aml.train(x=feature_cols, y=target_col, training_frame=h2o_train, leaderboard_frame=h2o_valid)
Leader Boardを確認
lb = aml.leaderboard lb.head(rows=lb.nrows)
若干RMSEが改善した
model_id | mean_residual_deviance | rmse | mse | mae | rmsle |
StackedEnsemble_AllModels_AutoML_20210115_052140 | 2.44109 | 1.5624 | 2.44109 | 1.27232 | 0.313276 |