분류모델 만들고 평가하기 연습

728x90

1. 데이터 분리하기
from sklearn.model_selection import train_test_split

X_TRAIN, X_TEST, Y_TRAIN, Y_TEST = train_test_split(x_train, y_train, test_size = 0.2, random_state = 10)

1.1 shape으로 데이터의 모양을 확인 필요

만약 종속변수의 컬럼이 2개 이상인 경우 오류

model.fit(X_TRAIN, Y_TRAIN)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-121-4ceb6b81c8dd> in <module>()
----> 1 model.fit(X_TRAIN, Y_TRAIN)

2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
   1037 
   1038     raise ValueError(
-> 1039         "y should be a 1d array, got an array of shape {} instead.".format(shape)
   1040     )
   1041 

ValueError: y should be a 1d array, got an array of shape (712, 2) instead.

Y_TRAIN.shape

(712, 2)

print(Y_TRAIN)

     癤풮assengerId  Survived
57             58         0
717           718         1
431           432         1
633           634         0
163           164         0
..            ...       ...
369           370         1
320           321         0
527           528         0
125           126         1
265           266         0

[712 rows x 2 columns]

2. 분류 학습모델 만들기

import xgboost

print(dir(xgboost)

print(help(xgboost.XGBClassifier))

Help on class XGBClassifier in module xgboost.sklearn:

class XGBClassifier(XGBModel, sklearn.base.ClassifierMixin)
 |  XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, verbosity=1, silent=None, objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
 |  
 |  Implementation of the scikit-learn API for XGBoost classification.
 |  
 |  Parameters
 |  ----------
 |  max_depth : int
 |      Maximum tree depth for base learners.
 |  learning_rate : float
 |      Boosting learning rate (xgb's "eta")
 |  n_estimators : int
 |      Number of trees to fit.
 |  verbosity : int
 |      The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
 |  silent : boolean
 |      Whether to print messages while running boosting. Deprecated. Use verbosity instead.
 |  objective : string or callable
 |      Specify the learning task and the corresponding learning objective or
 |      a custom objective function to be used (see note below).
 |  booster: string
 |      Specify which booster to use: gbtree, gblinear or dart.
 |  nthread : int
 |      Number of parallel threads used to run xgboost.  (Deprecated, please use ``n_jobs``)
 |  n_jobs : int
 |      Number of parallel threads used to run xgboost.  (replaces ``nthread``)
 |  gamma : float
 |      Minimum loss reduction required to make a further partition on a leaf node of the tree.
 |  min_child_weight : int
 |      Minimum sum of instance weight(hessian) needed in a child.
 |  max_delta_step : int
 |      Maximum delta step we allow each tree's weight estimation to be.
 |  subsample : float
 |      Subsample ratio of the training instance.
 |  colsample_bytree : float
 |      Subsample ratio of columns when constructing each tree.
 |  colsample_bylevel : float
 |      Subsample ratio of columns for each level.
 |  colsample_bynode : float
 |      Subsample ratio of columns for each split.
 |  reg_alpha : float (xgb's alpha)
 |      L1 regularization term on weights
 |  reg_lambda : float (xgb's lambda)
 |      L2 regularization term on weights
 |  scale_pos_weight : float
 |      Balancing of positive and negative weights.
 |  base_score:
 |      The initial prediction score of all instances, global bias.
 |  seed : int
 |      Random number seed.  (Deprecated, please use random_state)
 |  random_state : int
 |      Random number seed.  (replaces seed)
 |  missing : float, optional
 |      Value in the data which needs to be present as a missing value. If
 |      None, defaults to np.nan.
 |  importance_type: string, default "gain"
 |      The feature importance type for the feature_importances_ property: either "gain",
 |      "weight", "cover", "total_gain" or "total_cover".
 |  \*\*kwargs : dict, optional
 |      Keyword arguments for XGBoost Booster object.  Full documentation of parameters can
 |      be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst.
 |      Attempting to set a parameter via the constructor args and \*\*kwargs dict simultaneously
 |      will result in a TypeError.
 |  
 |      .. note:: \*\*kwargs unsupported by scikit-learn
 |  
 |          \*\*kwargs is unsupported by scikit-learn.  We do not guarantee that parameters
 |          passed via this argument will interact properly with scikit-learn.

위와같이 함수의 파라미터를 참고하여 학습모델 생성

from xgboost import XGBClassifier

model = XGBClassifier(n_estimators=100)

model.fit(X_TRAIN, Y_TRAIN)

3. 결과예측하기

y_test_predicted = pd.DataFrame(model.predict(X_TEST))

3-1. 만약 생존확률을 구한다고 하면 predict_proba() 함수 활용

y_test_predicted = pd.DataFrame(model.predict_proba(X_TEST))[0] // 사망할 확률

y_test_predicted = pd.DataFrame(model.predict_proba(X_TEST))[1] // 생존할 확률

4. 모델평가하기

from sklearn.metrics import roc_auc_score

print(roc_auc_score(Y_TEST, y_test_predicted))

5. 결과제출

final = pd.concat([x_test_passenser_id, y_test_predicted], axis = 1)

print(final)

     PassengerId    0
0            892  0.0
1            893  0.0
2            894  0.0
3            895  1.0
4            896  1.0


[418 rows x 2 columns]

5-1. predict 결과는 0으로 나오기 때문에 컬럼명 변경 필요

final = final.rename(columns={0:'Survived'})

print(final)

     PassengerId  Survived
0            892       0.0
1            893       0.0
2            894       0.0
3            895       1.0
4            896       1.0


[418 rows x 2 columns]

final.to_csv('result.csv', index = False)

'독서' 카테고리의 다른 글

네이버 파파고, 구글 번역기 (0)	2022.10.19
예측모델 연습하기 (0)	2022.06.18
전처리 연습 (0)	2022.06.18
힘들고 배고픔의 가치 (0)	2022.06.16
데이터분석 연습 (0)	2022.06.06

오늘 배우고 성장할 것을 계획하고 실행하자

분류모델 만들고 평가하기 연습

'독서' 카테고리의 다른 글

티스토리툴바

분류모델 만들고 평가하기 연습

'독서' 카테고리의 다른 글

관련글

티스토리툴바