본문 바로가기
독서

분류모델 만들고 평가하기 연습

by _><- 2022. 6. 18.
반응형

1. 데이터 분리하기
from sklearn.model_selection import train_test_split

X_TRAIN, X_TEST, Y_TRAIN, Y_TEST = train_test_split(x_train, y_train, test_size = 0.2, random_state = 10)
 
 
 
1.1 shape으로 데이터의 모양을 확인 필요
 
만약 종속변수의 컬럼이 2개 이상인 경우 오류
 
model.fit(X_TRAIN, Y_TRAIN)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-121-4ceb6b81c8dd> in <module>()
----> 1 model.fit(X_TRAIN, Y_TRAIN)

2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
   1037 
   1038     raise ValueError(
-> 1039         "y should be a 1d array, got an array of shape {} instead.".format(shape)
   1040     )
   1041 

ValueError: y should be a 1d array, got an array of shape (712, 2) instead.
 
Y_TRAIN.shape
 
(712, 2)
 
print(Y_TRAIN)
     癤풮assengerId  Survived
57             58         0
717           718         1
431           432         1
633           634         0
163           164         0
..            ...       ...
369           370         1
320           321         0
527           528         0
125           126         1
265           266         0

[712 rows x 2 columns]​
 
 
 
 
 
 
2. 분류 학습모델 만들기
 
 
import xgboost
print(dir(xgboost)
print(help(xgboost.XGBClassifier))
 
Help on class XGBClassifier in module xgboost.sklearn:

class XGBClassifier(XGBModel, sklearn.base.ClassifierMixin)
 |  XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, verbosity=1, silent=None, objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
 |  
 |  Implementation of the scikit-learn API for XGBoost classification.
 |  
 |  Parameters
 |  ----------
 |  max_depth : int
 |      Maximum tree depth for base learners.
 |  learning_rate : float
 |      Boosting learning rate (xgb's "eta")
 |  n_estimators : int
 |      Number of trees to fit.
 |  verbosity : int
 |      The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
 |  silent : boolean
 |      Whether to print messages while running boosting. Deprecated. Use verbosity instead.
 |  objective : string or callable
 |      Specify the learning task and the corresponding learning objective or
 |      a custom objective function to be used (see note below).
 |  booster: string
 |      Specify which booster to use: gbtree, gblinear or dart.
 |  nthread : int
 |      Number of parallel threads used to run xgboost.  (Deprecated, please use ``n_jobs``)
 |  n_jobs : int
 |      Number of parallel threads used to run xgboost.  (replaces ``nthread``)
 |  gamma : float
 |      Minimum loss reduction required to make a further partition on a leaf node of the tree.
 |  min_child_weight : int
 |      Minimum sum of instance weight(hessian) needed in a child.
 |  max_delta_step : int
 |      Maximum delta step we allow each tree's weight estimation to be.
 |  subsample : float
 |      Subsample ratio of the training instance.
 |  colsample_bytree : float
 |      Subsample ratio of columns when constructing each tree.
 |  colsample_bylevel : float
 |      Subsample ratio of columns for each level.
 |  colsample_bynode : float
 |      Subsample ratio of columns for each split.
 |  reg_alpha : float (xgb's alpha)
 |      L1 regularization term on weights
 |  reg_lambda : float (xgb's lambda)
 |      L2 regularization term on weights
 |  scale_pos_weight : float
 |      Balancing of positive and negative weights.
 |  base_score:
 |      The initial prediction score of all instances, global bias.
 |  seed : int
 |      Random number seed.  (Deprecated, please use random_state)
 |  random_state : int
 |      Random number seed.  (replaces seed)
 |  missing : float, optional
 |      Value in the data which needs to be present as a missing value. If
 |      None, defaults to np.nan.
 |  importance_type: string, default "gain"
 |      The feature importance type for the feature_importances_ property: either "gain",
 |      "weight", "cover", "total_gain" or "total_cover".
 |  \*\*kwargs : dict, optional
 |      Keyword arguments for XGBoost Booster object.  Full documentation of parameters can
 |      be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst.
 |      Attempting to set a parameter via the constructor args and \*\*kwargs dict simultaneously
 |      will result in a TypeError.
 |  
 |      .. note:: \*\*kwargs unsupported by scikit-learn
 |  
 |          \*\*kwargs is unsupported by scikit-learn.  We do not guarantee that parameters
 |          passed via this argument will interact properly with scikit-learn.

위와같이 함수의 파라미터를 참고하여 학습모델 생성

from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=100)

model.fit(X_TRAIN, Y_TRAIN)

 

3. 결과예측하기

y_test_predicted = pd.DataFrame(model.predict(X_TEST))
 
 
3-1. 만약 생존확률을 구한다고 하면 predict_proba() 함수 활용
 
y_test_predicted = pd.DataFrame(model.predict_proba(X_TEST))[0]   // 사망할 확률
 
 
y_test_predicted = pd.DataFrame(model.predict_proba(X_TEST))[1]   // 생존할 확률
 
 
 
 
4. 모델평가하기
 
from sklearn.metrics import roc_auc_score

print(roc_auc_score(Y_TEST, y_test_predicted))


5. 결과제출

final = pd.concat([x_test_passenser_id, y_test_predicted], axis = 1)
print(final)
 
     PassengerId    0
0            892  0.0
1            893  0.0
2            894  0.0
3            895  1.0
4            896  1.0


[418 rows x 2 columns]

5-1. predict 결과는 0으로 나오기 때문에 컬럼명 변경 필요

final = final.rename(columns={0:'Survived'})
print(final)
     PassengerId  Survived
0            892       0.0
1            893       0.0
2            894       0.0
3            895       1.0
4            896       1.0


[418 rows x 2 columns]
final.to_csv('result.csv', index = False)
반응형

'독서' 카테고리의 다른 글

네이버 파파고, 구글 번역기  (0) 2022.10.19
예측모델 연습하기  (0) 2022.06.18
전처리 연습  (0) 2022.06.18
힘들고 배고픔의 가치  (0) 2022.06.16
데이터분석 연습  (0) 2022.06.06