반응형
1. 데이터 분리하기
from sklearn.model_selection import train_test_split
X_TRAIN, X_TEST, Y_TRAIN, Y_TEST = train_test_split(x_train, y_train, test_size = 0.2, random_state = 10)
1.1 shape으로 데이터의 모양을 확인 필요
만약 종속변수의 컬럼이 2개 이상인 경우 오류
model.fit(X_TRAIN, Y_TRAIN)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-121-4ceb6b81c8dd> in <module>()
----> 1 model.fit(X_TRAIN, Y_TRAIN)
2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
1037
1038 raise ValueError(
-> 1039 "y should be a 1d array, got an array of shape {} instead.".format(shape)
1040 )
1041
ValueError: y should be a 1d array, got an array of shape (712, 2) instead.
Y_TRAIN.shape
(712, 2)
print(Y_TRAIN)
癤풮assengerId Survived
57 58 0
717 718 1
431 432 1
633 634 0
163 164 0
.. ... ...
369 370 1
320 321 0
527 528 0
125 126 1
265 266 0
[712 rows x 2 columns]
2. 분류 학습모델 만들기
import xgboost
print(dir(xgboost)
print(help(xgboost.XGBClassifier))
Help on class XGBClassifier in module xgboost.sklearn:
class XGBClassifier(XGBModel, sklearn.base.ClassifierMixin)
| XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, verbosity=1, silent=None, objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
|
| Implementation of the scikit-learn API for XGBoost classification.
|
| Parameters
| ----------
| max_depth : int
| Maximum tree depth for base learners.
| learning_rate : float
| Boosting learning rate (xgb's "eta")
| n_estimators : int
| Number of trees to fit.
| verbosity : int
| The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
| silent : boolean
| Whether to print messages while running boosting. Deprecated. Use verbosity instead.
| objective : string or callable
| Specify the learning task and the corresponding learning objective or
| a custom objective function to be used (see note below).
| booster: string
| Specify which booster to use: gbtree, gblinear or dart.
| nthread : int
| Number of parallel threads used to run xgboost. (Deprecated, please use ``n_jobs``)
| n_jobs : int
| Number of parallel threads used to run xgboost. (replaces ``nthread``)
| gamma : float
| Minimum loss reduction required to make a further partition on a leaf node of the tree.
| min_child_weight : int
| Minimum sum of instance weight(hessian) needed in a child.
| max_delta_step : int
| Maximum delta step we allow each tree's weight estimation to be.
| subsample : float
| Subsample ratio of the training instance.
| colsample_bytree : float
| Subsample ratio of columns when constructing each tree.
| colsample_bylevel : float
| Subsample ratio of columns for each level.
| colsample_bynode : float
| Subsample ratio of columns for each split.
| reg_alpha : float (xgb's alpha)
| L1 regularization term on weights
| reg_lambda : float (xgb's lambda)
| L2 regularization term on weights
| scale_pos_weight : float
| Balancing of positive and negative weights.
| base_score:
| The initial prediction score of all instances, global bias.
| seed : int
| Random number seed. (Deprecated, please use random_state)
| random_state : int
| Random number seed. (replaces seed)
| missing : float, optional
| Value in the data which needs to be present as a missing value. If
| None, defaults to np.nan.
| importance_type: string, default "gain"
| The feature importance type for the feature_importances_ property: either "gain",
| "weight", "cover", "total_gain" or "total_cover".
| \*\*kwargs : dict, optional
| Keyword arguments for XGBoost Booster object. Full documentation of parameters can
| be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst.
| Attempting to set a parameter via the constructor args and \*\*kwargs dict simultaneously
| will result in a TypeError.
|
| .. note:: \*\*kwargs unsupported by scikit-learn
|
| \*\*kwargs is unsupported by scikit-learn. We do not guarantee that parameters
| passed via this argument will interact properly with scikit-learn.
위와같이 함수의 파라미터를 참고하여 학습모델 생성
from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=100)
model.fit(X_TRAIN, Y_TRAIN)
3. 결과예측하기
y_test_predicted = pd.DataFrame(model.predict(X_TEST))
3-1. 만약 생존확률을 구한다고 하면 predict_proba() 함수 활용
y_test_predicted = pd.DataFrame(model.predict_proba(X_TEST))[0] // 사망할 확률
y_test_predicted = pd.DataFrame(model.predict_proba(X_TEST))[1] // 생존할 확률
4. 모델평가하기
from sklearn.metrics import roc_auc_score
print(roc_auc_score(Y_TEST, y_test_predicted))
5. 결과제출
final = pd.concat([x_test_passenser_id, y_test_predicted], axis = 1)
print(final)
PassengerId 0
0 892 0.0
1 893 0.0
2 894 0.0
3 895 1.0
4 896 1.0
[418 rows x 2 columns]
5-1. predict 결과는 0으로 나오기 때문에 컬럼명 변경 필요
final = final.rename(columns={0:'Survived'})
print(final)
PassengerId Survived
0 892 0.0
1 893 0.0
2 894 0.0
3 895 1.0
4 896 1.0
[418 rows x 2 columns]
final.to_csv('result.csv', index = False)
반응형
'독서' 카테고리의 다른 글
네이버 파파고, 구글 번역기 (0) | 2022.10.19 |
---|---|
예측모델 연습하기 (0) | 2022.06.18 |
전처리 연습 (0) | 2022.06.18 |
힘들고 배고픔의 가치 (0) | 2022.06.16 |
데이터분석 연습 (0) | 2022.06.06 |