축구선수 연봉 예측

수타. 2023. 7. 3. 11:29

이제 전처리한 데이터로 연봉을 예측해볼 차례 입니다.

먼저 AutoML로 어떤 모델이 가장 성능이 잘 나올지 예측해 보았습니다. AutoML설치부터 보면

!pip install autoviz

from autoviz.AutoViz_Class import AutoViz_Class
import matplotlib.pyplot as plt
%matplotlib inline

#plt.style.use('')  #스타일 조정가능

for pos in position:
  globals()['AV_{}'.format(pos)] = AutoViz_Class()
  globals()['AV_{}'.format(pos)].AutoViz(
      filename='',
      dfte =  globals()['{}'.format(pos)],
      depVar = 'value',
      verbose = 0,
      max_rows_analyzed = globals()['{}'.format(pos)].shape[0],
      max_cols_analyzed = globals()['{}'.format(pos)].shape[1]
  )
  
  !pip install lazypredict

중간에 position은 저번 포스팅 과 이어지며 포지션 4개의 리스트 입니다. 각각 포지션 마다 주는 데이터의 영향이 다를것이므로, 포지션별로 나누어 예측해주었습니다.

from sklearn.model_selection import train_test_split

for pos in position:
  globals()['X_{}_train'.format(pos)],globals()['X_{}_test'.format(pos)],globals()['y_{}_train'.format(pos)],globals()['y_{}_test'.format(pos)] =  \
   train_test_split(globals()['{}'.format(pos)].iloc[:,4:-1],globals()['{}'.format(pos)].iloc[:,-1],test_size=0.2,random_state=100)
   
   clf = LazyRegressor(verbose=0, predictions=True)
   
   models,predictions = clf.fit(X_Forward_train,X_Forward_test,y_Forward_train,y_Forward_test)
models

각각 포지션마다 예측해주고 확인해보면

다음과같이 RandomForestRegressor가 가장 성능이 좋은것으로 확인돼서 사용하기로 했습니다.

import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler #성능 UP
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate

# StandardScaler( )로 평균이 0, 분산 1로 데이터 분포도 변환
scaler = StandardScaler()

그 후 , 다음과같이 필요한 모듈들을 import해주고,

for pos in position:
  data_scaled = scaler.fit_transform(globals()['{}'.format(pos)].iloc[:,4:-1])
  target = globals()['{}'.format(pos)].iloc[:,-1]
  x_train, x_test, y_train, y_test = train_test_split(data_scaled, target, test_size=0.3, random_state=46)
  ##########모델 생성
  model = RandomForestRegressor()
  ##########모델 학습
  model.fit(x_train, y_train)
  ##########모델 성능
  print(pos)
  print(model.score(x_test, y_test))
  neg_mse_scores = cross_val_score(model, data_scaled, target, scoring="neg_mean_squared_error", cv = 5)
  rmse_scores  = np.sqrt(-1 * neg_mse_scores)
  avg_rmse = np.mean(rmse_scores)

  # cross_val_score(scoring="neg_mean_squared_error")로 반환된 값은 모두 음수
  print('5 folds 의 평균 RMSE : {0:.3f} '.format(avg_rmse))
  print()

각 포지션별로 성능을 확인해보자면,

다음과 같이 결과가 나왔습니다. 다른 포지션보다 골키퍼의 성능이 유의미하게 높게 나온것을 볼 수 있는데, 타포지션에 비해 몸값을 증명할 수 있는 데이터가 한정적이고 변수가 적기때문에 비교적 예측하기 용이했다고 분석 했습니다.

또한 선수들의 연봉데이터의 단위를 풀었기때문에 크면 억단위로 넘어가서 RMSE이 값이 크게 나온것을 확인할 수 있습니다.

다음은 포지션별 파라미터의 영향력입니다.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
for pos in position:
  data_scaled = scaler.fit_transform(globals()['{}'.format(pos)].iloc[:,4:-1])
  target = globals()['{}'.format(pos)].iloc[:,-1]
  x_train, x_test, y_train, y_test = train_test_split(data_scaled, target, test_size=0.3, random_state=46)
  ##########모델 생성
  model = RandomForestRegressor()
  ##########모델 학습
  model.fit(x_train, y_train)

  ftr_importances_values = model.feature_importances_
  ftr_importances = pd.Series(ftr_importances_values, index = globals()['{}'.format(pos)].iloc[:,4:-1].columns)
  ftr_top5 = ftr_importances.sort_values(ascending=False)[:10]

  plt.figure(figsize=(8,6))
  #plt.title('Top 10 Feature Importances')
  sns.barplot(x=ftr_top5, y=ftr_top5.index)
  plt.show()
  print()

보면 다른 포지션들은 FM데이터들이 많은 영향을 끼친것을 확인할 수 있었지만, 골키퍼는 FM데이터 보다도 clean_sheets즉 직접 막은 골개수가 더 많은 영향을 준것을 확인할 수 있습니다.

from sklearn.model_selection import GridSearchCV

for pos in position:
  data_scaled = scaler.fit_transform(globals()['{}'.format(pos)].iloc[:,4:-1])
  target = globals()['{}'.format(pos)].iloc[:,-1]
  x_train, x_test, y_train, y_test = train_test_split(data_scaled, target, test_size=0.3, random_state=46)

  params = { 'n_estimators' : [10, 100],
            'max_depth' : [6, 8, 10, 12],
            'min_samples_leaf' : [8, 12, 18],
            'min_samples_split' : [8, 16, 20]
              }

  # RandomForestClassifier 객체 생성 후 GridSearchCV 수행
  rf_clf = RandomForestRegressor(random_state = 0, n_jobs = -1)
  grid_cv = GridSearchCV(rf_clf, param_grid = params, cv = 3, n_jobs = -1)
  grid_cv.fit(x_train, y_train)
  print(pos)
  print('최적 하이퍼 파라미터: ', grid_cv.best_params_)
  print()

또한 다음과 같이 Grid Search를 통하여 포지셜 별로 가장 최적의 하이퍼 파라미터가 몇인지도 찾을 수 있습니다.

마지막으로 실제로 EPL로 이적하게된 김민재 선수의 데이터를 넣었을 때,

다음과같이 6400만 유로 한화 약 900억 가까이 나와, 김민재선수가 나폴리에서 보여주었던 성적을 EPL에서 했다면 다음과같은 몸값이었을것으로 추정했습니다.

마무리입니다. 사실 이번 프로젝트를 진행하며 데이터 수집과 자세한 과정을 넣진 않았지만 전처리부분에 대부분의 시간을 투자했다고 봐도 무방한데, 잘 정제 되어있는 데이터가 아닌, raw 데이터를 직접 긁어와 활용하다보니, 굉장히 까다로웠고, 우리가 무언가를 머신러닝을 통해 예측한다는 것은 그 결과가 모두 수집한 데이터에서 부터 기인했다고 가정하고 예측하는 것 인데, 주식, 날씨등이 왜 예측하기 어려운것인지 알 수 있었고, 머신러닝에 있어 데이터 베이스가 얼마나 중요한지 다시한번 깨닫게 되었습니다.

감사합니다.

저작자표시 (새창열림)