스터디 포스트 >  실전문제로 배우는 머신러닝

신용거래 실적에 따른 대출 전자 사인 가능성 예측하기

오혜수 멘토
안녕하세요! 저는 머신러닝에 관심 있는 취준생입니다!

실전문제로 배우는 머신러닝 - 5주차

 
💡
신용거래 실적에 따른 대출 전자 사인 가능성 예측하기
 
 
 

Import Library

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sn

dataset = pd.read_csv('.\P39-Financial-Data.csv')
 

EDA

dataset.head()
notion image
 
dataset.columns
notion image
 
dataset.describe()
notion image
 

Cleaning Data

Removing NaN

dataset.isna().any()
notion image

Histograms

dataset2 = dataset.drop(columns = ['entry_id', 'pay_schedule', 'e_signed'])
dataset2
notion image
 
fig = plt.figure(figsize = (15, 12))
plt.suptitle('Histograms of Numerical Columns', fontsize=20)
for i in range(dataset2.shape[1]):
	plt.subplot(6, 3, i+1)
	f = plt.gca()
	f.set_title(dataset2.columns.values[1])
	vals = np.size(dataset2.iloc[:, i].unique())
	if vals >= 100:
		vals = 100
	plt.hist(dataset2.iloc[:, i], bins=vals, color='#3F5D7D')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
notion image
 

Correlation with Response Variable

dataset3 = dataset2.corrwith(dataset.e_signed)
dataset3
notion image
 
dataset2.corrwith(dataset.e_signed).plot.bar(figsize=(20, 10), title = 'Correlation with E Signed',

fontsize = 15, rot=45, grid=True, color=plt.get_cmap('Paired').colors)
notion image
 
# Correlation Matrix
sn.set(style='white')

# Compute the correlation matrix
corr = dataset2.corr()
corr
notion image
 
# Generate a custom diverging colormap
cmap = sn.diverging_palette(220, 10, as_cmap = True)

# Draw the heatmap with the mask and correct aspect ratio
sn.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={'shrink':.5})
notion image
 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
import random
import time

random.seed(100)
 

Data Preprocessing

 
Feature Engineering
dataset = dataset.drop(columns=['months_employed'])
dataset['personal_account_months'] = (dataset.personal_account_m + (dataset.
↪personal_account_y * 12))
dataset[['personal_account_m', 'personal_account_y','personal_account_months']].head()
notion image
 
dataset = dataset.drop(columns=['personal_account_m', 'personal_account_y'])
dataset.head()
notion image
 
dataset.columns
notion image
 
 
One Hot Encoding
dataset = pd.get_dummies(dataset)
dataset.columns
notion image
 
dataset = dataset.drop(columns=['pay_schedule_semi-monthly'])
 
Removing extra columns
response = dataset['e_signed']
users = dataset['entry_id']
dataset = dataset.drop(columns = ['e_signed', 'entry_id'])
dataset.head()
notion image
 
Splitting into Train and Test Set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset, response, test_size = 0.2, random_state = 0)
 
Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train2 = pd.DataFrame(sc_X.fit_transform(X_train))
X_test2 = pd.DataFrame(sc_X.transform(X_test))
X_train2.columns = X_train.columns.values
X_test2.columns = X_test.columns.values
X_train2.index = X_train.index.values
X_test2.index = X_test.index.values
X_train = X_train2
X_test = X_test2
 

Model Building

Comparing Models

 
Logistic Regression
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0, penalty = 'l1', solver="saga")
classifier.fit(X_train, y_train)
 
Predicting Test Set
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

results = pd.DataFrame([['Linear Regression(Lasso)', acc, prec, rec, f1]], columns = ['Model', 'Accuracy', 'Precision', 'Recall','F1 Score'])

results
notion image
 
 
SVM (Linear)
from sklearn.svm import SVC
classifier = SVC(random_state = 0, kernel='linear')
classifier.fit(X_train, y_train)
notion image
 
Predicting test Set
y_pred = classifier.predict(X_test)
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
model_results = pd.DataFrame([['SVM (Linear)', acc, prec, rec, f1]], columns = ['Model', 'Accuracy', 'Precision','Recall', 'F1 Score'])
results = pd.concat([results, model_results], axis = 0, ignore_index = True)
results
notion image
 
SVM (rbf)
from sklearn.svm import SVC
classifier = SVC(random_state=0, kernel = 'rbf')
classifier.fit(X_train, y_train)
notion image
 
Predicting Test Set
y_pred = classifier.predict(X_test)
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
model_results = pd.DataFrame([['SVM (RBF)', acc, prec, rec, f1]], columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])
results = pd.concat([results, model_results], ignore_index = True)
results
notion image
 
Random Forest (n=100)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(random_state=0, n_estimators = 100, criterion='entropy')
classifier.fit(X_train, y_train)
notion image
 
Predicting Test Set
y_pred = classifier.predict(X_test)
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
results.loc[len(results.index)] = ['Random Forest (n=100)', acc, prec, rec, f1]
results
notion image
 
K-fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train,cv = 10)

print(f'Random Forest Classifier Accuracy: {accuracies.mean():.2f} (+/-{accuracies.std() * 2:.2f})')
notion image

Parameter Tuning

Applying Grid Search

 
Round 1 : Entropy
parameters = {"max_depth": [3, None], "max_features": [1, 5, 10], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 5, 10], "bootstrap": [True, False], "criterion": ["entropy"]}

from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(estimator = classifier, # Make sure classifier points to the RF model

param_grid = parameters, scoring = "accuracy", cv = 10, n_jobs = -1)
t0 = time.time()
grid_search = grid_search.fit(X_train, y_train)
t1 = time.time()
print(f'Took {(t1-t0):.2f} seconds')
notion image
 
rf_best_accuracy = grid_search.best_score_
rf_best_parameters = grid_search.best_params_
rf_best_accuracy, rf_best_parameters
notion image
 
Round 2 : Entropy
parameters = {'max_depth': [None], 'max_features': [3, 5, 7], 'min_samples_split': [8, 10, 12], 'min_samples_leaf': [1, 2, 3], 'bootstrap':[True], 'criterion':['entropy']}
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(estimator = classifier, param_grid = parameters, scoring = 'accuracy', cv = 10, n_jobs = -1)

t0 = time.time()
grid_search = grid_search.fit(X_train, y_train)
t1 = time.time()
print(f'Took {(t1 - t0):.2f} seconds')
notion image
 
rf_best_accuracy = grid_search.best_score_
rf_best_parameters = grid_search.best_params_
rf_best_accuracy, rf_best_parameters
notion image
 
Predicting Test Set
y_pred = grid_search.predict(X_test)
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred)
results.loc[len(results.index)] = ['Random Forest (n=100, GSx2 + Entropy)', acc, prec, rec, f1]

results
notion image
 
Round 1: Gini
parameters = {'max_depth': [3, None], 'max_features': [1, 5, 10], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 5, 10], 'bootstrap': [True, False], 'criterion': ['gini']}

from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(estimator = classifier, param_grid = parameters, scoring = 'accuracy', cv = 10, n_jobs = -1)

t0 = time.time()
grid_search = grid_search.fit(X_train, y_train)
t1 = time.time()
print(f'Took {(t1 - t0):.2f} seconds')
notion image
rf_best_accuracy = grid_search.best_score_
rf_best_parameters = grid_search.best_params_
rf_best_accuracy, rf_best_parameters
notion image
 
Round 2: Gini
parameters = {'max_depth': [None], 'max_features': [8, 10, 12], 'min_samples_split': [2, 3, 4], 'min_samples_leaf': [8, 10, 12], 'bootstrap': [True], 'criterion': ['gini']}

from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(estimator = classifier, param_grid = parameters, scoring = 'accuracy', cv = 10, n_jobs = -1)

t0 = time.time()
grid_search = grid_search.fit(X_train, y_train)
t1 = time.time()
print(f'Took {(t1-t0):.2f} seconds')
notion image
 
rf_best_accuracy = grid_search.best_score_
rf_best_parameters = grid_search.best_params_
rf_best_accuracy, rf_best_parameters
notion image
 
Predicting Test Set
y_pred = grid_search.predict(X_test)
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred)
results.loc[len(results.index)] = ['Random Forest (n = 100, GSx2 + Gini)', acc, prec, rec, f1]
results
notion image
 
Extra: Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
cm
notion image
 
df_cm = pd.DataFrame(cm, index = (0, 1), columns = (0, 1))
df_cm
notion image
 
plt.figure(figsize = (10, 7))
sn.set(font_scale=1.4)
sn.heatmap(df_cm, annot=True, fmt='g')

print(f'Test Data Accuracy: {accuracy_score(y_test, y_pred):.4f}')
notion image
notion image

End Of Model

 
Formatting Final Results
final_results = pd.concat([y_test, users], axis = 1).dropna()
final_results['predictions'] = y_pred
final_results = final_results[['entry_id', 'e_signed', 'predictions']]

final_results
notion image
 
 
 

 
 
본 스터디는 Udemy의 <【한글자막】 Machine Learning 완벽 실습 : 6가지 실제 사례 직접 해결하기> 강의를 활용해 진행됐습니다. 강의에 대한 자세한 정보는 아래에서 확인하실 수 있습니다.
 
 
프밍 스터디는 Udemy Korea와 함께 합니다.
 
 

 
 
원하는 스터디가 없다면? 다른 스터디 개설 신청하기
누군가 아직 원하는 스터디를 개설하지 않았나요? 여러분이 직접 개설 신청 해 주세요!
이 포스트는
"실전문제로 배우는 머신러닝" 스터디의 진행 결과입니다
진행중인 스터디
실전문제로 배우는 머신러닝
머신러닝 실전 프로젝트를 매주 1개씩 진행하면서 실무 능력을 길러봅시다!
오혜수 멘토
안녕하세요! 저는 머신러닝에 관심 있는 취준생입니다!