Random Forest

In this project we utilize Decision Trees and Random Forest to evaluate ML performances.

Data: Lending data from 2007-2010 (https://www.lendingclub.com/info/download-data.action) to classify and predict whether or not the borrower paid back their loan in full.

** Data head **

credit.policy purpose int.rate installment log.annual.inc dti fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec not.fully.paid
0 1 debt_consolidation 0.1189 829.10 11.350407 19.48 737 5639.958333 28854 52.1 0 0 0 0
1 1 credit_card 0.1071 228.22 11.082143 14.29 707 2760.000000 33623 76.7 0 0 0 0
2 1 debt_consolidation 0.1357 366.86 10.373491 11.63 682 4710.000000 3511 25.6 1 0 0 0
3 1 debt_consolidation 0.1008 162.34 11.350407 8.10 712 2699.958333 33667 73.2 1 0 0 0
4 1 credit_card 0.1426 102.92 11.299732 14.97 667 4066.000000 4740 39.5 0 1 0 0
5 1 credit_card 0.0788 125.13 11.904968 16.98 727 6120.041667 50807 51.0 0 0 0 0
6 1 debt_consolidation 0.1496 194.02 10.714418 4.00 667 3180.041667 3839 76.8 0 0 1 1
7 1 all_other 0.1114 131.22 11.002100 11.08 722 5116.000000 24220 68.6 0 0 0 1
8 1 home_improvement 0.1134 87.19 11.407565 17.25 682 3989.000000 69909 51.1 1 0 0 0
9 1 debt_consolidation 0.1221 84.12 10.203592 10.00 707 2730.041667 5630 23.0 1 0 0 0

** Data crude metrics **

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
credit.policy        9578 non-null int64
purpose              9578 non-null object
int.rate             9578 non-null float64
installment          9578 non-null float64
log.annual.inc       9578 non-null float64
dti                  9578 non-null float64
fico                 9578 non-null int64
days.with.cr.line    9578 non-null float64
revol.bal            9578 non-null int64
revol.util           9578 non-null float64
inq.last.6mths       9578 non-null int64
delinq.2yrs          9578 non-null int64
pub.rec              9578 non-null int64
not.fully.paid       9578 non-null int64
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB
credit.policy int.rate installment log.annual.inc dti fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec not.fully.paid
count 9578.000000 9578.000000 9578.000000 9578.000000 9578.000000 9578.000000 9578.000000 9.578000e+03 9578.000000 9578.000000 9578.000000 9578.000000 9578.000000
mean 0.804970 0.122640 319.089413 10.932117 12.606679 710.846314 4560.767197 1.691396e+04 46.799236 1.577469 0.163708 0.062122 0.160054
std 0.396245 0.026847 207.071301 0.614813 6.883970 37.970537 2496.930377 3.375619e+04 29.014417 2.200245 0.546215 0.262126 0.366676
min 0.000000 0.060000 15.670000 7.547502 0.000000 612.000000 178.958333 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 0.103900 163.770000 10.558414 7.212500 682.000000 2820.000000 3.187000e+03 22.600000 0.000000 0.000000 0.000000 0.000000
50% 1.000000 0.122100 268.950000 10.928884 12.665000 707.000000 4139.958333 8.596000e+03 46.300000 1.000000 0.000000 0.000000 0.000000
75% 1.000000 0.140700 432.762500 11.291293 17.950000 737.000000 5730.000000 1.824950e+04 70.900000 2.000000 0.000000 0.000000 0.000000
max 1.000000 0.216400 940.140000 14.528354 29.960000 827.000000 17639.958330 1.207359e+06 119.000000 33.000000 13.000000 5.000000 1.000000

Data Exploration

** Histogram of two FICO distributions on top of each other, one for each credit.policy outcome.**

Text(0.5,0,'fico')

png

** Similar figure, except this time select by the not.fully.paid column.**

Text(0.5,0,'fico')

png

** Countplot using seaborn showing the counts of loans by purpose, with the color hue defined by not.fully.paid. **

<matplotlib.axes._subplots.AxesSubplot at 0x7fc35f307c50>

png

** Trend between FICO score and interest rate using jointplot.**

<seaborn.axisgrid.JointGrid at 0x7fc35e4a5d50>

png

** Lmplots to see if the trend differed between not.fully.paid and credit.policy. **

<seaborn.axisgrid.FacetGrid at 0x7fc35e1033d0>
<Figure size 864x576 with 0 Axes>

png

Categorical Features

The purpose column as categorical. We need to transform them using dummy variables so sklearn will be able to understand them.

** Final Data head **

credit.policy int.rate installment log.annual.inc dti fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec not.fully.paid purpose_credit_card purpose_debt_consolidation purpose_educational purpose_home_improvement purpose_major_purchase purpose_small_business
0 1 0.1189 829.10 11.350407 19.48 737 5639.958333 28854 52.1 0 0 0 0 0 1 0 0 0 0
1 1 0.1071 228.22 11.082143 14.29 707 2760.000000 33623 76.7 0 0 0 0 1 0 0 0 0 0
2 1 0.1357 366.86 10.373491 11.63 682 4710.000000 3511 25.6 1 0 0 0 0 1 0 0 0 0
3 1 0.1008 162.34 11.350407 8.10 712 2699.958333 33667 73.2 1 0 0 0 0 1 0 0 0 0
4 1 0.1426 102.92 11.299732 14.97 667 4066.000000 4740 39.5 0 1 0 0 1 0 0 0 0 0

Model Evaluation of Decision Tree

Classification report and a confusion matrix.

             precision    recall  f1-score   support

          0       0.85      0.82      0.84      2431
          1       0.19      0.23      0.21       443

avg / total       0.75      0.73      0.74      2874


[[1990  441]
 [ 341  102]]

Model Evaluation of Random Forest

**Classification report, confusion matrix from predictions:

	 precision    recall  f1-score   support

          0       0.85      1.00      0.92      2431
          1       0.57      0.03      0.05       443

avg / total       0.81      0.85      0.78      2874


[[2422    9]
 [ 431   12]]

Leave a Reply

Your email address will not be published. Required fields are marked *