Bank Marketing Classification Project
Github Link: https://github.com/mlucio2000/Bank_Marketing_Classification_Project
For the project, I will be using machine learning to predict if clients will subscribe to a term deposit. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
Source: Moro,S., Rita,P., and Cortez,P.. (2012). Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.
The data is a mix of different variables such as categorical features and numeric features. Ultimately, we will need to encode all categorical and binay features and scale numeric features to use naive bayes model. Below is a table I built to further describe the dataset we are working with.
Data Pre-Proccessing
When working with machine learning models, the most important part to an effective model is to have a clean, proccesed dataset. With this dataset we began by encoding all the categorical and binary values using One-Hot Encoding and scaling the numerical values. After splitting our training and test data, we had to address the issue of the dataset being heavily unbalanced. I used SMOTE to fix this issue. SMOTE
(Synthetic Minority Oversampling Technique) is an oversampling technique that uses K-NN to create "synthetic" values for the minority data. In this case, we can see the minority are "Yes" results.
Machine Learning Models
We will be comparing two different machine learning models for this dataset: Naive Bayes and Neural Net.
The first is Naive Bayes and here are the results for the training data and test data:
I then built the neural net model using Python Keras and compared it to the Naive Bayes model.
Final Thoughts & Conclusion
Naive Bayes was accurate, but still struggled despite using SMOTE. The clear winner is the Neural Net Model with pretty outstanding results. This was a good project that demonstrates how to address a situation where the data is heavily unbalanced.