European Bank Dataset

Leonid Shpaner and Jose Luis Estrada

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed. The dataset is ordered by date (from May 2008 to November 2010).

1. Data Importing and Pre-processing

In [1]:
# First we import the necessary packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
%matplotlib inline

Importing the dataset and describing characteristics such as dimensions, data types, file types, and import methods used

The .csv file which is separated by semicolons is loaded into a new dataframe, as required:

In [2]:
eurobank = pd.read_csv('bank_marketing.csv', sep=';')
In [3]:
eurobank.head()
Out[3]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit
0 58.0 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 44.0 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 33.0 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 47.0 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 33.0 unknown single unknown no 1 no no NaN 5 may 198 1 -1 0 unknown no

After examining the first five rows of data to add a visual point of reference, the shape of the dataset is inspected at precisely 45,211 rows and 17 columns.

In [4]:
print(eurobank.shape)
(45211, 17)

Further examination yields that most of the columns (variables) presented herein are objects, as stated explicitly by the datatypes call on the "eurobank" data frame (eurobank.dtypes). These objects are categorical variables. Only six out of seventeen columns are presented numerically as integers (int64). Only one other numerical variable is presented as a floating number with two decimal places (float64).

In [5]:
eurobank.dtypes
Out[5]:
age          float64
job           object
marital       object
education     object
default       object
balance        int64
housing       object
loan          object
contact       object
day            int64
month         object
duration       int64
campaign       int64
pdays          int64
previous       int64
poutcome      object
deposit       object
dtype: object

Cleaning, wrangling, and handling missing data

Next, a determination is made on which columns have missing values by running the following code:

In [6]:
eurobank.isnull().sum()
Out[6]:
age          1339
job             0
marital         0
education       0
default      1306
balance         0
housing         0
loan            0
contact      1383
day             0
month           0
duration        0
campaign        0
pdays           0
previous        0
poutcome        0
deposit         0
dtype: int64

To handle missing values, we drop NA's for the 'default' and 'contact' variables and fill missing numerical values for age with median age values to smooth the data as shown below:

In [7]:
eurobank.dropna(subset=['default', 'contact'], inplace=True)                  
eurobank['age'].fillna(int(eurobank['age'].median()), inplace=True)

Reducing redundant data and performing need based discretization

Taking a further look, we determine that several of the following variables are of no consequence to the marketing campaign's efficacy. For example, the housing and loan variables pertain to binary loan categories and are redundant because they are essentially reporting the same metrics. Whether a person has a loan at a bank or if it is tied to housing does not have any bearing or impact on this particular marketing outreach. There is enough information to move forward with establishing relationships between variables. Similarly, the contact variable is omitted because the contact method (cellular, telephone, and unknown) does not provide much relevant information. Day and month would present some interesting findings. However, since this dataset is explicitly between the range of years 2008 - 2010, these parameters are not well-defined to provide which year they belong to; hence, we omit them from the model. Lastly, 'pdays', and 'previous' refer to contact tracing via the number of days. This exercise aims to establish the outcome of the marketing campaign via demographic representation, outreach, and subscription to the product. Thus, the following relevant columns are included in a new subset to help achieve this goal.

In [8]:
cols_to_include = ['age', 'job','marital', 'education', 'default', 'balance', 'duration', 'campaign', 'poutcome','deposit']

eurobank2 = eurobank[cols_to_include]
eurobank2.head()
Out[8]:
age job marital education default balance duration campaign poutcome deposit
0 58.0 management married tertiary no 2143 261 1 unknown no
1 44.0 technician single secondary no 29 151 1 unknown no
2 33.0 entrepreneur married secondary no 2 76 1 unknown no
3 47.0 blue-collar married unknown no 1506 92 1 unknown no
5 35.0 management married tertiary no 231 139 1 unknown no

In the accompanying table, deposit outcome is examined by age group. Whereas clients between the ages of 35-40 made 962 deposits, the subscription rate is only 11%. The highest percent of deposits in any age group are equal for 18-20 year olds and those 75+ years old, tying them both at a 42% subscription rate. Given larger numbers of customers in these age brackets, this can very well change. However, targeted marketing for those age groups may prove beneficial.

In [9]:
print("\033[1m"+'Deposit Outcome by Age (Maximum Values):'+"\033[1m")

def deposit_by_age():
    pd.options.mode.chained_assignment = None  # default='warn'
    bins = [18, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75]
    names = ['18-20', '20-25', '25-30', '30-35', '35-40', \
            '40-45', '45-50', '50-55', '55-60', '60-65', '65-70', '70-75', '75+']

    d = dict(enumerate(names, 1))
    eurobank2['Age Range'] = np.vectorize(d.get)(np.digitize(eurobank2['age'],bins))

    deposit_yes = eurobank2.loc[eurobank2.deposit == 'yes'].groupby(['Age Range'])[['deposit']].count()
    deposit_yes.rename(columns={'deposit':'Deposit'}, inplace=True)

    deposit_no = eurobank2.loc[eurobank2.deposit == 'no'].groupby(['Age Range'])[['deposit']].count()
    deposit_no.rename(columns={'deposit':'No Deposit'}, inplace=True)

    merged_df = pd.concat([deposit_yes, deposit_no], axis = 1)
    merged_df['Deposit'] = merged_df['Deposit'].fillna(0)
    merged_df['No Deposit'] = merged_df['No Deposit'].fillna(0)
    merged_df
    max = merged_df.max()
    print(max)
    merged_df.loc['Total'] = merged_df.sum(numeric_only=True, axis=0)
    merged_df['% of Deposits'] = round((merged_df['Deposit'] / (merged_df['Deposit'] \
                                                                + merged_df['No Deposit']))* 100, 2)
    return merged_df.style.format("{:,.0f}")

deposit_by_age()
Deposit Outcome by Age (Maximum Values):
Deposit        962
No Deposit    7945
dtype: int64
Out[9]:
Deposit No Deposit % of Deposits
Age Range
18-20 18 25 42
20-25 179 527 25
25-30 647 3,412 16
30-35 961 7,945 11
35-40 962 7,941 11
40-45 492 5,186 9
45-50 483 4,520 10
50-55 365 3,709 9
55-60 345 3,230 10
60-65 237 656 27
65-70 92 141 39
70-75 92 137 40
75+ 112 155 42
Total 4,985 37,584 12

Transforming data appropriately using techniques such as aggregation, normalization, and feature construction

The summary statistics table broken down by job type is presented below. While there are several ways to aggregate this data, looking at balances by job category sheds light on summary statistics like mean, median, and standard deviation. For example, clients that are retired exhibit the highest mean bank balance of 1,963.06. The next highest mean bank balance of 1,794.90 is for individuals who did not specify their career profiles (unknown). Whereas the third-highest recorded mean bank balance of 1,768.30 is for management professionals. Given that retirement income, visa vie savings generally produces larger bank balances, this make sense contextually. On average, management professionals are higher wage earners than hourly positions, larger bank balances make sense for this sample as well. The interesting finding here is that blue-collar workers recorded lower average bank balances (1,079.62) than those that are unemployed (1,499.55). The same job category rankings are shown for median recorded bank balances, but the balances themselves are all below 1,000.00.

The highest standard deviation is that of 4,420.12 for retired individuals, suggesting that it is dispersed over a wider range of values and further away from the mean. This makes sense because tracking retirement income, generally speaking, casts a wide net over an abstract group of individuals. Similar assertions can be made about entrepreneurs to capture a broad category; their standard deviation is recorded as 4,242.79.

In [10]:
print("\033[1m"+'Bank Balance Summary Statistics by Job:'+"\033[1m")

def summary_by_job():
    pd.options.display.float_format = '{:,.2f}'.format
    cols_to_include = ['age', 'job','marital', 'education', 'default', 'balance', 'duration',\
                       'campaign', 'poutcome','deposit']

    new = eurobank[cols_to_include]
    new = new.groupby('job')['balance'].agg(["mean", "median", "std", "min", "max"])
    new.loc['Total'] = new.sum(numeric_only=True, axis=0)
    column_rename = {'mean': 'Mean', 'median': 'Median','std': 'Standard Deviation',\
                     'min':'Minimum','max': 'Maximum'}
    dfsummary = new.rename(columns = column_rename)
    new
    return dfsummary

summary_by_job()
Bank Balance Summary Statistics by Job:
Out[10]:
Mean Median Standard Deviation Minimum Maximum
job
admin. 1,141.83 401.00 2,682.21 -1,601.00 64,343.00
blue-collar 1,079.62 390.00 2,249.38 -8,019.00 66,653.00
entrepreneur 1,554.25 358.50 4,242.79 -2,082.00 59,649.00
housemaid 1,393.90 399.00 3,028.83 -1,941.00 45,141.00
management 1,768.30 572.00 3,846.95 -6,847.00 102,127.00
retired 1,963.06 771.00 4,420.12 -1,598.00 81,204.00
self-employed 1,646.29 526.00 3,659.48 -3,313.00 52,587.00
services 1,002.60 339.00 2,186.20 -2,122.00 57,435.00
student 1,386.08 492.00 2,468.55 -679.00 24,025.00
technician 1,250.14 422.00 2,530.20 -2,827.00 45,248.00
unemployed 1,499.55 532.50 3,124.95 -1,270.00 44,134.00
unknown 1,794.90 690.00 3,011.71 -295.00 19,706.00
Total 17,480.51 5,893.00 37,451.37 -32,594.00 662,252.00

The summary statistics of bank balances by marital status show that married individuals have the highest average bank balance. In contrast, individuals who are single have the second-highest average bank balance. Divorced couples are shown as having the lowest average bank balance, perhaps in part due to income loss from legal filings, paperwork, and attorney fees. Ranking information (highest to lowest) for median bank balances is the same; however, the balances themselves are lower. Whereas married couples show the highest average bank balances, the mean (standard deviation) variance is also the highest in this category.

In [11]:
print("\033[1m"+'Bank Balance Summary Statistics by Marital Status:'+"\033[1m")

def summary_by_marital():
    pd.options.display.float_format = '{:,.2f}'.format
    new2 = eurobank2.groupby('marital')['balance'].agg(["mean", "median", "std", "min", "max"])
    new2.loc['Total'] = new2.sum(numeric_only=True, axis=0)
    column_rename = {'mean': 'Mean', 'median': 'Median','std': 'Standard Deviation',\
                         'min':'Minimum','max': 'Maximum'}
    dfsummary2 = new2.rename(columns = column_rename)
    return dfsummary2
    
summary_by_marital()
Bank Balance Summary Statistics by Marital Status:
Out[11]:
Mean Median Standard Deviation Minimum Maximum
marital
divorced 1,179.32 353.00 2,943.52 -4,057.00 66,721.00
married 1,427.87 477.00 3,160.23 -6,847.00 98,417.00
single 1,302.96 441.00 2,886.05 -8,019.00 102,127.00
Total 3,910.15 1,271.00 8,989.79 -18,923.00 267,265.00

The table below illustrates deposits by marital status. In examining this data, one must consider what all of this means from a marketing perspective. Each campaign has its unique premise for onboarding new clients to higher-tier services and offerings; hence this is why large conglomerate banks have promotional offers visa vie sales calls, pamphlets, and other promotional materials. In this case, the outcome of a successful campaign is that of a customer making a deposit.

Furthermore, it becomes apparent that the highest recorded deposits come from married clients (2,601), but even a larger (the largest at 23,050) share of non-deposits also come from married clients. Only 10% of married clients subscribed to this marketing campaign, the lowest subscription rate by marital status category. Divorced customers, the lowest count of all marital categories, came in second with an overall 12% subscription rate. Single clients subscribed at the highest rate of 15% by making 1,794 deposits. There were only 12,017 single clients, which is why this metric was higher than the rest. All in all, 12% (counted as 4,985) of all 42,569 clients made a deposit, thereby subscribing to this marketing campaign.

In [12]:
print("\033[1m"+'Deposit Outcome by Marital Status:'+"\033[1m")

def deposit_by_marital():
    marital_status_yes = eurobank2.loc[eurobank2.deposit == 'yes'].groupby\
    (['marital'])[['deposit']].count()
    marital_status_yes.rename(columns={'deposit':'Deposit'}, inplace=True)
    marital_status_no = eurobank2.loc[eurobank2.deposit == 'no'].groupby\
    (['marital'])[['deposit']].count()
    marital_status_no.rename(columns={'deposit':'No Deposit'}, inplace=True)
    merged_df = pd.concat([marital_status_yes, marital_status_no], axis = 1)
    merged_df.loc['Total'] = merged_df.sum(numeric_only=True, axis=0)
    merged_df['# of Clients'] = merged_df.sum(axis=1)
    merged_df['% Deposited'] = round((merged_df['Deposit'] / (merged_df['Deposit'] \
                                                              + merged_df['No Deposit']))* 100, 2)
    return merged_df.style.format("{:,.0f}")

deposit_by_marital()
Deposit Outcome by Marital Status:
Out[12]:
Deposit No Deposit # of Clients % Deposited
marital
divorced 590 4,311 4,901 12
married 2,601 23,050 25,651 10
single 1,794 10,223 12,017 15
Total 4,985 37,584 42,569 12

2. Data Analysis and Visualization

Identifying categorical, ordinal, and numerical variables within data

Upon further examination of the dataset once again in its entirety, it is apparent that each variable (column) has its unique characteristics in terms of type. For example, whereas age, balance, duration, campaign, pdays, and previous are numerical variables, some are not. To this point, job, marital, education, default, housing, loan, contact, poutcome, and deposit are categorical variables. Month, on the other hand, though categorical at first glance, is truly an ordinal variable, as it follows a strict order. To aide in the determination of each, 'eurobank.dtypes', and 'eurobank.head()' functions were already run in the prior section.

The following code block examines the summary statistics as measures of centrality for the selected variables. The average age of the bank's clients is approximately 41 years old, which is only about two years older than the median, so there is no remarkable difference. The minimum age is 18, which makes sense because that is close to the minimum age when banking clients can open new accounts. Not surprisingly, the maximum age of the clientele is 95 years old.

In [13]:
summary_stats = round(eurobank2.describe().T,2)
summary_stats.drop(['count'], axis=1, inplace=True)
summary_stats.rename (columns={'mean':'Mean', 'std': 'Standard Deviation', 'min': 'Minimum', \
                               '25%':'Q1', '50%': 'Median', '75%':'Q3', 'max':'Maximum'}, inplace=True)
summary_stats
Out[13]:
Mean Standard Deviation Minimum Q1 Median Q3 Maximum
age 40.86 10.44 18.00 33.00 39.00 48.00 95.00
balance 1,363.99 3,061.66 -8,019.00 72.00 450.00 1,430.00 102,127.00
duration 258.32 257.19 0.00 103.00 180.00 319.00 3,881.00
campaign 2.76 3.09 1.00 1.00 2.00 3.00 63.00

Providing measures of centrality and distribution with visualizations

We examine the age distribution visually as a boxplot, and though it is normally distributed, it is skewed to the right.

In [14]:
# Boxplot of age as another way of showing distribution
fig = plt.figure(figsize = (8,4))
plt.title ('Age Distribution of Clients With Outliers')
plt.xlabel('Client Age')
plt.ylabel('Clients')

sns.boxplot(data=eurobank2['age'], palette="PuBu", orient='h')
plt.show()

# Computing IQR
Q1 = eurobank2['age'].quantile(0.25)
Q3 = eurobank2['age'].quantile(0.75)
IQR = Q3-Q1

# Computing Summary Statistics
mean_1 = round(eurobank2['age'].mean(),2)
std_1 = round(eurobank2['age'].std(),2)
median_1 = round(eurobank2['age'].median(),2)
print('The first quartile is %s. '%Q1)
print('The third quartile is %s. '%Q3)
print('The IQR is %s.'%IQR)
print('The mean is %s.'%mean_1)
print('The standard deviation is %s.'%std_1)
print('The median is %s.'%median_1)
The first quartile is 33.0. 
The third quartile is 48.0. 
The IQR is 15.0.
The mean is 40.86.
The standard deviation is 10.44.
The median is 39.0.

Some outliers from past the age of 70 onward suggest that older clients past retirement age are few and far in between. These outliers are subsetted into a new data frame to examine the means of ages with and without them as bases of comparison.

In [15]:
lower_bound = Q1-(1.5*IQR)
upper_bound = Q3+(1.5*IQR)

outliers_eurobank2 = eurobank2.loc[(eurobank2['age'] < lower_bound) | (eurobank2['age'] > upper_bound)]
average_age = round(eurobank2.age.mean(), 2)
print('The average of age in the original data frame is %s.' % average_age)

average_age_outliers_eurobank2 = round(outliers_eurobank2.age.mean(),2)
print('The average of age of outliers is %s.' % average_age_outliers_eurobank2)
The average of age in the original data frame is 40.86.
The average of age of outliers is 76.8.

The average age of clients as outliers in this dataset is 76.8 years old. In this analysis, while we are generally interested in data sensitive to outliers, we are equally interested in examining the dataset that is resistant to these outliers as a basis of comparison.

Removing the outliers from the age variable yields little observable significance, as shown below. For example, the average age has only been reduced by 0.37, the standard deviation by 0.62, with no difference in the median age with or without outliers.

In [16]:
# Boxplot of age without outliers as another way of showing distribution
fig = plt.figure(figsize = (8,4))
plt.title ('Age Distribution of Clients Without Outliers')
plt.xlabel('Client Age')
plt.ylabel('Clients')
sns.boxplot(data=eurobank2['age'], palette="PuBu", orient='h', showfliers = False)
plt.show()

filter = (eurobank2['age'] >= Q1 - 1.5 * IQR) & (eurobank2['age'] <= Q3 + 1.5 *IQR)
summary_stats_2 = round(eurobank2.loc[filter].describe().T,2)
summary_stats_2.drop(['count'], axis=1, inplace=True)
summary_stats_2.rename (columns={'mean':'Mean', 'std': 'Standard Deviation', 'min': 'Minimum', \
                                 '25%':'Q1', '50%': 'Median', '75%':'Q3', 'max':'Maximum'}, inplace=True)
summary_stats_2
Out[16]:
Mean Standard Deviation Minimum Q1 Median Q3 Maximum
age 40.49 9.82 18.00 33.00 39.00 48.00 70.00
balance 1,347.12 2,996.60 -8,019.00 71.00 444.00 1,411.00 102,127.00
duration 257.48 256.99 0.00 103.00 179.00 318.00 3,881.00
campaign 2.77 3.10 1.00 1.00 2.00 3.00 63.00
In [17]:
# Computing Means, Standard Deviations, and Observing Differences
mean_2 = round(eurobank2.loc[filter]['age'].mean(),2)
mean_difference = round((mean_1 - mean_2),2)
std_2 = round(eurobank2.loc[filter]['age'].std(),2)
std_difference = round((std_1 - std_2),2)
median_2 = round(eurobank2.loc[filter]['age'].median(),2)
median_difference = round((median_1 - median_2),2)

print('The mean age without outliers is %s.'%mean_2)
print('The standard deviation of the age variable without outliers is %s.'%std_2)
print('The median age without outliers is %s.'%median_2)

print("\n")

print('The difference between the mean with outliers and the \
mean without outliers is %s.'%mean_difference)
print('The difference between the standard deviation with outliers and the \
standard deviation without outliers is %s.'%std_difference)
print('The difference between the median with outliers and the \
median without outliers is %s.'%median_difference)
The mean age without outliers is 40.49.
The standard deviation of the age variable without outliers is 9.82.
The median age without outliers is 39.0.


The difference between the mean with outliers and the mean without outliers is 0.37.
The difference between the standard deviation with outliers and the standard deviation without outliers is 0.62.
The difference between the median with outliers and the median without outliers is 0.0.

A histogram showing the distribution of the age variable is plotted below.

In [18]:
# Plotting a histogram to show distribution
fig = plt.figure(figsize = (8,8))
plt.hist(eurobank['age'], bins=10, color='steelblue', alpha=0.9, rwidth=.97)
plt.title ('Age Distribution of Clients')
plt.xlabel('Age') 
plt.ylabel('Count')
plt.show()

# mode = eurobank2.mode('age', numeric_only=True)
# print('The mode is %s.'%mode)

print('The information below shows the mode for the age variable:')
eurobank.loc[:,'age'].mode()
The information below shows the mode for the age variable:
Out[18]:
0   39.00
dtype: float64

There are more clients who are 39 years old than any other age. While age is one relevant variable in this dataset, it is worthwile to examine the distribution of account balances as follows:

In [19]:
# Plotting a histogram to show distribution
fig = plt.figure(figsize = (8,8))
plt.hist(eurobank2['balance'], bins=10, color='steelblue', alpha=0.9, rwidth=0.7)
plt.title ('Balance Distribution of Clients')
plt.xlabel('Balance') 
plt.ylabel('Count')
plt.show()

print('The following values represent the number of data points between each bin:')

counts, bin_edges = np.histogram(eurobank2['balance'], bins=10)
print(counts)
The following values represent the number of data points between each bin:
[37269  4927   281    54    20     7     6     1     2     2]

The majority of these clients have negative balances. This is not so alarming since the prior summary output table did show a minimum balance of -8,019.00. While there appear to be only four bins in this figure, the output clearly shows otherwise. For example, most of our dataset is below \$40,000, based on the four visible bins shown above. More importantly, most of the balance is either zero or below. This can potentially yield some additional insights in terms of the marketing campaign's direction and its predictability.

In [20]:
campaign_age = eurobank2.groupby('age', as_index=False)['campaign'].sum()
fig = plt.figure(figsize = (15,8))
plt.barh(campaign_age['age'], campaign_age['campaign'])
plt.xlabel("Campaign")
plt.ylabel("Age")
plt.title("Marketing Campaigns by Age")
plt.show()

Perform exploratory analysis in combination with visualization techniques to discover patterns and features of interest

Now, to prepare the dataset for a regression model, we assign numerical values to the categorical variables of deposit and prior outcome as follows:

In [21]:
eurobank2['deposit'] = eurobank2.deposit.map(dict(yes=1, no=0))
eurobank2['poutcome'] = eurobank2.poutcome.map(dict(unknown=0, other=1, failure=2, success=3))

Next, we examine correlations visually:

In [22]:
sns.set_context("paper", font_scale=1)
plot = sns.pairplot(eurobank2, diag_kind='hist', dropna = True, palette='Blues')
plot.fig.suptitle("Pairplot of European Bank Marketing Dataset", y = 1.0001) 
plt.show()

Diagnosing for correlations between variables and determining independent and dependent variables

At first glance it does not appear that any of these variables have any noticeable degrees of correlation. Therefore, a heatmap correlation with indices is produced below:

In [23]:
# Examining the correlations between variables using a heat map
heatmap = eurobank2[['age','balance','duration','campaign','poutcome','deposit']]
sns.set_context("paper", rc={"axes.labelsize":12}, font_scale = 1.3)
correlations = heatmap.corr()
plt.figure(figsize = (9,9))
ax = sns.heatmap(correlations[['age','balance','duration','campaign','poutcome', 'deposit']], \
                 annot = True, square = True, cbar_kws={'label': 'Correlation Index'})
ax.set_title('Heatmap of Correlations in European Bank Marketing Campaign')
ax.set_ylim(len(correlations), -0.5)
plt.show()

From here we can see that the duration of the marketing call is moderately correlated to the deposit (0.40). This is a moderate positive correlation. The second highest correlation is that of prior marketing campaign outcome (poutcome) to deposit (.22). This is a low positive correlation. Lastly, a low negative correlation (-.11) is seen between campaign and poutcome.

The following scatterplot modeling the moderate at best relationship between deposit and duration is produced below. An increase in the duration of the contact attempt produces a similar increase in the deposits (subscriptions) in the marketing campaign.

In [24]:
sns.regplot(data = eurobank2, x = 'duration', y = 'deposit')
plt.title('Deposit vs. Duration of Marketing Call')
plt.show()
print('The correlation coefficient r is equal to %s.' % \
      round(np.corrcoef(eurobank2.duration,eurobank2.deposit)[0,1], 2))
The correlation coefficient r is equal to 0.4.

3. Data Analytics

Determining the need for a supervised or unsupervised learning method and identifying dependent and independent variables

In looking at linear regression as a potential model for this dataset, a determination is made that this model holds little to no merit for its weak variable associations (relationships). Therefore, the eurobank dataset requires a logistic regression model to make predictions based on the current variables. This is an example of supervised learning.

Since the marketing strategy focused on who opened an account and made an initial deposit, the analysis requires changing the outputs from yes and no to 1 and 0. Additionally, it will need dummy variables to help the program understand the categorical data. After the data preprocessing, the program will split the data into train and test sets to identify accuracy. Not all the variables will be selected for the final model, so the analysis will use the Recursive Feature Elimination (REF) method to reduce the number of variables.

The final portion of the analysis will show the final model, the model's accuracy, and a confusion matrix for the overall performance of the model.

In [25]:
# Change yes and no answers from deposit variable (column) to 0(no) and 1(yes) for linear regression
eurobank['deposit'] = eurobank.deposit.map(dict(yes=1, no=0))
In [26]:
# Created dummy variables for categorical variables
categorical_variables=['job','marital','education','default','housing','loan','contact','month','poutcome']
for var in categorical_variables:
    categorical_list='var'+'_'+var
    categorical_list = pd.get_dummies(eurobank[var], prefix=var)
    eurobank10=eurobank.join(categorical_list)
    eurobank=eurobank10
categorical_variables=['job','marital','education','default','housing','loan','contact','month','poutcome']
data_vars=eurobank.columns.values.tolist()
keep=[i for i in data_vars if i not in categorical_variables]
In [27]:
# Print all variables including dummy variables
data_final = eurobank[keep]
data_final.columns.values
Out[27]:
array(['age', 'balance', 'day', 'duration', 'campaign', 'pdays',
       'previous', 'deposit', 'job_admin.', 'job_blue-collar',
       'job_entrepreneur', 'job_housemaid', 'job_management',
       'job_retired', 'job_self-employed', 'job_services', 'job_student',
       'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'education_primary', 'education_secondary', 'education_tertiary',
       'education_unknown', 'default_no', 'default_yes', 'housing_no',
       'housing_yes', 'loan_no', 'loan_yes', 'contact_cellular',
       'contact_telephone', 'contact_unknown', 'month_apr', 'month_aug',
       'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'poutcome_failure', 'poutcome_other', 'poutcome_success',
       'poutcome_unknown'], dtype=object)
In [28]:
# Named dependent variable to y (deposit), while the rest are independent (X)
X = data_final.loc[:, data_final.columns != 'deposit']
y = data_final.loc[:, data_final.columns == 'deposit']
# Used 70% of train data and 30% of test data
os = SMOTE(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
columns = X_train.columns
os_data_X,os_data_y=os.fit_sample(X_train, y_train)
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['deposit'])

The Recursive Feature Elimination will identify which variables are significant for our model. It will help as a first filter, and a logistic regression analysis will help as a second filter to choose the most signficant variables.

In [29]:
import warnings
warnings.filterwarnings("ignore")
data_final_vars=data_final.columns.values.tolist()
y=['deposit']
X=[i for i in data_final_vars if i not in y]
#Used Recursive Feature Elimiation (RFE) to select significant feautures for our model
logreg = LogisticRegression()
rfe = RFE(logreg, 20)
rfe = rfe.fit(os_data_X, os_data_y.values.ravel())
print(rfe.support_)
print(rfe.ranking_)
[False False False False False False False  True  True  True  True  True
 False  True  True  True  True  True  True  True  True  True  True  True
  True  True False False  True  True False False False False False False
 False False False False False False False False False False False False
 False False False]
[27 32 28 29 25 31 26  1  1  1  1  1  2  1  1  1  1  1  1  1  1  1  1  1
  1  1 24 23  1  1  7  6  5  4  3 17 12 21 15 11 14 18 22 16 13 30 20  9
 10 19  8]
In [30]:
# Used variables that were given by the RFE output
cols = ['job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid', 'job_retired', \
        'job_self-employed', 'job_student', 'job_unemployed', 'job_unknown', 'marital_divorced', \
        'marital_married', 'marital_single', 'education_primary', 'education_secondary', 'education_tertiary', \
        'default_yes', 'housing_no', 'housing_yes', 'loan_no']
X = os_data_X[cols]
y = os_data_y['deposit']

Approaching the end result, we have a sound logistical model that maps the optimal job type, marital status, educational background, and housing strata to an effective targeted marketing campaign. Since all of the p-values associated with these variables are statistically significant (less than the alpha value of 0.05), they are all good candidates for measuring viability in marketing outreach.

In [31]:
# No features are removed since all are significant (<0.05)
logit_model = sm.Logit(y,X)
result = logit_model.fit()
print(result.summary2())
Optimization terminated successfully.
         Current function value: 0.479469
         Iterations 7
                           Results: Logit
====================================================================
Model:               Logit             Pseudo R-squared:  0.308     
Dependent Variable:  deposit           AIC:               57647.1741
Date:                2020-12-13 17:10  BIC:               57818.2380
No. Observations:    60076             Log-Likelihood:    -28805.   
Df Model:            18                LL-Null:           -41642.   
Df Residuals:        60057             LLR p-value:       0.0000    
Converged:           1.0000            Scale:             1.0000    
No. Iterations:      7.0000                                         
--------------------------------------------------------------------
                     Coef.  Std.Err.    z     P>|z|   [0.025  0.975]
--------------------------------------------------------------------
job_admin.          -1.1086   0.0447 -24.7750 0.0000 -1.1963 -1.0209
job_blue-collar     -1.2743   0.0392 -32.4982 0.0000 -1.3511 -1.1974
job_entrepreneur    -1.6783   0.1046 -16.0430 0.0000 -1.8833 -1.4732
job_housemaid       -1.9375   0.1151 -16.8301 0.0000 -2.1632 -1.7119
job_retired         -0.5353   0.0550  -9.7354 0.0000 -0.6430 -0.4275
job_self-employed   -1.4575   0.0846 -17.2257 0.0000 -1.6233 -1.2916
job_student         -0.9760   0.0805 -12.1232 0.0000 -1.1337 -0.8182
job_unemployed      -1.3534   0.0863 -15.6813 0.0000 -1.5226 -1.1843
job_unknown         -2.4720   0.1937 -12.7595 0.0000 -2.8517 -2.0923
marital_divorced    -1.8412   0.0498 -36.9523 0.0000 -1.9388 -1.7435
marital_married     -1.2229   0.0296 -41.3819 0.0000 -1.2809 -1.1650
marital_single      -1.1418   0.0333 -34.3293 0.0000 -1.2070 -1.0767
education_primary   -1.6621   0.0515 -32.2933 0.0000 -1.7630 -1.5612
education_secondary -0.9882   0.0277 -35.6692 0.0000 -1.0426 -0.9339
education_tertiary  -1.3234   0.0303 -43.7156 0.0000 -1.3828 -1.2641
default_yes         -0.9444   0.1359  -6.9505 0.0000 -1.2107 -0.6781
housing_no           0.1569   0.0292   5.3754 0.0000  0.0997  0.2141
housing_yes         -0.7991   0.0313 -25.5400 0.0000 -0.8604 -0.7377
loan_no              2.4685   0.0286  86.3215 0.0000  2.4124  2.5245
====================================================================

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Out[32]:
LogisticRegression()

The accuracy of this logistical model is determined to be 0.87 as shown below:

In [33]:
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
Accuracy of logistic regression classifier on test set: 0.87

The confusion matrix shows whether the predictions had false negatives (Type II) or false positives (Type I). The first row shows 5,782 true positives and 186 false positives, while the second row illustrates 1,333 falses negatives and 4715 true negatives. The prediction had a total of 10,497 accurate results and 1,519 not accurate results.

In [34]:
# The output shows 5782 and 4715 correct predictions
# It also shows 186 and 1333 incorrect predictions
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
[[5782  186]
 [1333 4715]]

Ultimately, modeling an effective subscription model visa vie deposit outcome only hinges on one variable relationship between duration and deposits with a coefficient of determination of roughly 40%. Taking that value against itself produces a 16% explanation in terms of the variability between both of these variables. While increasing the duration of the contact attempt in the campaign can stand to benefit a higher subscription rate, it will only do so slightly. Therefore, a revision model based on logistical regression is carried out to solve the overarching dilemma of targeted marketing to a select demographic, with 87% accuracy.