Data Preparation - Preprocessing

Team Members: Leonid Shpaner, Christopher Robinson, and Jose Luis Estrada


This notebook goes into preprocessing the dataset by various methods of imputation, reclassifying useful categorical features, feature engineering, and removing columns not used for model development (exclusion criteria). It culminates with the data being partioned into a train, validation, and test hold out sets.

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
Mounted at /content/drive
In [2]:
%cd /content/drive/Shared drives/Capstone - Best Group/GitHub Repository/navigating_crime/Code Library
/content/drive/Shared drives/Capstone - Best Group/GitHub Repository/navigating_crime/Code Library
In [3]:
####################################
## import the requisite libraries ##
####################################
import os
import csv
import pandas as pd
import numpy as np

# import data_types .py file created for inspecting data types
from functions import data_types

# library for one-hot encoding
from sklearn.preprocessing import OneHotEncoder

# import library from sci-kit learn for splitting the data
from sklearn.model_selection import train_test_split
In [4]:
# check current working directory
current_directory = os.getcwd()
current_directory
Out[4]:
'/content/drive/Shared drives/Capstone - Best Group/GitHub Repository/navigating_crime/Code Library'

Assign Paths to Folders

In [5]:
# path to the data file
data_file = '/content/drive/Shareddrives/Capstone - Best Group/' \
          + 'Final_Data_20220719/LA_Streets_with_Crimes_By_Division.csv'

# path to data folder
data_folder = '/content/drive/Shareddrives/Capstone - Best Group/' \
            + 'GitHub Repository/navigating_crime/Data Folder/'

# path to the image library
eda_image_path = '/content/drive/Shareddrives/Capstone - Best Group/' \
               + 'GitHub Repository/navigating_crime/Image Folder/EDA Images'
In [6]:
# read in the csv file to a dataframe using pandas
df = pd.read_csv(data_file, low_memory=False).set_index('OBJECTID')
df.head()
Out[6]:
Shape Join_Count TARGET_FID Join_Count_1 TARGET_FID_1 JOIN_FID FullName Type Elevation Surface ... LAT LON OBJECTID_1 APREC PREC AREA_1 PERIMETER SHAPE_Leng Shape_Length_1 Shape_Length
OBJECTID
18 (-13168057.17375, 3992141.1804500036) 1 18 1 18 -1 NaN Alley Unknown Unknown ... 33.7261 -118.2907 20.0 HARBOR 5.0 8.928780e+08 272451.139908 0.807721 0.807721 120.220887
36 (-13180677.51995, 4031566.0669) 1 36 1 36 -1 NaN Secondary Surface Paved ... 34.0205 -118.4040 17.0 PACIFIC 14.0 7.176129e+08 246934.321606 0.757182 0.757182 15.473452
37 (-13180677.51995, 4031566.0669) 1 36 1 37 -1 NaN Secondary Surface Paved ... 34.0205 -118.4040 17.0 PACIFIC 14.0 7.176129e+08 246934.321606 0.757182 0.757182 15.473452
44 (-13160890.369035002, 4035687.0216437643) 1 43 1 44 -1 NaN Minor Surface Paved ... 34.0509 -118.2265 11.0 HOLLENBECK 4.0 4.330323e+08 111317.952873 0.329096 0.329096 118.766240
56 (-13165346.8195, 4030644.1469999994) 1 55 1 56 -1 NaN Secondary Surface Paved ... 34.0138 -118.2663 16.0 NEWTON 13.0 2.723760e+08 81476.922847 0.241128 0.241128 9.731604

5 rows × 80 columns

In [7]:
# show the columns of the dataframe for inspection
df.columns
Out[7]:
Index(['Shape', 'Join_Count', 'TARGET_FID', 'Join_Count_1', 'TARGET_FID_1',
       'JOIN_FID', 'FullName', 'Type', 'Elevation', 'Surface', 'Status',
       'DrivingDir', 'From_L', 'From_R', 'To_L', 'To_R', 'Parity_L',
       'Parity_R', 'StPreDir', 'StPreMod', 'StPreType', 'StArticle', 'StName',
       'StPostType', 'StPostDir', 'StPostMod', 'Zip_L', 'Zip_R', 'LCity_L',
       'LCity_R', 'NameCat_L', 'NameCat_R', 'Accuracy', 'Jurisdiction',
       'Source', 'SourceID', 'UpdateDate', 'MSAG_LCity', 'MSAG_RCity',
       'MSAG_LESN', 'MSAG_RESN', 'Crime_Cost', 'StreetOID', 'Field1', 'DR_NO',
       'Date_Rptd', 'DATE_OCC', 'TIME_OCC', 'AREA', 'AREA_NAME', 'Rpt_Dist_No',
       'Part_1_2', 'Crm_Cd', 'Crm_Cd_Desc', 'Mocodes', 'Vict_Age', 'Vict_Sex',
       'Vict_Descent', 'Premis_Cd', 'Premis_Desc', 'Weapon_Used_Cd',
       'Weapon_Desc', 'Status_1', 'Status_Desc', 'Crm_Cd_1', 'Crm_Cd_2',
       'Crm_Cd_3', 'Crm_Cd_4', 'LOCATION', 'Cross_Street', 'LAT', 'LON',
       'OBJECTID_1', 'APREC', 'PREC', 'AREA_1', 'PERIMETER', 'SHAPE_Leng',
       'Shape_Length_1', 'Shape_Length'],
      dtype='object')
In [8]:
# re-inspect the shape of the dataframe. This is also done on EDA file.
print(' There are', df.shape[0], 'and', df.shape[1], 'columns in the dataframe.')
 There are 183362 and 80 columns in the dataframe.

Introduction

The data is ingested from a shared google drive path and in its raw form, contains 183,362 rows and 80 columns. The preprocessing stage begins by inspecting each column's data types along with its respective null values.

In [9]:
# run the python file created for showing columns, data types, and their
# respective null counts (this was imported as a library at the top of this nb)
dtypes = data_types(df)
pd.set_option('display.max_rows', None)
dtypes = dtypes.sort_values(by='# of Nulls', ascending=False)
In [10]:
# here only the top null columns are subset
# a good amount of these columns are to be subsequently removed
dat_typ = dtypes[(dtypes['Percent Null']<84) & (dtypes['# of Nulls']>0)]

# any column with '_L' and '_R' is of no interest and will be dropped
dat_typ = dat_typ[dat_typ['Column/Variable'].str.contains('_L')==False]
dat_typ = dat_typ[dat_typ['Column/Variable'].str.contains('_R')==False]
dat_typ

# object types are typically removed for ML processing, but some imputation
# may be necessary, which is why they are examined prior
Out[10]:
Column/Variable Data Type # of Nulls Percent Null
18 StPreDir object 83241 45.0
69 Cross_Street object 74575 41.0
23 StPostType object 6189 3.0
61 Weapon_Desc object 3274 2.0
60 Weapon_Used_Cd float64 3274 2.0
6 FullName object 1357 1.0
22 StName object 1357 1.0
75 AREA_1 float64 686 0.0
76 PERIMETER float64 686 0.0
74 PREC float64 686 0.0
73 APREC object 686 0.0
72 OBJECTID_1 float64 686 0.0
54 Mocodes object 219 0.0
57 Vict_Descent object 171 0.0
56 Vict_Sex object 167 0.0
33 Jurisdiction object 28 0.0

For example, there are 3,274 missing values for weapon description. Imputation is necessary to properly assign the corresponding numerical value to the corresponding weapons used code, which is also missing 3,274 observations.

The victim descent column only has 171 missing values, the rows corresponding to these values can be dropped, but they are instead imputed with 'Unknown' since the ethnicities here remain unknown.

The same holds true for victim's sex, where there exist 167 missing values.

Ethnicity and age are important features which will be numericized and adapted for use with machine learning. Jurisdiction and other string columns presented herein will be subsequently omitted from the development (modeling) set.

Imputation Strategies

Weapons used in crimes contain useful information, and the rows of missing data presented therein are imputed by a value of 500, corresponding to 'unknown' in the data dictionary. Similarly, the missing values in victim's sex and ethnicity, respectively, are imputed by a categorical value of "X," for "unknown." The 211 missing values contained in the zip code column are dropped since they cannot be logically imputed.

In [11]:
# Since the code for unknown weapons is 500, missing values for 'Weapon_Used_Cd' 
# where 'Weapon_Desc' is 'UNKNOWN WEAPON/OTHER WEAPON' will be 500.
# However, since missing values in one column correspond to the other, 
# missing values in 'Weapon_Desc' will need to be imputed first with
# 'UNKNOWN WEAPON/OTHER WEAPON'

# impute missing weapon descriptions
df['Weapon_Desc'] = df['Weapon_Desc'].fillna('UNKNOWN WEAPON/OTHER WEAPON')

# impute missing 'Weapon_Used_Cd' with 500 for reasons discussed above.
df['Weapon_Used_Cd'] = df['Weapon_Used_Cd'].fillna(500)

# impute missing values for victim's gender with 'X,' since this corresponds to
# 'Unknown' in the data dictionary
df['Vict_Sex'] = df['Vict_Sex'].fillna('X')

# impute missing values for victim's race with 'X,' since this corresponds to
# 'Unknown' in the data dictionary
df['Vict_Descent'] = df['Vict_Descent'].fillna('X')

# Since there are only 211 rows with missing zip codes, those rows are dropped
df.dropna(subset=['Zip_R'], inplace=True)

Feature Engineering

There exist two date columns, date reported and date occurred, of which only the latter is used for extracting the year and month into two new columns, respectively, since a date object by itself cannot be used in a machine learning algorithm. In addition, descriptions of street premises are first converted into lower case strings within the column and subsequently encoded into separate binarized columns. Similarly, new columns for time (derived from military time) and city neighborhoods (derived from area) are encoded into binarized columns, respectively.

More importantly, the target column (Crime_Code) is defined by the values in the 'Crm_Cd' column where 0 is for any crime that ranges from 1-499 (most severe) and 1 is for any crime above 500 (less severe).

In [12]:
# create new column only for year of 'Date_Occurred'
df['Year'] = pd.DatetimeIndex(df['DATE_OCC']).year

# create new numeric column for month of year
df['Month_No'] = pd.DatetimeIndex(df['DATE_OCC']).month
# create new string column for month of year
df['Month'] = df['Month_No'].map({1: 'January', 2: 'February', 3: 'March',
                                  4: 'April', 5: 'May', 6: 'June', 7: 'July',
                                  8: 'August', 9: 'September', 10: 'October',
                                  11: 'November', 12: 'December'})

# bin the ages using pd.cut() function
# 
df['age_bin'] = pd.cut(df['Vict_Age'],
                       bins=[-1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 
                             60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 
                             115, 120],
                       labels=[' 0-5', ' 5-10', '10-15', '15-20', '20-25',
                               '25-30', '30-35', '35-40', '40-45', '45-50',
                               '50-55', '55-60', '60-65', '65-70', '70-75',
                               '75-80', '80-85', '85-90', '90-95', '95-100', 
                               '100-105', '105-110', '110-115', '115-120'])

# binarize the 'Crm_Cd' column; this is the target column
# 0 is for any crime that ranges 1-499 (most severe)
# anything above 500 is less severe
df['Crime_Code'] = df['Crm_Cd'].apply(lambda value: 1 if value <= 499 else \
                                     (0 if value >= 500 else 2))

# re-categorize ground truth as serious vs. not serious crimes where serious 
# crimes are '1' and less serious crimes are '0'
df['crime_severity'] = df['Crime_Code'].map({0:'Less Serious', 1:'More Serious'})

# creating a new column of premises types by renaming the premises types to lower
# case characters and eliminating spaces between words
df['Premises'] = df['Premis_Desc'].replace({'STREET': 'Street',
                                            'SIDEWALK': 'Sidewalk',
                                            'PARKING LOT': 'Parking_Lot',
                                            'ALLEY': 'Alley',
                                            'DRIVEWAY': 'Driveway',
                                            'PARK/PLAYGROUND': 'Park_Playground',
                                            'VACANT LOT': 'Vacant_Lot',
                                            'TUNNEL': 'Tunnel',
                                            'PEDESTRIAN OVERCROSSING': \
                                            'Pedestrian_Overcrossing'})

# replace strings with spaces of neighborhoods within 'AREA_NAME' to 
# strings without spaces by replacing the space with an underscore since 
# this is a more acceptable format for a final dataframe
df['AREA_NAME'] = df['AREA_NAME'].replace({'77th Street': '77th_Street',
                                           'N Hollywood': 'N_Hollywood',
                                           'Van Nuys': 'Van_Nuys',
                                           'West Valley': 'West_Valley',
                                           'West LA': 'West_LA'})

# creating instance of one-hot-encoder using pd.get_dummies
# perform one-hot encoding on 'Vict_Descent' column 
premise_encode = pd.get_dummies(df['Premises'])
# join these new columns back to the original dataframe
df = premise_encode.join(df)

def applyFunc(s):
    if s==0:
        return 'Midnight'
    elif s<=1159:
        return 'Morning'
    elif s>=1200 and s<=1800:
        return 'Afternoon'
    elif s>=1800 and s<=2200:
        return 'Evening'
    elif s>2200:
        return 'Night'

df['Time_of_Day'] = df['TIME_OCC'].apply(applyFunc)

# dummy encode military time into 'Morning', 'Afternoon', 'Evening', 'Night',
# and 'Midnight' time of day strings, respectively, making them their own
# separate and distinct columns
military_encode = pd.get_dummies(df['Time_of_Day'])
df = military_encode.join(df)

# dummy encode area names into their own separate and distinct columns
area_encode = pd.get_dummies(df['AREA_NAME'])
df = area_encode.join(df)

Reclassifying Useful Categorical Features

Gender and ethnicity are critical factors in making informed decisions about crime. New columns are created for both variables to better represent their characteristics. For example, unidentified genders are replaced with unknowns, and one-letter abbreviations for victim descent are re-categorized into full-text ethnicity descriptions from the city of Los Angeles' data dictionary.

In [13]:
# narrow down victim sex column to three categories (M, F, X=Unknown). Currently,
# there are four, with "H" presented. "H" is re-categorized to "X" since it is not
# in the data dictionary, and thus, remains unknown.
df['Victim_Sex'] = df['Vict_Sex'].map({'F':'F', 'H':'X', 'M':'M', 'X':'X'})
df['Gender'] = df['Victim_Sex'].map({'F':1, 'M':2, 'X':3})

# reclassify letters in victim description back to full race description from
# data dictionary nomenclature and add as new column; this will be helpful for 
# additional EDA post pre-processing.
df['Victim_Desc'] = df['Vict_Descent'].map({'A':'Other Asian', 'B':'Black', 'C':
                                         'Chinese', 'D': 'Cambodian', 'F':
                                         'Filipino', 'G': 'Guamanian', 'H':
                                         'Hispanic/Latin/Mexican', 'I':
                                         'American Indian/Alaskan Native', 'J':
                                         'Japanese', 'K': 'Korean', 'L': 'Laotian',
                                         'O': 'Other', 'P': 'Pacific Islander',
                                         'S': 'Samoan', 'U': 'Hawaiian', 'V':
                                         'Vietnamese', 'W': 'White', 'X': 'Unknown',
                                         'Z': 'Asian Indian'})

# write out the preliminarily preprocessed df to new .csv file
df.to_csv('/content/drive/Shareddrives/Capstone - Best Group/Final_Data_20220719'
         +'/df.csv')

Remove Columns Not Used for Model Development (Exclusion Criteria)

Any columns with all null rows and object data types are immediately dropped from the dataset, but cast into a new dataframe for posterity. Furthermore, only data from the year 2022 is included, since older retrospective data does not represent an accurate enough depiction of crime in the city of Los Angeles. Longitude and latitude columns are dropped since city neighborhood columns take their place for geographic information. Moreover, columns like retained joins, identifications, report numbers, and categorical information encoded to the original dataframe are omitted from this dataframe. Columns that only present one unique value (Pedestrian Overcrossing and Tunnel) are removed. Lastly, any column (military time) that presents over and above the Pearson correlation threshold of r = 0.75 is dropped to avoid between-predictor collinearity; all numeric values in the dataframe are cast to integer type formatting.

In [14]:
# make a unique list of cols to remove that are completely missing
cols_remove_null = data_types(df)[(data_types(df)["Percent Null"]==100)] \
                            ['Column/Variable'].unique()
                            
# make a unique list of cols to remove that are non-numeric
cols_remove_string = data_types(df)[(data_types(df)["Data Type"]=='object')] \
                              ['Column/Variable'].unique()

intersect = np.intersect1d(cols_remove_null, cols_remove_string)
print('The following columns intersect as being objects and being null:',
      intersect, '\n')

# Since these columns exist as objects too, they can be removed from the
# list of nulls such that they are not doubled up between the two.'
refined_null = [x for x in cols_remove_null if x not in cols_remove_string]

# set-up new df without dropped columns (dataframe used later for modeling)
df_prep = df.drop(columns=refined_null) 

# Subset dataframe for observations only taking place in year 2022
df_prep = df_prep[df_prep['Year']==2022]

print('Removed Fully Null columns:')
for c in cols_remove_null:
    print(f'\t{c}')
print()

# these columns are to be removed b/c they are object datatypes 
# (i.e., free text, strings)
df_prep = df_prep.drop(columns=cols_remove_string)

print('Removed string and object columns:')
for c in cols_remove_string:
    print(f'\t{c}')

# drop latitude & longitude columns; they will not be used to inform ML model(s)
print()
print('Removed Latitude and Longtitude Columns')
for col in df_prep.columns:
    if 'LAT' in col:
        df_prep = df_prep.drop(columns=col)
        print(f'\t{col}')
    elif 'LON' in col:
        df_prep = df_prep.drop(columns=col)
        print(f'\t{col}')

# drop any additional crime code columns, since only Crm_Cd without number
# following it will be used as target
print()
print('Removed Extraneous Crm_Cd Columns:')
for col in df_prep.columns:
    if 'Crm_Cd' in col:
        df_prep = df_prep.drop(columns=col)
        print(f'\t{col}')
print()

# drop any columns that persisted after join or feature engineering
print()
print('Removed Columns Persisting After JOIN:')

for col in df_prep.columns:
    if 'JOIN_' in col:
        df_prep = df_prep.drop(columns=col)
        print(f'\t{col}')
    elif 'Join_' in col:
        df_prep = df_prep.drop(columns=col)
        print(f'\t{col}')
    elif 'OBJECTID' in col:
        df_prep = df_prep.drop(columns=col)
        print(f'\t{col}')
    elif 'Premis' in col:
        df_prep = df_prep.drop(columns=col)
        print(f'\t{col}')
    elif 'Time' in col:
        df_prep = df_prep.drop(columns=col)
        print(f'\t{col}')   

# removed 'Victim_Sex' column because numericized as new 'Gender' column
# removed 'Victim_Desc' (ethnicity) since these cannot be numerically 
# classified without introducing bias

# drop any columns with '_L' or '_R' nomenclature since these are positional
# street indexes which are used to compile other location-based information

# drop the 'DR_NO' column since these are associated with records
df_prep = df_prep.drop(columns=['From_L', 'From_R', 'To_L', 'To_R', 'Zip_L', 
                                'Zip_R', 'SHAPE_Leng', 'Shape_Length_1',
                                'Shape_Length', 'DR_NO', 'TARGET_FID', 'age_bin',
                                'TARGET_FID_1', 'Accuracy', 'Crime_Cost',
                                'StreetOID', 'Field1', 'AREA', 'AREA_1', 
                                'Part_1_2', 'Weapon_Used_Cd', 'PREC', 'Year',
                                'PERIMETER', 'Rpt_Dist_No'])

print()
print('Removed columns with only one unique value:')
for col in df_prep:
    # remove any columns that have only one unique value in the data
    if df_prep[col].nunique() == 1:
        df_prep = df_prep.drop(columns=col)
        print(f'\t{col}')

# identify and drop any columns exceeding a correlation threshold of 0.75
corr_df = df_prep.reset_index(drop=True) # omit index by subsetting in new df
corr_matrix = corr_df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find index of feature columns with correlation greater >= 0.75
to_drop = [column for column in upper.columns if any(upper[column] > 0.75)]
# Drop features exceeding a correlation threshold of 0.75
df_prep = df_prep.drop(corr_df[to_drop], axis=1)
print()
print('These are the columns we should drop: %s'%to_drop)

# cast all numeric variables in preprocessed dataset to integers
df_prep = df_prep.astype(int)
The following columns intersect as being objects and being null: ['MSAG_LCity' 'MSAG_RCity' 'StPostDir' 'StPreMod'] 

Removed Fully Null columns:
	StPreMod
	StPostDir
	StPostMod
	MSAG_LCity
	MSAG_RCity
	MSAG_LESN
	MSAG_RESN
	Crm_Cd_4

Removed string and object columns:
	Shape
	FullName
	Type
	Elevation
	Surface
	Status
	DrivingDir
	Parity_L
	Parity_R
	StPreDir
	StPreMod
	StPreType
	StArticle
	StName
	StPostType
	StPostDir
	LCity_L
	LCity_R
	NameCat_L
	NameCat_R
	Jurisdiction
	Source
	SourceID
	UpdateDate
	MSAG_LCity
	MSAG_RCity
	Date_Rptd
	DATE_OCC
	AREA_NAME
	Crm_Cd_Desc
	Mocodes
	Vict_Sex
	Vict_Descent
	Premis_Desc
	Weapon_Desc
	Status_1
	Status_Desc
	LOCATION
	Cross_Street
	APREC
	Month
	crime_severity
	Premises
	Time_of_Day
	Victim_Sex
	Victim_Desc

Removed Latitude and Longtitude Columns
	LAT
	LON

Removed Extraneous Crm_Cd Columns:
	Crm_Cd
	Crm_Cd_1
	Crm_Cd_2
	Crm_Cd_3


Removed Columns Persisting After JOIN:
	Join_Count
	Join_Count_1
	JOIN_FID
	Premis_Cd
	OBJECTID_1

Removed columns with only one unique value:
	Pedestrian_Overcrossing
	Tunnel

These are the columns we should drop: ['TIME_OCC']
In [15]:
# describe the new preprocessed dataset as the development set and list its cols
development_cols = df_prep.columns.to_list()
print('Development Data Column List:', '\n')
for x in development_cols:
    print(x)
print()
print('There are', df_prep.shape[0], 'rows and', df_prep.shape[1], 
      'columns in the development set.')
Development Data Column List: 

77th_Street
Central
Devonshire
Foothill
Harbor
Hollenbeck
Hollywood
Mission
N_Hollywood
Newton
Northeast
Olympic
Pacific
Rampart
Southeast
Southwest
Topanga
Van_Nuys
West_LA
West_Valley
Wilshire
Afternoon
Evening
Morning
Night
Alley
Driveway
Park_Playground
Parking_Lot
Sidewalk
Street
Vacant_Lot
Vict_Age
Month_No
Crime_Code
Gender

There are 42072 rows and 36 columns in the development set.

Data types are examined once more on this preprocessed dataset to ensure that all data types are integers and no columns remain null. At this stage, there are 42,072 rows and 38 columns.

In [16]:
# run the python file for checking data types again 
# this time, for the development set
data_types(df_prep)
Out[16]:
Column/Variable Data Type # of Nulls Percent Null
0 77th_Street int64 0 0.0
1 Central int64 0 0.0
2 Devonshire int64 0 0.0
3 Foothill int64 0 0.0
4 Harbor int64 0 0.0
5 Hollenbeck int64 0 0.0
6 Hollywood int64 0 0.0
7 Mission int64 0 0.0
8 N_Hollywood int64 0 0.0
9 Newton int64 0 0.0
10 Northeast int64 0 0.0
11 Olympic int64 0 0.0
12 Pacific int64 0 0.0
13 Rampart int64 0 0.0
14 Southeast int64 0 0.0
15 Southwest int64 0 0.0
16 Topanga int64 0 0.0
17 Van_Nuys int64 0 0.0
18 West_LA int64 0 0.0
19 West_Valley int64 0 0.0
20 Wilshire int64 0 0.0
21 Afternoon int64 0 0.0
22 Evening int64 0 0.0
23 Morning int64 0 0.0
24 Night int64 0 0.0
25 Alley int64 0 0.0
26 Driveway int64 0 0.0
27 Park_Playground int64 0 0.0
28 Parking_Lot int64 0 0.0
29 Sidewalk int64 0 0.0
30 Street int64 0 0.0
31 Vacant_Lot int64 0 0.0
32 Vict_Age int64 0 0.0
33 Month_No int64 0 0.0
34 Crime_Code int64 0 0.0
35 Gender int64 0 0.0
In [17]:
df_prep = df_prep.copy()
df_prep.shape
Out[17]:
(42072, 36)

Train-Test-Validation Split

The data is split into a fifty percent development (training and validation) and fifty percent holdout (testing) sets, respectively, saving 3 separate files to the data folder path.

In [18]:
# set random state for reproducibility
rstate = 222
In [19]:
# Divide train set by 50%, valid set by 25%, and test set by 25%
train_prop = 0.50
valid_prop = 0.25
test_prop = 0.25
size_train = np.int64(round(train_prop*len(df_prep),2))
size_valid = np.int64(round(valid_prop*len(df_prep),2))
size_test = np.int64(round(test_prop*len(df_prep),2))
size_total = size_test + size_valid + size_train

# split the data into the development (train and validation)
# and test set, respectively
train, test = train_test_split(df_prep, train_size=size_train,\
                              random_state=rstate)
valid, test = train_test_split(test, train_size=size_valid,\
                              random_state=rstate)

print('Training size:', size_train)
print('Validation size:', size_valid)
print('Test size:', size_test)
print('Total size:', size_train + size_valid + size_test, '\n')
print('Training percentage:', round(size_train/(size_total),2))
print('Validation percentage:', round(size_valid/(size_total),2))
print('Test percentage:', round(size_test/(size_total),2))
print('Total percentage:', (size_train/size_total + size_valid/size_total \
                          + size_test/size_total)*100)
Training size: 21036
Validation size: 10518
Test size: 10518
Total size: 42072 

Training percentage: 0.5
Validation percentage: 0.25
Test percentage: 0.25
Total percentage: 100.0
In [20]:
# subset the development and test sets into their own respective dataframes
train_set = train
valid_set = valid
test_set = test

Save Train, Validation, and Test Sets as Separate .CSV Files

In [21]:
# write out the train, validation, and test sets to .csv files to the data folder
# such that they are contained in their own unique files
# the train and validation sets can be used without prejudice. However, the data
# can ONLY be run on the test set once
train_set.to_csv(data_folder + '/train_set.csv')
valid_set.to_csv(data_folder + '/valid_set.csv')
test_set.to_csv(data_folder + '/test_set.csv')