Team Members: Leonid Shpaner, Christopher Robinson, and Jose Luis Estrada
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
%cd /content/drive/Shared drives/Capstone - Best Group/GitHub Repository/navigating_crime/Code Library
####################################
## import the requisite libraries ##
####################################
import os
import csv
import pandas as pd
import numpy as np
# import data_types .py file created for inspecting data types
from functions import data_types
# library for one-hot encoding
from sklearn.preprocessing import OneHotEncoder
# import library from sci-kit learn for splitting the data
from sklearn.model_selection import train_test_split
# check current working directory
current_directory = os.getcwd()
current_directory
# path to the data file
data_file = '/content/drive/Shareddrives/Capstone - Best Group/' \
+ 'Final_Data_20220719/LA_Streets_with_Crimes_By_Division.csv'
# path to data folder
data_folder = '/content/drive/Shareddrives/Capstone - Best Group/' \
+ 'GitHub Repository/navigating_crime/Data Folder/'
# path to the image library
eda_image_path = '/content/drive/Shareddrives/Capstone - Best Group/' \
+ 'GitHub Repository/navigating_crime/Image Folder/EDA Images'
# read in the csv file to a dataframe using pandas
df = pd.read_csv(data_file, low_memory=False).set_index('OBJECTID')
df.head()
# show the columns of the dataframe for inspection
df.columns
# re-inspect the shape of the dataframe. This is also done on EDA file.
print(' There are', df.shape[0], 'and', df.shape[1], 'columns in the dataframe.')
The data is ingested from a shared google drive path and in its raw form, contains 183,362 rows and 80 columns. The preprocessing stage begins by inspecting each column's data types along with its respective null values.
# run the python file created for showing columns, data types, and their
# respective null counts (this was imported as a library at the top of this nb)
dtypes = data_types(df)
pd.set_option('display.max_rows', None)
dtypes = dtypes.sort_values(by='# of Nulls', ascending=False)
# here only the top null columns are subset
# a good amount of these columns are to be subsequently removed
dat_typ = dtypes[(dtypes['Percent Null']<84) & (dtypes['# of Nulls']>0)]
# any column with '_L' and '_R' is of no interest and will be dropped
dat_typ = dat_typ[dat_typ['Column/Variable'].str.contains('_L')==False]
dat_typ = dat_typ[dat_typ['Column/Variable'].str.contains('_R')==False]
dat_typ
# object types are typically removed for ML processing, but some imputation
# may be necessary, which is why they are examined prior
For example, there are 3,274 missing values for weapon description. Imputation is necessary to properly assign the corresponding numerical value to the corresponding weapons used code, which is also missing 3,274 observations.
The victim descent column only has 171 missing values, the rows corresponding to these values can be dropped, but they are instead imputed with 'Unknown' since the ethnicities here remain unknown.
The same holds true for victim's sex, where there exist 167 missing values.
Ethnicity and age are important features which will be numericized and adapted for use with machine learning. Jurisdiction and other string columns presented herein will be subsequently omitted from the development (modeling) set.
Weapons used in crimes contain useful information, and the rows of missing data presented therein are imputed by a value of 500, corresponding to 'unknown' in the data dictionary. Similarly, the missing values in victim's sex and ethnicity, respectively, are imputed by a categorical value of "X," for "unknown." The 211 missing values contained in the zip code column are dropped since they cannot be logically imputed.
# Since the code for unknown weapons is 500, missing values for 'Weapon_Used_Cd'
# where 'Weapon_Desc' is 'UNKNOWN WEAPON/OTHER WEAPON' will be 500.
# However, since missing values in one column correspond to the other,
# missing values in 'Weapon_Desc' will need to be imputed first with
# 'UNKNOWN WEAPON/OTHER WEAPON'
# impute missing weapon descriptions
df['Weapon_Desc'] = df['Weapon_Desc'].fillna('UNKNOWN WEAPON/OTHER WEAPON')
# impute missing 'Weapon_Used_Cd' with 500 for reasons discussed above.
df['Weapon_Used_Cd'] = df['Weapon_Used_Cd'].fillna(500)
# impute missing values for victim's gender with 'X,' since this corresponds to
# 'Unknown' in the data dictionary
df['Vict_Sex'] = df['Vict_Sex'].fillna('X')
# impute missing values for victim's race with 'X,' since this corresponds to
# 'Unknown' in the data dictionary
df['Vict_Descent'] = df['Vict_Descent'].fillna('X')
# Since there are only 211 rows with missing zip codes, those rows are dropped
df.dropna(subset=['Zip_R'], inplace=True)
There exist two date columns, date reported and date occurred, of which only the latter is used for extracting the year and month into two new columns, respectively, since a date object by itself cannot be used in a machine learning algorithm. In addition, descriptions of street premises are first converted into lower case strings within the column and subsequently encoded into separate binarized columns. Similarly, new columns for time (derived from military time) and city neighborhoods (derived from area) are encoded into binarized columns, respectively.
More importantly, the target column (Crime_Code) is defined by the values in the 'Crm_Cd' column where 0 is for any crime that ranges from 1-499 (most severe) and 1 is for any crime above 500 (less severe).
# create new column only for year of 'Date_Occurred'
df['Year'] = pd.DatetimeIndex(df['DATE_OCC']).year
# create new numeric column for month of year
df['Month_No'] = pd.DatetimeIndex(df['DATE_OCC']).month
# create new string column for month of year
df['Month'] = df['Month_No'].map({1: 'January', 2: 'February', 3: 'March',
4: 'April', 5: 'May', 6: 'June', 7: 'July',
8: 'August', 9: 'September', 10: 'October',
11: 'November', 12: 'December'})
# bin the ages using pd.cut() function
#
df['age_bin'] = pd.cut(df['Vict_Age'],
bins=[-1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55,
60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110,
115, 120],
labels=[' 0-5', ' 5-10', '10-15', '15-20', '20-25',
'25-30', '30-35', '35-40', '40-45', '45-50',
'50-55', '55-60', '60-65', '65-70', '70-75',
'75-80', '80-85', '85-90', '90-95', '95-100',
'100-105', '105-110', '110-115', '115-120'])
# binarize the 'Crm_Cd' column; this is the target column
# 0 is for any crime that ranges 1-499 (most severe)
# anything above 500 is less severe
df['Crime_Code'] = df['Crm_Cd'].apply(lambda value: 1 if value <= 499 else \
(0 if value >= 500 else 2))
# re-categorize ground truth as serious vs. not serious crimes where serious
# crimes are '1' and less serious crimes are '0'
df['crime_severity'] = df['Crime_Code'].map({0:'Less Serious', 1:'More Serious'})
# creating a new column of premises types by renaming the premises types to lower
# case characters and eliminating spaces between words
df['Premises'] = df['Premis_Desc'].replace({'STREET': 'Street',
'SIDEWALK': 'Sidewalk',
'PARKING LOT': 'Parking_Lot',
'ALLEY': 'Alley',
'DRIVEWAY': 'Driveway',
'PARK/PLAYGROUND': 'Park_Playground',
'VACANT LOT': 'Vacant_Lot',
'TUNNEL': 'Tunnel',
'PEDESTRIAN OVERCROSSING': \
'Pedestrian_Overcrossing'})
# replace strings with spaces of neighborhoods within 'AREA_NAME' to
# strings without spaces by replacing the space with an underscore since
# this is a more acceptable format for a final dataframe
df['AREA_NAME'] = df['AREA_NAME'].replace({'77th Street': '77th_Street',
'N Hollywood': 'N_Hollywood',
'Van Nuys': 'Van_Nuys',
'West Valley': 'West_Valley',
'West LA': 'West_LA'})
# creating instance of one-hot-encoder using pd.get_dummies
# perform one-hot encoding on 'Vict_Descent' column
premise_encode = pd.get_dummies(df['Premises'])
# join these new columns back to the original dataframe
df = premise_encode.join(df)
def applyFunc(s):
if s==0:
return 'Midnight'
elif s<=1159:
return 'Morning'
elif s>=1200 and s<=1800:
return 'Afternoon'
elif s>=1800 and s<=2200:
return 'Evening'
elif s>2200:
return 'Night'
df['Time_of_Day'] = df['TIME_OCC'].apply(applyFunc)
# dummy encode military time into 'Morning', 'Afternoon', 'Evening', 'Night',
# and 'Midnight' time of day strings, respectively, making them their own
# separate and distinct columns
military_encode = pd.get_dummies(df['Time_of_Day'])
df = military_encode.join(df)
# dummy encode area names into their own separate and distinct columns
area_encode = pd.get_dummies(df['AREA_NAME'])
df = area_encode.join(df)
Gender and ethnicity are critical factors in making informed decisions about crime. New columns are created for both variables to better represent their characteristics. For example, unidentified genders are replaced with unknowns, and one-letter abbreviations for victim descent are re-categorized into full-text ethnicity descriptions from the city of Los Angeles' data dictionary.
# narrow down victim sex column to three categories (M, F, X=Unknown). Currently,
# there are four, with "H" presented. "H" is re-categorized to "X" since it is not
# in the data dictionary, and thus, remains unknown.
df['Victim_Sex'] = df['Vict_Sex'].map({'F':'F', 'H':'X', 'M':'M', 'X':'X'})
df['Gender'] = df['Victim_Sex'].map({'F':1, 'M':2, 'X':3})
# reclassify letters in victim description back to full race description from
# data dictionary nomenclature and add as new column; this will be helpful for
# additional EDA post pre-processing.
df['Victim_Desc'] = df['Vict_Descent'].map({'A':'Other Asian', 'B':'Black', 'C':
'Chinese', 'D': 'Cambodian', 'F':
'Filipino', 'G': 'Guamanian', 'H':
'Hispanic/Latin/Mexican', 'I':
'American Indian/Alaskan Native', 'J':
'Japanese', 'K': 'Korean', 'L': 'Laotian',
'O': 'Other', 'P': 'Pacific Islander',
'S': 'Samoan', 'U': 'Hawaiian', 'V':
'Vietnamese', 'W': 'White', 'X': 'Unknown',
'Z': 'Asian Indian'})
# write out the preliminarily preprocessed df to new .csv file
df.to_csv('/content/drive/Shareddrives/Capstone - Best Group/Final_Data_20220719'
+'/df.csv')
Any columns with all null rows and object data types are immediately dropped from the dataset, but cast into a new dataframe for posterity. Furthermore, only data from the year 2022 is included, since older retrospective data does not represent an accurate enough depiction of crime in the city of Los Angeles. Longitude and latitude columns are dropped since city neighborhood columns take their place for geographic information. Moreover, columns like retained joins, identifications, report numbers, and categorical information encoded to the original dataframe are omitted from this dataframe. Columns that only present one unique value (Pedestrian Overcrossing and Tunnel) are removed. Lastly, any column (military time) that presents over and above the Pearson correlation threshold of r = 0.75 is dropped to avoid between-predictor collinearity; all numeric values in the dataframe are cast to integer type formatting.
# make a unique list of cols to remove that are completely missing
cols_remove_null = data_types(df)[(data_types(df)["Percent Null"]==100)] \
['Column/Variable'].unique()
# make a unique list of cols to remove that are non-numeric
cols_remove_string = data_types(df)[(data_types(df)["Data Type"]=='object')] \
['Column/Variable'].unique()
intersect = np.intersect1d(cols_remove_null, cols_remove_string)
print('The following columns intersect as being objects and being null:',
intersect, '\n')
# Since these columns exist as objects too, they can be removed from the
# list of nulls such that they are not doubled up between the two.'
refined_null = [x for x in cols_remove_null if x not in cols_remove_string]
# set-up new df without dropped columns (dataframe used later for modeling)
df_prep = df.drop(columns=refined_null)
# Subset dataframe for observations only taking place in year 2022
df_prep = df_prep[df_prep['Year']==2022]
print('Removed Fully Null columns:')
for c in cols_remove_null:
print(f'\t{c}')
print()
# these columns are to be removed b/c they are object datatypes
# (i.e., free text, strings)
df_prep = df_prep.drop(columns=cols_remove_string)
print('Removed string and object columns:')
for c in cols_remove_string:
print(f'\t{c}')
# drop latitude & longitude columns; they will not be used to inform ML model(s)
print()
print('Removed Latitude and Longtitude Columns')
for col in df_prep.columns:
if 'LAT' in col:
df_prep = df_prep.drop(columns=col)
print(f'\t{col}')
elif 'LON' in col:
df_prep = df_prep.drop(columns=col)
print(f'\t{col}')
# drop any additional crime code columns, since only Crm_Cd without number
# following it will be used as target
print()
print('Removed Extraneous Crm_Cd Columns:')
for col in df_prep.columns:
if 'Crm_Cd' in col:
df_prep = df_prep.drop(columns=col)
print(f'\t{col}')
print()
# drop any columns that persisted after join or feature engineering
print()
print('Removed Columns Persisting After JOIN:')
for col in df_prep.columns:
if 'JOIN_' in col:
df_prep = df_prep.drop(columns=col)
print(f'\t{col}')
elif 'Join_' in col:
df_prep = df_prep.drop(columns=col)
print(f'\t{col}')
elif 'OBJECTID' in col:
df_prep = df_prep.drop(columns=col)
print(f'\t{col}')
elif 'Premis' in col:
df_prep = df_prep.drop(columns=col)
print(f'\t{col}')
elif 'Time' in col:
df_prep = df_prep.drop(columns=col)
print(f'\t{col}')
# removed 'Victim_Sex' column because numericized as new 'Gender' column
# removed 'Victim_Desc' (ethnicity) since these cannot be numerically
# classified without introducing bias
# drop any columns with '_L' or '_R' nomenclature since these are positional
# street indexes which are used to compile other location-based information
# drop the 'DR_NO' column since these are associated with records
df_prep = df_prep.drop(columns=['From_L', 'From_R', 'To_L', 'To_R', 'Zip_L',
'Zip_R', 'SHAPE_Leng', 'Shape_Length_1',
'Shape_Length', 'DR_NO', 'TARGET_FID', 'age_bin',
'TARGET_FID_1', 'Accuracy', 'Crime_Cost',
'StreetOID', 'Field1', 'AREA', 'AREA_1',
'Part_1_2', 'Weapon_Used_Cd', 'PREC', 'Year',
'PERIMETER', 'Rpt_Dist_No'])
print()
print('Removed columns with only one unique value:')
for col in df_prep:
# remove any columns that have only one unique value in the data
if df_prep[col].nunique() == 1:
df_prep = df_prep.drop(columns=col)
print(f'\t{col}')
# identify and drop any columns exceeding a correlation threshold of 0.75
corr_df = df_prep.reset_index(drop=True) # omit index by subsetting in new df
corr_matrix = corr_df.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Find index of feature columns with correlation greater >= 0.75
to_drop = [column for column in upper.columns if any(upper[column] > 0.75)]
# Drop features exceeding a correlation threshold of 0.75
df_prep = df_prep.drop(corr_df[to_drop], axis=1)
print()
print('These are the columns we should drop: %s'%to_drop)
# cast all numeric variables in preprocessed dataset to integers
df_prep = df_prep.astype(int)
# describe the new preprocessed dataset as the development set and list its cols
development_cols = df_prep.columns.to_list()
print('Development Data Column List:', '\n')
for x in development_cols:
print(x)
print()
print('There are', df_prep.shape[0], 'rows and', df_prep.shape[1],
'columns in the development set.')
Data types are examined once more on this preprocessed dataset to ensure that all data types are integers and no columns remain null. At this stage, there are 42,072 rows and 38 columns.
# run the python file for checking data types again
# this time, for the development set
data_types(df_prep)
df_prep = df_prep.copy()
df_prep.shape
The data is split into a fifty percent development (training and validation) and fifty percent holdout (testing) sets, respectively, saving 3 separate files to the data folder path.
# set random state for reproducibility
rstate = 222
# Divide train set by 50%, valid set by 25%, and test set by 25%
train_prop = 0.50
valid_prop = 0.25
test_prop = 0.25
size_train = np.int64(round(train_prop*len(df_prep),2))
size_valid = np.int64(round(valid_prop*len(df_prep),2))
size_test = np.int64(round(test_prop*len(df_prep),2))
size_total = size_test + size_valid + size_train
# split the data into the development (train and validation)
# and test set, respectively
train, test = train_test_split(df_prep, train_size=size_train,\
random_state=rstate)
valid, test = train_test_split(test, train_size=size_valid,\
random_state=rstate)
print('Training size:', size_train)
print('Validation size:', size_valid)
print('Test size:', size_test)
print('Total size:', size_train + size_valid + size_test, '\n')
print('Training percentage:', round(size_train/(size_total),2))
print('Validation percentage:', round(size_valid/(size_total),2))
print('Test percentage:', round(size_test/(size_total),2))
print('Total percentage:', (size_train/size_total + size_valid/size_total \
+ size_test/size_total)*100)
# subset the development and test sets into their own respective dataframes
train_set = train
valid_set = valid
test_set = test
# write out the train, validation, and test sets to .csv files to the data folder
# such that they are contained in their own unique files
# the train and validation sets can be used without prejudice. However, the data
# can ONLY be run on the test set once
train_set.to_csv(data_folder + '/train_set.csv')
valid_set.to_csv(data_folder + '/valid_set.csv')
test_set.to_csv(data_folder + '/test_set.csv')