Exploratory Data Analysis - Phase II¶

Team Members: Leonid Shpaner, Christopher Robinson, and Jose Luis Estrada

This notebook takes a more granular look into the columns of interest using boxplots, stacked bar graphs, histograms, and culminates with a correlation matrix.

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive

%cd /content/drive/Shared drives/Capstone - Best Group/GitHub Repository/navigating_crime/Code Library

/content/drive/Shared drives/Capstone - Best Group/GitHub Repository/navigating_crime/Code Library

####################################
## import the requisite libraries ##
####################################
import os
import csv
import pandas as pd
import numpy as np

# plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
# suppress warnings for cleaner output
warnings.filterwarnings('ignore')
# suppress future warnings for cleaner output
warnings.filterwarnings(action='ignore', category=FutureWarning)

# check current working directory
current_directory = os.getcwd()
current_directory

'/content/drive/Shared drives/Capstone - Best Group/GitHub Repository/navigating_crime/Code Library'

Assign Paths to Folders¶

# path to the data file
data_frame = '/content/drive/Shareddrives/Capstone - Best Group/' \
           + 'Final_Data_20220719/df.csv'

# path to data folder
data_folder = '/content/drive/Shareddrives/Capstone - Best Group/' \
            +  'GitHub Repository/navigating_crime/Data Folder/'

# path to the training file
train_path = '/content/drive/Shareddrives/Capstone - Best Group/' \
           + 'GitHub Repository/navigating_crime/Data Folder/train_set.csv'

# path to the image library
eda_image_path = '/content/drive/Shareddrives/Capstone - Best Group/' \
               + 'GitHub Repository/navigating_crime/Image Folder/EDA Images'

# bring in original dataframe as preprocessed in the 
# data_preparation.ipynb file
df = pd.read_csv(data_frame, low_memory=False).set_index('OBJECTID')

# re-inspect the shape of the dataframe. 
print('There are', df.shape[0], 'rows and', df.shape[1], 
      'columns in the dataframe.')

There are 183151 rows and 125 columns in the dataframe.

Age Range Statistics¶

The top three ages of crime victims are 25-30, 20-25, and 30-35, with ages 25-30 reporting 25,792 crimes, ages 20-25 reporting 22,235 crimes, and ages 30-35 reporting 21,801 crimes, respectively.

# this bar_plot library was created as a bar_plot.py file during the EDA Phase I
# stage; it can be acccessed in that respective notebook
from functions import bar_plot
bar_plot(15, 10, df, False, 'bar', 'Bar Graph of Age Ranges', 0, 
         "Victims' Age Range", 'Count', 'age_bin', 100)
plt.savefig(eda_image_path + '/age_range_bargraph.png', bbox_inches='tight')

Contingency Table¶

Using a contingency table allows for the data in any column of interest to be summarized by the values in the target column (crime severity).

from functions import cont_table

Summary Statistics¶

Calling this from the functions.py library will provide summary statistics for any column in the dataframe.

from functions import summ_stats

Status Description by Age¶

summ_stats(df, 'Status_Desc', 'Vict_Age')

Summary Statistics by Age

Victim Sex by Age¶

summ_stats(df, 'Victim_Sex', 'Vict_Age')

Summary Statistics by Age

Stacked Bar Plots¶

This function provides a stacked and normalized bar graph of any column of interest, colored by ground truth column

from functions import stacked_plot

Crime Severity by Age Group¶

Crime severity presents at an about even ratio per age group, with the exception being the highest age range of 115-120, where crime incidence is lower, but overall, more serious. Moreover, it is interesting to note that there are 19,073 more serious crimes than less serious ones, comprising an overwhelming majority (55.56%) of all cases.

age_table = cont_table(df, 'crime_severity', 'Less Serious', 'age_bin', 
                       'More Serious').data
age_table

stacked_plot(15, 10, 10, df, 'age_bin', 'crime_severity', 'Less Serious', 'bar', 
             'Crime Severity by Age Group', 'Age Group', 'Count', 0.9, 0,
             'Crime Severity by Age Group - Normalized', 'Age Group', 
             'Frequency')
plt.savefig(eda_image_path + '/age_crime_stacked_bar.png', bbox_inches='tight')

Crime Severity by Street Type¶

It is interesting to note that not only the least serious crimes occur in alleys, pedestrian walkways, private roads (paved or unpaved), and trails, but that where they do occur, they are less severe.

street_type_table = cont_table(df, 'crime_severity', 'Less Serious', 
                               'Type', 'More Serious').data
street_type_table

stacked_plot(15, 10, 10, df, 'Type', 'crime_severity', 'Less Serious', 'bar', 
             'Crime Severity by Street Type', 'Street Type', 'Count', 0.9, 0, 
             'Crime Severity by Street Type - Normalized', 'Street Type',
             'Frequency')
plt.savefig(eda_image_path + '/streettype_stacked_bar.png', bbox_inches='tight')

Crime Severity by Sex¶

Whereas there are more males than females in this dataset, it can be seen from both the regular and normalized distributions, respectively, that more serious crimes occur with a higher prevalence (69,370 or 64.24%) for the former than the latter (28,565 or 41.70%). For sexes unknown, there is a 36.70% prevalence rate for more serious crimes.

sex_table = cont_table(df, 'crime_severity', 'Less Serious', 'Victim_Sex', 
                       'More Serious').data
sex_table

stacked_plot(10, 10, 10, df, 'Victim_Sex', 'crime_severity', 'Less Serious', 
             'bar', 'Crime Severity by Sex', 'Sex', 'Count', 
             0.9, 0, 'Crime Severity by Sex - Normalized', 'Sex', 'Frequency')
plt.savefig(eda_image_path + '/victim_sex_stacked_bar.png', bbox_inches='tight')

Crime Severity by Time of Day¶

It is interesting to note that more serious crimes (35,396) occur in the morning than any other time of day, with more serious night crimes accounting for only 9,814 (approximately 10%) of all such crimes.

time_table = cont_table(df, 'crime_severity', 'Less Serious', 
                        'Time_of_Day', 'More Serious').data
time_table

stacked_plot(10, 10, 10, df, 'Time_of_Day', 'crime_severity', 'Less Serious', 
             'bar', 'Time of Day by Crime Severity', 'Time of Day', 'Count', 
             0.9, 0, 'Time of Day by Crime Severity - Normalized', 
             'Time of Day', 'Frequency')
plt.savefig(eda_image_path + '/time_day_stacked_bar.png', bbox_inches='tight')

Crime Severity by Month¶

The month of June presents a record of 10,852 more serious crimes than any other month, so there exists a higher prevalence of more serious crimes mid-year than any other time of year.

month_table = cont_table(df, 'crime_severity', 'Less Serious', 
                         'Month', 'More Serious').data
month_table

stacked_plot(15, 10, 10, df, 'Month', 'crime_severity', 'Less Serious', 
             'bar', 'Month by Crime Severity', 'Month', 'Count', 
             0.9, 0, 'Month by Crime Severity - Normalized', 
             'Month', 'Frequency')
plt.savefig(eda_image_path + '/month_stacked_bar.png', bbox_inches='tight')

Crime Severity Victim Descent¶

In terms of ethnicity, members of the Hispanic/Latin/Mexican demographic account for 51,601 incidences of more serious crimes. More importantly, with an additional 40,226 less serious crimes, this demographic accounts for a total of 91,827 crimes, an overwhelming 50% of all crimes in the data.

descent_table = cont_table(df, 'crime_severity', 'Less Serious', 
                           'Victim_Desc', 'More Serious').data
descent_table

stacked_plot(10,10, 10, df, 'Victim_Desc', 'crime_severity', 'Less Serious', 
             'barh', 'Victim Descent by Crime Severity', 'Count', 
             'Victim Description', 0.9, 0, 
             'Victim Descent by Crime Severity - Normalized', 
             'Frequency', 'Victim Descent')
plt.savefig(eda_image_path + '/victim_desc_stacked_bar.png', bbox_inches='tight')

Crime Severity by Neighborhood¶

In terms of neighborhoods based on police districts, the 77th Street region shows the highest amount of more serious crimes (14,350) than any other district; second is the Southeast area (10,142).

area_table = cont_table(df, 'crime_severity', 'Less Serious', 
                        'AREA_NAME', 'More Serious').data
area_table

stacked_plot(10, 10, 10, df, 'AREA_NAME', 'crime_severity', 'Less Serious', 
             'barh', 'Neighborhood by Crime Severity', 'Count', 'Neighborhood', 
             0.9, 0, 'Neighborhood by Crime Severity - Normalized', 'Frequency', 
             'Neighborhood')
plt.savefig(eda_image_path + '/area_stacked_bar.png', bbox_inches='tight')

Crime Severity by Premises¶

It is equally important to note that most crimes (100,487 or ~55%) occur on the street, with 57.51% being attributed to more serious crimes.

premis_table = cont_table(df, 'crime_severity', 'Less Serious', 
                          'Premises', 'More Serious').data
premis_table

stacked_plot(10, 10, 10, df, 'Premises', 'crime_severity', 'Less Serious', 'barh', 
             'Premises by Crime Severity', 'Count', 'Premises', 0.9, 0, 
             'Premises by Crime Severity - Normalized', 'Frequency', 
             'Premises')
plt.savefig(eda_image_path + '/premises_stacked_bar.png', bbox_inches='tight')

Histogram Distributions Colored by Crime Code Target¶

# read in the train set since we are inspecting ground truth only on this subset
# to avoid exposing information from the unseen data 
train_set = pd.read_csv(train_path).set_index('OBJECTID')

# Histogram Distributions Colored by Crime Code Target
def colored_hist(target, nrows, ncols, x, y, w_pad, h_pad):
    '''
    This function shows a histogram for the entire dataframe, colored by 
    the ground truth column.
    Inputs:
        target: ground truth column
        nrows: number of histogram rows to include in plot
        ncols: number of histogram cols to include in plot
        x: x-axis figure size
        y: y-axis figure size
        w_pad: width padding for plot
        h_pad: height padding for plot
    '''
    # create empty list to enumerate columns from
    list1 = train_set.drop(columns={target})
    # set the plot size dimensions
    fig, axes = plt.subplots(nrows, ncols, figsize=(x, y))
    ax = axes.flatten()
    for i, col in enumerate(list1):
        train_set.pivot(columns=target)[col].plot(kind='hist', density=True, 
                                                        stacked=True, ax=ax[i])
        ax[i].set_title(col)
        ax[i].set_xlabel('Values')
        ax[i].legend(loc='upper right')
        
    fig.tight_layout(w_pad=w_pad, h_pad=h_pad)

colored_hist('Crime_Code', 5, 7, 25, 20, 6, 12)
plt.savefig(eda_image_path + '/colored_hist.png', bbox_inches='tight')

Examining Possible Correlations¶

# this function is defined and used only once; hence, it remains in 
# this notebook
def corr_plot(df, x, y):
    '''
    This function plots a correlation matrix for the dataframe
    Inputs:
        df: dataframe to ingest into the correlation matrix plot
        x: x-axis size
        y: y-axis size
    '''
    # correlation matrix title
    print("\033[1m" + 'La Crime Data: Correlation Matrix' 
                    + "\033[1m")
    # assign correlation function to new variable
    corr = df.corr()
    matrix = np.triu(corr) # for triangular matrix
    plt.figure(figsize=(x,y))
    # parse corr variable intro triangular matrix
    sns.heatmap(df.corr(method='pearson'), 
                annot=True, linewidths=.5, cmap='coolwarm', mask=matrix,
                square=True, 
                cbar_kws={'label': 'Correlation Index'})

# subset train set without index into new corr_df dataframe
corr_df = train_set.reset_index(drop=True) 
   
# plot the correlation matrix
corr_plot(corr_df, 25, 25)
plt.savefig(eda_image_path + '/correlation_plot.png', bbox_inches='tight')

La Crime Data: Correlation Matrix

No multicollinearity has been detected at a threshold of r=0.75.

	Mean	Median	Standard Deviation	Minimum	Maximum
Status_Desc
Adult Arrest	33.00	32.00	17.85	0.00	99.00
Adult Other	35.69	34.00	15.24	0.00	99.00
Invest Cont	34.64	33.00	16.96	0.00	120.00
Juv Arrest	26.26	23.00	16.39	0.00	81.00
Juv Other	23.35	17.00	16.01	0.00	76.00

	Mean	Median	Standard Deviation	Minimum	Maximum
Victim_Sex
F	34.44	32.00	15.19	0.00	99.00
M	36.32	35.00	16.46	0.00	99.00
X	5.69	0.00	12.95	0.00	120.00

	Less Serious	More Serious	Total	% More Serious
age_bin
0-5	6,731.00	5,516.00	12,247.00	45.04
5-10	343.00	387.00	730.00	53.01
10-15	2,329.00	1,932.00	4,261.00	45.34
15-20	6,348.00	8,465.00	14,813.00	57.15
20-25	9,697.00	12,538.00	22,235.00	56.39
25-30	10,975.00	14,817.00	25,792.00	57.45
30-35	9,708.00	12,093.00	21,801.00	55.47
35-40	8,583.00	10,716.00	19,299.00	55.53
40-45	6,744.00	8,802.00	15,546.00	56.62
45-50	5,802.00	6,975.00	12,777.00	54.59
50-55	5,218.00	6,308.00	11,526.00	54.73
55-60	4,360.00	4,950.00	9,310.00	53.17
60-65	3,011.00	3,462.00	6,473.00	53.48
65-70	1,638.00	1,797.00	3,435.00	52.31
70-75	697.00	799.00	1,496.00	53.41
75-80	326.00	445.00	771.00	57.72
80-85	124.00	183.00	307.00	59.61
85-90	65.00	68.00	133.00	51.13
90-95	17.00	38.00	55.00	69.09
95-100	54.00	86.00	140.00	61.43
115-120	NaN	4.00	4.00	0.00
Total	82,770.00	100,381.00	183,151.00	54.81

	Less Serious	More Serious	Total	% More Serious
Type
Alley	240.00	190.00	430.00	44.19
Minor	41,420.00	49,066.00	90,486.00	54.22
Pedestrian Walkway	211.00	49.00	260.00	18.85
Primary	29,291.00	35,953.00	65,244.00	55.11
Private Road	31.00	38.00	69.00	55.07
Secondary	11,562.00	15,077.00	26,639.00	56.60
Trail	15.00	6.00	21.00	28.57
Unpaved Road	NaN	2.00	2.00	0.00
Total	82,770.00	100,381.00	183,151.00	54.81

	Less Serious	More Serious	Total	% More Serious
Victim_Sex
F	39936	28565	68501	41.70
M	38615	69370	107985	64.24
X	4219	2446	6665	36.70
Total	82770	100381	183151	54.81

	Less Serious	More Serious	Total	% More Serious
Time_of_Day
Afternoon	29609	31406	61015	51.47
Evening	18547	23765	42312	56.17
Morning	28105	35396	63501	55.74
Night	6509	9814	16323	60.12
Total	82770	100381	183151	54.81

	Less Serious	More Serious	Total	% More Serious
Month
April	8165	9530	17695	53.86
August	6002	7468	13470	55.44
December	4241	5843	10084	57.94
February	7960	9114	17074	53.38
January	8376	9457	17833	53.03
July	6690	8404	15094	55.68
June	8756	10852	19608	55.34
March	8092	9216	17308	53.25
May	8841	10548	19389	54.40
November	4785	6273	11058	56.73
October	5435	7018	12453	56.36
September	5427	6658	12085	55.09
Total	82770	100381	183151	54.81

	Less Serious	More Serious	Total	% More Serious
Victim_Desc
American Indian/Alaskan Native	28.00	16.00	44.00	36.36
Black	16,598.00	26,092.00	42,690.00	61.12
Filipino	23.00	33.00	56.00	58.93
Guamanian	16.00	4.00	20.00	20.00
Hispanic/Latin/Mexican	40,226.00	51,601.00	91,827.00	56.19
Japanese	5.00	9.00	14.00	64.29
Korean	129.00	162.00	291.00	55.67
Other	5,043.00	4,877.00	9,920.00	49.16
Other Asian	1,485.00	1,541.00	3,026.00	50.93
Samoan	6.00	0.00	6.00	0.00
Unknown	4,948.00	3,375.00	8,323.00	40.55
Vietnamese	4.00	7.00	11.00	63.64
White	14,259.00	12,650.00	26,909.00	47.01
Chinese	NaN	9.00	9.00	0.00
Hawaiian	NaN	1.00	1.00	0.00
Laotian	NaN	4.00	4.00	0.00
Total	82,770.00	100,381.00	183,151.00	54.81

	Less Serious	More Serious	Total	% More Serious
AREA_NAME
77th_Street	7330	14350	21680	66.19
Central	7733	9395	17128	54.85
Devonshire	2398	1846	4244	43.50
Foothill	2347	2421	4768	50.78
Harbor	3459	3735	7194	51.92
Hollenbeck	3102	4113	7215	57.01
Hollywood	5058	5129	10187	50.35
Mission	2716	3278	5994	54.69
N_Hollywood	3789	3039	6828	44.51
Newton	5589	8309	13898	59.79
Northeast	2692	2803	5495	51.01
Olympic	5150	5203	10353	50.26
Pacific	2814	1729	4543	38.06
Rampart	5086	6397	11483	55.71
Southeast	5237	10142	15379	65.95
Southwest	4844	6126	10970	55.84
Topanga	2155	1933	4088	47.28
Van_Nuys	3035	2769	5804	47.71
West_LA	2478	1164	3642	31.96
West_Valley	2505	2544	5049	50.39
Wilshire	3253	3956	7209	54.88
Total	82770	100381	183151	54.81

	Less Serious	More Serious	Total	% More Serious
Premises
Alley	2199	3312	5511	60.10
Driveway	2521	1742	4263	40.86
Park_Playground	2156	2076	4232	49.05
Parking_Lot	10503	10064	20567	48.93
Pedestrian_Overcrossing	5	9	14	64.29
Sidewalk	22634	25313	47947	52.79
Street	42700	57787	100487	57.51
Tunnel	13	16	29	55.17
Vacant_Lot	39	62	101	61.39
Total	82770	100381	183151	54.81