Exploratory Data Analysis - Phase II

Team Members: Leonid Shpaner, Christopher Robinson, and Jose Luis Estrada


This notebook takes a more granular look into the columns of interest using boxplots, stacked bar graphs, histograms, and culminates with a correlation matrix.

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
Mounted at /content/drive
In [2]:
%cd /content/drive/Shared drives/Capstone - Best Group/GitHub Repository/navigating_crime/Code Library
/content/drive/Shared drives/Capstone - Best Group/GitHub Repository/navigating_crime/Code Library
In [3]:
####################################
## import the requisite libraries ##
####################################
import os
import csv
import pandas as pd
import numpy as np

# plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
# suppress warnings for cleaner output
warnings.filterwarnings('ignore')
# suppress future warnings for cleaner output
warnings.filterwarnings(action='ignore', category=FutureWarning)
In [4]:
# check current working directory
current_directory = os.getcwd()
current_directory
Out[4]:
'/content/drive/Shared drives/Capstone - Best Group/GitHub Repository/navigating_crime/Code Library'

Assign Paths to Folders

In [5]:
# path to the data file
data_frame = '/content/drive/Shareddrives/Capstone - Best Group/' \
           + 'Final_Data_20220719/df.csv'

# path to data folder
data_folder = '/content/drive/Shareddrives/Capstone - Best Group/' \
            +  'GitHub Repository/navigating_crime/Data Folder/'

# path to the training file
train_path = '/content/drive/Shareddrives/Capstone - Best Group/' \
           + 'GitHub Repository/navigating_crime/Data Folder/train_set.csv'

# path to the image library
eda_image_path = '/content/drive/Shareddrives/Capstone - Best Group/' \
               + 'GitHub Repository/navigating_crime/Image Folder/EDA Images'
In [6]:
# bring in original dataframe as preprocessed in the 
# data_preparation.ipynb file
df = pd.read_csv(data_frame, low_memory=False).set_index('OBJECTID')
In [7]:
# re-inspect the shape of the dataframe. 
print('There are', df.shape[0], 'rows and', df.shape[1], 
      'columns in the dataframe.')
There are 183151 rows and 125 columns in the dataframe.

Age Range Statistics

The top three ages of crime victims are 25-30, 20-25, and 30-35, with ages 25-30 reporting 25,792 crimes, ages 20-25 reporting 22,235 crimes, and ages 30-35 reporting 21,801 crimes, respectively.

In [8]:
# this bar_plot library was created as a bar_plot.py file during the EDA Phase I
# stage; it can be acccessed in that respective notebook
from functions import bar_plot
bar_plot(15, 10, df, False, 'bar', 'Bar Graph of Age Ranges', 0, 
         "Victims' Age Range", 'Count', 'age_bin', 100)
plt.savefig(eda_image_path + '/age_range_bargraph.png', bbox_inches='tight')

Contingency Table

Using a contingency table allows for the data in any column of interest to be summarized by the values in the target column (crime severity).

In [9]:
from functions import cont_table

Summary Statistics

Calling this from the functions.py library will provide summary statistics for any column in the dataframe.

In [10]:
from functions import summ_stats

Status Description by Age

In [11]:
summ_stats(df, 'Status_Desc', 'Vict_Age')
Summary Statistics by Age
Out[11]:
Mean Median Standard Deviation Minimum Maximum
Status_Desc
Adult Arrest 33.00 32.00 17.85 0.00 99.00
Adult Other 35.69 34.00 15.24 0.00 99.00
Invest Cont 34.64 33.00 16.96 0.00 120.00
Juv Arrest 26.26 23.00 16.39 0.00 81.00
Juv Other 23.35 17.00 16.01 0.00 76.00

Victim Sex by Age

In [12]:
summ_stats(df, 'Victim_Sex', 'Vict_Age')
Summary Statistics by Age
Out[12]:
Mean Median Standard Deviation Minimum Maximum
Victim_Sex
F 34.44 32.00 15.19 0.00 99.00
M 36.32 35.00 16.46 0.00 99.00
X 5.69 0.00 12.95 0.00 120.00

Stacked Bar Plots

This function provides a stacked and normalized bar graph of any column of interest, colored by ground truth column

In [13]:
from functions import stacked_plot

Crime Severity by Age Group

Crime severity presents at an about even ratio per age group, with the exception being the highest age range of 115-120, where crime incidence is lower, but overall, more serious. Moreover, it is interesting to note that there are 19,073 more serious crimes than less serious ones, comprising an overwhelming majority (55.56%) of all cases.

In [14]:
age_table = cont_table(df, 'crime_severity', 'Less Serious', 'age_bin', 
                       'More Serious').data
age_table
Out[14]:
Less Serious More Serious Total % More Serious
age_bin
0-5 6,731.00 5,516.00 12,247.00 45.04
5-10 343.00 387.00 730.00 53.01
10-15 2,329.00 1,932.00 4,261.00 45.34
15-20 6,348.00 8,465.00 14,813.00 57.15
20-25 9,697.00 12,538.00 22,235.00 56.39
25-30 10,975.00 14,817.00 25,792.00 57.45
30-35 9,708.00 12,093.00 21,801.00 55.47
35-40 8,583.00 10,716.00 19,299.00 55.53
40-45 6,744.00 8,802.00 15,546.00 56.62
45-50 5,802.00 6,975.00 12,777.00 54.59
50-55 5,218.00 6,308.00 11,526.00 54.73
55-60 4,360.00 4,950.00 9,310.00 53.17
60-65 3,011.00 3,462.00 6,473.00 53.48
65-70 1,638.00 1,797.00 3,435.00 52.31
70-75 697.00 799.00 1,496.00 53.41
75-80 326.00 445.00 771.00 57.72
80-85 124.00 183.00 307.00 59.61
85-90 65.00 68.00 133.00 51.13
90-95 17.00 38.00 55.00 69.09
95-100 54.00 86.00 140.00 61.43
115-120 NaN 4.00 4.00 0.00
Total 82,770.00 100,381.00 183,151.00 54.81
In [15]:
stacked_plot(15, 10, 10, df, 'age_bin', 'crime_severity', 'Less Serious', 'bar', 
             'Crime Severity by Age Group', 'Age Group', 'Count', 0.9, 0,
             'Crime Severity by Age Group - Normalized', 'Age Group', 
             'Frequency')
plt.savefig(eda_image_path + '/age_crime_stacked_bar.png', bbox_inches='tight')

Crime Severity by Street Type

It is interesting to note that not only the least serious crimes occur in alleys, pedestrian walkways, private roads (paved or unpaved), and trails, but that where they do occur, they are less severe.

In [16]:
street_type_table = cont_table(df, 'crime_severity', 'Less Serious', 
                               'Type', 'More Serious').data
street_type_table
Out[16]:
Less Serious More Serious Total % More Serious
Type
Alley 240.00 190.00 430.00 44.19
Minor 41,420.00 49,066.00 90,486.00 54.22
Pedestrian Walkway 211.00 49.00 260.00 18.85
Primary 29,291.00 35,953.00 65,244.00 55.11
Private Road 31.00 38.00 69.00 55.07
Secondary 11,562.00 15,077.00 26,639.00 56.60
Trail 15.00 6.00 21.00 28.57
Unpaved Road NaN 2.00 2.00 0.00
Total 82,770.00 100,381.00 183,151.00 54.81
In [17]:
stacked_plot(15, 10, 10, df, 'Type', 'crime_severity', 'Less Serious', 'bar', 
             'Crime Severity by Street Type', 'Street Type', 'Count', 0.9, 0, 
             'Crime Severity by Street Type - Normalized', 'Street Type',
             'Frequency')
plt.savefig(eda_image_path + '/streettype_stacked_bar.png', bbox_inches='tight')

Crime Severity by Sex

Whereas there are more males than females in this dataset, it can be seen from both the regular and normalized distributions, respectively, that more serious crimes occur with a higher prevalence (69,370 or 64.24%) for the former than the latter (28,565 or 41.70%). For sexes unknown, there is a 36.70% prevalence rate for more serious crimes.

In [18]:
sex_table = cont_table(df, 'crime_severity', 'Less Serious', 'Victim_Sex', 
                       'More Serious').data
sex_table
Out[18]:
Less Serious More Serious Total % More Serious
Victim_Sex
F 39936 28565 68501 41.70
M 38615 69370 107985 64.24
X 4219 2446 6665 36.70
Total 82770 100381 183151 54.81
In [19]:
stacked_plot(10, 10, 10, df, 'Victim_Sex', 'crime_severity', 'Less Serious', 
             'bar', 'Crime Severity by Sex', 'Sex', 'Count', 
             0.9, 0, 'Crime Severity by Sex - Normalized', 'Sex', 'Frequency')
plt.savefig(eda_image_path + '/victim_sex_stacked_bar.png', bbox_inches='tight')

Crime Severity by Time of Day

It is interesting to note that more serious crimes (35,396) occur in the morning than any other time of day, with more serious night crimes accounting for only 9,814 (approximately 10%) of all such crimes.

In [20]:
time_table = cont_table(df, 'crime_severity', 'Less Serious', 
                        'Time_of_Day', 'More Serious').data
time_table
Out[20]:
Less Serious More Serious Total % More Serious
Time_of_Day
Afternoon 29609 31406 61015 51.47
Evening 18547 23765 42312 56.17
Morning 28105 35396 63501 55.74
Night 6509 9814 16323 60.12
Total 82770 100381 183151 54.81
In [21]:
stacked_plot(10, 10, 10, df, 'Time_of_Day', 'crime_severity', 'Less Serious', 
             'bar', 'Time of Day by Crime Severity', 'Time of Day', 'Count', 
             0.9, 0, 'Time of Day by Crime Severity - Normalized', 
             'Time of Day', 'Frequency')
plt.savefig(eda_image_path + '/time_day_stacked_bar.png', bbox_inches='tight')

Crime Severity by Month

The month of June presents a record of 10,852 more serious crimes than any other month, so there exists a higher prevalence of more serious crimes mid-year than any other time of year.

In [22]:
month_table = cont_table(df, 'crime_severity', 'Less Serious', 
                         'Month', 'More Serious').data
month_table
Out[22]:
Less Serious More Serious Total % More Serious
Month
April 8165 9530 17695 53.86
August 6002 7468 13470 55.44
December 4241 5843 10084 57.94
February 7960 9114 17074 53.38
January 8376 9457 17833 53.03
July 6690 8404 15094 55.68
June 8756 10852 19608 55.34
March 8092 9216 17308 53.25
May 8841 10548 19389 54.40
November 4785 6273 11058 56.73
October 5435 7018 12453 56.36
September 5427 6658 12085 55.09
Total 82770 100381 183151 54.81
In [23]:
stacked_plot(15, 10, 10, df, 'Month', 'crime_severity', 'Less Serious', 
             'bar', 'Month by Crime Severity', 'Month', 'Count', 
             0.9, 0, 'Month by Crime Severity - Normalized', 
             'Month', 'Frequency')
plt.savefig(eda_image_path + '/month_stacked_bar.png', bbox_inches='tight')

Crime Severity Victim Descent

In terms of ethnicity, members of the Hispanic/Latin/Mexican demographic account for 51,601 incidences of more serious crimes. More importantly, with an additional 40,226 less serious crimes, this demographic accounts for a total of 91,827 crimes, an overwhelming 50% of all crimes in the data.

In [24]:
descent_table = cont_table(df, 'crime_severity', 'Less Serious', 
                           'Victim_Desc', 'More Serious').data
descent_table
Out[24]:
Less Serious More Serious Total % More Serious
Victim_Desc
American Indian/Alaskan Native 28.00 16.00 44.00 36.36
Black 16,598.00 26,092.00 42,690.00 61.12
Filipino 23.00 33.00 56.00 58.93
Guamanian 16.00 4.00 20.00 20.00
Hispanic/Latin/Mexican 40,226.00 51,601.00 91,827.00 56.19
Japanese 5.00 9.00 14.00 64.29
Korean 129.00 162.00 291.00 55.67
Other 5,043.00 4,877.00 9,920.00 49.16
Other Asian 1,485.00 1,541.00 3,026.00 50.93
Samoan 6.00 0.00 6.00 0.00
Unknown 4,948.00 3,375.00 8,323.00 40.55
Vietnamese 4.00 7.00 11.00 63.64
White 14,259.00 12,650.00 26,909.00 47.01
Chinese NaN 9.00 9.00 0.00
Hawaiian NaN 1.00 1.00 0.00
Laotian NaN 4.00 4.00 0.00
Total 82,770.00 100,381.00 183,151.00 54.81
In [25]:
stacked_plot(10,10, 10, df, 'Victim_Desc', 'crime_severity', 'Less Serious', 
             'barh', 'Victim Descent by Crime Severity', 'Count', 
             'Victim Description', 0.9, 0, 
             'Victim Descent by Crime Severity - Normalized', 
             'Frequency', 'Victim Descent')
plt.savefig(eda_image_path + '/victim_desc_stacked_bar.png', bbox_inches='tight')

Crime Severity by Neighborhood

In terms of neighborhoods based on police districts, the 77th Street region shows the highest amount of more serious crimes (14,350) than any other district; second is the Southeast area (10,142).

In [26]:
area_table = cont_table(df, 'crime_severity', 'Less Serious', 
                        'AREA_NAME', 'More Serious').data
area_table
Out[26]:
Less Serious More Serious Total % More Serious
AREA_NAME
77th_Street 7330 14350 21680 66.19
Central 7733 9395 17128 54.85
Devonshire 2398 1846 4244 43.50
Foothill 2347 2421 4768 50.78
Harbor 3459 3735 7194 51.92
Hollenbeck 3102 4113 7215 57.01
Hollywood 5058 5129 10187 50.35
Mission 2716 3278 5994 54.69
N_Hollywood 3789 3039 6828 44.51
Newton 5589 8309 13898 59.79
Northeast 2692 2803 5495 51.01
Olympic 5150 5203 10353 50.26
Pacific 2814 1729 4543 38.06
Rampart 5086 6397 11483 55.71
Southeast 5237 10142 15379 65.95
Southwest 4844 6126 10970 55.84
Topanga 2155 1933 4088 47.28
Van_Nuys 3035 2769 5804 47.71
West_LA 2478 1164 3642 31.96
West_Valley 2505 2544 5049 50.39
Wilshire 3253 3956 7209 54.88
Total 82770 100381 183151 54.81
In [27]:
stacked_plot(10, 10, 10, df, 'AREA_NAME', 'crime_severity', 'Less Serious', 
             'barh', 'Neighborhood by Crime Severity', 'Count', 'Neighborhood', 
             0.9, 0, 'Neighborhood by Crime Severity - Normalized', 'Frequency', 
             'Neighborhood')
plt.savefig(eda_image_path + '/area_stacked_bar.png', bbox_inches='tight')

Crime Severity by Premises

It is equally important to note that most crimes (100,487 or ~55%) occur on the street, with 57.51% being attributed to more serious crimes.

In [28]:
premis_table = cont_table(df, 'crime_severity', 'Less Serious', 
                          'Premises', 'More Serious').data
premis_table
Out[28]:
Less Serious More Serious Total % More Serious
Premises
Alley 2199 3312 5511 60.10
Driveway 2521 1742 4263 40.86
Park_Playground 2156 2076 4232 49.05
Parking_Lot 10503 10064 20567 48.93
Pedestrian_Overcrossing 5 9 14 64.29
Sidewalk 22634 25313 47947 52.79
Street 42700 57787 100487 57.51
Tunnel 13 16 29 55.17
Vacant_Lot 39 62 101 61.39
Total 82770 100381 183151 54.81
In [29]:
stacked_plot(10, 10, 10, df, 'Premises', 'crime_severity', 'Less Serious', 'barh', 
             'Premises by Crime Severity', 'Count', 'Premises', 0.9, 0, 
             'Premises by Crime Severity - Normalized', 'Frequency', 
             'Premises')
plt.savefig(eda_image_path + '/premises_stacked_bar.png', bbox_inches='tight')

Histogram Distributions Colored by Crime Code Target

In [30]:
# read in the train set since we are inspecting ground truth only on this subset
# to avoid exposing information from the unseen data 
train_set = pd.read_csv(train_path).set_index('OBJECTID')
In [31]:
# Histogram Distributions Colored by Crime Code Target
def colored_hist(target, nrows, ncols, x, y, w_pad, h_pad):
    '''
    This function shows a histogram for the entire dataframe, colored by 
    the ground truth column.
    Inputs:
        target: ground truth column
        nrows: number of histogram rows to include in plot
        ncols: number of histogram cols to include in plot
        x: x-axis figure size
        y: y-axis figure size
        w_pad: width padding for plot
        h_pad: height padding for plot
    '''
    # create empty list to enumerate columns from
    list1 = train_set.drop(columns={target})
    # set the plot size dimensions
    fig, axes = plt.subplots(nrows, ncols, figsize=(x, y))
    ax = axes.flatten()
    for i, col in enumerate(list1):
        train_set.pivot(columns=target)[col].plot(kind='hist', density=True, 
                                                        stacked=True, ax=ax[i])
        ax[i].set_title(col)
        ax[i].set_xlabel('Values')
        ax[i].legend(loc='upper right')
        
    fig.tight_layout(w_pad=w_pad, h_pad=h_pad)
In [32]:
colored_hist('Crime_Code', 5, 7, 25, 20, 6, 12)
plt.savefig(eda_image_path + '/colored_hist.png', bbox_inches='tight')

Examining Possible Correlations

In [33]:
# this function is defined and used only once; hence, it remains in 
# this notebook
def corr_plot(df, x, y):
    '''
    This function plots a correlation matrix for the dataframe
    Inputs:
        df: dataframe to ingest into the correlation matrix plot
        x: x-axis size
        y: y-axis size
    '''
    # correlation matrix title
    print("\033[1m" + 'La Crime Data: Correlation Matrix' 
                    + "\033[1m")
    # assign correlation function to new variable
    corr = df.corr()
    matrix = np.triu(corr) # for triangular matrix
    plt.figure(figsize=(x,y))
    # parse corr variable intro triangular matrix
    sns.heatmap(df.corr(method='pearson'), 
                annot=True, linewidths=.5, cmap='coolwarm', mask=matrix,
                square=True, 
                cbar_kws={'label': 'Correlation Index'})
In [34]:
# subset train set without index into new corr_df dataframe
corr_df = train_set.reset_index(drop=True) 
   
# plot the correlation matrix
corr_plot(corr_df, 25, 25)
plt.savefig(eda_image_path + '/correlation_plot.png', bbox_inches='tight')
La Crime Data: Correlation Matrix

No multicollinearity has been detected at a threshold of r=0.75.