Neural Networks and Machine Learning in R

Leonid Shpaner

Instructions:

In this project, you will:

  • Examine what common architectures are and activation functions for neural networks
  • Examine how to optimize the parameters in a neural network
  • Make predictions using neural networks in R
  • Practice Deep Learning using R
  • Apply ideas for cross-validation for neural network model development and validation
  • Tune parameters in a neural network using a grid search
  • Use the package Lime in R to recognize which variables are driving the recommendations your neural network is making

Part One - Projecting Data for Neural Networks

In this part of the project, you will project data for neural networks using the file ElectionData.csv, which contains the fraction of votes by county earned by President Trump and Secretary Clinton in the 2016 US Presidential election, sorted by county FIPS code (FIPS stands for Federal Information Processing System, and is a geographic identifier). The data file also includes several variables that may be useful to predict these fractions. Use the data to develop a neural network model for either individual.

This part of the project requires some work in RStudio, located on the project page in Canvas. Use that space, along with the provided script and data file, to perform the work, then use this document to answer questions on what you discover.

  1. How well did your model predict the election results?
  2. Before discussing the model’s predictive ability, it is necessary to dispense with a comprehensive analysis whereby the features of the dataset are examined in the following manner.
    1. There are five highly correlated predictors above a threshold of r = 0.75 which are omitted from the model - Percent.White.Not.Hispanic, Percent.foreign.born, PercentLangDiffEnglish, PercentWhite, and Bachlorsorhigher.
    2. Correlation - various economic indicators may shed light on election results. These important variable relationships are examined in the ensuing scatter plot diagrams. Plot A demonstrates that there exists a strong negative correlation (r=-0.98) between Clinton’s fraction of Votes to that of Trump’s. In Plot B, as income per capita decreases, the percent below poverty increases, so there exists a strong negative relationship (r=-0.73) between the two. Plot C shows that there is a weak (almost moderate) relationship (r=-0.34) between home ownership and percent below poverty; so, as the percent below poverty increases, home ownership begins to decrease. Lastly, persons per house shares a low correlation (r=0.21) with percent below poverty, so it is difficult to compare the two.

    3. The data undergoes a 70:30 train-test-split, with 70% allocated to the training set and 30% allocated to the test set.

    4. Training Dimensions: 2186 27  
      Testing Dimensions: 957 27  
      Training Dimensions Percentage: 0.7  
      Testing Dimensions Percentage: 0.3 
      


    5. Prior to calling on the neural network, a generalized linear model is fitted with the training data. The following output is obtained. All features are statistically significant at an \(\alpha\) = 0.05 significance level. The mean absolute error between true and fitted values was 0.1349895 fraction of votes, so it was a fairly small error, thereby rendering the model predictions sound and proper.

    6. Call:
      glm(formula = Trump ~ ., data = trainingdata)
      
      Deviance Residuals: 
             Min          1Q      Median          3Q         Max  
      -9.992e-16  -3.331e-16  -1.110e-16   2.220e-16   1.166e-15  
      
      Coefficients: (2 not defined because of singularities)
                               Estimate Std. Error    t value Pr(>|t|)    
      (Intercept)             1.112e-17  3.416e-16  3.300e-02 0.974031    
      fips                    1.419e-21  5.556e-22  2.555e+00 0.010698 *  
      US.regions             -1.938e-16  3.539e-17 -5.477e+00 4.83e-08 ***
      PacificCoast           -4.894e-16  1.230e-16 -3.980e+00 7.11e-05 ***
      MountainsPlains        -5.678e-16  8.248e-17 -6.885e+00 7.57e-12 ***
      South                  -5.426e-16  5.116e-17 -1.061e+01  < 2e-16 ***
      Midwest                        NA         NA         NA       NA    
      Northeast                      NA         NA         NA       NA    
      Population              5.198e-23  4.696e-23  1.107e+00 0.268464    
      PercentFemale           1.303e-17  3.783e-18  3.444e+00 0.000583 ***
      PercentAfricanAmerican  1.064e-17  8.595e-19  1.238e+01  < 2e-16 ***
      PercentNativeAmerican   3.026e-19  1.416e-18  2.140e-01 0.830823    
      PercentAsian            7.091e-18  4.264e-18  1.663e+00 0.096505 .  
      PercentHawaianPI        4.762e-18  8.261e-18  5.760e-01 0.564363    
      PercentTwoorMore       -5.311e-18  7.205e-18 -7.370e-01 0.461098    
      PercentHispanic        -4.935e-19  8.324e-19 -5.930e-01 0.553334    
      HighSchoolGrad         -1.252e-17  1.774e-18 -7.054e+00 2.32e-12 ***
      HomeOwnership          -3.491e-18  1.367e-18 -2.554e+00 0.010711 *  
      PersonsPerHouse        -3.482e-17  4.271e-17 -8.150e-01 0.415082    
      IncomeperCapita        -1.327e-21  3.025e-21 -4.380e-01 0.661080    
      PercentBelowPoverty    -6.857e-18  2.609e-18 -2.629e+00 0.008636 ** 
      PopPerSqMile            1.119e-20  8.475e-21  1.320e+00 0.187033    
      output_train            1.000e+00  5.303e-17  1.886e+16  < 2e-16 ***
      ---
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
      
      (Dispersion parameter for gaussian family taken to be 1.43169e-31)
      
          Null deviance: 6.6136e+01  on 2185  degrees of freedom
      Residual deviance: 3.0996e-28  on 2165  degrees of freedom
      AIC: -149026
      
      Number of Fisher Scoring iterations: 1
      
  1. Do you think your model will generalize well to new data? Why or why not?
  2. The model should generalize well to new data because the mean absolute error on the out-of-sample (test) data is 0.1285449 (in close proximity to that of the training MAE (0.1362437)). There is a negligible difference between the two of 0.007698762.

  3. What could you do to improve the model?
  4. Apart from eliminating the highly correlated predictors, various additional feature reduction techniques like PCA (Principal Component Analysis) can be applied. Furthermore, the number of hidden layers and nodes can be changed and the model hyperparameters tuned.

  5. In the space below, describe your neural network structure (i.e., how many hidden layers were in the network?). How many nodes are in each hidden layer? Which activation function did you use? Which independent variables did you include?
  6. The neural network structure includes 2 hidden layers with 5 and 3 nodes, respectively:
    nn_train <- neuralnet(f_train, data=trainingdata, hidden=c(5,3), linear.output=T)
    


    No activation (act.fct) was applied to the model, since linear.output was set to TRUE. As corroborated in the prior answers, all independent variables were used, except for the five highly correlated predictors that were identified via the correlation matrix.

Part Two

In this part of the course project, you will use H2O and LIME for neural network predictions on a data set predicting the sale prices of houses. This part of the project requires some work in RStudio, located on the project page in Canvas. Use that space, along with the provided script and data file, to perform the work, then use this document to record your accuracy measures.

Additional information:
Many attributes in this dataset are categorical. For example, ASSESSMENT_NBHD (assessment neighborhood) will need to be converted to numerical data from a neighborhood name. Consider carefully how you might do this. There are also a lot of variables, and it will be easy to overfit your model. To this end, in the deeplearning() function in H2O, there is a setting called L2. Investigate this setting and see if you find it valuable. There is also a function h2o.varimp_plot() that will create a histogram of the relative importance of all the inputs to your model. That might help you decide how to select variables for inclusion in your final model. When constructing your model, also consider the following:

  • Explore the data set with scatter plots and compute the correlation matrix.
  • Split the entire dataset into a training and test set, and both should be converted into H2O data frames. The test set should contain a random sample of roughly 30% of the entire data.
  • Select several data points from the training data set to analyze with LIME later on.
  • Fit the neural network using h2o.deeplearning()
    • Decide which independent variables to include.
    • Decide how many hidden layers to include in the network.
    • Decide how many nodes will be in each hidden layer.
    • Decide which activation function to use (for reference, h2o supports the following: “Tanh”, “TanhWithDropout”, “Rectifier”, “RectifierWithDropout”, “Maxout”, “MaxoutWithDropout” ).
    • Specify whether or not to use an adaptive learning rate (argument: adaptive_rate).
    • L1 regularization can result in many weights being set to 0, add stability, and improve generalization. Set the l1 argument (larger values correspond to more regularization).
    • L2 regularization can result in many weights being set to small values, add stability, and improve generalization. Set the l2 argument (larger values correspond to more regularization). Note: if the L1 coefficient is higher than the L2 coefficient, the algorithm will favor L1 regularization and vice versa.
    • Specify how many epochs the training algorithm should run for.
    • Set the random seed argument (seed) to guarantee the same results each time you run the training algorithm.
  • Plot the training and validation loss vs. the training epochs.
  • Use the summary() function to examine the fit model.
  • Use the data points selected earlier to compute and analyze the neural network’s predictions using the lime() function from the lime package.
    • Visualize the results of this analysis.
  • In this course project document, analyze the model and summarize your findings. How well does your model predict house price? Do you think it will generalize well to new data? Which variables ended up being most important? What could be done to improve the model?
  1. How well did your model predict the house prices?
  2. First and foremost, it is important to discuss data exploration. Typically, modeling commences after the data has been sufficiently explored. However, in this exercise, the data is modeled and explored in an alternating fashion. The dataset consists of 108 variables which are subsequently reduced to 90.

    1. Estimate The Deep Neural Network
      The figure below shows the relationship between mean absolute error and number of epochs for both the training and validation sets, respectively. Generally, as the number of epochs increase, the MAE decreases, but there is a wider gap between the training and validation sets. The neural network is tuned with the Tanh activation function over 1000 epochs, four hidden layers and four nodes, respectively, an l1 rate of 0.0001, and l2 rate of 0.01. Moreover, the adaptive_rate, variable_importances, and reproducible hyperparameters are set to TRUE. Cross-validation is carried out over 3 folds.

    2. Additional Exploratory Data Analysis (EDA)
    3. The 20 most important variables will be taken into consideration, but scatter plots on the full dataset (not training) are created only for columns with quantitative and continuous values. Price vs. rooms and price vs. bathrooms both exhibit low correlations (r=0.16, and r=-0.21, respectively). There is not much to conclude here.

    4. Correlation Matrix
      The following correlation matrix is compiled from the top 20 features and does not show any between-predictor relationships above the r=0.75 threshold.

    5. Re-estimate the Deep Neural Network
    6. The model is re-trained with the top 20 features over the same hyperparameters as the original model. There exists a narrower gap between the training and validation MAE scores over roughly the first twenty epochs. However, the gap progressively widens.

    7. LIME Analysis
    8. Using only the top 20 input features and the LIME package library, a substantial amount of variation is explained by the data according to the \(R^2\) values - the highest of which is 0.71, but starts off as 0.68, and then increases, but decreases in a step-wise pattern to .65, until it gradually drops off and reaches 0.43, a moderate amount of variation.

      ##  case   model_r2   model_intercept   model_prediction    feature     prediction 
      ## ------ ---------- ----------------- ------------------ ------------ ------------
      ##  5009    0.6843        517791             513531           Ward        617066   
      ##  5009    0.6843        517791             513531        BATHROOMS      617066   
      ##  5009    0.6843        517791             513531        FIREPLACES     617066   
      ##  5009    0.6843        517791             513531         LANDAREA      617066   
      ##  480     0.7096        729583             304552           Ward        253949   
      ##  480     0.7096        729583             304552        BATHROOMS      253949   
      ##  480     0.7096        729583             304552        FIREPLACES     253949   
      ##  480     0.7096        729583             304552         LANDAREA      253949   
      ##  7067    0.6472        430790             640012           Ward        628222   
      ##  7067    0.6472        430790             640012        BATHROOMS      628222   
      ##  7067    0.6472        430790             640012        FIREPLACES     628222   
      ##  7067    0.6472        430790             640012         LANDAREA      628222   
      ##  5089    0.664         418250             692433           Ward        568733   
      ##  5089    0.664         418250             692433        BATHROOMS      568733   
      ##  5089    0.664         418250             692433        FIREPLACES     568733   
      ##  5089    0.664         418250             692433         LANDAREA      568733   
      ##  4938    0.4334        401948             824680           Ward        824680   
      ##  4938    0.4334        401948             824680           EYB         824680   
      ##  4938    0.4334        401948             824680        BATHROOMS      824680   
      ##  4938    0.4334        401948             824680         YearSold      824680   
      


      Moreover, the mean price prediction of $578,529.80 differs by only $19,528.26 from the actual mean housing price of $598,058.10. That is an almost negligible difference of approximately 3%, so the model predicted well.

      The ensuing plots show how five random features explain the model’s fit with a mean of 63%.

  1. Do you think your model will generalize well to new data? Why or why not?
  2. The 613.31 difference in the mean absolute error between the train (80,587.15) and validation (81,200.46) sets is negligible. However, at each time step of the epoch progression, that difference becomes wider, and thus, will not generalize well to new data unless the epochs are limited.

  3. Which variables ended up being the most important and why do you think this is so?
  4. Using the var_imp() function call, the following variables ended up being most important:

    ## Variable Importances: 
    ##           variable relative_importance scaled_importance percentage
    ## 1             Ward            1.000000          1.000000   0.162422
    ## 2              EYB            0.584610          0.584610   0.094954
    ## 3         OldCity1            0.581093          0.581093   0.094382
    ## 4        BATHROOMS            0.456646          0.456646   0.074170
    ## 5               NE            0.403440          0.403440   0.065528
    ## 6            Ward4            0.322812          0.322812   0.052432
    ## 7      CapitolHill            0.273834          0.273834   0.044477
    ## 8            Ward7            0.268182          0.268182   0.043559
    ## 9         LANDAREA            0.250221          0.250221   0.040641
    ## 10        OldCity2            0.237900          0.237900   0.038640
    ## 11 CongressHeights            0.221452          0.221452   0.035969
    ## 12       RowInside            0.216636          0.216636   0.035187
    ## 13        Deanwood            0.206530          0.206530   0.033545
    ## 14          RowEnd            0.198102          0.198102   0.032176
    ## 15      FIREPLACES            0.194298          0.194298   0.031558
    ## 16           Ward3            0.185223          0.185223   0.030084
    ## 17        YearSold            0.178580          0.178580   0.029005
    ## 18       RiggsPark            0.165993          0.165993   0.026961
    ## 19       Eckington            0.108953          0.108953   0.017696
    ## 20           Ward5            0.102287          0.102287   0.016614
    


    Number of bathrooms and bedrooms, among other features are important predictors in real estate data because they drive price.

  5. What could you do to improve the model?
  6. Additional preprocessing like one-hot encoding other columns with many classes could prove beneficial. Understanding the dataset better would also lead to better preprocessing, which would ultimately lead to better model creation. Additional iterations with cross-validation over 5 folds and a more robust hyperparameter tuning mechanism like that of grid search could be implemented. Grid search on a CPU with 4 cores and 8 logical processors at a base speed of 2.30 GHz would not do this endeavor any justice. To this end, it may be necessary to purchase a more powerful computer with gaming GPUs.

  7. In the space below, describe your neural network structure (i.e., which independent variables did you use?). How many hidden layers were in the network? How many nodes were in each hidden layer? Which activation function did you use? Did you use an adaptive learning rate? How many epochs did the training algorithm run for?
  8. The final (third iteration) model’s neural network structure includes all twenty of the top most important features listed in number 4. 2 hidden layers and 2 nodes were used, respectively, with an activation function of Tanh. The adaptive learning rate was passed in and set to TRUE, even though it is set to TRUE by default. The model was iterated through 1,000 epochs.