class: center, middle, inverse, title-slide # Predictive Modeling ## DATA 606 - Statistics & Probability for Data Analytics ### Jason Bryer, Ph.D. and Angela Lui, Ph.D. ### May 1, 2024 --- # One Minute Paper Results .pull-left[ **What was the most important thing you learned during this class?** <img src="09-Predictive_Modeling_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] .pull-right[ **What important question remains unanswered for you?** <img src="09-Predictive_Modeling_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- class: inverse, middle, center # Classification and Regression Trees (CART) --- # Classification and Regression Trees The goal of CART methods is to find best predictor in X of some outcome, y. CART methods do this recursively using the following procedures: * Find the best predictor in X for y. * Split the data into two based upon that predictor. * Repeat 1 and 2 with the split data sets until a stopping criteria has been reached. There are a number of possible stopping criteria including: Only one data point remains. * All data points have the same outcome value. * No predictor can be found that sufficiently splits the data. --- # Recursive Partitioning Logic of CART .pull-left[ Consider the scatter plot to the right with the following characteristics: * Binary outcome, G, coded “A” or “B”. * Two predictors, x and z * The vertical line at z = 3 creates the first partition. * The double horizontal line at x = -4 creates the second partition. * The triple horizontal line at x = 6 creates the third partition. ] .pull-right[ ![](images/Partitioning.png) ] --- # Tree Structure .pull-left[ * The root node contains the full data set. * The data are split into two mutually exclusive pieces. Cases where x > ci go to the right, cases where x <= ci go to the left. * Those that go to the left reach a terminal node. * Those on the right are split into two mutually exclusive pieces. Cases where z > c2 go to the right and terminal node 3; cases where z <= c2 go to the left and terminal node 2. ] .pull-right[ ![](images/TreeDiagram.png) ] --- # Sum of Squared Errors The sum of squared errors for a tree *T* is: `$$S=\sum _{ c\in leaves(T) }^{ }{ \sum _{ i\in c }^{ }{ { (y-{ m }_{ c }) }^{ 2 } } }$$` Where, `\({ m }_{ c }=\frac { 1 }{ n } \sum _{ i\in c }^{ }{ { y }_{ i } }\)`, the prediction for leaf \textit{c}. Or, alternatively written as: `$$S=\sum _{ c\in leaves(T) }^{ }{ { n }_{ c }{ V }_{ c } }$$` Where `\(V_{c}\)` is the within-leave variance of leaf \textit{c}. Our goal then is to find splits that minimize S. --- # Advantages of CART Methods * Making predictions is fast. * It is easy to understand what variables are important in making predictions. * Trees can be grown with data containing missingness. For rows where we cannot reach a leaf node, we can still make a prediction by averaging the leaves in the sub-tree we do reach. * The resulting model will inherently include interaction effects. There are many reliable algorithms available. --- # Regression Trees In this example we will predict the median California house price from the house’s longitude and latitude. ```r str(calif) ``` ``` ## 'data.frame': 20640 obs. of 10 variables: ## $ MedianHouseValue: num 452600 358500 352100 341300 342200 ... ## $ MedianIncome : num 8.33 8.3 7.26 5.64 3.85 ... ## $ MedianHouseAge : num 41 21 52 52 52 52 52 52 42 52 ... ## $ TotalRooms : num 880 7099 1467 1274 1627 ... ## $ TotalBedrooms : num 129 1106 190 235 280 ... ## $ Population : num 322 2401 496 558 565 ... ## $ Households : num 126 1138 177 219 259 ... ## $ Latitude : num 37.9 37.9 37.9 37.9 37.9 ... ## $ Longitude : num -122 -122 -122 -122 -122 ... ## $ cut.prices : Factor w/ 4 levels "[1.5e+04,1.2e+05]",..: 4 4 4 4 4 4 4 3 3 3 ... ``` --- # Tree 1 ```r treefit <- tree(log(MedianHouseValue) ~ Longitude + Latitude, data=calif) plot(treefit); text(treefit, cex=0.75) ``` <img src="09-Predictive_Modeling_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- # Tree 1 <img src="09-Predictive_Modeling_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- # Tree 1 ```r summary(treefit) ``` ``` ## ## Regression tree: ## tree(formula = log(MedianHouseValue) ~ Longitude + Latitude, ## data = calif) ## Number of terminal nodes: 12 ## Residual mean deviance: 0.1662 = 3429 / 20630 ## Distribution of residuals: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.75900 -0.26080 -0.01359 0.00000 0.26310 1.84100 ``` Here “deviance” is the mean squared error, or root-mean-square error of `\(\sqrt{.166} = 0.41\)`. --- # Tree 2, Reduce Minimum Deviance We can increase the fit but changing the stopping criteria with the mindev parameter. ```r treefit2 <- tree(log(MedianHouseValue) ~ Longitude + Latitude, data=calif, mindev=.001) summary(treefit2) ``` ``` ## ## Regression tree: ## tree(formula = log(MedianHouseValue) ~ Longitude + Latitude, ## data = calif, mindev = 0.001) ## Number of terminal nodes: 68 ## Residual mean deviance: 0.1052 = 2164 / 20570 ## Distribution of residuals: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.94700 -0.19790 -0.01872 0.00000 0.19970 1.60600 ``` With the larger tree we now have a root-mean-square error of 0.32. --- # Tree 2, Reduce Minimum Deviance <img src="09-Predictive_Modeling_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # Tree 3, Include All Variables However, we can get a better fitting model by including the other variables. ```r treefit3 <- tree(log(MedianHouseValue) ~ ., data=calif) summary(treefit3) ``` ``` ## ## Regression tree: ## tree(formula = log(MedianHouseValue) ~ ., data = calif) ## Variables actually used in tree construction: ## [1] "cut.prices" ## Number of terminal nodes: 4 ## Residual mean deviance: 0.03608 = 744.5 / 20640 ## Distribution of residuals: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -1.718000 -0.127300 0.009245 0.000000 0.130000 0.358600 ``` With all the available variables, the root-mean-square error is 0.11. --- # Classification Trees Predicting who survived the Titanic. * `pclass`: Passenger class (1 = 1st; 2 = 2nd; 3 = 3rd) * `survival`: A Boolean indicating whether the passenger survived or not (0 = No; 1 = Yes); this is our target * `name`: A field rich in information as it contains title and family names * `sex`: male/female * `age`: Age, a significant portion of values are missing * `sibsp`: Number of siblings/spouses aboard * `parch`: Number of parents/children aboard * `ticket`: Ticket number. * `fare`: Passenger fare (British Pound). * `cabin`: Does the location of the cabin influence chances of survival? * `embarked`: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) * `boat`: Lifeboat, many missing values * `body`: Body Identification Number * `home.dest`: Home/destination --- # Classification using `rpart` ```r (titanic.rpart <- rpart(survived ~ pclass + sex + age + sibsp, data=titanic.train)) ``` ``` ## n= 981 ## ## node), split, n, deviance, yval ## * denotes terminal node ## ## 1) root 981 231.651400 0.3822630 ## 2) sex=male 629 97.723370 0.1923688 ## 4) pclass>=1.5 498 63.004020 0.1485944 ## 8) age>=9.5 454 49.092510 0.1233480 * ## 9) age< 9.5 44 10.636360 0.4090909 ## 18) pclass>=2.5 35 6.685714 0.2571429 * ## 19) pclass< 2.5 9 0.000000 1.0000000 * ## 5) pclass< 1.5 131 30.137400 0.3587786 * ## 3) sex=female 352 70.715910 0.7215909 ## 6) pclass>=2.5 158 39.272150 0.4620253 * ## 7) pclass< 2.5 194 12.128870 0.9329897 * ``` --- # Classification using `rpart` ```r plot(titanic.rpart); text(titanic.rpart, use.n=TRUE, cex=1) ``` <img src="09-Predictive_Modeling_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- # Classification using `ctree` ```r (titanic.ctree <- ctree(survived ~ pclass + sex + age + sibsp, data=titanic.train)) ``` ``` ## ## Conditional inference tree with 8 terminal nodes ## ## Response: survived ## Inputs: pclass, sex, age, sibsp ## Number of observations: 981 ## ## 1) sex == {female}; criterion = 1, statistic = 267.418 ## 2) pclass <= 2; criterion = 1, statistic = 91.487 ## 3) pclass <= 1; criterion = 0.99, statistic = 9.116 ## 4)* weights = 108 ## 3) pclass > 1 ## 5)* weights = 86 ## 2) pclass > 2 ## 6) sibsp <= 2; criterion = 0.957, statistic = 6.496 ## 7)* weights = 142 ## 6) sibsp > 2 ## 8)* weights = 16 ## 1) sex == {male} ## 9) pclass <= 1; criterion = 1, statistic = 24.468 ## 10)* weights = 131 ## 9) pclass > 1 ## 11) age <= 9; criterion = 0.996, statistic = 10.948 ## 12) pclass <= 2; criterion = 1, statistic = 15.971 ## 13)* weights = 9 ## 12) pclass > 2 ## 14)* weights = 35 ## 11) age > 9 ## 15)* weights = 454 ``` --- # Classification using `ctree` ```r plot(titanic.ctree) ``` <img src="09-Predictive_Modeling_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- # Ensemble Methods Ensemble methods use multiple models that are combined by weighting, or averaging, each individual model to provide an overall estimate. Each model is a random sample of the sample. Common ensemble methods include: * *Boosting* - Each successive trees give extra weight to points incorrectly predicted by earlier trees. After all trees have been estimated, the prediction is determined by a weighted “vote” of all predictions (i.e. results of each individual tree model). * *Bagging* - Each tree is estimated independent of other trees. A simple “majority vote” is take for the prediction. * *Random Forests* - In addition to randomly sampling the data for each model, each split is selected from a random subset of all predictors. * *Super Learner* - An ensemble of ensembles. See https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html --- class: font90 # Random Forests The random forest algorithm works as follows: 1. Draw `\(n_{tree}\)` bootstrap samples from the original data. 2. For each bootstrap sample, grow an unpruned tree. At each node, randomly sample `\(m_{try}\)` predictors and choose the best split among those predictors selected<footnote>Bagging is a special case of random forests where `\(m_{try} = p\)` where *p* is the number of predictors</footnote>. 3. Predict new data by aggregating the predictions of the ntree trees (majority votes for classification, average for regression). Error rates are obtained as follows: 1. At each bootstrap iteration predict data not in the bootstrap sample (what Breiman calls “out-of-bag”, or OOB, data) using the tree grown with the bootstrap sample. 2. Aggregate the OOB predictions. On average, each data point would be out-of-bag 36% of the times, so aggregate these predictions. The calculated error rate is called the OOB estimate of the error rate. --- # Random Forests: Titanic ```r titanic.rf <- randomForest(factor(survived) ~ pclass + sex + age + sibsp, data = titanic.train, ntree = 5000, importance = TRUE) ``` ```r importance(titanic.rf) ``` ``` ## 0 1 MeanDecreaseAccuracy MeanDecreaseGini ## pclass 98.53696 113.42441 134.5831 45.95406 ## sex 231.50133 302.50534 308.4630 129.84180 ## age 96.56779 59.30569 121.2686 58.59634 ## sibsp 79.02451 -17.28700 62.0706 17.19954 ``` --- # Random Forests: Titanic (cont.) ```r importance(titanic.rf) ``` ``` ## 0 1 MeanDecreaseAccuracy MeanDecreaseGini ## pclass 98.53696 113.42441 134.5831 45.95406 ## sex 231.50133 302.50534 308.4630 129.84180 ## age 96.56779 59.30569 121.2686 58.59634 ## sibsp 79.02451 -17.28700 62.0706 17.19954 ``` --- # Random Forests: Titanic ```r min_depth_frame <- min_depth_distribution(titanic.rf) ``` ```r plot_min_depth_distribution(min_depth_frame) ``` <img src="09-Predictive_Modeling_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- class: inverse, middle, center # Predictive Modeling --- # Example: Hours Studying Predicting Passing ```r study <- data.frame( Hours=c(0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00, 3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50), Pass=c(0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1) ) study[sample(nrow(study), 5),] ``` ``` ## Hours Pass ## 8 2.00 0 ## 16 4.25 1 ## 6 1.75 0 ## 18 4.75 1 ## 20 5.50 1 ``` ```r tab <- describeBy(study$Hours, group = study$Pass, mat = TRUE, skew = FALSE) tab$group1 <- as.integer(as.character(tab$group1)) ``` --- # Prediction Odds (or probability) of passing if studied **zero** hours? `$$log(\frac{p}{1-p}) = -4.078 + 1.505 \times 0$$` `$$\frac{p}{1-p} = exp(-4.078) = 0.0169$$` `$$p = \frac{0.0169}{1.169} = .016$$` -- Odds (or probability) of passing if studied **4** hours? `$$log(\frac{p}{1-p}) = -4.078 + 1.505 \times 4$$` `$$\frac{p}{1-p} = exp(1.942) = 6.97$$` `$$p = \frac{6.97}{7.97} = 0.875$$` --- # Fitted Values ```r study[1,] ``` ``` ## Hours Pass ## 1 0.5 0 ``` ```r logistic <- function(x, b0, b1) { return(1 / (1 + exp(-1 * (b0 + b1 * x)) )) } logistic(.5, b0=-4.078, b1=1.505) ``` ``` ## [1] 0.03470667 ``` --- # Model Performance The use of statistical models to predict outcomes, typically on new data, is called predictive modeling. Logistic regression is a common statistical procedure used for prediction. We will utilize a **confusion matrix** to evaluate accuracy of the predictions. <img src="images/Confusion_Matrix.png" width="1100" style="display: block; margin: auto;" /> --- class: font80 # Predicting Heart Attacks Source: https://www.kaggle.com/datasets/imnikhilanand/heart-attack-prediction?select=data.csv ```r heart <- read.csv('../course_data/heart_attack_predictions.csv') heart <- heart |> mutate_if(is.character, as.numeric) |> select(!c(slope, ca, thal)) str(heart) ``` ``` ## 'data.frame': 294 obs. of 11 variables: ## $ age : int 28 29 29 30 31 32 32 32 33 34 ... ## $ sex : int 1 1 1 0 0 0 1 1 1 0 ... ## $ cp : int 2 2 2 1 2 2 2 2 3 2 ... ## $ trestbps: num 130 120 140 170 100 105 110 125 120 130 ... ## $ chol : num 132 243 NA 237 219 198 225 254 298 161 ... ## $ fbs : num 0 0 0 0 0 0 0 0 0 0 ... ## $ restecg : num 2 0 0 1 1 0 0 0 0 0 ... ## $ thalach : num 185 160 170 170 150 165 184 155 185 190 ... ## $ exang : num 0 0 0 0 0 0 0 0 0 0 ... ## $ oldpeak : num 0 0 0 0 0 0 0 0 0 0 ... ## $ num : int 0 0 0 0 0 0 0 0 0 0 ... ``` Note: `num` is the diagnosis of heart disease (angiographic disease status) (i.e. Value 0: < 50% diameter narrowing -- Value 1: > 50% diameter narrowing) --- # Missing Data We will save this for another day... ```r complete.cases(heart) |> table() ``` ``` ## ## FALSE TRUE ## 33 261 ``` ```r mice_out <- mice::mice(heart, m = 1) ``` ``` ## ## iter imp variable ## 1 1 trestbps chol fbs restecg thalach exang ## 2 1 trestbps chol fbs restecg thalach exang ## 3 1 trestbps chol fbs restecg thalach exang ## 4 1 trestbps chol fbs restecg thalach exang ## 5 1 trestbps chol fbs restecg thalach exang ``` ```r heart <- mice::complete(mice_out) ``` --- # Data Setup We will split the data into a training set (70% of observations) and validation set (30%). ```r train.rows <- sample(nrow(heart), nrow(heart) * .7) heart_train <- heart[train.rows,] heart_test <- heart[-train.rows,] ``` This is the proportions of survivors and defines what our "guessing" rate is. That is, if we guessed no one had a heart attack, we would be correct 62% of the time. ```r (heart_attack <- table(heart_train$num) %>% prop.table) ``` ``` ## ## 0 1 ## 0.604878 0.395122 ``` --- class: font80 # Model Training ```r lr.out <- glm(num ~ ., data=heart_train, family=binomial(link = 'logit')) summary(lr.out) ``` ``` ## ## Call: ## glm(formula = num ~ ., family = binomial(link = "logit"), data = heart_train) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -5.895542 3.311490 -1.780 0.07502 . ## age 0.033586 0.033259 1.010 0.31257 ## sex 1.422810 0.587957 2.420 0.01552 * ## cp 1.059011 0.264157 4.009 6.1e-05 *** ## trestbps -0.016868 0.013816 -1.221 0.22212 ## chol 0.007188 0.003313 2.170 0.03001 * ## fbs 1.474454 0.984983 1.497 0.13441 ## restecg -0.467892 0.551265 -0.849 0.39601 ## thalach -0.008747 0.010904 -0.802 0.42244 ## exang 1.182195 0.581334 2.034 0.04199 * ## oldpeak 1.056386 0.324408 3.256 0.00113 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 275.10 on 204 degrees of freedom ## Residual deviance: 148.47 on 194 degrees of freedom ## AIC: 170.47 ## ## Number of Fisher Scoring iterations: 5 ``` --- # Predicted Values ```r heart_train$prediction <- predict(lr.out, type = 'response', newdata = heart_train) ggplot(heart_train, aes(x = prediction, color = num == 1)) + geom_density() ``` <img src="09-Predictive_Modeling_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> --- # Results ```r heart_train$prediction_class <- heart_train$prediction > 0.5 tab <- table(heart_train$prediction_class, heart_train$num) %>% prop.table() %>% print() ``` ``` ## ## 0 1 ## FALSE 0.55121951 0.10243902 ## TRUE 0.05365854 0.29268293 ``` For the training set, the overall accuracy is 84.39%. Recall that 60.49% people did not have a heart attach. Therefore, the simplest model would be to predict that no one had a heart attack, which would mean we would be correct 60.49% of the time. Therefore, our prediction model is 23.9% better than guessing. --- # Checking with the validation dataset ```r (survived_test <- table(heart_test$num) %>% prop.table()) ``` ``` ## ## 0 1 ## 0.7191011 0.2808989 ``` ```r heart_test$prediction <- predict(lr.out, newdata = heart_test, type = 'response') heart_test$prediciton_class <- heart_test$prediction > 0.5 tab_test <- table(heart_test$prediciton_class, heart_test$num) %>% prop.table() %>% print() ``` ``` ## ## 0 1 ## FALSE 0.59550562 0.05617978 ## TRUE 0.12359551 0.22471910 ``` The overall accuracy is 82.02%, or 10.1% better than guessing. --- class: font90 # Receiver Operating Characteristic (ROC) Curve The ROC curve is created by plotting the true positive rate (TPR; AKA sensitivity) against the false positive rate (FPR; AKA probability of false alarm) at various threshold settings. .pull-left[ In a classification model, outcomes are either as positive (*p*) or negative (*n*). There are then four possible outcomes: * **true positive** (TP) The outcome from a prediction is *p* and the actual value is also *p*. * **false positive** (FP) The actual value is *n*. * **true negative** (TN) Both the prediction outcome and the actual value are *n*. * **false negative** (FN) The prediction outcome is *n* while the actual value is *p*. ] .pull-right[ ![](images/rocdiagram.png) ```r roc <- calculate_roc(heart_train$prediction, heart_train$num == 1) summary(roc) ``` ``` ## AUC = 0.912 ## Cost of false-positive = 1 ## Cost of false-negative = 1 ## Threshold with minimum cost = 0.606 ``` ] --- # ROC Curve .center[ ![](images/ROCspace.png) ] --- # ROC Curve ```r plot(roc, curve = 'accuracy') ``` <img src="09-Predictive_Modeling_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> --- # ROC Curve ```r plot(roc) ``` <img src="09-Predictive_Modeling_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> --- class: font90 # Caution on Interpreting Accuracy - [Loh, Sooo, and Zing](http://cs229.stanford.edu/proj2016/report/LohSooXing-PredictingSexualOrientationBasedOnFacebookStatusUpdates-report.pdf) (2016) predicted sexual orientation based on Facebook Status. - They reported model accuracies of approximately 90% using SVM, logistic regression and/or random forest methods. -- - [Gallup](https://news.gallup.com/poll/234863/estimate-lgbt-population-rises.aspx) (2018) poll estimates that 4.5% of the Americal population identifies as LGBT. -- - *My proposed model:* I predict all Americans are heterosexual. - The accuracy of my model is 95.5%, or *5.5% better than Facebook's model!* - Predicting "rare" events (i.e. when the proportion of one of the two outcomes large) is difficult and requires independent (predictor) variables that strongly associated with the dependent (outcome) variable. --- # Fitted Values Revisited What happens when the ratio of true-to-false increases (i.e. want to predict "rare" events)? Let's simulate a dataset where the ratio of true-to-false is 10-to-1. We can also define the distribution of the dependent variable. Here, there is moderate separation in the distributions. ```r test.df2 <- getSimulatedData( treat.mean=.6, control.mean=.4) ``` The `multilevelPSA::psrange` function will sample with varying ratios from 1:10 to 1:1. It takes multiple samples and averages the ranges and distributions of the fitted values from logistic regression. ```r psranges2 <- psrange(test.df2, test.df2$treat, treat ~ ., samples=seq(100,1000,by=100), nboot=20) ``` --- # Fitted Values Revisited (cont.) ```r plot(psranges2) ``` <img src="09-Predictive_Modeling_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> --- class: left, font140 # One Minute Paper .pull-left[ 1. What was the most important thing you learned during this class? 2. What important question remains unanswered for you? ] .pull-right[ <img src="09-Predictive_Modeling_files/figure-html/unnamed-chunk-36-1.png" style="display: block; margin: auto;" /> ] https://forms.gle/Jcw55CYvc6Ym8A5F7