class: center, middle, inverse, title-slide .title[ # An introduction to machine learning ] .author[ ### INFO 5940
Cornell University ] --- class: inverse, middle # What is machine learning? --- class: middle <img src="../../../../../../../../img/amazon-recommendations.png" width="80%" style="display: block; margin: auto;" /> --- class: middle <img src="https://miro.medium.com/max/1400/1*j5aWfH9t1_EZPJC92CJ7oQ.png" width="80%" style="display: block; margin: auto;" /> .footnote[https://medium.com/tmobile-tech/small-data-big-value-f783ceca4fdb] --- class: middle <img src="https://techcrunch.com/wp-content/uploads/2017/12/facebook-facial-recognition-photo-review.png?w=730&crop=1" width="80%" style="display: block; margin: auto;" /> --- class: middle, inverse # What is machine learning? --- class: middle, center <img src="https://imgs.xkcd.com/comics/machine_learning.png" width="40%" style="display: block; margin: auto;" /> --- class: middle <img src="images/intro.002.jpeg" width="80%" style="display: block; margin: auto;" /> --- class: middle <img src="images/intro.003.jpeg" width="80%" style="display: block; margin: auto;" /> --- ## Types of machine learning - Supervised - Unsupervised --- <img src="images/all-of-ml.jpg" width="80%" style="display: block; margin: auto;" /> .footnote[Credit: <https://vas3k.com/blog/machine_learning/>] --- ## Examples of supervised learning - Ad click prediction - Supreme Court decision making - Police misconduct prediction --- ## Two modes -- .pull-left[ ### Classification Is your hospital in a state of crisis care? ] -- .pull-left[ ### Regression What percentage of hospital beds are occupied? ] --- ## Two cultures .pull-left[ ### Statistics - model first - inference emphasis ] .pull-right[ ### Machine Learning - data first - prediction emphasis ] --- name: train-love background-image: url(images/train.jpg) background-size: contain background-color: #f6f6f6 --- template: train-love class: center, top # Statistics --- template: train-love class: bottom > *"Statisticians, like artists, have the bad habit of falling in love with their models."* > > — George Box --- ## Predictive modeling ```r library(tidymodels) ``` ``` ## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ── ``` ``` ## ✔ broom 1.0.0 ✔ recipes 1.0.1 ## ✔ dials 1.0.0 ✔ rsample 1.1.0 ## ✔ dplyr 1.0.9 ✔ tibble 3.1.8 ## ✔ ggplot2 3.3.6 ✔ tidyr 1.2.0 ## ✔ infer 1.0.2 ✔ tune 1.0.0 ## ✔ modeldata 1.0.0 ✔ workflows 1.0.0 ## ✔ parsnip 1.0.0 ✔ workflowsets 1.0.0 ## ✔ purrr 0.3.4 ✔ yardstick 1.0.0 ``` ``` ## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ── ## ✖ purrr::discard() masks scales::discard() ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ✖ recipes::step() masks stats::step() ## • Search for functions across packages at https://www.tidymodels.org/find/ ``` --- ## Bechdel Test <img src="https://fivethirtyeight.com/wp-content/uploads/2014/04/hickey-bechdel-11.png" width="60%" style="display: block; margin: auto;" /> .footnote[Source:[FiveThirtyEight](https://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/)] --- ## Bechdel Test 1. It has to have at least two [named] women in it 1. Who talk to each other 1. About something besides a man -- ```r library(rcis) data("bechdel") glimpse(bechdel) ## Rows: 1,394 ## Columns: 10 ## $ year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, … ## $ title <chr> "12 Years a Slave", "2 Guns", "42", … ## $ test <fct> Fail, Fail, Fail, Fail, Fail, Pass, … ## $ budget_2013 <dbl> 2.00, 6.10, 4.00, 22.50, 9.20, 1.20,… ## $ domgross_2013 <dbl> 5.310703, 7.561246, 9.502021, 3.8362… ## $ intgross_2013 <dbl> 15.860703, 13.249301, 9.502021, 14.5… ## $ rated <chr> "R", "R", "PG-13", "PG-13", "R", "R"… ## $ metascore <dbl> 97, 55, 62, 29, 28, 55, 48, 33, 90, … ## $ imdb_rating <dbl> 8.3, 6.8, 7.6, 6.6, 5.4, 7.8, 5.7, 5… ## $ genre <chr> "Biography", "Action", "Biography", … ``` --- ## Bechdel test data - N = 1394 - 1 categorical outcome: `test` - 9 predictors --- class: middle <img src="index_files/figure-html/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" /> --- class: inverse, middle # What is the goal of machine learning? -- ## Build .white[models] that -- ## generate .white[accurate predictions] -- ## for .white[future, yet-to-be-seen data]. -- .footnote[Max Kuhn & Kjell Johnston, http://www.feat.engineering/] --- ## Machine learning We'll use this goal to drive learning of 3 core `tidymodels` packages: - `parsnip` - `yardstick` - `rsample` --- class: inverse, middle ## 🔨 Build models with `parsnip` --- class: middle, center, frame ## parsnip <iframe src="https://parsnip.tidymodels.org" width="100%" height="400px" data-external="1"></iframe> --- ## `glm()` ```r glm(test ~ metascore, family = binomial, data = bechdel) ## ## Call: glm(formula = test ~ metascore, family = binomial, data = bechdel) ## ## Coefficients: ## (Intercept) metascore ## 0.052274 -0.004563 ## ## Degrees of Freedom: 1393 Total (i.e. Null); 1392 Residual ## Null Deviance: 1916 ## Residual Deviance: 1914 AIC: 1918 ``` --- class: center <img src="images/predicting/predicting.001.jpeg" width="80%" style="display: block; margin: auto;" /> --- ## To specify a model with `parsnip` 1. Pick a model 1. Set the engine 1. Set the mode (if needed) --- ## To specify a model with `parsnip` ```r logistic_reg() %>% set_engine("glm") %>% set_mode("classification") ## Logistic Regression Model Specification (classification) ## ## Computational engine: glm ``` --- ## To specify a model with `parsnip` ```r decision_tree() %>% set_engine("C5.0") %>% set_mode("classification") ## Decision Tree Model Specification (classification) ## ## Computational engine: C5.0 ``` --- ## To specify a model with `parsnip` ```r nearest_neighbor() %>% set_engine("kknn") %>% set_mode("classification") ## K-Nearest Neighbor Model Specification (classification) ## ## Computational engine: kknn ``` --- ## 1\. Pick a model All available models are listed at <https://www.tidymodels.org/find/parsnip/> <iframe src="https://www.tidymodels.org/find/parsnip/" width="100%" height="400px" data-external="1"></iframe> --- ## `logistic_reg()` Specifies a model that uses logistic regression ```r logistic_reg(penalty = NULL, mixture = NULL) ``` --- ## `logistic_reg()` Specifies a model that uses logistic regression ```r logistic_reg( mode = "classification", # "default" mode, if exists penalty = NULL, # model hyper-parameter mixture = NULL # model hyper-parameter ) ``` --- ## `set_engine()` Adds an engine to power or implement the model. ```r logistic_reg() %>% set_engine(engine = "glm") ``` --- ## `set_mode()` Sets the class of problem the model will solve, which influences which output is collected. Not necessary if mode is set in Step 1. ```r logistic_reg() %>% set_mode(mode = "classification") ``` --- class: your-turn ## Your turn 1 Run the chunk in your .Rmd and look at the output. Then, copy/paste the code and edit to create: + a decision tree model for classification + that uses the `C5.0` engine. Save it as `tree_mod` and look at the object. What is different about the output? *Hint: you'll need https://www.tidymodels.org/find/parsnip/*
03
:
00
--- ```r lr_mod ## Logistic Regression Model Specification (classification) ## ## Computational engine: glm tree_mod <- decision_tree() %>% set_engine(engine = "C5.0") %>% set_mode("classification") tree_mod ## Decision Tree Model Specification (classification) ## ## Computational engine: C5.0 ``` --- class: inverse, middle ## Now we've built a model. -- ## But, how do we .white[use] a model? -- ## First - what does it mean to use a model? --- class: inverse, middle, center ![](https://media.giphy.com/media/fhAwk4DnqNgw8/giphy.gif) Statistical models learn from the data. Many learn model parameters, which *can* be useful as values for inference and interpretation. --- ## A fitted model .pull-left[ ```r lr_mod %>% fit(test ~ metascore + imdb_rating, data = bechdel) %>% broom::tidy() ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 2.70 0.436 6.20 5.64e-10 ## 2 metascore 0.0202 0.00481 4.20 2.66e- 5 ## 3 imdb_rating -0.606 0.0889 -6.82 8.87e-12 ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-27-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## "All models are wrong, but some are useful" <img src="index_files/figure-html/unnamed-chunk-29-1.png" width="80%" style="display: block; margin: auto;" /> --- ## "All models are wrong, but some are useful" <img src="index_files/figure-html/unnamed-chunk-31-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Predict new data ```r bechdel_new <- tibble(metascore = c(40, 50, 60), imdb_rating = c(6, 6, 6), test = c("Fail", "Fail", "Pass")) %>% mutate(test = factor(test, levels = c("Fail", "Pass"))) bechdel_new ## # A tibble: 3 × 3 ## metascore imdb_rating test ## <dbl> <dbl> <fct> ## 1 40 6 Fail ## 2 50 6 Fail ## 3 60 6 Pass ``` --- ## Predict old data ```r tree_mod %>% fit(test ~ metascore + imdb_rating, data = bechdel) %>% predict(new_data = bechdel) %>% mutate(true_bechdel = bechdel$test) %>% accuracy(truth = true_bechdel, estimate = .pred_class) ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.580 ``` --- ## Predict new data .pull-left[ ### out with the old... ```r tree_mod %>% fit(test ~ metascore + imdb_rating, data = bechdel) %>% predict(new_data = bechdel) %>% mutate(true_bechdel = bechdel$test) %>% accuracy(truth = true_bechdel, estimate = .pred_class) ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.580 ``` ] .pull-right[ ### in with the 🆕 ```r tree_mod %>% fit(test ~ metascore + imdb_rating, data = bechdel) %>% * predict(new_data = bechdel_new) %>% * mutate(true_bechdel = bechdel_new$test) %>% accuracy(truth = true_bechdel, estimate = .pred_class) ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.333 ``` ] --- ## `fit()` Train a model by fitting a model. Returns a parsnip model fit. ```r fit(tree_mod, test ~ metascore + imdb_rating, data = bechdel) ``` --- ## `fit()` Train a model by fitting a model. Returns a parsnip model fit. ```r tree_mod %>% # parsnip model fit(test ~ metascore + imdb_rating, # a formula data = bechdel # dataframe ) ``` --- ## `fit()` Train a model by fitting a model. Returns a parsnip model fit. ```r tree_fit <- tree_mod %>% # parsnip model fit(test ~ metascore + imdb_rating, # a formula data = bechdel # dataframe ) ``` --- class: center <img src="images/predicting/predicting.001.jpeg" width="80%" style="display: block; margin: auto;" /> --- class: center <img src="images/predicting/predicting.003.jpeg" width="80%" style="display: block; margin: auto;" /> --- ## `predict()` Use a fitted model to predict new `y` values from data. Returns a tibble. ```r predict(tree_fit, new_data = bechdel_new) ``` --- ```r tree_fit %>% predict(new_data = bechdel_new) ## # A tibble: 3 × 1 ## .pred_class ## <fct> ## 1 Pass ## 2 Pass ## 3 Pass ``` --- ## Axiom The best way to measure a model's performance at predicting new data is to .display[predict new data]. --- ## Data splitting <img src="index_files/figure-html/all-split-1.png" width="80%" style="display: block; margin: auto;" /> --- class: inverse, middle # ♻️ Resample models with `rsample` --- ## `rsample` <iframe src="https://tidymodels.github.io/rsample/" width="100%" height="400px" data-external="1"></iframe> --- ## `initial_split()*` "Splits" data randomly into a single testing and a single training set. ```r initial_split(data, prop = 3/4) ``` .footnote[`*` from `rsample`] --- ```r bechdel_split <- initial_split(data = bechdel, strata = test, prop = 3/4) bechdel_split ## <Training/Testing/Total> ## <1045/349/1394> ``` --- ## `training()` and `testing()*` Extract training and testing sets from an `rsplit` ```r training(bechdel_split) testing(bechdel_split) ``` .footnote[`*` from `rsample`] --- ```r bechdel_train <- training(bechdel_split) bechdel_train ## # A tibble: 1,045 × 10 ## year title test budge…¹ domgr…² intgr…³ rated metas…⁴ ## <dbl> <chr> <fct> <dbl> <dbl> <dbl> <chr> <dbl> ## 1 2013 12 Yea… Fail 2 5.31 15.9 R 97 ## 2 2013 2 Guns Fail 6.1 7.56 13.2 R 55 ## 3 2013 42 Fail 4 9.50 9.50 PG-13 62 ## 4 2013 47 Ron… Fail 22.5 3.84 14.6 PG-13 29 ## 5 2013 A Good… Fail 9.2 6.73 30.4 R 28 ## 6 2013 After … Fail 13 6.05 24.4 PG-13 33 ## 7 2013 Cloudy… Fail 7.8 12.0 27.2 PG 59 ## 8 2013 Don Jon Fail 0.55 2.45 2.64 R 66 ## 9 2013 Escape… Fail 7 2.52 10.4 R 49 ## 10 2013 Gangst… Fail 6 4.60 10.4 R 40 ## # … with 1,035 more rows, 2 more variables: ## # imdb_rating <dbl>, genre <chr>, and abbreviated ## # variable names ¹budget_2013, ²domgross_2013, ## # ³intgross_2013, ⁴metascore ## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names ``` --- class: center <img src="images/predicting/predicting.001.jpeg" width="80%" style="display: block; margin: auto;" /> --- class: center <img src="images/predicting/predicting.003.jpeg" width="80%" style="display: block; margin: auto;" /> --- class: center <img src="images/predicting/predicting.004.jpeg" width="80%" style="display: block; margin: auto;" /> --- class: center <img src="images/predicting/predicting.006.jpeg" width="80%" style="display: block; margin: auto;" /> --- class: center <img src="images/predicting/predicting.007.jpeg" width="80%" style="display: block; margin: auto;" /> --- class: center <img src="images/predicting/predicting.008.jpeg" width="80%" style="display: block; margin: auto;" /> --- class: center <img src="images/predicting/predicting.009.jpeg" width="80%" style="display: block; margin: auto;" /> --- ## Your turn 2 Fill in the blanks. Use `initial_split()`, `training()`, and `testing()` to: 1. Split **bechdel** into training and test sets. Save the rsplit! 2. Extract the training data and fit your classification tree model. 3. Predict the testing data, and save the true `test` values. 4. Measure the accuracy of your model with your test set. Keep `set.seed(100)` at the start of your code.
04
:
00
--- ```r set.seed(100) # Important! bechdel_split <- initial_split(bechdel, strata = test, prop = 3/4) bechdel_train <- training(bechdel_split) bechdel_test <- testing(bechdel_split) tree_mod %>% fit(test ~ metascore + imdb_rating, data = bechdel_train) %>% predict(new_data = bechdel_test) %>% mutate(true_bechdel = bechdel_test$test) %>% accuracy(truth = true_bechdel, estimate = .pred_class) ``` --- class: inverse, middle # Goal of Machine Learning ## 🎯 generate accurate predictions --- ## Axiom Better Model = Better Predictions (Lower error rate) --- ## `accuracy()*` Calculates the accuracy based on two columns in a dataframe: The .display[truth]: `\({y}_i\)` The predicted .display[estimate]: `\(\hat{y}_i\)` ```r accuracy(data, truth, estimate) ``` .footnote[`*` from `yardstick`] --- ```r tree_mod %>% fit(test ~ metascore + imdb_rating, data = bechdel_train) %>% predict(new_data = bechdel_test) %>% mutate(true_bechdel = bechdel_test$test) %>% * accuracy(truth = true_bechdel, estimate = .pred_class) ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.547 ``` --- class: center <img src="images/predicting/predicting.006.jpeg" width="80%" style="display: block; margin: auto;" /> --- class: center <img src="images/predicting/predicting.007.jpeg" width="80%" style="display: block; margin: auto;" /> --- class: center <img src="images/predicting/predicting.008.jpeg" width="80%" style="display: block; margin: auto;" /> --- class: center <img src="images/predicting/predicting.009.jpeg" width="80%" style="display: block; margin: auto;" /> --- ## Your Turn 3 What would happen if you repeated this process? Would you get the same answers? Note your accuracy from above. Then change your seed number and rerun just the last code chunk above. Do you get the same answer? Try it a few times with a few different seeds.
02
:
00
--- .pull-left[ ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.553 ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.559 ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.567 ``` ] -- .pull-right[ ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.539 ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.504 ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.550 ``` ] --- ## Quiz Why is the new estimate different? --- ## Data Splitting -- <img src="index_files/figure-html/unnamed-chunk-74-1.png" width="80%" style="display: block; margin: auto;" /> -- <img src="index_files/figure-html/unnamed-chunk-75-1.png" width="80%" style="display: block; margin: auto;" /> -- <img src="index_files/figure-html/unnamed-chunk-76-1.png" width="80%" style="display: block; margin: auto;" /> -- <img src="index_files/figure-html/unnamed-chunk-77-1.png" width="80%" style="display: block; margin: auto;" /> -- <img src="index_files/figure-html/unnamed-chunk-78-1.png" width="80%" style="display: block; margin: auto;" /> -- <img src="index_files/figure-html/unnamed-chunk-79-1.png" width="80%" style="display: block; margin: auto;" /> -- <img src="index_files/figure-html/unnamed-chunk-80-1.png" width="80%" style="display: block; margin: auto;" /> -- <img src="index_files/figure-html/unnamed-chunk-81-1.png" width="80%" style="display: block; margin: auto;" /> --- <img src="index_files/figure-html/unnamed-chunk-82-1.png" width="80%" style="display: block; margin: auto;" /> -- <img src="index_files/figure-html/unnamed-chunk-83-1.png" width="80%" style="display: block; margin: auto;" /> -- <img src="index_files/figure-html/unnamed-chunk-84-1.png" width="80%" style="display: block; margin: auto;" /> -- <img src="index_files/figure-html/unnamed-chunk-85-1.png" width="80%" style="display: block; margin: auto;" /> -- <img src="index_files/figure-html/unnamed-chunk-86-1.png" width="80%" style="display: block; margin: auto;" /> -- <img src="index_files/figure-html/unnamed-chunk-87-1.png" width="80%" style="display: block; margin: auto;" /> -- <img src="index_files/figure-html/unnamed-chunk-88-1.png" width="80%" style="display: block; margin: auto;" /> -- <img src="index_files/figure-html/unnamed-chunk-89-1.png" width="80%" style="display: block; margin: auto;" /> -- .right[Mean accuracy] --- ## Resampling Let's resample 10 times then compute the mean of the results... --- ```r acc %>% tibble::enframe(name = "accuracy") ## # A tibble: 10 × 2 ## accuracy value ## <int> <dbl> ## 1 1 0.550 ## 2 2 0.625 ## 3 3 0.599 ## 4 4 0.587 ## 5 5 0.605 ## 6 6 0.593 ## 7 7 0.542 ## 8 8 0.530 ## 9 9 0.564 ## 10 10 0.599 mean(acc) ## [1] 0.5793696 ``` --- ## Guess Which do you think is a better estimate? The best result or the mean of the results? Why? --- ## But also... Fit with .display[training set] Predict with .display[testing set] -- Rinse and repeat? --- ## There has to be a better way... ```r acc <- vector(length = 10, mode = "double") for (i in 1:10) { new_split <- initial_split(bechdel) new_train <- training(new_split) new_test <- testing(new_split) acc[i] <- lr_mod %>% fit(test ~ metascore + imdb_rating, data = new_train) %>% predict(new_test) %>% mutate(truth = new_test$test) %>% accuracy(truth, .pred_class) %>% pull(.estimate) } ``` --- background-image: url(images/diamonds.jpg) background-size: contain background-position: left class: middle, center background-color: #f5f5f5 .pull-right[ ## The .display[testing set] is precious... ## we can only use it once! ] --- background-image: url(images/diamonds.jpg) background-size: contain background-position: left class: middle, center background-color: #f5f5f5 .pull-right[ ## How can we use the training set to compare, evaluate, and tune models? ] --- class: middle <img src="https://www.tidymodels.org/start/resampling/img/resampling.svg" width="80%" style="display: block; margin: auto;" /> --- ## Cross-validation <img src="images/cross-validation/Slide2.png" width="80%" style="display: block; margin: auto;" /> --- ## Cross-validation <img src="images/cross-validation/Slide3.png" width="80%" style="display: block; margin: auto;" /> --- ## Cross-validation <img src="images/cross-validation/Slide4.png" width="80%" style="display: block; margin: auto;" /> --- ## Cross-validation <img src="images/cross-validation/Slide5.png" width="80%" style="display: block; margin: auto;" /> --- ## Cross-validation <img src="images/cross-validation/Slide6.png" width="80%" style="display: block; margin: auto;" /> --- ## Cross-validation <img src="images/cross-validation/Slide7.png" width="80%" style="display: block; margin: auto;" /> --- ## Cross-validation <img src="images/cross-validation/Slide8.png" width="80%" style="display: block; margin: auto;" /> --- ## Cross-validation <img src="images/cross-validation/Slide9.png" width="80%" style="display: block; margin: auto;" /> --- ## Cross-validation <img src="images/cross-validation/Slide10.png" width="80%" style="display: block; margin: auto;" /> --- ## Cross-validation <img src="images/cross-validation/Slide11.png" width="80%" style="display: block; margin: auto;" /> --- ## V-fold cross-validation ```r vfold_cv(data, v = 10, ...) ``` --- exclude: true --- ## Guess How many times does an observation/row appear in the assessment set? <img src="index_files/figure-html/vfold-tiles-1.png" width="80%" style="display: block; margin: auto;" /> --- <img src="index_files/figure-html/unnamed-chunk-105-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Quiz If we use 10 folds, which percent of our data will end up in the training set and which percent in the testing set for each fold? -- 90% - training 10% - test --- ## Your Turn 4 Run the code below. What does it return? ```r set.seed(100) bechdel_folds <- vfold_cv(data = bechdel_train, v = 10, strata = test) bechdel_folds ```
01
:
00
--- ```r set.seed(100) bechdel_folds <- vfold_cv(data = bechdel_train, v = 10, strata = test) bechdel_folds ## # 10-fold cross-validation using stratification ## # A tibble: 10 × 2 ## splits id ## <list> <chr> ## 1 <split [940/105]> Fold01 ## 2 <split [940/105]> Fold02 ## 3 <split [940/105]> Fold03 ## 4 <split [940/105]> Fold04 ## 5 <split [940/105]> Fold05 ## 6 <split [940/105]> Fold06 ## 7 <split [941/104]> Fold07 ## 8 <split [941/104]> Fold08 ## 9 <split [941/104]> Fold09 ## 10 <split [942/103]> Fold10 ``` --- ## We need a new way to fit ```r split1 <- bechdel_folds %>% pluck("splits", 1) split1_train <- training(split1) split1_test <- testing(split1) tree_mod %>% fit(test ~ ., data = split1_train) %>% predict(split1_test) %>% mutate(truth = split1_test$test) %>% accuracy(truth, .pred_class) # rinse and repeat split2 <- ... ``` --- ## `fit_resamples()` Trains and tests a resampled model. ```r tree_mod %>% fit_resamples( test ~ metascore + imdb_rating, resamples = bechdel_folds ) ``` --- ```r tree_mod %>% fit_resamples( test ~ metascore + imdb_rating, resamples = bechdel_folds ) ## # Resampling results ## # 10-fold cross-validation using stratification ## # A tibble: 10 × 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [940/105]> Fold01 <tibble [2 × 4]> <tibble> ## 2 <split [940/105]> Fold02 <tibble [2 × 4]> <tibble> ## 3 <split [940/105]> Fold03 <tibble [2 × 4]> <tibble> ## 4 <split [940/105]> Fold04 <tibble [2 × 4]> <tibble> ## 5 <split [940/105]> Fold05 <tibble [2 × 4]> <tibble> ## 6 <split [940/105]> Fold06 <tibble [2 × 4]> <tibble> ## 7 <split [941/104]> Fold07 <tibble [2 × 4]> <tibble> ## 8 <split [941/104]> Fold08 <tibble [2 × 4]> <tibble> ## 9 <split [941/104]> Fold09 <tibble [2 × 4]> <tibble> ## 10 <split [942/103]> Fold10 <tibble [2 × 4]> <tibble> ``` --- ## `collect_metrics()` Unnest the metrics column from a tidymodels `fit_resamples()` ```r _results %>% collect_metrics(summarize = TRUE) ``` -- .footnote[`TRUE` is actually the default; averages across folds] --- ```r tree_mod %>% fit_resamples( test ~ metascore + imdb_rating, resamples = bechdel_folds ) %>% collect_metrics(summarize = FALSE) ## # A tibble: 20 × 5 ## id .metric .estimator .estimate .config ## <chr> <chr> <chr> <dbl> <chr> ## 1 Fold01 accuracy binary 0.610 Preprocessor1_Model1 ## 2 Fold01 roc_auc binary 0.587 Preprocessor1_Model1 ## 3 Fold02 accuracy binary 0.438 Preprocessor1_Model1 ## 4 Fold02 roc_auc binary 0.434 Preprocessor1_Model1 ## 5 Fold03 accuracy binary 0.6 Preprocessor1_Model1 ## 6 Fold03 roc_auc binary 0.570 Preprocessor1_Model1 ## 7 Fold04 accuracy binary 0.562 Preprocessor1_Model1 ## 8 Fold04 roc_auc binary 0.547 Preprocessor1_Model1 ## 9 Fold05 accuracy binary 0.6 Preprocessor1_Model1 ## 10 Fold05 roc_auc binary 0.572 Preprocessor1_Model1 ## # … with 10 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` --- ## 10-fold CV - 10 different analysis/assessment sets - 10 different models (trained on .display[analysis] sets) - 10 different sets of performance statistics (on .display[assessment] sets) --- ## Your Turn 5 Modify the code below to use `fit_resamples` and `bechdel_folds` to cross-validate the classification tree model. What is the ROC AUC that you collect at the end? ```r set.seed(100) tree_mod %>% fit(test ~ metascore + imdb_rating, data = bechdel_train) %>% predict(new_data = bechdel_test) %>% mutate(true_bechdel = bechdel_test$test) %>% accuracy(truth = true_bechdel, estimate = .pred_class) ```
03
:
00
--- ```r set.seed(100) tree_mod %>% fit_resamples(test ~ metascore + imdb_rating, resamples = bechdel_folds) %>% collect_metrics() ## # A tibble: 2 × 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.560 10 0.0157 Preprocessor1_Mod… ## 2 roc_auc binary 0.547 10 0.0154 Preprocessor1_Mod… ``` --- ## How did we do? ```r tree_mod %>% fit(test ~ metascore + imdb_rating, data = bechdel_train) %>% predict(bechdel_test) %>% mutate(truth = bechdel_test$test) %>% accuracy(truth, .pred_class) ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.550 ``` ``` ## # A tibble: 2 × 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.560 10 0.0157 Preprocessor1_Mod… ## 2 roc_auc binary 0.547 10 0.0154 Preprocessor1_Mod… ```