class: center, middle, inverse, title-slide # Reproducibility with R and Python ## 2021 DSS Bootcamp ### Colin Rundel ### 08-02-21 --- exclude: true --- class: inverse, center, middle # Preliminaries --- ## GitHub account To get everyone on the same page: - If you don't have one, sign up for a GitHub account (it takes 1 min.) - Go to https://github.com/join - Enter your information (username, email address and a password). --- ## Account names A few suggestions on picking a GitHub username: - Incorporate your actual name! People like to know who they’re dealing with. It makes your username easier for people to guess or remember. - Pick a username you will be comfortable revealing to your future boss. - Shorter is better than longer. - Be as unique as possible in as few characters as possible. In some settings GitHub auto-completes or suggests usernames. - Make it timeless. Don’t highlight your current university, employer, or place of residence. - Avoid words laden with special meaning in programming. --- ## Accessing RStudio Workbench To get started as quickly as possible, the preferred method is to use DSS RStudio server(s). To access RStudio Workbench: 1. If off campus, use the VPN to create a secure connection from your computer to Duke. If you are on campus, be sure you are connected to the Dukeblue network. 2. Navigate to - http://rstudio.stat.duke.edu:8787 3. Log-in with your Duke NetID and password. If you are having trouble accessing RStudio Pro see the next slide. --- ## DSS RStudio alternatives If you cannot access RStudio via the DSS servers: - Install R and RStudio locally on your computer - Download R on your computer [here](http://archive.linux.duke.edu/cran/) - Download RStudio [here](https://www.rstudio.com/products/rstudio/download/) - Use a docker container from Duke OIT 1. Go to https://vm-manage.oit.duke.edu/containers 2. Log in with your Duke NetID and password 3. Find `RStudio - statistics application with Rmarkdown and knitr support` 4. Click the link to create your environment --- ## Bootcamp materials - Everything from this computing bootcamp is available at https://dukestatsci.github.io/computing_bootcamp_2021/ - GitHub repo is available at https://github.com/dukestatsci/computing_bootcamp_2021. - You'll learn how to get this on your computer shortly. --- class: inverse, center, middle # git and GitHub --- ## Why version control? .pull-left[ - Simple formal system for tracking changes to a project over time - Time machine for your projects + Track blame and/or praise + Remove the fear of breaking things - Learning curve can be a bit steep, but when you need it you *REALLY* need it ] .pull-right[ ![](images/local-vc-system.png) ] --- ## Why git? .pull-left[ - Distributed + Work online or offline + Collaborate with large groups - Popular and Successful + Active development + Shiny new tools and ecosystems + Fast - Tracks any type of file - Branching + Smarter merges ] .pull-right[ <img src="images/distributed-vc-system.png" width="75%" style="display: block; margin: auto;" /> ] --- ## Verifying git exists git is already set-up on the DSS servers. In the terminal tab you can verify this by ```bash [cr173@numeric1 ~]$ git --version git version 2.20.1 ``` ```bash [cr173@numeric1 ~]$ which git /usr/bin/git ``` <br> To install git on your own computer follow the directions in [Happy Git and GitHub for the useR](https://happygitwithr.com/install-git.html). --- ## Git sitrep ```r usethis::git_sitrep() ## Git config (global) ## ● Name: <unset> ## ● Email: <unset> ## ● Vaccinated: FALSE ## ℹ See `?git_vaccinate` to learn more ## ℹ Defaulting to https Git protocol ## ● Default Git protocol: 'https' ## GitHub ## ● Default GitHub host: 'https://github.com' ## ● Personal access token for 'https://github.com': <unset> ## x Call `gh_token_help()` for help configuring a token ## Git repo for current project ## ℹ No active usethis project ``` --- class: inverse, center, middle # git and GitHub live demo --- ## Configure git The following will tell git who you are, and other common configuration tasks. ```r usethis::use_git_config( user.name = "Colin Rundel", user.email = "rundel@gmail.com", push.default = "simple" ) ``` You will need to do this configuration once on each machine in which you choose to use git. This can also be done via the terminal with, ```bash $ git config --global user.name "Colin Rundel" $ git config --global user.email "rundel@gmail.com" $ git config --global push.default simple ``` --- ## Configure git verification To verify you configured git correctly, run ```r usethis::git_sitrep() ## Git config (global) ## ● Name: 'Colin Rundel' ## ● Email: 'rundel@gmail.com' ## ● Vaccinated: FALSE ## ℹ See `?git_vaccinate` to learn more ## ● Default Git protocol: 'https' ## GitHub ## ● Default GitHub host: 'https://github.com' ## ● Personal access token for 'https://github.com': <unset> ## x Call `gh_token_help()` for help configuring a token ## Git repo for current project ## ℹ No active usethis project ``` You should see output similar to above. --- ## Configure SSH and GitHub (authentication) We will be authenticating with GitHub using SSH based public / private keys. We can create a new key pair (if necessary) by running the following in RStudio's *console*: ```r credentials::ssh_setup_github() ``` -- ```r ## No SSH key found. Generate one now? ## ## 1: Yes ## 2: No ## ## Selection: 1 ## Generating new RSA keyspair at: /home/guest/.ssh/id_rsa ## Your public key: ## *## ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC/wH7pT3UXdOMJSX2wMaPVTyGnYkS8OPmcfjct6h8Q+44/9UG3sOibjjUCxIxVeCWAoYFB0rDI3/Ljf2EWozLlpeGzAe7xsg6A+MHtUObZnfzXSB/NnOhZymD2u8Nh+py07aojVdKAPBkRH3nHA+rljidc3gXZkqseetYEI1N79OQUshp2P+Qm6Vab4I5OCnfAwLFkR7Sw7J9hvZN1qUmM0DB0WTWSlNmPSMsASMe/6Nz30IRoBh35Z7tgF79rlIW385giCkEeD20Le9EOueGoTWarJWylE1RWnUyig2mZ9JK/rYTw4KBXacPhBwn+MgGC+r8xY5IEX78xkeXW9q2z ## ## Please copy the line above to GitHub: https://github.com/settings/ssh/new ## Would you like to open a browser now? ## ## 1: Yes ## 2: No ## ## Selection: 1 ``` --- ## Getting started today In order to get started, you need to obtain today's files from GitHub. The steps below will give you access. 1. Log in to GitHub 2. Navigate to https://github.com/dukestatsci/computing_bootcamp_2021 3. Fork the repository 4. Copy the link under `Clone or Download` to clone with *SSH*. 5. In RStudio go to `File > New Project > Version Control > Git` 6. Paste the URL, that you copied in step 4, in the box under `Repository URL:` You now should have all the files in the repository in a directory on the server or your own computer. --- ## Version control best practices - Commit early, often, and with complete code. - Write clear and concise commit summary messages. - Test code before you commit. - Use branches. - Communicate with your team. --- ## git and GitHub resources - git's [Pro Git](https://git-scm.com/book/en/v2) book, Chapters [Getting Started](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control) and [Git Basics](https://git-scm.com/book/en/v2/Git-Basics-Getting-a-Git-Repository) will be most useful if you are new to git and GitHub - [Git cheatsheet](https://www.atlassian.com/git/tutorials/atlassian-git-cheatsheet) by Atlassian - GitHub's interactive [tutorial](https://try.github.io/) - [Free online course](https://www.udacity.com/course/version-control-with-git--ud123) from Udacity - [Happy Git with R](http://happygitwithr.com/) by Jenny Bryan --- class: inverse, center, middle # Responsible research and reproducibility --- ## Seizure study retracted after authors realize data got "terribly mixed" - From the authors of **Low Dose Lidocaine for Refractory Seizures in Preterm Neonates**: - *"The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness."* <br/> *Source*: http://retractionwatch.com/2013/02/01/seizure-study-retracted-after-authors-realize-data-got-terribly-mixed/ --- ## Bad spreadsheet merge kills depression paper, quick fix resurrects it - The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct. - **Original conclusion:** Lower levels of CSF IL-6 were associated with current depression and with future depression [...]. - **Revised conclusion:** Higher levels of CSF IL-6 and IL-8 were associated with current depression [...]. <br/> *Source*: http://retractionwatch.com/2014/07/01/bad-spreadsheet-merge-kills-depression-paper-quick-fix-resurrects-it/ --- ## Study of social media retracted when authors can’t provide data - "*A business journal has retracted a 2016 paper about how social media can encourage young consumers to become devoted to particular brands, after discovering flaws in the data and findings.*" - Reasons for retraction: - Error in data - Error in results and/or conclusions - Results not reproducible <br/> *Source*: http://retractionwatch.com/2017/07/31/study-social-media-retracted-authors-cant-provide-data/ --- ## Heart pulls sodium meta-analysis over duplicated, and now missing, data - "*The journal Heart has retracted a 2012 meta-analysis after learning that two of the six studies included in the review contained duplicated data. Those studies, it so happens, were conducted by one of the co-authors.*" - From the retraction notice, "*The Committee considered that without sight of the raw data on which the two papers containing the duplicate data were based, their reliability could not be substantiated. Following inquiries, it turns out that the raw data are no longer available having been lost as a result of computer failure.*" - Reasons for retraction: - Duplication of data - Results not reproducible <br/> *Source*: http://retractionwatch.com/2013/05/02/heart-pulls-sodium-meta-analysis-over-duplicated-and-now-missing-data/ --- ## Teaching Reproducibility 1. Convince researchers to adopt a reproducible research workflow. 2. Train new researchers who don’t have any other workflow. --- ## Donald Knuth "Literate Programming" (1983) "Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a *computer* what to do, let us concentrate rather on explaining to *human beings* what we want a computer to do." "The practitioner of literate programming [...] strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other." - These ideas have been around for years! - Tools for putting them to practice have also been around. - They have never been as accessible as the current tools. --- ## Reproducibility checklist - Are the tables and figures reproducible from the code and data? <br><br> - Does the code actually do what you think it does? <br><br> - In addition to what was done, is it clear *why* it was done? (e.g., how were parameter settings chosen?) <br><br> - Can the code be used for other data, especially future updates to the current data? <br><br> - Can you extend the code to do other things? <br><br> --- ## Ambitious goal We need an environment where - data, analysis, and results are tightly connected, or better yet, inseparable, - reproducibility is built in, + the original data remains untouched + all data manipulations and analyses are inherently documented - documentation is human readable and syntax is minimal. --- ## Toolkits <img src="images/rmarkdown_logo.png" width="20%" style="display: block; margin: auto;" /><img src="images/jupyter.png" width="20%" style="display: block; margin: auto;" /> --- ## Reproducible data analysis - Scriptability `\(\rightarrow\)` R *or* Python - Literate programming `\(\rightarrow\)` R Markdown *or* Jupyter Notebooks - Version control `\(\rightarrow\)` git / GitHub -- <br><br> Could these tools have prevented some of the aforementioned retractions? --- ## What is markdown? - Markdown is a lightweight markup language for creating HTML (and other formatted) documents. - Markup languages are designed to produce documents from human readable text (and annotations). - Some of you may be familiar with LaTeX. This is another (less human friendly) markup language for creating pdf documents. - Why markdown is great: - Easy to learn and use. - Focus on **content**, rather than **coding** and debugging **errors**. - Once you have the basics down, you can get fancy via HTML, JavaScript, and CSS. - Used by both RMarkdown and Jupyter Notebooks <br/> R supports R Markdown - an authoring framework for data science and statistical analysis. --- ## What is R Markdown? <center> <iframe src="https://player.vimeo.com/video/178485416?color=428bca" width="640" height="400" frameborder="0" allow="autoplay; fullscreen" allowfullscreen></iframe> </center> <p><a href="https://vimeo.com/178485416">What is R Markdown?</a> from <a href="https://vimeo.com/rstudioinc">RStudio, Inc.</a> on <a href="https://vimeo.com">Vimeo</a>.</p> --- ## R Markdown .pull-left[ **Something simple** <br/><br/> ![](images/simple-rmd.png) ] .pull-right[ **Something fancy** <br/><br/> ![](images/fancy-rmd.png) ] --- ## R Markdown resources - In RStudio, go to `Help > Cheatsheets` and select - R Markdown Cheat Sheet - R Markdown Reference Guide - Check out the official R Markdown book: [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/) by Yihui Xie, J. J. Allaire, and Garrett Grolemund - Check out [bookdown: Authoring Books and Technical Documents with R Markdown](https://bookdown.org/yihui/bookdown/) by Yihui Xie. - Take a look at [RPubs](http://rpubs.com/) web published R Markdown documents. --- class: inverse, center, middle # RMarkdown demo --- ## R packages - Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data. - In the following exercises we'll use the `tidyverse` package. - The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. - The core tidyverse packages consists of `ggplot2`, `tibble`, `tidyr`, `readr`, `purrr`, `dplyr`, `stringr`, and `forcats` packages. - This package is already installed for you on the DSS servers. If needed, you can install it by running the following in the `Console`: ```r install.packages("tidyverse") ``` You only need to install a package once, but you must load it with function `library()` each R session. --- ## A note on environments - Your R Markdown document and your Console do not share their global environments. - This is good for reproducibility, but can sometimes result in frustrating errors. - This also means any packages or data needed for your analysis need to be loaded in your R Markdown document as well. --- ## Unvotes data analysis To get started, - open `examples/unvotes.Rmd`, - try knitting the entire document and examine the results. - try changing one or more of the selected countries, re-knit the document and observe any changes. - commit and push your changes to GitHub (include the newly generated `unvotes.html` file as well) <br/> --- ## R Markdown suggestions - Remember to name your code chunks - Familiarize yourself with chunk options (https://yihui.org/knitr/options/) - Use global chunk options to reduce duplication - Load packages at the start of a document, generally the chunk after your set-up chunk - Familiarize yourself with various outputs: Make slides with `output: ioslides_presentation` or `xaringan`, make websites with `blogdown`, author a book with `bookdown`, etc. <br/><br/> These slides were made with R Markdown and `xaringan`. --- ## R programming resources - Style - [Tidyverse style guide](http://style.tidyverse.org/) - [Google's R style guide](https://google.github.io/styleguide/Rguide.html) - Beginner - [swirl](https://swirlstats.com/): swirl teaches you R programming and data science interactively, at your own pace, and right in the R console - [R manuals](https://cran.r-project.org/manuals.html) - [R for Data Science](https://r4ds.had.co.nz) by Hadley Wickham and Garret Grolemund - [R Cookbook](https://www.amazon.com/gp/product/0596809158/ref=as_li_tf_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=0596809158&linkCode=as2&tag=cooforr09-20) by Paul Teetor - Next steps - [Advanced R](https://adv-r.hadley.nz/) by Hadley Wickham - [R Packages](http://r-pkgs.had.co.nz/) by Hadley Wickham - Miscellaneous - All available [CRAN packages](https://cran.r-project.org/web/packages/available_packages_by_name.html), sorted by name --- ## More R / RStudio resources - Some useful resources from RStudio: https://www.rstudio.com/resources/cheatsheets/ - RStudio IDE Cheat Sheet - R Markdown Cheat Sheet - R Markdown Reference Guide - Data Import Cheat Sheet - Data Transformation Cheat Sheet - Data Visualization Cheat Sheet Some of the above cheat sheets are available in RStudio: `Help > Cheatsheets` --- class: inverse, center, middle # Jupyter notebook demo --- ## Why python? .center.middle[ ![](images/kd_nuggets.png) ] *Source*: https://www.kdnuggets.com/2020/06/data-science-tools-popularity-animated.html --- ## Stack Overflow trends To see how technologies have trended over time based on use of their tags since 2008 we can look at Stack Overflow trends. <img src="images/stack_overflow.png" width="66%" style="display: block; margin: auto;" /> --- ## RStudio Workbench + Jupyter If you return to http://rstudio.stat.duke.edu:8787 you can launch a new session and select Jupyter Lab as your editor of choice. <br/><br/> <img src="images/workbench_jupyter.png" width="66%" style="display: block; margin: auto;" /> --- ## Overview of the notebook Bimodal interface: edit mode and command mode <br> Click in a cell or hit `enter` to enter edit mode .center[ ![](images/edit-mode.png) ] When in edit mode you can type code or write text with markdown. <br> Hit `esc` to enter command mode .center[ ![](images/command_mode.png) ] When in command mode you can make edits to the notebook, but not individual cells. --- ## Notebook shortcuts In **edit** mode: - Run cell and add new cell: `shift + enter` - Add a line within a cell: `enter` -- In **command** mode: - Save the notebook: `s` - Change cell to markdown: `m` - Change cell to code: `y` - Cut, copy, paste, delete a cell: `x`, `c`, `v`, `d` - Add a cell above, below: `a`, `b` <br><br> The point-and-click interface is also an option to execute these commands. --- ## Jupyter and Terminal - Jupyter Lab provides a direct interface to the terminal (similar to RStudio) under Launcher > Other - Terminal commands can also be included in notebooks by prefixing with `!`, e.g. ```python !pip install --user statsmodels ``` --- ## Jupyter and Git - The departmental server has the git jupyter lab [extension](https://github.com/jupyterlab/jupyterlab-git) installed. - This provides a GUI similar to RStudio's for interacting with Git repositories - Navigate to a repo's root directory and then switch to the Git pane. --- ## Unvotes data analysis To get started, - open `examples/unvotes.ipynb`, - Render the entire document by selecting Run > Run All Cells - Try changing one or more of the selected countries, rerunning the document, observe any changes. - commit and push your changes to GitHub (include the newly generated `unvotes.html` file as well) <br/> --- ## Jupyter notebook versus R Markdown - Similar to R Markdown, Jupyter notebooks allow you to write code and text in one easy to read document that is reproducible and easy to share with others. - A Jupyter notebook does not knit to an HTML, PDF or Word file. However, you can embed HTML into a notebook. - Exports are possible with packages like `nbcovert` - For a more detailed comparison see [The First Notebook War](https://yihui.name/en/2018/09/notebook-war/). --- ## Additional Python resources - Style - [PEP 8](https://www.python.org/dev/peps/pep-0008/): standard Python style - [PEP 257](https://www.python.org/dev/peps/pep-0257/): documentation conventions - Beginner - [Python](https://www.python.org/): official documentation and tutorial - [Jupyter](https://jupyter.org/): web notebook interface, reproducible research - [A Byte of Python](https://python.swaroopch.com/) - [Python Crash Course](https://ehmatthes.github.io/pcc/) - [Python Crash Course - Cheat Sheets](https://ehmatthes.github.io/pcc/cheatsheets/README.html) - Next steps - [Python Package Index](https://pypi.org) - [Problem Solving with Algorithms and Data Structures using Python](https://runestone.academy/runestone/books/published/pythonds/index.html) - Miscellaneous - [Python 3 Module of the Week](https://pymotw.com/3/)