Responsible research
and reproducibility

2022 DSS Bootcamp

Colin Rundel

08-25-22

Seizure study retracted after authors realize data got “terribly mixed”

From the authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates:
“The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness.”

Bad spreadsheet merge kills depression paper, quick fix resurrects it

The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.
Original conclusion: Lower levels of CSF IL-6 were associated with current depression and with future depression […].
Revised conclusion: Higher levels of CSF IL-6 and IL-8 were associated with current depression […].

Heart pulls sodium meta-analysis over duplicated, and now missing, data

“The journal Heart has retracted a 2012 meta-analysis after learning that two of the six studies included in the review contained duplicated data. Those studies, it so happens, were conducted by one of the co-authors.”
From the retraction notice, “The Committee considered that without sight of the raw data on which the two papers containing the duplicate data were based, their reliability could not be substantiated. Following inquiries, it turns out that the raw data are no longer available having been lost as a result of computer failure.”
Reasons for retraction:
- Duplication of data
- Results not reproducible

Teaching Reproducibility

Convince researchers to adopt a reproducible research workflow.
Train new researchers who don’t have any other workflow.

Reproducibility checklist

Are the tables and figures reproducible from the code and data?
Does the code actually do what you think it does?
In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)
Can the code be used for other data, especially future updates to the current data?
Can you extend the code to do other things?

Ambitious goal

We need an environment where

data, analysis, and results are tightly connected, or better yet, inseparable,
reproducibility is built in,
- the original data remains untouched
- all data manipulations and analyses are inherently documented
documentation is human readable and syntax is minimal.

Donald Knuth “Literate Programming” (1983)

“Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”

“The practitioner of literate programming […] strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.”

These ideas have been around for years!
Tools for putting them to practice have also been around.
They have never been as accessible as the current tools.

Toolkits

Reproducible data analysis

Scriptability \(\rightarrow\) R / Python
Literate programming \(\rightarrow\) R Markdown / Jupyter Notebooks / Quarto
Version control \(\rightarrow\) git / GitHub

Could these tools have prevented some of the aforementioned retractions?

What is markdown?

Markdown is a lightweight markup language for creating HTML (and other formatted) documents.
Markup languages are designed to produce documents from human readable text (and annotations).
Some of you may be familiar with LaTeX. This is another (less human friendly) markup language for creating pdf documents.
Why markdown is great:
- Easy to learn and use.
- Focus on content, rather than coding and debugging errors.
- Once you have the basics down, you can get fancy via HTML, JavaScript, and CSS.
- Used by RMarkdown, Jupyter Notebooks, and Quarto

What is Quarto?

rstudio::conf 2022 Keynotes - Hello Quarto: Share • Collaborate • Teach • Reimagine - RStudio

R Markdown / Quarto

Something simple

Something fancy

R Markdown resources

In RStudio, go to Help > Cheatsheets and select
- R Markdown Cheat Sheet
- R Markdown Reference Guide
Check out the official R Markdown book: R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, and Garrett Grolemund
Check out bookdown: Authoring Books and Technical Documents with R Markdown by Yihui Xie.
Take a look at RPubs web published R Markdown documents.

Quarto resources

Much of the syntax is shared with R Markdown - so previous resources are a good place to start
quarto.org
Tom Mock’s Intro to Quarto webinar
RStudioConf 2022 workshops
- Getting Started with Quarto
- From R Markdown to Quarto

Quarto / RMarkdown demo

R packages

Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data.
In the following exercises we’ll use the tidyverse package.
- The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
- The core tidyverse packages consists of ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, and forcats packages.
This package is already installed for you on the DSS servers. If needed, you can install it by running the following in the Console:
```
install.packages("tidyverse")
```

You only need to install a package once, but you must load it with function library() each R session.

A note on environments

Your R Markdown document and your Console do not share their global environments.
This is good for reproducibility, but can sometimes result in frustrating errors.
This also means any packages or data needed for your analysis need to be loaded in your R Markdown document as well.

Unvotes data analysis

To get started,

open examples/unvotes.qmd,
try Rendering the entire document and examine the results.
try changing one or more of the selected countries, re-knit the document and observe any changes.
commit and push your changes to GitHub (include the newly generated unvotes.html file as well)

R Markdown / quarto suggestions

Remember to name your code chunks
Familiarize yourself with chunk options (https://yihui.org/knitr/options/)
- Use global chunk options to reduce duplication
- Using #| syntax enables tab completion for chunk options
Load packages at the start of a document, generally the chunk after your setup chunk
Familiarize yourself with various output formats: Make slides with revealjs, pdfs, books, etc.

R programming resources

Style
- Tidyverse style guide
- Google’s R style guide
Beginner
- swirl: swirl teaches you R programming and data science interactively, at your own pace, and right in the R console
- R manuals
- R for Data Science by Hadley Wickham and Garret Grolemund
- R Cookbook by Paul Teetor
Next steps
- Advanced R by Hadley Wickham
- R Packages by Hadley Wickham
Miscellaneous
- All available CRAN packages, sorted by name

More R / RStudio resources

Some useful resources from RStudio: https://www.rstudio.com/resources/cheatsheets/
- RStudio IDE Cheat Sheet
- R Markdown Cheat Sheet
- R Markdown Reference Guide
- Data Import Cheat Sheet
- Data Transformation Cheat Sheet
- Data Visualization Cheat Sheet

Some of the above cheat sheets are available in RStudio: Help > Cheatsheets

Jupyter notebook demo

Why python?

.center.middle[

]

Source: https://www.kdnuggets.com/2020/06/data-science-tools-popularity-animated.html

Stack Overflow trends

To see how technologies have trended over time based on use of their tags since 2008 we can look at Stack Overflow trends.

RStudio Workbench + Jupyter

If you return to http://rstudio.stat.duke.edu:8787 you can launch a new session and select Jupyter Lab as your editor of choice.

Overview of the notebook

Bimodal interface: edit mode and command mode

Click in a cell or hit enter to enter edit mode .center[ ]

When in edit mode you can type code or write text with markdown.

Hit esc to enter command mode .center[ ]

When in command mode you can make edits to the notebook, but not individual cells.

Notebook shortcuts

In edit mode:

Run cell and add new cell: shift + enter
Add a line within a cell: enter

–

In command mode:

Save the notebook: s
Change cell to markdown: m
Change cell to code: y
Cut, copy, paste, delete a cell: x, c, v, d
Add a cell above, below: a, b

The point-and-click interface is also an option to execute these commands.

Jupyter and Terminal

Jupyter Lab provides a direct interface to the terminal (similar to RStudio) under Launcher > Other
Terminal commands can also be included in notebooks by prefixing with !, e.g.

!pip install --user statsmodels

Jupyter and Git

The departmental server has the git jupyter lab extension installed.
This provides a GUI similar to RStudio’s for interacting with Git repositories
Navigate to a repo’s root directory and then switch to the Git pane.

Unvotes data analysis

To get started,

open examples/unvotes.ipynb,
Render the entire document by selecting Run > Run All Cells
Try changing one or more of the selected countries, rerunning the document, observe any changes.
commit and push your changes to GitHub (include the newly generated unvotes.html file as well)

Jupyter notebook versus R Markdown

Similar to R Markdown, Jupyter notebooks allow you to write code and text in one easy to read document that is reproducible and easy to share with others.
Jupyter notebooks include the text, code and computational output.
A Jupyter notebook does not knit to an HTML, PDF or Word file. However, you can embed HTML into a notebook.
- Exports are possible with packages like nbcovert
For a more detailed comparison see The First Notebook War.

Jupyter notebook + quarto

Quarto was build to unify the scientific publish process across the most commonly used notebook formats and this include Jupyter notebooks.

Specifically, quarto has a couple of neat tricks: - Render ipynb files using the jupyter engine

Converts between ipynb and qmd files seamlessly

Additional Python resources

Style
- PEP 8: standard Python style
- PEP 257: documentation conventions
Beginner
- Python: official documentation and tutorial
- Jupyter: web notebook interface, reproducible research
- A Byte of Python
- Python Crash Course
- Python Crash Course - Cheat Sheets
Next steps
- Python Package Index
- Problem Solving with Algorithms and Data Structures using Python
Miscellaneous
- Python 3 Module of the Week

Responsible researchand reproducibility

Seizure study retracted after authors realize data got “terribly mixed”

Bad spreadsheet merge kills depression paper, quick fix resurrects it

Study of social media retracted when authors can’t provide data

Heart pulls sodium meta-analysis over duplicated, and now missing, data

Teaching Reproducibility

Reproducibility checklist

Ambitious goal

Donald Knuth “Literate Programming” (1983)

Toolkits

Reproducible data analysis

What is markdown?

What is Quarto?

R Markdown / Quarto

R Markdown resources

Quarto resources

Quarto / RMarkdown demo

R packages

A note on environments

Unvotes data analysis

R Markdown / quarto suggestions

R programming resources

More R / RStudio resources

Jupyter notebook demo

Why python?

Stack Overflow trends

RStudio Workbench + Jupyter

Overview of the notebook

Notebook shortcuts

Jupyter and Terminal

Jupyter and Git

Unvotes data analysis

Jupyter notebook versus R Markdown

Jupyter notebook + quarto

Additional Python resources

Responsible research
and reproducibility