Responsible research
and reproducibility

2022 DSS Bootcamp

Colin Rundel

08-25-22

Seizure study retracted after authors realize data got “terribly mixed”

  • From the authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates:

  • “The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness.”

Bad spreadsheet merge kills depression paper, quick fix resurrects it

  • The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.

  • Original conclusion: Lower levels of CSF IL-6 were associated with current depression and with future depression […].

  • Revised conclusion: Higher levels of CSF IL-6 and IL-8 were associated with current depression […].

Study of social media retracted when authors can’t provide data

  • A business journal has retracted a 2016 paper about how social media can encourage young consumers to become devoted to particular brands, after discovering flaws in the data and findings.

  • Reasons for retraction:

    • Error in data
    • Error in results and/or conclusions
    • Results not reproducible

Heart pulls sodium meta-analysis over duplicated, and now missing, data

  • The journal Heart has retracted a 2012 meta-analysis after learning that two of the six studies included in the review contained duplicated data. Those studies, it so happens, were conducted by one of the co-authors.

  • From the retraction notice, “The Committee considered that without sight of the raw data on which the two papers containing the duplicate data were based, their reliability could not be substantiated. Following inquiries, it turns out that the raw data are no longer available having been lost as a result of computer failure.

  • Reasons for retraction:

    • Duplication of data
    • Results not reproducible

Teaching Reproducibility

  1. Convince researchers to adopt a reproducible research workflow.

  2. Train new researchers who don’t have any other workflow.

Reproducibility checklist

  • Are the tables and figures reproducible from the code and data?

  • Does the code actually do what you think it does?

  • In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)

  • Can the code be used for other data, especially future updates to the current data?

  • Can you extend the code to do other things?

Ambitious goal

We need an environment where

  • data, analysis, and results are tightly connected, or better yet, inseparable,

  • reproducibility is built in,

    • the original data remains untouched
    • all data manipulations and analyses are inherently documented
  • documentation is human readable and syntax is minimal.

Donald Knuth “Literate Programming” (1983)

“Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”

“The practitioner of literate programming […] strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.”

  • These ideas have been around for years!

  • Tools for putting them to practice have also been around.

  • They have never been as accessible as the current tools.

Toolkits

Reproducible data analysis

  • Scriptability \(\rightarrow\) R / Python

  • Literate programming \(\rightarrow\) R Markdown / Jupyter Notebooks / Quarto

  • Version control \(\rightarrow\) git / GitHub



Could these tools have prevented some of the aforementioned retractions?

What is markdown?

  • Markdown is a lightweight markup language for creating HTML (and other formatted) documents.

  • Markup languages are designed to produce documents from human readable text (and annotations).

  • Some of you may be familiar with LaTeX. This is another (less human friendly) markup language for creating pdf documents.

  • Why markdown is great:

    • Easy to learn and use.
    • Focus on content, rather than coding and debugging errors.
    • Once you have the basics down, you can get fancy via HTML, JavaScript, and CSS.
    • Used by RMarkdown, Jupyter Notebooks, and Quarto

What is Quarto?

R Markdown / Quarto

Something simple

Something fancy

R Markdown resources

Quarto resources

Quarto / RMarkdown demo

R packages

  • Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data.

  • In the following exercises we’ll use the tidyverse package.

    • The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
    • The core tidyverse packages consists of ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, and forcats packages.
  • This package is already installed for you on the DSS servers. If needed, you can install it by running the following in the Console:

    install.packages("tidyverse")

You only need to install a package once, but you must load it with function library() each R session.

A note on environments

  • Your R Markdown document and your Console do not share their global environments.

  • This is good for reproducibility, but can sometimes result in frustrating errors.

  • This also means any packages or data needed for your analysis need to be loaded in your R Markdown document as well.

Unvotes data analysis

To get started,

  • open examples/unvotes.qmd,

  • try Rendering the entire document and examine the results.

  • try changing one or more of the selected countries, re-knit the document and observe any changes.

  • commit and push your changes to GitHub (include the newly generated unvotes.html file as well)

R Markdown / quarto suggestions

  • Remember to name your code chunks

  • Familiarize yourself with chunk options (https://yihui.org/knitr/options/)

    • Use global chunk options to reduce duplication
    • Using #| syntax enables tab completion for chunk options
  • Load packages at the start of a document, generally the chunk after your setup chunk

  • Familiarize yourself with various output formats: Make slides with revealjs, pdfs, books, etc.

R programming resources

More R / RStudio resources

  • Some useful resources from RStudio: https://www.rstudio.com/resources/cheatsheets/

    • RStudio IDE Cheat Sheet
    • R Markdown Cheat Sheet
    • R Markdown Reference Guide
    • Data Import Cheat Sheet
    • Data Transformation Cheat Sheet
    • Data Visualization Cheat Sheet

Some of the above cheat sheets are available in RStudio: Help > Cheatsheets

Jupyter notebook demo

Why python?

.center.middle[

]

Source: https://www.kdnuggets.com/2020/06/data-science-tools-popularity-animated.html

RStudio Workbench + Jupyter

If you return to http://rstudio.stat.duke.edu:8787 you can launch a new session and select Jupyter Lab as your editor of choice.



Overview of the notebook

Bimodal interface: edit mode and command mode


Click in a cell or hit enter to enter edit mode .center[ ]

When in edit mode you can type code or write text with markdown.


Hit esc to enter command mode .center[ ]

When in command mode you can make edits to the notebook, but not individual cells.

Notebook shortcuts

In edit mode:

  • Run cell and add new cell: shift + enter
  • Add a line within a cell: enter

In command mode:

  • Save the notebook: s
  • Change cell to markdown: m
  • Change cell to code: y
  • Cut, copy, paste, delete a cell: x, c, v, d
  • Add a cell above, below: a, b



The point-and-click interface is also an option to execute these commands.

Jupyter and Terminal

  • Jupyter Lab provides a direct interface to the terminal (similar to RStudio) under Launcher > Other

  • Terminal commands can also be included in notebooks by prefixing with !, e.g.

!pip install --user statsmodels

Jupyter and Git

  • The departmental server has the git jupyter lab extension installed.

  • This provides a GUI similar to RStudio’s for interacting with Git repositories

  • Navigate to a repo’s root directory and then switch to the Git pane.

Unvotes data analysis

To get started,

  • open examples/unvotes.ipynb,

  • Render the entire document by selecting Run > Run All Cells

  • Try changing one or more of the selected countries, rerunning the document, observe any changes.

  • commit and push your changes to GitHub (include the newly generated unvotes.html file as well)

Jupyter notebook versus R Markdown

  • Similar to R Markdown, Jupyter notebooks allow you to write code and text in one easy to read document that is reproducible and easy to share with others.

  • Jupyter notebooks include the text, code and computational output.

  • A Jupyter notebook does not knit to an HTML, PDF or Word file. However, you can embed HTML into a notebook.

    • Exports are possible with packages like nbcovert
  • For a more detailed comparison see The First Notebook War.

Jupyter notebook + quarto

Quarto was build to unify the scientific publish process across the most commonly used notebook formats and this include Jupyter notebooks.

Specifically, quarto has a couple of neat tricks: - Render ipynb files using the jupyter engine

  • Converts between ipynb and qmd files seamlessly

Additional Python resources