2022 DSS Bootcamp
Colin Rundel
08-25-22
From the authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates:
“The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness.”
The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.
Original conclusion: Lower levels of CSF IL-6 were associated with current depression and with future depression […].
Revised conclusion: Higher levels of CSF IL-6 and IL-8 were associated with current depression […].
“A business journal has retracted a 2016 paper about how social media can encourage young consumers to become devoted to particular brands, after discovering flaws in the data and findings.”
Reasons for retraction:
“The journal Heart has retracted a 2012 meta-analysis after learning that two of the six studies included in the review contained duplicated data. Those studies, it so happens, were conducted by one of the co-authors.”
From the retraction notice, “The Committee considered that without sight of the raw data on which the two papers containing the duplicate data were based, their reliability could not be substantiated. Following inquiries, it turns out that the raw data are no longer available having been lost as a result of computer failure.”
Reasons for retraction:
Convince researchers to adopt a reproducible research workflow.
Train new researchers who don’t have any other workflow.
We need an environment where
data, analysis, and results are tightly connected, or better yet, inseparable,
reproducibility is built in,
documentation is human readable and syntax is minimal.
“Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”
“The practitioner of literate programming […] strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.”
These ideas have been around for years!
Tools for putting them to practice have also been around.
They have never been as accessible as the current tools.
Scriptability \(\rightarrow\) R / Python
Literate programming \(\rightarrow\) R Markdown / Jupyter Notebooks / Quarto
Version control \(\rightarrow\) git / GitHub
Could these tools have prevented some of the aforementioned retractions?
Markdown is a lightweight markup language for creating HTML (and other formatted) documents.
Markup languages are designed to produce documents from human readable text (and annotations).
Some of you may be familiar with LaTeX. This is another (less human friendly) markup language for creating pdf documents.
Why markdown is great:
Something simple
Something fancy
In RStudio, go to Help > Cheatsheets
and select
Check out the official R Markdown book: R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, and Garrett Grolemund
Check out bookdown: Authoring Books and Technical Documents with R Markdown by Yihui Xie.
Take a look at RPubs web published R Markdown documents.
Much of the syntax is shared with R Markdown - so previous resources are a good place to start
Tom Mock’s Intro to Quarto webinar
RStudioConf 2022 workshops
Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data.
In the following exercises we’ll use the tidyverse
package.
ggplot2
, tibble
, tidyr
, readr
, purrr
, dplyr
, stringr
, and forcats
packages.This package is already installed for you on the DSS servers. If needed, you can install it by running the following in the Console
:
You only need to install a package once, but you must load it with function library()
each R session.
Your R Markdown document and your Console do not share their global environments.
This is good for reproducibility, but can sometimes result in frustrating errors.
This also means any packages or data needed for your analysis need to be loaded in your R Markdown document as well.
To get started,
open examples/unvotes.qmd
,
try Rendering the entire document and examine the results.
try changing one or more of the selected countries, re-knit the document and observe any changes.
commit and push your changes to GitHub (include the newly generated unvotes.html
file as well)
Remember to name your code chunks
Familiarize yourself with chunk options (https://yihui.org/knitr/options/)
#|
syntax enables tab completion for chunk optionsLoad packages at the start of a document, generally the chunk after your setup chunk
Familiarize yourself with various output formats: Make slides with revealjs
, pdfs, books, etc.
Style
Beginner
Next steps
Miscellaneous
Some useful resources from RStudio: https://www.rstudio.com/resources/cheatsheets/
Some of the above cheat sheets are available in RStudio: Help > Cheatsheets
.center.middle[
]
Source: https://www.kdnuggets.com/2020/06/data-science-tools-popularity-animated.html
To see how technologies have trended over time based on use of their tags since 2008 we can look at Stack Overflow trends.
If you return to http://rstudio.stat.duke.edu:8787 you can launch a new session and select Jupyter Lab as your editor of choice.
Bimodal interface: edit mode and command mode
Click in a cell or hit enter
to enter edit mode .center[ ]
When in edit mode you can type code or write text with markdown.
Hit esc
to enter command mode .center[ ]
When in command mode you can make edits to the notebook, but not individual cells.
In edit mode:
shift + enter
enter
–
In command mode:
s
m
y
x
, c
, v
, d
a
, b
The point-and-click interface is also an option to execute these commands.
Jupyter Lab provides a direct interface to the terminal (similar to RStudio) under Launcher > Other
Terminal commands can also be included in notebooks by prefixing with !
, e.g.
The departmental server has the git jupyter lab extension installed.
This provides a GUI similar to RStudio’s for interacting with Git repositories
Navigate to a repo’s root directory and then switch to the Git pane.
To get started,
open examples/unvotes.ipynb
,
Render the entire document by selecting Run > Run All Cells
Try changing one or more of the selected countries, rerunning the document, observe any changes.
commit and push your changes to GitHub (include the newly generated unvotes.html
file as well)
Similar to R Markdown, Jupyter notebooks allow you to write code and text in one easy to read document that is reproducible and easy to share with others.
Jupyter notebooks include the text, code and computational output.
A Jupyter notebook does not knit to an HTML, PDF or Word file. However, you can embed HTML into a notebook.
nbcovert
For a more detailed comparison see The First Notebook War.
Quarto was build to unify the scientific publish process across the most commonly used notebook formats and this include Jupyter notebooks.
Specifically, quarto has a couple of neat tricks: - Render ipynb
files using the jupyter engine
ipynb
and qmd
files seamlesslyStyle
Beginner
Next steps
Miscellaneous