As a researcher, inquiries about previously published research probably evoke two feelings: panic-filled regret or calm authority. Often the difference is time; it’s easier to talk about the project you worked on last week than last decade. Talking about old protocols or software is a lot like someone critically examining the finger painting you did as a child. You know it’s not perfect and you would do several things differently in hindsight, but it is the method in the public record. The rate of change seems faster with software development, where new technologies redefine best practices and standards at dizzying rates.
Perhaps the most challenging problem is when researchers outside of your institution fail to reproduce results. How can you troubleshoot the software on every system or determine what missing piece is required to get things working? What was that magic bash command you wrote 5 years ago?
I recently had a similar problem where the published protocol was missing details necessary to reproduce our results. The culprit was a lab resource that was used for years without fully documenting how it was created. What we needed was a way to capture all the specifics of the entire workflow to simplify distribution and create a gold standard to point at and say “It works here!”
Concerns for reproduction
When preparing your work for distribution, there are a few questions to answer before you get started with picking a solution:
- What is the input data?
This includes all the input data, including non-public lab resources. If you can’t download it from a reputable repository your workflow should generate it from public resources. - What code was run?
Git is a great start to packaging your software in a maintainable way. If you have CI/CD setup you likely know your software will work on other systems and have identified at least some of the dependencies. - What environment and dependencies were present?
Hopefully, you already use conda or have modules recorded with specific versions. Even then, when you install on a different system you may be surprised that tools you thought were standard must be specified. This includes compilers, libraries, and generally anything added to your$PATH
. - How were results really generated?
Generally, the git repo will contain a software package that is agnostic to the input files. When you run an analysis for publication, you likely performed minor tweaks to input parameters, file locations, or did some post-processing to initial results. All of these steps need to be recorded if the analysis is going to be reproducible. - Will the solution survive 10 years? 100 years?
This question is the most philosophical. However, practically think about what public resources and platforms you use. If a git repo is deleted, will it break all your dependencies? What if a WordPress page disappears? See if some sources can be changed or if a dependency is better placed in your repo.
Codeocean as a platform for reproducible research
Based on the above concerns, I decided to release a new version of the software including the exact workflow to replicate the published results. Since this was also the first public release of the lab resource, we would need to host the results so others could easily download it without running 100 hours of computation themselves. To host a reproducible workflow with attached data, I decided to use Codeocean.
Codeocean is attempting to address the “Reproducibility Crisis” in science by providing “capsules” containing code, data, and published results. If you run a published capsule you should get the same result. This simplifies distribution for a researcher as storage and computation are provided externally.
Specifically, Codeocean runs in an AWS virtual private cloud. Dependencies and the environment are specified by a docker file, which is generated through a polished GUI. The capsule is tightly bound to a simplified git model and source code can be edited through a browser window. The core of distribution is the “reproducible run” which requires headless (non-interactive) execution to produce all results without network connections.
Projects in Codeocean are called “capsules” and consist primarily of four directories:
- environment
Contains the Dockerfile and any post installation scripts to further create the computational environment. Packages can be installed through apt-get, conda, pip, CRAN, and more. It should be included in version control. - code
The code directory should contain the contents of your git repo, including the README file. Place files under version control as you normally would and note that a.gitignore
in the root directory of the capsule is respected by Codeocean. - data
This should contain all of your input data. The contents can be placed under version control and git LFS is available. However, if your data is going to be static, leave it out of git and Codeocean will make copies after publication. Note that when a run or interactive session starts, the entire contents ofdata
are copied to the workstation. As such, if your data directory is larger than a few GB you will notice a slowdown. See if the dataset is available through Codeocean or as a public AWS S3 bucket. - results
Results are treated differently with Codeocean. In a reproducible run or interactive session, any files outside of the data or results folder are lost after the session ends. Data follows along with the capsule while results appear as output for each run in the timeline to the right.
When to use Codeocean
Codeocean is a great option for hosting and reproducing small to medium projects or subsets of larger analyses. The default resources are quite modest, 10 hrs/week of compute time and 20 GB of storage. If your capsule takes 100 hours to complete, it will be hard for a general researcher to run it on Codeocean.
One option would be to use a representative set of parameters or samples to demonstrate how the results could be produced. Interested parties could then export the capsule and run it locally.
Another interesting use case is setting up a compute environment for a domain-specific course or lab. With fully specified dependencies and data in place, students could easily follow along with how to perform an analysis.
Moving from Git to Codeocean
1) Create a new capsule, cloning from the existing repo.
2) Specify the environment. You will likely have to add new dependencies as you attempt to run the analysis.
3) Move all code, including the README to the code directory
4) Split your workflow into download/network steps and headless analysis. If your analysis expects user interaction, you will need to add a command-line interface to accommodate headless execution.
5) Modify the paths to use the data and results directories.
6) If needed, run the script to download data in an interactive terminal. Note that if a process uses less than 2% of the CPU for an hour it will be terminated. Downloads from slow connections may have to be run locally and uploaded separately.
7) Commit all changes. It’s a good habit to commit before each reproducible run so commits stay small.
8) Perform a reproducible run. If you run into problems, modify your code, commit, and run again.
9) Once you have your results, you can submit the capsule for publication
– Make sure metadata is complete
– A reproducible run must be performed after the latest commit
– Your README must be present in the code folder
– Capsule has a descriptive title someone could find and understand while browsing
A published capsule will receive a DOI and be publicly available. Anyone can find and download your data, code, and results. Be mindful of sensitive datasets! Published capsules are static and cannot be modified. To make a change, you can amend and publish a subsequent version.
Summary
I found working in Codeocean to be a pleasant experience. Reproducing computational results is a challenge and Codeocean adds features that make this task approachable to a domain scientist. If you are comfortable with github, you can use Codeocean.
By forcing you to set up a compute environment from scratch, dependencies are easy to identify immediately. The environment GUI is much easier than writing a docker file while the optional post-install script provides full flexibility. Though the git model is simplified, you are constantly reminded there are uncommitted changes that gently nudge you to make smaller commits. Overall the platform promotes computational best practices without complicating your work. Hopefully, it will become a more common prerequisite to publishing new analyses!
Positives
- Forces separation of input data, results, and code
- Forces full specification of the environment
- Forces headless execution
- Uncommitted changes are always at the front of the display
- Builtin support for jupyter and Rstudio
- GPU nodes are available
Limitations
- The git model available through the main GUI uses a single branch
- Collaboration seems prone to clobbering due to a lack of branching. Need to coordinate separately on who is currently doing work or use git through the interactive terminal
- The file layout doesn’t directly translate from most git repos. Need to clone and then move contents into
code
- Default resources are low (10 hrs/week, 20 GB)
- Workstations are shared and have 16 cores, 128 GB RAM