Frequently Asked Questions
How is this different from existing efforts in open science?
Recent interest in open science has been a positive step toward improving the reproducibility of scientific research, but the open science movement has largely focused on the transparency of scientific research, rather than explicilty facilitating its reproducibility. Many of the tools and practices of open science — open data, open code, pre-registration, etc. — are important steps toward reproducibility, but they are not sufficient, as they require authors to provide detailed documentation and instructions to ensure that others can independently replicate their work. This work is often left undone and, even when it is, writing adequate documentation and troubleshooting cross-platform execution errors can be a time-consuming and error-prone process.
A major goal for this project is to remove this need for manual documentation by making it easier for scientists to create projects that are inherently reproducible.
Why containerization?
Aren’t tools like Python’s pip
(and venv
and conda
and mamba
and etc.) and R’s .RProj
(and Renviron
and packrat
and etc.) sufficient?
Existing software projects around reproducibile/open science are often language-specific. The power of containerization is in its universality. Nearly any scientific environment can be packaged into a containerized interface, opening up the possibility for language-agnostic reproducibility standards and frameworks that can transcend any particular programming language.
Containers either meet or exceed the standards of other packaging tools on several dimensions. First is the fact that containers can include and run each of the tools listed above (and many more) as a dependency. Language-specific tools like venv
or .Rproj
are typically only able to accommodate dependencies that lie within their ecosystem, meaning containers objectively have support for a wider scope of workflows. Further, since each of the tools listed above require installation of standalone software (a working Python distribution, RStudio, Anaconda distribution, etc.), the fact that one needs to install Docker as a dependency should not be considered a significant disadvantage of containers. It might be argued that containers are more confusing or less user-friendly than other packaging systems. This is exactly why reproduce.work exists: to lower the techincal barriers for scientists wishing to take advantage of containerization technologies.
Further, due to the vagaries of software operating systems, tools like Python virtual environments and R project files still leave a lot of room for technical decay. A venv
workflow that worked on one machine is not guaranteed to work on another machine with a different operating system or even your own machine after an OS update. This limits the longevity of projects that rely on these types of tools exclusively. By packaging scientific computing environments into containers, scientists can ensure that their analyses can be replicated exactly, irrespective of changes in underlying technology or platforms.
If you adopt containerization in earnest, you should never need to install Python, R, or any other scientific computing software direclty to your machine again. Further, if you use separate containers for each of your projects, you should never again fall into dependency hell in which you need to sort out version conflicts, incompatible packages, or arcane OS-specific library equirements (i.e., you should never need to track down a .dll
file or debug a 'gcc' failed
message again).
What is the long-term vision for this proejct?
I have starting writing up a bigger vision for reproduce.work in a rough-draft manuscript, which can be found here.
As it currently stands, the reproduce.work CLI tool is essentially a simplified Docker interface for scientists. However, the potential applications of containerization technology in science more broadly go far beyond this, opening up possibilities for automating much of the existing work that falls under the umbrella of “open” science.
Part of this bigger vision includes the development of a framework for metadata publishing in a way that allows scientific projects to be automatically verified, almost in a CI/CD-friendly manner. It is not hard to think of at least one or more deterministic checks – i.e., fully programmatic scripts that could be run in the cloud or verified/signed independently – that would be valuable for projects that make use of scientific computing. (In theory, even non-computational academic projects could benefit from some aspects of automated review, i.e., plaigarism checks, citation checks, etc.)
For example, if scientists were to conduct their analyses and author their manuscripts inside containerized environments, it would be trivial to not only reproduce their published reports from scratch, but it would also be possible to automatically verify that statistical results, figures, and data published in scientific manuscripts match exactly the output of their analysis code. One could also automatically check whether published analyses match pre-registered analyses. This entirely removes the possibility for human transcription errors and greatly reduces the burden on reviewers/editors to conduct these checks manually.
It may be possible for these technologies to mitigate (or at least significantly raise the costs of) certain types of fraud, but this is not necessarily the main objective of reproduce.work. Rather, the primary vision is to make the best scientists better by making it easier to document and share their work with others in ways that meet basic standards of rigor and reproducibility in ways that are self-verifying and self-evident.
What’s more, as LLMs proliferate, one wonders whether application of these tools to academic publishing will be a blessing or a curse. One might be tempted to expand the types of checks we might want to do to include things like “Get the consensus view of the 10 top performing AI models on {this_leaderboard}
in response to this battery of prompts: [{prompts}]
”. There is also the possibility that as LLMs become more integrated into scientific workflows, the importance of executability, verifiability, and reproducibility will become even more important as the proliferation of AI-genernated content may diminish the trust we place in media and projects consisting of pure text.
My hope is that reproduce.work can contribute to some of the upside associated with these possiblities. Much of what is described above is possible with current containerization technologies, but will require more development and coordination to become a reality.