6 Stable Computing Environments

Duration: 25 - 40 minutes

Learning objectives

Learn what a computing environment is.
Understand why stable computing environments are important for reproducible research.
Know about use cases for containers.
Get an overview of Docker and it’s functionality.

6.1 Opening demo

Create a Python script called plot.py with the following content:

import matplotlib.pyplot as plt
import numpy as np

# Generate some example data
x = np.random.rand(100)  # Generate 100 random numbers between 0 and 1
y = 2 * x + 1 + np.random.randn(100)  # Apply a linear relationship with noise

# Create the scatter plot
plt.figure(figsize=(8, 6))  # Set the figure size
plt.scatter(x, y)  # Create the scatter plot

# Add labels and title
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Simulated Data")

# Show the plot
plt.grid(True)  # Add a grid for better readability
plt.savefig("plot.png")

Try running the script by executing

python plot.py

Does it work?

Likely, you will get an error message that matplotlib is not installed. Possibly also Python is not installed on your system, showing error message like: python: command not found.

When you have Docker installed, you can run the script in a Docker container with the following command:

docker run -v $(pwd):/app -w /app czentye/matplotlib-minimal python app.py

Note

This demonstration consists of a Python script requiring two dependencies: numpy and matplotlib. If any of these dependencies are missing, the script will not run.

By running the script in a Docker container, the dependencies are provided and the script will run successfully and creates a file plot.png with a scatter plot.

6.2 What is a computing environment?

The computing environment is the set of software tools and libraries that are used to run a program. It includes the operating system, the programming language, and any other software that is required to run the program.

Research software typically consists of multiple dependencies, such as libraries, compilers, and other software tools.

In addition, software tools and libraries are constantly updated, which can lead to compatibility issues. For example, a program that was written to run on a specific version of a library may not work on a newer version of the library. Also, functionality may change between versions, or bugs may have been fixed or newly introduced.

For reproducibility it is thus important to record the computing environment in which the software was developed and tested.

Tools to record the computing environment include:

conda
Python venv
R: renv, packrat

Note

How are dependencies managed in Matlab?

However, recording the computing environment is not enough. It is also important to be able to recreate the computing environment, including underlying system libraries.

6.3 Why are stable computing environments important for reproducible research?

Reproducibility: A stable computing environment ensures that the software will run the same way every time it is executed. This is important for reproducibility, as it allows others to run the software and obtain the same results.
Portability: A stable computing environment can be easily shared with others, allowing them to run the software on their own systems.
Collaboration: A stable computing environment allows multiple researchers to work on the same project without worrying about compatibility issues.
Longevity: A stable computing environment ensures that the software will continue to run in the future, even as software tools and libraries are updated.

The house analogy

Imagine your research project as your home that provides functionality for cooking, sleeping, working/studying, etc.

When you install all software tools and libraries on your laptop, your kitchen, living room and bed are all in the same room. Everything is unseparable, and possibly interdependent: If you blow a fuse in the kitchen, the power in the whole rooms goes out.

With virtual environments, you can think of your research project as a house with separate rooms. Each room has its own purpose and is independent of the others. If you blow a fuse in the kitchen, the power in the living and bedroom is still on. However, if it rains through the roof, all rooms might be affected.

When you containerize your software environments, each room is a shipping container. Each part can be shipped to another location, and still work as intended.

6.4 Use cases

Resolve Depdendencies and simplify Installation - Containers can be used to package software and its dependencies into a single unit that can be easily installed and run on any system. This is useful for software that has complex dependencies or requires specific versions of libraries.
Reproducibility: Containers can be used to create reproducible computing environments that can be easily shared with others. This is important for reproducible research, as it allows others to run the software and obtain the same results. For example, the initative BioContainers provides a collection of bioinformatics tools in containers.
Scalability: Containers can be easily scaled up to run on multiple machines, allowing for parallel processing of large datasets. This is important for high-performance computing applications, such as machine learning and data analysis.
Development: Containers can be used to create isolated development environments that are separate from the host system. This is useful for testing new software tools and libraries without affecting the host system.
Cloud computing: With the advent of cloud computing, containerization has become an important tool for deploying software in the cloud. Containers can be easily deployed on cloud platforms such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure without worrying about the underlying operating system and installed libraries on the host system.

7 Docker

Docker is one containerization tool that is widely used in the research community. Docker allows you to package software and its dependencies into a container that can be easily shared and run on any system.

Some important concepts in Docker are:

Dockerfile: A Dockerfile is a text file that contains instructions for building a Docker image. The Dockerfile specifies the base image, software tools, and libraries that are required to run the software. Think of the the Dockerfile as the recipe for building a container.
Images: An image is a read-only template that contains the software and its dependencies. Images are used to create containers. The image is the packaged version of a kitchen, with everything you need to follow the recipe.
Containers: A container is a running instance of an image. Containers are isolated from each other and from the host system. The container is the operating kitchen that produces the meal.