Posted on

Earlier this week, I had the unfortunate necessary task of rebuilding an important machine at work due to the limited way I had created an initial testing bench for our single-cell RNA sequencing experiments. As I’m mainly the only person in my research group comfortable of working in a Linux environment, I thought I’d document the steps I took to build a bioinformatics in silco lab. This would not only serve as an instruction manual for the future, but also demonstrate how straightforward it can be to start working with Linux*

* Every Linux user would give you a half-laugh while encouraging you to go forward with this disclaimer: Linux often provides reliability, simplicity, and reproducibility to complex tasks (often for free). It also often makes the simplest of tasks a hair-pulling endeavor. You will develop a love-hate relationship with it (not unlike Windows or MacOS)

Hardware

We’ll be working to setup a next-generation genomics workbench on one of the machines I have available. In no way are these the minimum requirements to run various parts of a single-cell RNA sequencing experiment. It is provided below as a reference. As a general rule of thumb, having more CPU cores, more RAM, and a (NVidia) GPU helps speed up the workflow. If you are starting off the raw FASTQ files and loading them into CellRanger, you will have greater hardware requirements than what is offered on most personal laptops. Downstream analysis can be performed without a GPU, but will take many times longer to complete.

  • Operating System: Debian GNU/Linux 12
  • KDE Plasma Version: 5.27.5
  • KDE Frameworks Version: 5.103.0
  • Qt Version: 5.15.8
  • Kernel Version: 6.1.0-9-amd64 (64-bit)
  • Graphics Platform: X11
  • Processors: 32 × Intel® Xeon® CPU E5-2690 0 @ 2.90GHz
  • Memory: 125.8 GiB of RAM
  • Graphics Processor: NVIDIA GeForce RTX 3060/PCIe/SSE2
  • Manufacturer: Hewlett-Packard
  • Product Name: HP Z820 Workstation

Debian as your operating system

When you first look into the Linux world, it can be daunting seeing so many different types of distributions (aka “distros” for short). The open source world allows people to collaboratively work on something together. If a group within the larger group wants to take the project in a slightly different direction, they can do so by “forking” the project. Forks can choose to keep most of the functionality of the main project while introducing some unique features or approaches to their project.

debian 12 1250x656 2

Ubuntu is often a recommended starting place for many who are entering into Linux world, as it offers one of the best out-of-the-box experiences. The reason why I am recommending Debian (the distro Ubuntu is based off of) is because it opens up some more possibilities for virtualization with Proxmox down the road (which is out of the scope of this article and will be discussed in the future).

I use Ventoy to create a bootable flash drive with the latest Debian ISO (for reference, I am installing Debian 12.0.0 at the time of writing this post. The “DVD” type images are a larger initial download. But they help speed things along, especially if you are trying to create a mostly offline operating system.

I would recommend creating a separate partition for the datalad RIA file storage, where we will keep a local archive of all of our experiment files.

Initial configuration

Add the primary user to the sudoers group

In order for you to install anything on the computer, you will need to have administrator privileges. This may be confusing, as you would have entered in a password for both the root administrator account and your own personal account (e.g. pranav is my username). Despite doing all of this, in a brand new install of Debian, you do not have permission to “elevate” yourself to an admin from your main account (e.g. pranav).

# Change user to root
su root

# Now as the ROOT user
#-----------------------
# Add your username to the usergroup
sudo usermod -aG sudo pranav

# Restart the computer
sudo shutdown -r now

Setup repository list

In case you made a mistake, as I did, and forgot to configure the list of repositories during the operating system installation, you can add them by editing /etc/apt/sources.list with nano /etc/apt/sources.list

# Repository list for Debian 12 (includes a few additions to the defaults)
deb http://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware
deb-src http://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware

deb http://security.debian.org/debian-security bookworm-security main contrib non-free
deb-src http://security.debian.org/debian-security bookworm-security main contrib non-free

deb http://deb.debian.org/debian/ bookworm-updates main contrib non-free
deb-src http://deb.debian.org/debian/ bookworm-updates main contrib non-free

Next, update your installation with sudo apt update && sudo apt upgrade -y.

Install NVidia drivers

nvidia cuda art

If you have connected your monitors to a NVidia GPU over DisplayPort, you could be experiencing a (potentially seizure inducing) flicker on your screen so far. Thankfully, if you’ve made it this far with the flicker, fixing it is our next step, and it’s quite simple. Install the nvidia-detect package from apt with the command sudo apt install nvidia-detect. For my GPU, I received the following output:

Detected NVIDIA GPUs:
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA104 [GeForce RTX 3060] [10de:2487] (rev a1)
Checking card: NVIDIA Corporation NVIDIA Corporation GA104 [GeForce RTX 3060] (rev a1)
Your card is supported by all driver versions.
Your card is also supported by the Tesla drivers series.
Your card is also supported by the Tesla 470 drivers series.
It is recommended to install the
nvidia-driver
package.

Assuming you have the apt repositories configured as above, you can then install the driver with apt install nvidia-driver firmware-misc-nonfree. Then, as advised, restart your computer with sudo shutdown -r now.

Install KDE Plasma Desktop

KDE Plasma 5 on Debian 12 for bioinformatics research
KDE Plasma 5 on Debian 12

One of the best parts of Linux is the incredible capacity to personalize everything, including the look and feel of your desktop. While I do not have strong feelings for one desktop environment vs another, I feel that KDE Plasma is one of the best “out of the box” experiences for colleagues who may be hesitant to work on a Linux machine due to a lack of familiarity. KDE largely resembles a Windows 10 environment with a task bar across the bottom and something which resembles a “Start” menu.

You can install KDE Plasma Desktop, which is one of the smaller installations, with sudo apt install kde-plasma-desktop

Install Python via Mamba

logo

Without getting too far into the details, it is highly recommended that you use an environment manager for python. conda by Anaconda is probably the most widely adopted package and environment manager, especially in data science. The one drawback of the original conda is that it takes forever for the software to resolve which versions of packages are compatible and what should be downloaded. Mamba is a drop-in replacement for conda which runs significantly faster than conda. You can have a version with conda-forge preconfigured, which is known as “mambaforge”.

Update: mamba is now the default solver inside of conda. You do not need to install mamba-forge separately

# Create a tmp folder inside of your user directory
mkdir -p ~/tmp && cd ~/tmp

# Download mambaforge and run the installer
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh"
bash Miniforge3-Linux-x86_64.sh

# Accept the license terms and allow the "init" script to run

Configure SSH for secure, remote access

Again, for the purposes of this article, I will assume that you know the basics about SSH.

# Make a new directory for SSH

mkdir -p ~/.ssh
nano ~/.ssh/authorized_keys

# Download your SSH public keys (which you may have readily available at github.com/<yourusername>.keys
curl -s https://github.com/pranavmishra90.keys >> ~/.ssh/authorized_keys

# Paste in or generate a new private key (whichever you prefer)
# Make sure that the permissions on the private key are set correctly
chmod 400 ~/.ssh/id_ed25519

# Change your SSH configuration
sudo nano /etc/ssh/sshd_config

Setting up your sshd configuration file is a long discussion in itself. Because I am preparing to use Tailscale, which can take over port 22, I will change the default SSH port. After adjusting the file, restart the SSH service with sudo systemctl restart ssh

Personalize your settings with dotfiles / Yadm

yadm is an free, open-source program which stands for Yet another dotfile manager. When you are frequently jumping between different Linux machines, you want to make sure that your command-line personalization is uniform. Command line interface (CLI) personalization is often saved in “dotfiles” which are files that begin with a dot (e.g. ~/.bash_profile) and are often located in your home directory ~/. First, install yadm with sudo apt install yadm. Then download your dotfiles by doing yadm clone <url>

Want to see what some of these configurations look like? Check out my dotfiles on GitHub at pranavmishra90/dotfiles. You can load these in with yadm clone https://github.com/pranavmishra90/dotfiles.git (Note: for my own repository, I use SSH authentication with GitHub. So the corresponding command will be yadm clone [email protected]:pranavmishra90/dotfiles.git)

Install Docker

In the past, I’ve almost exclusively used the standard docker engine, which you can find the installation instructions for here. Today, I wanted to explore Docker Desktop a bit further with its inherent capacity to leverage tailscale on a container level basis. The official documentation on the docker website is usually all that’s needed. It’s important to know that the Desktop version of Docker is not entirely FOSS (free and open source software). You will have to look into seeing if the license terms apply to your use case.

docker logo 1

You will need to have a PGP key in place for your user account in Debian. PGP is one of oldest and widely established encryption technologies. Beyond just allowing you to encrypt anything from text, to email, to files, one of the most common uses of PGP is to “sign” files, as a way of proving that the file was created exactly by the person who claims to be the creator (and not a malicious impostor).

Install Zsh

Zsh is a newer alternate shell to the standard bash prompt. While there are many things that look and work much better on zsh over bash, I often find myself switching to bash for compatibility with certain commands. It’s more of a slightly annoyance of having to type in bash while you’re trying to get something done quickly. I would also recommend using the PowerLevel10k theme manager to get zsh working exactly how you’d like it.

# Zsh installation
sudo apt install zsh

# PowerLevel10k installation
git clone --depth=1 https://github.com/romkatv/powerlevel10k.git ~/powerlevel10k
echo 'source ~/powerlevel10k/powerlevel10k.zsh-theme' >>~/.zshrc

Preparing your bioinformatics workbench

Install NVidia CUDA to leverage your GPU for machine learning

Official installation instructions can be found at NVidia’s website for CUDA. Again, the documentation is well-done, and I will follow along their documentation.

(base) pranav: temp : lspci | grep -i nvidia
05:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3060] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)

(base) pranav: temp : uname -m && cat /etc/*release
x86_64
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

When checking for the gcc compiler, I received the following error, indicating that I had to install gcc

(base) pranav: temp : gcc –version
-bash: gcc: command not found

gcc can be easily installed with:

sudo apt update && sudo apt install build-essential
sudo apt-get install manpages-dev

# Confirm that gcc is working with
gcc --version

gcc (Debian 12.2.0-14) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

At the time of writing, Debian 12 is about 1 month old. We will be using the Debian 11 installer from the NVidia website, since it is the closest one available.

wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda

Install VSCode

# Download the latest installer
cd ~/tmp
wget https://code.visualstudio.com/sha/download?build=stable&os=linux-deb-x64

sudo dpkg -i code_1.80.0-1688479026_amd64.deb 

VSCode is one of the most well-regarded open source projects. The official version, maintained by Microsoft, has a number of features which allow for integration with GitHub and the extensions store. Open-source purists may note that some of these features come with a trade-off of telemetry and proprietary code. If this bothers you, there are forks available. For me, the official VSCode or the related coder/coder platforms have a far greater trade-off of using the official version for maximum feature availability. The biggest “quality of life improvement” comes with the GitHub settings sync, repository integration, and the capacity to share, tunnel, etc with other users. I would still recommend limiting telemetry, though.

You can find this in the settings (Control + ,) under “Telemetry: Telemetry level”. To see the telemetry settings of all VSCode, including extensions, search for “telemetry”. I’m usually someone who keeps telemetry settings at the absolute minimum. However, with certain open source projects that I love, I enable minimal settings. It can help provide valuable insight for improving the programs you heavily rely on.

vscode telemetry settings

Install Quarto

# Download the latest version of quarto: https://quarto.org/docs/get-started/
cd ~/tmp
wget https://github.com/quarto-dev/quarto-cli/releases/download/v1.3.433/quarto-1.3.433-linux-amd64.deb

# Install quarto
sudo dpkg -i quarto-1.3.433-linux-amd64.deb

Single-Cell RNA Sequencing: the essential packages

With mamba installed, you can get a “turnkey” solution to having most of what you need to get going by creating a new environment with the necessary packages. I have a properly formatted YAML template for conda / mamba, called environment.yml. I’ve including a gist below, which shows the latest version of my python environment for scRNAseq experiments.

View this gist on GitHub

With datalad, I can keep the conda / mamba environment file in sync between all of my projects:


datalad download-url --nosave --path ./python/environment.yml -o "https://gist.githubusercontent.com/pranavmishra90/40e71f8928695663765ef48d117f4ad8/raw/environment.yml"

scRNAseq (or bioinformatics projects, in general) are inherently complex, with a large number of files, where some may be in the hundreds of gigabyte range. Staying organized is critical for experiment reproducibility (and your own sanity!). I can (and hope to) write a series of posts on creating reproducible research environments. For now, I will describe the overall project setup.

Create a “work” directory containing all of your projects

My work directory looks like:

(base) pranav@singulab work : tree -d -L 2
.
├── bfgi
│   ├── animal-studies
│   ├── code
│   ├── deploy
│   ├── documentation
│   ├── human-studies
│   ├── licenses
│   └── templates
├── oa-ptoa
│   ├── cell-culture
│   ├── code
│   ├── deploy
│   ├── explant-PTOA
│   ├── licenses
│   ├── microCT
│   ├── notebook
│   ├── output
│   ├── synovial-fluid
│   ├── templates
│   └── tribology
├── pranavmishra90
│   ├── Curriculum-Vitae_Pranav-Mishra
│   ├── genomics-studio
│   ├── laboratory-labels
│   ├── protocol-timer
│   └── research-reference
└── prrx1
    ├── code
    ├── deploy
    ├── figures
    ├── licenses
    ├── notebook
    └── scRNA-seq

Under the main ~/work directory, I have the three main projects that I am actively working on listed. Nested underneath a directory with my git username at ~/work/pranavmishra90, I’ve quickly pulled down a few important repositories which I use regularly (or should update soon, like my CV!).

I was able to rapidly retrieve all of my projects on a brand new Debian installation with the magic of datalad, which uses git and git-annex in the background. More on datalad in some coming posts!

Use separate environments in Python and R for each of your projects

Mixing and matching environments can lead to hair-pulling headaches after accidentally installing / ‘upgrading’ one of your packages which then downgraded many others automatically, breaking all of your code.

It’s even better to work inside an entire virtual environment, either as a Virtual Machine or a Docker-like container (more on this in the future). If you would like to get started using something like this immediately, I have a few Docker images / GitHub repositories that you might find useful:

pranavmishra90/SinguLab: An ‘all-in-one’ python / Jupyter notebook based containerized lab with many of the components I’ve descrive above pre-installed.

pranavmishra90/genomics-studio: R-studio with seurat, monocle, and more pre-installed

pranavmishra90/quarto-containerized: Create publication quality websites, PDF documents, wiki, and more all from one markdown document which can execute python, R, and LaTex code to render content, using Quarto.

Need more guidance or have any suggestions? Reach out to me

If you have a question or comment on the process I describe above, feel free to reach me by my website, social media, or email!

Dr. Pranav Mishra

Dr. Pranav Mishra

Dr. Pranav Mishra is a physician-scientist in-training. He is currently a post-doctoral research fellow in the Departments of Surgery and Orthopedic Surgery at Rush University. As a hobby, he enjoys learning about programming, self-hosting, and open-source software through personal projects. This sparked his interest in data science / bioinformatics, developing reproducible and auditable environments for his research projects. You can find his contact information listed on his website and connect with him on most socials @drpranavmishra.