Leipzig Summer School, Day 5 Viromics

Introduction & Setup

Overview

Teaching: 30 min
Exercises: 0 min
Questions
Objectives
  • Overall view of the workshop

  • Getting to know each other

  • Setup

Key Points


Lecture on Viruses and viromics

Overview

Teaching: 30 min
Exercises: 0 min
Questions
Objectives

Key Points


Break and Questions

Overview

Teaching: 30 min
Exercises: 0 min
Questions
Objectives

Key Points


Lecture on RDM and Virus Infrastructure

Overview

Teaching: 30 min
Exercises: min
Questions
Objectives

Key Points


Lunch break

Overview

Teaching: 60 min
Exercises: 0 min
Questions
Objectives

Key Points


Project Explanation

Overview

Teaching: 30 min
Exercises: min
Questions
Objectives

Hands-on Workshop: Viromics

Description

Metagenomics

The emergence of Next Generation Sequencing (NGS) has facilitated the development of metagenomics. In metagenomic studies, DNA from all the organisms in a mixed sample is sequenced in a massively parallel way (or RNA in case of metatranscriptomics). The goal of these studies is usually to identify certain microbes in a sample, or to taxonomically or functionally characterize a microbial community. There are different ways to process and analyze metagenomes, such as the targeted amplification and sequencing of the 16S ribosomal RNA gene (amplicon sequencing, used for taxonomic profiling) or shotgun sequencing of the complete genomes in the sample.

After primary processing of the NGS data (which we will not perform in this exercise), a common approach is to compare the metagenomic sequencing reads to reference databases composed of genome sequences of known organisms. Sequence similarity indicates that the microbes in the sample are genomically related to the organisms in the database. By counting the sequencing reads that are related to certain taxa, or that encode certain functions, we can get an idea of the ecology and functioning of the sampled metagenome.

When the sample is composed mostly of viruses we talk of metaviromics. Viruses are the most abundant entities on earth and the majority of them are yet to be discovered. This means that the fraction of viruses that are described in the databases is a small representation of the actual viral diversity. Because of this, a high percentage of the sequencing data in metaviromic studies show no similarity with any sequence in the databases. We sometimes call this unknown, or at least uncharacterizable fraction as viral dark matter. As additional viruses are discovered and described and we expand our view of the Virosphere, we will increasingly be able to understand the role of viruses in microbial ecosystems.

In this hands-on portion of the workshop, we will go through key steps in viromics.

0. Setup and choosing a dataset

Conda enables us to create multiple separate environments on our computer, where different programs can be installed without affecting our global virtual environment.

Why is this useful? Most of the time, tools rely on other programs to be able to run correctly - these programs are the tool’s dependencies. For example: you cannot run a tool that is coded in Python 3 on a machine that only has Python 2 installed (or no Python at all!).

So, why not just install everything into one big global environment that can run everything? We will focus on two reasons: compatibility and findability. The issue with findability is that sometimes, a tool will not “find” its dependency even though it is installed. To reuse the Python example: If you try to run a tool that requires Python 2, but your global environment’s default Python version is Python 3.7, then the tool will not run properly, even if Python 2 is technically installed. In that case, you would have to manually change the Python version anytime you decide to run a different tool, which is tedious and messy. The issue with compatibility is that some tools will just plain uninstall versions of programs or packages that are incompatible with them, and then reinstall the versions that work for them, thereby “breaking” another tool (this might or might not have happened during the preparation of this course ;) ). To summarize: keeping tools in separate conda environments will can save you a lot of pain.

1. Exploring Viromes

Tools: Seqtk, FastQC

Choose a dataset and explore it.

Describe fastq and fasta file formats

2. Quality Control and Taxonomic Profiling

Tools: Fastp, Kraken2 (using nf-core:taxoprofiler)

Sequencing Quality Questions

Why do we need quality control?

3. Binning and Evaluation

Tools: MaxBin2, CheckV

Questions How many viral contigs are there in the assembly? How many bins were found by the tool? How many high-quality bins do you find? What was the completeness of the bins?

4. Virus Identification

Tool: geNomad

How does Genomad work?

Taxonomic annotation of genomes with Genomad

Additional Considerations

  1. Data Management and Storage: Ensure you have adequate storage and a robust data management plan in place, as viromics data can be extensive. Here we use a tiny dataset to accelerate the preparation part.
  2. Computational Resources: Some of these tools require significant computational resources. Ensure you have access to high-performance computing facilities if necessary.
  3. Documentation and Reproducibility: Keep detailed records of your commands, parameters, and tool versions to enhance process reproducibility.
  4. Visualization and Interpretation: Post-analysis, use visualization tools like Krona for taxonomic profiles or Bandage for assembly graphs to better interpret the results.

By following these steps and leveraging the described tools, you will be able to perform an accurate analysis of a viromics dataset.

Key Points


Exploring viromics resources and files

Overview

Teaching: min
Exercises: 30 min
Questions
  • What are the data sources in viromics

Objectives
  • Exploring viromics resources and files

1. Download viromes and generate a summary report

Tools: Seqtk, FastQC, Fastp

Step 1: Exploring Viromics Resources and Files

sample 1: sample (BioProject PRJEB47625)

Sample from project on characterizing viral communities associated with human faecal virome.

Project description: The raw sequencing reads were derived from the human faecal virome (BioProject PRJEB47625). Total VLP DNA isolated from faecal samples provided by three donors were subjected to whole-genome shotgun metagenomic sequencing using Illumina HiSeq X Ten platform. In this study, we compared the use of PCR and PCR-free methods for sequence library construction to assess the impact of PCR amplification bias on the human faecal virome.

To download the dataset from a BioProject there are multiple tools including Entrez Direct and SRA Toolkit that need to be installed on your system. Alternatively, we used SRA Explorer online tool to find the list of FastQ files belonging to this BioProject within SRA FTP server.

Description The raw sequencing reads were derived from the human faecal virome. Total VLP DNA isolated from faecal samples provided by three donors was subjected to whole-genome shotgun metagenomic sequencing using Illumina HiSeq X Ten platform. In this study, we compared the use of PCR and PCR-free methods for sequence library construction to assess the impact of PCR amplification bias on the human faecal virome. Reads: https://www.ebi.ac.uk/ena/browser/view/PRJEB47625 Assemblies: https://zenodo.org/records/10650983 Article: https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.001236

prerequisites

To successfully download the required files, you need to have either wget or curl installed on your system. These tools are essential for fetching files from the internet.

Using wget: wget is a command-line utility for downloading files from the web. It supports HTTP, HTTPS, and FTP protocols.

  • Installation:
    • Linux (Debian/Ubuntu):
      sudo apt-get install wget
      
    • MacOS (requires Homebrew):
      brew install wget
      
    • Conda environment:
      conda install anaconda::wget
      

Using curl: curl is a command-line tool for transferring data using various network protocols, including HTTP, HTTPS, and FTP.

  • Installation:
    • Linux (Debian/Ubuntu):
      sudo apt-get install curl
      
    • MacOS (requires Homebrew):
      brew install curl
      
    • Conda environment:
      conda install conda-forge::curl
      

We recommend creating two conda environments including most of the tools we will use in this workshop. To use them, first install conda (described at the bottom of the page). Then download YML files and create the environment using them.

conda create -n genomad -c conda-forge -c bioconda genomad
conda activate genomad
wget -c "https://github.com/VirJenDB/2024-07-12-leipzig-viromics-workshop/blob/415a236fcbeb0823bbe9b84a936c945e182e4615/rawfiles/workshop-day5.yml" -O workshop-day5.yml
conda env create -f workshop-day5.yml

Ensure that you have one of these tools installed before proceeding with the download.

The list of the curl commands are stored in PRJEB47625_fastq_download.sh bash script file.

# Create a new directory "PRJEB47625" within "workshop" directory
mkdir workshop_day5 && mkdir workshop_day5/PRJEB47625

# Download "PRJEB47625_fastq_download.sh" bash script file, if you are using wget  
# wget -O workshop_day5/PRJEB47625/PRJEB47625_fastq_download.sh https://raw.githubusercontent.com/VirJenDB/2024-07-12-leipzig-viromics-workshop/1e984f29c4c7e4559493ae26453c6d9122763353/rawfiles/dataset/PRJEB47625_fastq_download_wget.sh

curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR679/001/ERR6797441/ERR6797441_1.fastq.gz -o workshop_day5/PRJEB47625/ERR6797441_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR679/001/ERR6797441/ERR6797441_2.fastq.gz -o workshop_day5/PRJEB47625/ERR6797441_2.fastq.gz

wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR679/001/ERR6797441/ERR6797441_1.fastq.gz -O workshop_day5/PRJEB47625/ERR6797441_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR679/001/ERR6797441/ERR6797441_2.fastq.gz -O workshop_day5/PRJEB47625/ERR6797441_2.fastq.gz

# Download "PRJEB47625_fastq_download.sh" bash script file, if you are using curl
#curl -L https://raw.githubusercontent.com/VirJenDB/2024-07-12-leipzig-viromics-workshop/6db885443ca210ba51c07345717a0bed9abf707a/rawfiles/dataset/PRJEB47625_fastq_download.sh -o workshop_day5/PRJEB47625/PRJEB47625_fastq_download.sh

# Give the required permission to the script to be executed
chmod +x workshop_day5/PRJEB47625/PRJEB47625_fastq_download.sh

# Execute the script that will download 39 Datasets in the current directory 
./workshop_day5/PRJEB47625/PRJEB47625_fastq_download.sh

Step 2: Perform quality control using Fastp

Fastp is a fast all-in-one preprocessing tool for FASTQ files. Install the tool from download page. Alternatively, you can install Fastp tool using conda conda install bioconda::fastp


Explanation of Main Fastp Options

Option Description
-i Input file for read 1 (can be gzipped).
-I Input file for read 2 (paired-end, can be gzipped).
-o Output file for read 1 (can be gzipped).
-O Output file for read 2 (paired-end, can be gzipped).
--html Generate an HTML report.
--json Generate a JSON report.
--thread Number of threads to use.
--length_required Minimum length of reads to keep.

Usage:

# Enter to the workshop day 5 folder
cd workshop_day5 && mkdir fastp_report fastp_output
# Perform quality control on the input FASTQ file
fastp -i PRJEB47625/ERR6797441_1.fastq.gz -I PRJEB47625/ERR6797441_2.fastq.gz -o fastp_output/ERR6797441_1.fastq.gz -O fastp_output/ERR6797441_2.fastq.gz --html fastp_report/report.html --json fastp_report/report.json

Fastp not only performs quality filtering but also removes adapters and low-quality reads, producing a cleaned dataset ready for downstream analysis.

Check the outputs:

# If the FastP run finished successfully, there should be two report files in the fastp_report folder and two FASTQ files within the "fastp_output" folder. FASTQ files will be used in the next step.
ls fastp_output/
ls fastp_report/ 

# To see FASTQ content, use zcat as they are compressed files
zcat fastp_output/ERR6797441_1.fastq.gz | head -n 10
 
# To see the report, either use a web browser to open fastp_report/report.html or use the head to read the first 26 lines of json file
head -n 26 fastp_report/report.json

Step 3: Assess the quality of the sequencing data using FastQC

FastQC provides a simple way to perform quality control checks on raw sequence data. Check the download page to download and install it or install it in the conda environment conda install bioconda::fastqc.

Usage:

# create a directory for FastQC output
mkdir fastqc_report 
# Run FastQC on the input FASTQ file
fastqc PRJEB47625/ERR6797441_1.fastq.gz -o fastqc_report

FastQC generates a detailed report on the quality of the sequencing data, including information on read length distribution, GC content, and the presence of adapters, which helps in identifying any potential issues before further analysis.

Let’s take a look at the statistics generated by FastQC:

unzip fastqc_report/ERR6797441_1_fastqc.zip -d fastqc_report
head fastqc_report/ERR6797441_1_fastqc/fastqc_data.txt -n 20

Step 4: Create small files for use in the next steps using Seqtk

Seqtk is a fast and lightweight tool for processing sequences in the FASTA/FASTQ format. Check the GitHub page of Seqtk to install the tool. Alternatively, use conda install bioconda::seqtk to install Seqtk as a conda package.

Usage:

# Extract the first 1000 reads from a FASTQ file
seqtk sample -s100 fastp_output/ERR6797441_1.fastq.gz 1000 > read1.fastq
seqtk sample -s100 fastp_output/ERR6797441_2.fastq.gz 1000 > read2.fastq

In this step, we use Seqtk to create a smaller subset of the data for initial exploration, which helps in quickly assessing the quality and content of the dataset without processing the entire file.


Conda installation

Find the latest Miniconda installer links for operation system from anaconda website

# Download the file using curl or get
wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh

# If you change the address of miniconda3 folder from the home directory, make sure there is enough space in the new location
bash miniconda.sh -p ~/miniconda3
rm -rf miniconda.sh
conda init

Key Points

  • Seqtk

  • fastqc

  • Fastp


Quality control and taxonomic profiling

Overview

Teaching: min
Exercises: 30 min
Questions
  • Quality control and taxonomic profiling of one virome

Objectives
  • Install Fastp, kraken2 and use them

2. Taxonomic Profiling of One Virome

Tools: Kraken2

Perform taxonomic profiling using Kraken2

Kraken2 is a system for assigning taxonomic labels to short DNA sequences. nf-core/taxprofiler is a bioinformatics best-practice analysis pipeline for taxonomic classification and profiling of shotgun metagenomic data. It allows for in-parallel taxonomic identification of reads or taxonomic abundance estimation with multiple classification and profiling tools against multiple databases and produces standardized output tables.

Step 1: Installation

Method 1: Install Nextflow from the (>=22.10.1).

Method 2: (recommended) Install nextflow using Docker Engine. The Docker engine installation procedure is described at the end of the page.

# Install Java as a dependency 
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install java 11.0.11.hs-adpt

curl -s https://get.nextflow.io | bash
mv nextflow /usr/local/bin/

If your user is not a sudoer, try a users local bin folder and add it to the PATH variable

# Create ~/bin Directory if It Does Not Exist
mkdir -p ~/bin

# Move the Nextflow Executable to ~/bin
mv nextflow ~/bin/

# Add ~/bin to Your PATH
export PATH=$HOME/bin:$PATH

# Then, reload your shell configuration:
source ~/.bashrc

Test the installation of Nextflow using nextflow -version

Method 3: Create a new conda environment and install nextflow:

conda create -n workshop_nextflow -c bioconda nextflow 
conda activate workshop_nextflow

For other installation methods, please check nf-core website

Step 2: Download taxprofiler pipeline

Download taxprofiler pipeline and test it on a minimal dataset with a single command:

In the following command, the name of the method used for the installation has been used as the “Profile” name. The profile name could be one of the docker, singularity, podman, shifter, charliecloud, and conda which instruct the pipeline to use the named tool for software management. For example if you used conda, ` -profile test,conda ` is the correct term.

# Update nextflow to make sure it is the latest version if it is needed
nextflow self-update

# return to workshop_day5 folder
cd workshop_day5

# Chain multiple config profiles in a comma-separated string
nextflow run nf-core/taxprofiler -profile test,YOURPROFILE --outdir nextflow

Step 3: Run the nf-core:taxprofiler pipeline

execution of the pipeline required multiple input files including samplesheet.csv and database.csv files.

# Create samplesheet.csv by listing the fastq files processed by fastp
echo "sample,run_accession,instrument_platform,fastq_1,fastq_2,fasta\nERR6797441,run1,ILLUMINA,PRJEB47625/ERR6797441_1.fastq.gz,PRJEB47625/ERR6797441_2.fastq.gz," > samplesheet.csv

echo -e "tool,db_name,db_path\nkraken2,kraken2,kraken2_db/kraken2_db/" > database.csv


nf-core/taxprofiler Pipeline Parameters

Parameter Description
--input samplesheet.csv Specifies the CSV file with your sample information.
--databases database.csv Specifies the CSV file with the list of databases.
--outdir <OUTDIR> The directory where the output files will be saved.
--run_<TOOL1> Include any specific tools or modules you want to run. Replace <TOOL1> with the actual tool names (e.g., --run_blast).
--run_<TOOL2> Optionally include additional tools or modules.
-profile <profile> Specifies the execution profile (e.g., docker, singularity, conda, etc.).


# Run the nf-core:taxprofiler pipeline
nextflow run nf-core/taxprofiler --input samplesheet.csv --databases database.csv --outdir nextflow_output -profile docker

This pipeline automates the process, running Kraken2 and other tools as part of a streamlined workflow.

after the completion of the pipeline execution, the nextflow_output folder will contain fastqc, multiqc, and pipeline_info folders.


Install Docker Engine on Ubuntu if it is needed (Sudo privilege is required)

#Update the Package Index
sudo apt-get update

# Install Required Packages
sudo apt-get install apt-transport-https ca-certificates curl gnupg lsb-release

# Add Docker’s Official GPG Key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

# Set Up the Repository
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# Update the Package Index Again
sudo apt-get update

# Install the latest version of Docker Engine and containerd.io:
sudo apt-get install docker-ce docker-ce-cli containerd.io

#Verify Docker Installation
sudo docker run hello-world

Create a Kraken2 Database

# Get back to the workshop folder
cd workshop_day5

# Download and install Kraken2
wget https://github.com/DerrickWood/kraken2/archive/v2.1.2.tar.gz
tar -xvzf v2.1.2.tar.gz
cd kraken2-2.1.2
./install_kraken2.sh .
export PATH=$PATH:$PWD
cd .. && rm v2.1.2.tar.gz

# Create a directory for the database
mkdir -p kraken2_db
cd kraken2_db

# Download RefSeq viruses as a kraken2 database
kraken2-build --download-library viral --db kraken2_db

Install with conda directly

conda install bioconda::kraken2
kraken2-build --download-taxonomy --db kraken2_db
kraken2-build --download-library viral --db kraken2_db

download the ready kraken database for viruses

wget  -c "https://genome-idx.s3.amazonaws.com/kraken/k2_viral_20240605.tar.gz"
mkdir kraken_db
tar -xvzf k2_viral_20240605.tar.gz -C kraken_db

Key Points

  • Fastp, kraken2 using nf-core:taxprofiler


Virome Assembly and Bin Evaluation

Overview

Teaching: min
Exercises: 30 min
Questions
  • How to bin virus genomes from the assembly file

  • How to evaluate binning results

Objectives

3. Bin virus genomes

Tools: MaxBin2, CheckV

In this section, we start with the assembled contig file belonging to the same project to save time and focus on binning and evaluation. By starting with a pre-assembled contig file, you streamline the process and focus on binning and quality evaluation, ensuring efficient and effective analysis of your viromics data.

Download the assembled file with wget or curl

wget -nc https://zenodo.org/records/10650983/files/illumina_sample_01_megahit.fa.gz?download=1 -O workshop_day5/illumina_sample_01_megahit.fa.gz curl -L https://zenodo.org/records/10650983/files/illumina_sample_01_megahit.fa.gz?download=1 -o workshop_day5/illumina_sample_01_megahit.fa.gz


Create abundance_counts file

First, we need an abundance_counts file to be used later in

# Index the contig file (conda install bioconda::bwa)
bwa index PRJEB47625/illumina_sample_01_megahit.fa.gz

# Align reads to contigs
bwa mem PRJEB47625/illumina_sample_01_megahit.fa.gz PRJEB47625/ERR6797441_1.fastq.gz PRJEB47625/ERR6797441_2.fastq.gz > aligned_reads.sam

# Convert SAM to BAM (conda install bioconda::samtools)
samtools view -bS aligned_reads.sam > aligned_reads.bam

# Sort BAM file
samtools sort aligned_reads.bam -o sorted_reads.bam

# Index BAM file
samtools index sorted_reads.bam

bedtools bamtobed -i sorted_reads.bam > intervals.bed

# Count reads mapped to each contig (conda install bioconda::bedtools)
bedtools coverage -a intervals.bed -b sorted_reads.bam > abundance_counts.txt

Step 1: Bin virus genomes from the assembly file using MaxBin2

MaxBin2 is a tool designed to bin metagenomic contigs into individual genomes, including viral genomes. Follow website instructions or use conda install bioconda::maxbin2 to install maxbin2 via conda.

Usage:

cd workshop_day5
wget -c https://zenodo.org/records/10650983/files/illumina_sample_01_megahit.fa.gz -O PRJEB47625/illumina_sample_01_megahit.fa.gz

# Run MaxBin2 for binning
run_MaxBin.pl -contig PRJEB47625/illumina_sample_01_megahit.fa.gz -abund abundance_counts.txt -out bins_directory

MaxBin2 will generate bins of contigs, each representing a putative genome, including viral genomes.

Step 2: Evaluate bins using CheckV

CheckV is used to assess the quality of viral genomes from metagenomic assemblies. Conda installation command is conda install bioconda::checkv Usage:

# Run CheckV on the binned data
checkv end_to_end bins_directory checkv_output -t 4

CheckV evaluates the completeness and contamination of viral bins, providing quality metrics that help in refining and validating your viral genomes.

Key Points

  • MaxBin2, CheckV


Virus identification using geNomad

Overview

Teaching: min
Exercises: 30 min
Questions
  • How to use geNomad?

Objectives
  • Install geNomad

  • Run and interpret its result

4. Virus Identification and annotation by geNomad

Tools: geNomad

geNomad is a tool that identifies virus and plasmid genomes from nucleotide sequences. It provides state-of-the-art classification performance and can be used to quickly find mobile genetic elements from genomes, metagenomes, or metatranscriptomes.

Installation

using conda

conda create -n genomad -c conda-forge -c bioconda genomad
conda activate genomad
genomad download-database .

using docker

docker pull antoniopcamargo/genomad
docker run -ti --rm -v "$(pwd):/app" antoniopcamargo/genomad download-database .
docker run -ti --rm -v "$(pwd):/app" antoniopcamargo/genomad end-to-end PRJEB47625/illumina_sample_01_megahit.fa.gz output genomad_db

Pipeline Options

Option Description
end-to-end Executes the full pipeline
–cleanup Force geNomad to delete intermediate files
–splits 8 To make it possible to run this example in a notebook

Usage:

# Run the full geNomad pipeline (end-to-end command), taking a nucleotide FASTA file (illumina_sample_01_megahit.fa.gz) and the database (genomad_db) as input and produce output in genomad_output
genomad end-to-end --cleanup --splits 8 PRJEB47625/illumina_sample_01_megahit.fa.gz genomad_output genomad_db

geNomad identifies viral sequences within the assembled contigs and provides annotations that are crucial for understanding the viral components of your virome.

Key Points

  • geNomad


Wrap Up

Overview

Teaching: 30 min
Exercises: 0 min
Questions
Objectives

Key Points