Introduction & Setup
Overview
Teaching: 30 min
Exercises: 0 minQuestions
Objectives
Overall view of the workshop
Getting to know each other
Setup
Key Points
Lecture on Viruses and viromics
Overview
Teaching: 30 min
Exercises: 0 minQuestions
Objectives
Key Points
Break and Questions
Overview
Teaching: 30 min
Exercises: 0 minQuestions
Objectives
Key Points
Lecture on RDM and Virus Infrastructure
Overview
Teaching: 30 min
Exercises: minQuestions
Objectives
Key Points
Lunch break
Overview
Teaching: 60 min
Exercises: 0 minQuestions
Objectives
Key Points
Project Explanation
Overview
Teaching: 30 min
Exercises: minQuestions
Objectives
Hands-on Workshop: Viromics
Description
Metagenomics
The emergence of Next Generation Sequencing (NGS) has facilitated the development of metagenomics. In metagenomic studies, DNA from all the organisms in a mixed sample is sequenced in a massively parallel way (or RNA in case of metatranscriptomics). The goal of these studies is usually to identify certain microbes in a sample, or to taxonomically or functionally characterize a microbial community. There are different ways to process and analyze metagenomes, such as the targeted amplification and sequencing of the 16S ribosomal RNA gene (amplicon sequencing, used for taxonomic profiling) or shotgun sequencing of the complete genomes in the sample.
After primary processing of the NGS data (which we will not perform in this exercise), a common approach is to compare the metagenomic sequencing reads to reference databases composed of genome sequences of known organisms. Sequence similarity indicates that the microbes in the sample are genomically related to the organisms in the database. By counting the sequencing reads that are related to certain taxa, or that encode certain functions, we can get an idea of the ecology and functioning of the sampled metagenome.
When the sample is composed mostly of viruses we talk of metaviromics. Viruses are the most abundant entities on earth and the majority of them are yet to be discovered. This means that the fraction of viruses that are described in the databases is a small representation of the actual viral diversity. Because of this, a high percentage of the sequencing data in metaviromic studies show no similarity with any sequence in the databases. We sometimes call this unknown, or at least uncharacterizable fraction as viral dark matter. As additional viruses are discovered and described and we expand our view of the Virosphere, we will increasingly be able to understand the role of viruses in microbial ecosystems.
In this hands-on portion of the workshop, we will go through key steps in viromics.
0. Setup and choosing a dataset
Conda enables us to create multiple separate environments on our computer, where different programs can be installed without affecting our global virtual environment.
Why is this useful? Most of the time, tools rely on other programs to be able to run correctly - these programs are the tool’s dependencies. For example: you cannot run a tool that is coded in Python 3 on a machine that only has Python 2 installed (or no Python at all!).
So, why not just install everything into one big global environment that can run everything? We will focus on two reasons: compatibility and findability. The issue with findability is that sometimes, a tool will not “find” its dependency even though it is installed. To reuse the Python example: If you try to run a tool that requires Python 2, but your global environment’s default Python version is Python 3.7, then the tool will not run properly, even if Python 2 is technically installed. In that case, you would have to manually change the Python version anytime you decide to run a different tool, which is tedious and messy. The issue with compatibility is that some tools will just plain uninstall versions of programs or packages that are incompatible with them, and then reinstall the versions that work for them, thereby “breaking” another tool (this might or might not have happened during the preparation of this course ;) ). To summarize: keeping tools in separate conda environments will can save you a lot of pain.
1. Exploring Viromes
Tools: Seqtk, FastQC
- Seqtk: Extract subsets of your data.
- FastQC: Generate summary reports.
Choose a dataset and explore it.
Describe fastq and fasta file formats
- What information is contained in sequencing files?
- How many sequences are in your files? Print the first 10 and last 10 lines of your files in terminal
- Describe these lines
- What is different between the paired end files? What is the same? What is the GC content of your files?
2. Quality Control and Taxonomic Profiling
Tools: Fastp, Kraken2 (using nf-core:taxoprofiler)
- Fastp: Clean your data by removing low-quality reads and adapters, ensuring high-quality input for downstream analyses.
- Kraken2: Assign taxonomic labels to your sequences, providing insights into the diversity of your virome.
- nf-core:taxoprofiler: An integrated pipeline that automates the taxonomic profiling process, improving efficiency and reproducibility.
Sequencing Quality Questions
Why do we need quality control?
- adapters and/or primers
- low quality ends What is the impact of including low quality reads in downstream analyses? What are the metrics for assessing sequencing quality for Illumina reads? – phred scores – read quality disribution
3. Binning and Evaluation
Tools: MaxBin2, CheckV
- MaxBin2: Organize contigs into bins representing individual genomes, facilitating genome-centric analyses.
- CheckV: Evaluate the quality of your viral bins to ensure completeness and minimize contamination, enhancing the reliability of your results.
Questions How many viral contigs are there in the assembly? How many bins were found by the tool? How many high-quality bins do you find? What was the completeness of the bins?
4. Virus Identification
Tool: geNomad
- geNomad: Identify and annotate viral sequences
How does Genomad work?
- Strengths vs weaknesses
Taxonomic annotation of genomes with Genomad
- How many genomes are classified?
- Describe the output
Additional Considerations
- Data Management and Storage: Ensure you have adequate storage and a robust data management plan in place, as viromics data can be extensive. Here we use a tiny dataset to accelerate the preparation part.
- Computational Resources: Some of these tools require significant computational resources. Ensure you have access to high-performance computing facilities if necessary.
- Documentation and Reproducibility: Keep detailed records of your commands, parameters, and tool versions to enhance process reproducibility.
- Visualization and Interpretation: Post-analysis, use visualization tools like Krona for taxonomic profiles or Bandage for assembly graphs to better interpret the results.
By following these steps and leveraging the described tools, you will be able to perform an accurate analysis of a viromics dataset.
Key Points
Exploring viromics resources and files
Overview
Teaching: min
Exercises: 30 minQuestions
What are the data sources in viromics
Objectives
Exploring viromics resources and files
1. Download viromes and generate a summary report
Tools: Seqtk, FastQC, Fastp
Step 1: Exploring Viromics Resources and Files
sample 1: sample (BioProject PRJEB47625)
Sample from project on characterizing viral communities associated with human faecal virome.
Project description: The raw sequencing reads were derived from the human faecal virome (BioProject PRJEB47625). Total VLP DNA isolated from faecal samples provided by three donors were subjected to whole-genome shotgun metagenomic sequencing using Illumina HiSeq X Ten platform. In this study, we compared the use of PCR and PCR-free methods for sequence library construction to assess the impact of PCR amplification bias on the human faecal virome.
To download the dataset from a BioProject there are multiple tools including Entrez Direct and SRA Toolkit that need to be installed on your system. Alternatively, we used SRA Explorer online tool to find the list of FastQ files belonging to this BioProject within SRA FTP server.
Description The raw sequencing reads were derived from the human faecal virome. Total VLP DNA isolated from faecal samples provided by three donors was subjected to whole-genome shotgun metagenomic sequencing using Illumina HiSeq X Ten platform. In this study, we compared the use of PCR and PCR-free methods for sequence library construction to assess the impact of PCR amplification bias on the human faecal virome. Reads: https://www.ebi.ac.uk/ena/browser/view/PRJEB47625 Assemblies: https://zenodo.org/records/10650983 Article: https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.001236
prerequisites
To successfully download the required files, you need to have either
wget
orcurl
installed on your system. These tools are essential for fetching files from the internet.Using
wget
:wget
is a command-line utility for downloading files from the web. It supports HTTP, HTTPS, and FTP protocols.
- Installation:
- Linux (Debian/Ubuntu):
sudo apt-get install wget
- MacOS (requires Homebrew):
brew install wget
- Conda environment:
conda install anaconda::wget
Using
curl
:curl
is a command-line tool for transferring data using various network protocols, including HTTP, HTTPS, and FTP.
- Installation:
- Linux (Debian/Ubuntu):
sudo apt-get install curl
- MacOS (requires Homebrew):
brew install curl
- Conda environment:
conda install conda-forge::curl
We recommend creating two conda environments including most of the tools we will use in this workshop. To use them, first install conda (described at the bottom of the page). Then download YML files and create the environment using them.
conda create -n genomad -c conda-forge -c bioconda genomad conda activate genomad wget -c "https://github.com/VirJenDB/2024-07-12-leipzig-viromics-workshop/blob/415a236fcbeb0823bbe9b84a936c945e182e4615/rawfiles/workshop-day5.yml" -O workshop-day5.yml conda env create -f workshop-day5.yml
Ensure that you have one of these tools installed before proceeding with the download.
The list of the curl commands are stored in PRJEB47625_fastq_download.sh bash script file.
# Create a new directory "PRJEB47625" within "workshop" directory
mkdir workshop_day5 && mkdir workshop_day5/PRJEB47625
# Download "PRJEB47625_fastq_download.sh" bash script file, if you are using wget
# wget -O workshop_day5/PRJEB47625/PRJEB47625_fastq_download.sh https://raw.githubusercontent.com/VirJenDB/2024-07-12-leipzig-viromics-workshop/1e984f29c4c7e4559493ae26453c6d9122763353/rawfiles/dataset/PRJEB47625_fastq_download_wget.sh
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR679/001/ERR6797441/ERR6797441_1.fastq.gz -o workshop_day5/PRJEB47625/ERR6797441_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR679/001/ERR6797441/ERR6797441_2.fastq.gz -o workshop_day5/PRJEB47625/ERR6797441_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR679/001/ERR6797441/ERR6797441_1.fastq.gz -O workshop_day5/PRJEB47625/ERR6797441_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR679/001/ERR6797441/ERR6797441_2.fastq.gz -O workshop_day5/PRJEB47625/ERR6797441_2.fastq.gz
# Download "PRJEB47625_fastq_download.sh" bash script file, if you are using curl
#curl -L https://raw.githubusercontent.com/VirJenDB/2024-07-12-leipzig-viromics-workshop/6db885443ca210ba51c07345717a0bed9abf707a/rawfiles/dataset/PRJEB47625_fastq_download.sh -o workshop_day5/PRJEB47625/PRJEB47625_fastq_download.sh
# Give the required permission to the script to be executed
chmod +x workshop_day5/PRJEB47625/PRJEB47625_fastq_download.sh
# Execute the script that will download 39 Datasets in the current directory
./workshop_day5/PRJEB47625/PRJEB47625_fastq_download.sh
Step 2: Perform quality control using Fastp
Fastp is a fast all-in-one preprocessing tool for FASTQ files. Install the tool from download page.
Alternatively, you can install Fastp tool using conda conda install bioconda::fastp
Explanation of Main Fastp Options
Option | Description |
---|---|
-i |
Input file for read 1 (can be gzipped). |
-I |
Input file for read 2 (paired-end, can be gzipped). |
-o |
Output file for read 1 (can be gzipped). |
-O |
Output file for read 2 (paired-end, can be gzipped). |
--html |
Generate an HTML report. |
--json |
Generate a JSON report. |
--thread |
Number of threads to use. |
--length_required |
Minimum length of reads to keep. |
Usage:
# Enter to the workshop day 5 folder
cd workshop_day5 && mkdir fastp_report fastp_output
# Perform quality control on the input FASTQ file
fastp -i PRJEB47625/ERR6797441_1.fastq.gz -I PRJEB47625/ERR6797441_2.fastq.gz -o fastp_output/ERR6797441_1.fastq.gz -O fastp_output/ERR6797441_2.fastq.gz --html fastp_report/report.html --json fastp_report/report.json
Fastp not only performs quality filtering but also removes adapters and low-quality reads, producing a cleaned dataset ready for downstream analysis.
Check the outputs:
# If the FastP run finished successfully, there should be two report files in the fastp_report folder and two FASTQ files within the "fastp_output" folder. FASTQ files will be used in the next step.
ls fastp_output/
ls fastp_report/
# To see FASTQ content, use zcat as they are compressed files
zcat fastp_output/ERR6797441_1.fastq.gz | head -n 10
# To see the report, either use a web browser to open fastp_report/report.html or use the head to read the first 26 lines of json file
head -n 26 fastp_report/report.json
Step 3: Assess the quality of the sequencing data using FastQC
FastQC provides a simple way to perform quality control checks on raw sequence data. Check the download page to download and install it or install it in the conda environment conda install bioconda::fastqc
.
Usage:
# create a directory for FastQC output
mkdir fastqc_report
# Run FastQC on the input FASTQ file
fastqc PRJEB47625/ERR6797441_1.fastq.gz -o fastqc_report
FastQC generates a detailed report on the quality of the sequencing data, including information on read length distribution, GC content, and the presence of adapters, which helps in identifying any potential issues before further analysis.
Let’s take a look at the statistics generated by FastQC:
unzip fastqc_report/ERR6797441_1_fastqc.zip -d fastqc_report
head fastqc_report/ERR6797441_1_fastqc/fastqc_data.txt -n 20
Step 4: Create small files for use in the next steps using Seqtk
Seqtk is a fast and lightweight tool for processing sequences in the FASTA/FASTQ format. Check the GitHub page of Seqtk to install the tool. Alternatively, use conda install bioconda::seqtk
to install Seqtk as a conda package.
Usage:
# Extract the first 1000 reads from a FASTQ file
seqtk sample -s100 fastp_output/ERR6797441_1.fastq.gz 1000 > read1.fastq
seqtk sample -s100 fastp_output/ERR6797441_2.fastq.gz 1000 > read2.fastq
In this step, we use Seqtk to create a smaller subset of the data for initial exploration, which helps in quickly assessing the quality and content of the dataset without processing the entire file.
Conda installation
Find the latest Miniconda installer links for operation system from anaconda website
# Download the file using curl or get
wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# If you change the address of miniconda3 folder from the home directory, make sure there is enough space in the new location
bash miniconda.sh -p ~/miniconda3
rm -rf miniconda.sh
conda init
Key Points
Seqtk
fastqc
Fastp
Quality control and taxonomic profiling
Overview
Teaching: min
Exercises: 30 minQuestions
Quality control and taxonomic profiling of one virome
Objectives
Install Fastp, kraken2 and use them
2. Taxonomic Profiling of One Virome
Tools: Kraken2
Perform taxonomic profiling using Kraken2
Kraken2 is a system for assigning taxonomic labels to short DNA sequences. nf-core/taxprofiler is a bioinformatics best-practice analysis pipeline for taxonomic classification and profiling of shotgun metagenomic data. It allows for in-parallel taxonomic identification of reads or taxonomic abundance estimation with multiple classification and profiling tools against multiple databases and produces standardized output tables.
Step 1: Installation
Method 1: Install Nextflow from the (>=22.10.1).
Method 2: (recommended) Install nextflow using Docker Engine. The Docker engine installation procedure is described at the end of the page.
# Install Java as a dependency
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install java 11.0.11.hs-adpt
curl -s https://get.nextflow.io | bash
mv nextflow /usr/local/bin/
If your user is not a sudoer, try a users local bin folder and add it to the PATH variable
# Create ~/bin Directory if It Does Not Exist
mkdir -p ~/bin
# Move the Nextflow Executable to ~/bin
mv nextflow ~/bin/
# Add ~/bin to Your PATH
export PATH=$HOME/bin:$PATH
# Then, reload your shell configuration:
source ~/.bashrc
Test the installation of Nextflow using nextflow -version
Method 3: Create a new conda environment and install nextflow:
conda create -n workshop_nextflow -c bioconda nextflow
conda activate workshop_nextflow
For other installation methods, please check nf-core website
Step 2: Download taxprofiler pipeline
Download taxprofiler pipeline and test it on a minimal dataset with a single command:
In the following command, the name of the method used for the installation has been used as the “Profile” name. The profile name could be one of the docker, singularity, podman, shifter, charliecloud, and conda which instruct the pipeline to use the named tool for software management. For example if you used conda, ` -profile test,conda ` is the correct term.
# Update nextflow to make sure it is the latest version if it is needed
nextflow self-update
# return to workshop_day5 folder
cd workshop_day5
# Chain multiple config profiles in a comma-separated string
nextflow run nf-core/taxprofiler -profile test,YOURPROFILE --outdir nextflow
Step 3: Run the nf-core:taxprofiler pipeline
execution of the pipeline required multiple input files including samplesheet.csv and database.csv files.
# Create samplesheet.csv by listing the fastq files processed by fastp
echo "sample,run_accession,instrument_platform,fastq_1,fastq_2,fasta\nERR6797441,run1,ILLUMINA,PRJEB47625/ERR6797441_1.fastq.gz,PRJEB47625/ERR6797441_2.fastq.gz," > samplesheet.csv
echo -e "tool,db_name,db_path\nkraken2,kraken2,kraken2_db/kraken2_db/" > database.csv
nf-core/taxprofiler Pipeline Parameters
Parameter | Description |
---|---|
--input samplesheet.csv |
Specifies the CSV file with your sample information. |
--databases database.csv |
Specifies the CSV file with the list of databases. |
--outdir <OUTDIR> |
The directory where the output files will be saved. |
--run_<TOOL1> |
Include any specific tools or modules you want to run. Replace <TOOL1> with the actual tool names (e.g., --run_blast ). |
--run_<TOOL2> |
Optionally include additional tools or modules. |
-profile <profile> |
Specifies the execution profile (e.g., docker , singularity , conda , etc.). |
# Run the nf-core:taxprofiler pipeline
nextflow run nf-core/taxprofiler --input samplesheet.csv --databases database.csv --outdir nextflow_output -profile docker
This pipeline automates the process, running Kraken2 and other tools as part of a streamlined workflow.
after the completion of the pipeline execution, the nextflow_output folder will contain fastqc, multiqc, and pipeline_info folders.
Install Docker Engine on Ubuntu if it is needed (Sudo privilege is required)
#Update the Package Index
sudo apt-get update
# Install Required Packages
sudo apt-get install apt-transport-https ca-certificates curl gnupg lsb-release
# Add Docker’s Official GPG Key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
# Set Up the Repository
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# Update the Package Index Again
sudo apt-get update
# Install the latest version of Docker Engine and containerd.io:
sudo apt-get install docker-ce docker-ce-cli containerd.io
#Verify Docker Installation
sudo docker run hello-world
Create a Kraken2 Database
# Get back to the workshop folder
cd workshop_day5
# Download and install Kraken2
wget https://github.com/DerrickWood/kraken2/archive/v2.1.2.tar.gz
tar -xvzf v2.1.2.tar.gz
cd kraken2-2.1.2
./install_kraken2.sh .
export PATH=$PATH:$PWD
cd .. && rm v2.1.2.tar.gz
# Create a directory for the database
mkdir -p kraken2_db
cd kraken2_db
# Download RefSeq viruses as a kraken2 database
kraken2-build --download-library viral --db kraken2_db
Install with conda directly
conda install bioconda::kraken2
kraken2-build --download-taxonomy --db kraken2_db
kraken2-build --download-library viral --db kraken2_db
download the ready kraken database for viruses
wget -c "https://genome-idx.s3.amazonaws.com/kraken/k2_viral_20240605.tar.gz"
mkdir kraken_db
tar -xvzf k2_viral_20240605.tar.gz -C kraken_db
Key Points
Fastp, kraken2 using nf-core:taxprofiler
Virome Assembly and Bin Evaluation
Overview
Teaching: min
Exercises: 30 minQuestions
How to bin virus genomes from the assembly file
How to evaluate binning results
Objectives
3. Bin virus genomes
Tools: MaxBin2, CheckV
In this section, we start with the assembled contig file belonging to the same project to save time and focus on binning and evaluation. By starting with a pre-assembled contig file, you streamline the process and focus on binning and quality evaluation, ensuring efficient and effective analysis of your viromics data.
Download the assembled file with wget or curl
wget -nc https://zenodo.org/records/10650983/files/illumina_sample_01_megahit.fa.gz?download=1 -O workshop_day5/illumina_sample_01_megahit.fa.gz curl -L https://zenodo.org/records/10650983/files/illumina_sample_01_megahit.fa.gz?download=1 -o workshop_day5/illumina_sample_01_megahit.fa.gz
Create abundance_counts file
First, we need an abundance_counts file to be used later in
# Index the contig file (conda install bioconda::bwa)
bwa index PRJEB47625/illumina_sample_01_megahit.fa.gz
# Align reads to contigs
bwa mem PRJEB47625/illumina_sample_01_megahit.fa.gz PRJEB47625/ERR6797441_1.fastq.gz PRJEB47625/ERR6797441_2.fastq.gz > aligned_reads.sam
# Convert SAM to BAM (conda install bioconda::samtools)
samtools view -bS aligned_reads.sam > aligned_reads.bam
# Sort BAM file
samtools sort aligned_reads.bam -o sorted_reads.bam
# Index BAM file
samtools index sorted_reads.bam
bedtools bamtobed -i sorted_reads.bam > intervals.bed
# Count reads mapped to each contig (conda install bioconda::bedtools)
bedtools coverage -a intervals.bed -b sorted_reads.bam > abundance_counts.txt
Step 1: Bin virus genomes from the assembly file using MaxBin2
MaxBin2 is a tool designed to bin metagenomic contigs into individual genomes, including viral genomes. Follow website instructions or use conda install bioconda::maxbin2
to install maxbin2 via conda.
Usage:
cd workshop_day5
wget -c https://zenodo.org/records/10650983/files/illumina_sample_01_megahit.fa.gz -O PRJEB47625/illumina_sample_01_megahit.fa.gz
# Run MaxBin2 for binning
run_MaxBin.pl -contig PRJEB47625/illumina_sample_01_megahit.fa.gz -abund abundance_counts.txt -out bins_directory
MaxBin2 will generate bins of contigs, each representing a putative genome, including viral genomes.
Step 2: Evaluate bins using CheckV
CheckV is used to assess the quality of viral genomes from metagenomic assemblies. Conda installation command is conda install bioconda::checkv
Usage:
# Run CheckV on the binned data
checkv end_to_end bins_directory checkv_output -t 4
CheckV evaluates the completeness and contamination of viral bins, providing quality metrics that help in refining and validating your viral genomes.
Key Points
MaxBin2, CheckV
Virus identification using geNomad
Overview
Teaching: min
Exercises: 30 minQuestions
How to use geNomad?
Objectives
Install geNomad
Run and interpret its result
4. Virus Identification and annotation by geNomad
Tools: geNomad
geNomad is a tool that identifies virus and plasmid genomes from nucleotide sequences. It provides state-of-the-art classification performance and can be used to quickly find mobile genetic elements from genomes, metagenomes, or metatranscriptomes.
Installation
using conda
conda create -n genomad -c conda-forge -c bioconda genomad
conda activate genomad
genomad download-database .
using docker
docker pull antoniopcamargo/genomad
docker run -ti --rm -v "$(pwd):/app" antoniopcamargo/genomad download-database .
docker run -ti --rm -v "$(pwd):/app" antoniopcamargo/genomad end-to-end PRJEB47625/illumina_sample_01_megahit.fa.gz output genomad_db
Pipeline Options
Option | Description |
---|---|
end-to-end | Executes the full pipeline |
–cleanup | Force geNomad to delete intermediate files |
–splits 8 | To make it possible to run this example in a notebook |
Usage:
# Run the full geNomad pipeline (end-to-end command), taking a nucleotide FASTA file (illumina_sample_01_megahit.fa.gz) and the database (genomad_db) as input and produce output in genomad_output
genomad end-to-end --cleanup --splits 8 PRJEB47625/illumina_sample_01_megahit.fa.gz genomad_output genomad_db
geNomad identifies viral sequences within the assembled contigs and provides annotations that are crucial for understanding the viral components of your virome.
Key Points
geNomad
Wrap Up
Overview
Teaching: 30 min
Exercises: 0 minQuestions
Objectives
Key Points