Quality control and taxonomic profiling
Overview
Teaching: min
Exercises: 30 minQuestions
Quality control and taxonomic profiling of one virome
Objectives
Install Fastp, kraken2 and use them
2. Taxonomic Profiling of One Virome
Tools: Kraken2
Perform taxonomic profiling using Kraken2
Kraken2 is a system for assigning taxonomic labels to short DNA sequences. nf-core/taxprofiler is a bioinformatics best-practice analysis pipeline for taxonomic classification and profiling of shotgun metagenomic data. It allows for in-parallel taxonomic identification of reads or taxonomic abundance estimation with multiple classification and profiling tools against multiple databases and produces standardized output tables.
Step 1: Installation
Method 1: Install Nextflow from the (>=22.10.1).
Method 2: (recommended) Install nextflow using Docker Engine. The Docker engine installation procedure is described at the end of the page.
# Install Java as a dependency
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install java 11.0.11.hs-adpt
curl -s https://get.nextflow.io | bash
mv nextflow /usr/local/bin/
If your user is not a sudoer, try a users local bin folder and add it to the PATH variable
# Create ~/bin Directory if It Does Not Exist
mkdir -p ~/bin
# Move the Nextflow Executable to ~/bin
mv nextflow ~/bin/
# Add ~/bin to Your PATH
export PATH=$HOME/bin:$PATH
# Then, reload your shell configuration:
source ~/.bashrc
Test the installation of Nextflow using nextflow -version
Method 3: Create a new conda environment and install nextflow:
conda create -n workshop_nextflow -c bioconda nextflow
conda activate workshop_nextflow
For other installation methods, please check nf-core website
Step 2: Download taxprofiler pipeline
Download taxprofiler pipeline and test it on a minimal dataset with a single command:
In the following command, the name of the method used for the installation has been used as the “Profile” name. The profile name could be one of the docker, singularity, podman, shifter, charliecloud, and conda which instruct the pipeline to use the named tool for software management. For example if you used conda, ` -profile test,conda ` is the correct term.
# Update nextflow to make sure it is the latest version if it is needed
nextflow self-update
# return to workshop_day5 folder
cd workshop_day5
# Chain multiple config profiles in a comma-separated string
nextflow run nf-core/taxprofiler -profile test,YOURPROFILE --outdir nextflow
Step 3: Run the nf-core:taxprofiler pipeline
execution of the pipeline required multiple input files including samplesheet.csv and database.csv files.
# Create samplesheet.csv by listing the fastq files processed by fastp
echo "sample,run_accession,instrument_platform,fastq_1,fastq_2,fasta\nERR6797441,run1,ILLUMINA,PRJEB47625/ERR6797441_1.fastq.gz,PRJEB47625/ERR6797441_2.fastq.gz," > samplesheet.csv
echo -e "tool,db_name,db_path\nkraken2,kraken2,kraken2_db/kraken2_db/" > database.csv
nf-core/taxprofiler Pipeline Parameters
Parameter | Description |
---|---|
--input samplesheet.csv |
Specifies the CSV file with your sample information. |
--databases database.csv |
Specifies the CSV file with the list of databases. |
--outdir <OUTDIR> |
The directory where the output files will be saved. |
--run_<TOOL1> |
Include any specific tools or modules you want to run. Replace <TOOL1> with the actual tool names (e.g., --run_blast ). |
--run_<TOOL2> |
Optionally include additional tools or modules. |
-profile <profile> |
Specifies the execution profile (e.g., docker , singularity , conda , etc.). |
# Run the nf-core:taxprofiler pipeline
nextflow run nf-core/taxprofiler --input samplesheet.csv --databases database.csv --outdir nextflow_output -profile docker
This pipeline automates the process, running Kraken2 and other tools as part of a streamlined workflow.
after the completion of the pipeline execution, the nextflow_output folder will contain fastqc, multiqc, and pipeline_info folders.
Install Docker Engine on Ubuntu if it is needed (Sudo privilege is required)
#Update the Package Index
sudo apt-get update
# Install Required Packages
sudo apt-get install apt-transport-https ca-certificates curl gnupg lsb-release
# Add Docker’s Official GPG Key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
# Set Up the Repository
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# Update the Package Index Again
sudo apt-get update
# Install the latest version of Docker Engine and containerd.io:
sudo apt-get install docker-ce docker-ce-cli containerd.io
#Verify Docker Installation
sudo docker run hello-world
Create a Kraken2 Database
# Get back to the workshop folder
cd workshop_day5
# Download and install Kraken2
wget https://github.com/DerrickWood/kraken2/archive/v2.1.2.tar.gz
tar -xvzf v2.1.2.tar.gz
cd kraken2-2.1.2
./install_kraken2.sh .
export PATH=$PATH:$PWD
cd .. && rm v2.1.2.tar.gz
# Create a directory for the database
mkdir -p kraken2_db
cd kraken2_db
# Download RefSeq viruses as a kraken2 database
kraken2-build --download-library viral --db kraken2_db
Install with conda directly
conda install bioconda::kraken2
kraken2-build --download-taxonomy --db kraken2_db
kraken2-build --download-library viral --db kraken2_db
download the ready kraken database for viruses
wget -c "https://genome-idx.s3.amazonaws.com/kraken/k2_viral_20240605.tar.gz"
mkdir kraken_db
tar -xvzf k2_viral_20240605.tar.gz -C kraken_db
Key Points
Fastp, kraken2 using nf-core:taxprofiler