Analysis workflow¶
Here we present a full pipeline example for running CoCoNet from the raw reads to final bins. We assume that you have paired-end illumina reads available in the data/
directory. Each sample is named with the convention <sample>_R{1,2}.fastq.gz
.
Trimming¶
The first step consists in quality trimming and filtering your reads. Many tools can be used for this step, such as fastp or trimmomatic. Below is a trimming example with fastp. Note that the parameters need to be tuned to your specific case. FastQC and MultiQC can be very useful in that regard.
#!/usr/bin/env bash
# SAMPLES is an environment variable containing the sample names
# (e.g.: export SAMPLES="sample1 sample2")
for sample in $SAMPLES; do
fastp \
-i data/${sample}_R1.fastq.gz -I data/${sample}_R2.fastq.gz \
-o ${sample}-trimmed_R1.fastq.gz -O ${sample}-trimmed_R2.fastq.gz
done
Assembly¶
To assemble your reads, you have many choices. One of the most accurate for metagenomics is metaSPAdes. However if you have a lot of samples and/or high coverage, metaSPAdes will require a significant amount of time and memory, in which case Megahit can be a good alternative.
# combine all samples
cat data/*-trimmed_R1.fastq.gz > forward.fastq.gz
cat data/*-trimmed_R2.fastq.gz > reverse.fastq.gz
# run metaspades
metaspades.py \
-1 forward.fastq.gz -2 reverse.fastq.gz \
-o assembly-outputs
Coverage¶
To compute contig coverage, we can align the trimmed reads on the assembled contigs with tools such as bwa or bowtie2:
mkdir coverage
# Build index
bwa index -p coverage/index assembly-outputs/scaffolds.fasta
for sample in $SAMPLES; do
bwa mem -p coverage/index data/${sample}-trimmed_R*.fastq.gz \
| samtools view -bh \
| samtools sort -o coverage/${sample}.bam
samtools index coverage/${sample}.bam coverage/${sample}.bai
done
Binning¶
You can finally use CoCoNet and bin your contigs:
coconet run --fasta assembly-outputs/scaffolds.fasta --bam coverage/*.bam --output binning
The binning assignment are then available in the file binning/bins-*.csv.