Analysis workflow

Here we present a full pipeline example for running CoCoNet from the raw reads to final bins. We assume that you have paired-end illumina reads available in the data/ directory. Each sample is named with the convention <sample>_R{1,2}.fastq.gz.


The first step consists in quality trimming and filtering your reads. Many tools can be used for this step, such as fastp or trimmomatic. Below is a trimming example with fastp. Note that the parameters need to be tuned to your specific case. FastQC and MultiQC can be very useful in that regard.

#!/usr/bin/env bash

# SAMPLES is an environment variable containing the sample names
# (e.g.: export SAMPLES="sample1 sample2")

for sample in $SAMPLES; do
    fastp \
    -i data/${sample}_R1.fastq.gz -I data/${sample}_R2.fastq.gz \
    -o ${sample}-trimmed_R1.fastq.gz -O ${sample}-trimmed_R2.fastq.gz


To assemble your reads, you have many choices. One of the most accurate for metagenomics is metaSPAdes. However if you have a lot of samples and/or high coverage, metaSPAdes will require a significant amount of time and memory, in which case Megahit can be a good alternative.

# combine all samples
cat data/*-trimmed_R1.fastq.gz > forward.fastq.gz
cat data/*-trimmed_R2.fastq.gz > reverse.fastq.gz

# run metaspades \
-1 forward.fastq.gz -2 reverse.fastq.gz \
-o assembly-outputs


To compute contig coverage, we can align the trimmed reads on the assembled contigs with tools such as bwa or bowtie2:

mkdir coverage

# Build index
bwa index -p coverage/index assembly-outputs/scaffolds.fasta

for sample in $SAMPLES; do
    bwa mem -p coverage/index data/${sample}-trimmed_R*.fastq.gz \
    | samtools view -bh \
    | samtools sort -o coverage/${sample}.bam
    samtools index coverage/${sample}.bam coverage/${sample}.bai


You can finally use CoCoNet and bin your contigs:

coconet run --fasta assembly-outputs/scaffolds.fasta --bam coverage/*.bam --output binning

The binning assignment are then available in the file binning/bins-*.csv.