Insect Genome Resequencing Variant Analysis

This comprehensive workflow provides methodologies adopted by the InsectBase database team for variant detection and analysis using resequencing data from insect genomes. The pipeline covers data preparation, read mapping, variant calling with bcftools, and variant annotation.

Overview

The resequencing variant analysis workflow is designed to identify and analyze genetic variants from insect genome resequencing data. This pipeline has been developed and tested using 131 insect species and provides a robust framework for variant discovery.

The workflow includes comprehensive steps from initial data preparation through final variant annotation, ensuring high-quality results suitable for population genetics, evolutionary studies, and functional genomics research.

Requirements

Software Dependencies

  • BWA (Burrows-Wheeler Aligner)
  • Samtools
  • BCFtools
  • snpEff

Additional Tools

  • Python 3
  • ETE3 (for phylogenetic tree manipulation)
  • bgzip

1. Data Preparation

1.1 Genome Data

This workflow was developed using 131 insect species. You can use a phylogenetic tree to select a subset of species for analysis.

Install ETE3 toolkit:

conda install -c etetoolkit ete3 ete_toolchain

Python script for phylogenetic tree manipulation:

from ete3 import Tree
from sys import argv

# Load the tree from the first command line argument
tree = Tree(argv[1])

# Load the list of taxa to keep from the second command line argument
subtree_taxa = [x.strip() for x in open(argv[2])]

# Prune the tree to keep only the specified taxa
tree.prune(subtree_taxa, preserve_branch_length=True)

# Write the subtree to a new file
tree.write(format=1, outfile="subtree.nwk")

Note: This script takes two arguments: the path to the original tree file and a file containing the list of taxa to keep (one taxon per line).

1.2 Resequencing Data

Resequencing data should be organized by species, with each species having its own directory containing paired-end FASTQ files for each sample.

Tip: Ensure your resequencing data is of high quality. Consider performing quality control steps such as adapter trimming and quality filtering before proceeding with the analysis.

2. Variant Detection with bcftools

2.1 Reference Genome Indexing

First, create a directory for each species and set up the necessary files:

# 00_index.sh
species=$1
mkdir $species
cd $species
cp ../01_mapping.sh .
cp ../03_sorted.sh .
cp ../04_bcftools.sh .

# Update species name in scripts
sed -i "s/species='species'/species='$species'/" 01_mapping.sh
sed -i "s/species='species'/species='$species'/" 05_clean.sh

# Link to the reference genome and create BWA index
ln -s ../../../02_genome/${species}.genome.fa genome.fa
bwa index -p genome genome.fa

# Copy the list of SRA accessions for this species
cp ../../01_data/${species}/sra.list .

Usage example:

./00_index.sh Zaprionus_indianus

Note: This script assumes that reference genomes are stored in a directory named "02_genome" with filenames in the format "species_name.genome.fa".

2.2 Resequencing Data Mapping

Map the resequencing reads to the reference genome using BWA-MEM:

# 01_mapping.sh
#! /bin/bash

trap "exec 1000>&-;exec 1000<&-;exit 0" 2
FIFO_FILE=$$.fifo
mkfifo $FIFO_FILE
exec 1000<>$FIFO_FILE

rm -rf $FIFO_FILE

# Maximum number of parallel processes
PROCESS_NUM=2
for ((idx=0; idx<$PROCESS_NUM; idx++));
do
        echo>&1000
done

species='species'

for x in `cat sra.list`;
do
        read -u1000
        {
                # Map reads using BWA-MEM and convert to BAM format
                bwa mem -t 10 -M -R '@RG\tID:${x}\tPL:Illumina\tSM:${x}' genome ../../01_data/${species}/${x}_1.clean.fastq.gz ../../01_data/${species}/${x}_2.clean.fastq.gz | samtools view -@ 10 -bS -o ${x}.bam

                # Generate mapping statistics
                samtools flagstat -@ 10  ${x}.bam > ${x}.flagstat

                echo >&1000
        } &
done

wait

# Collect mapping statistics and filter samples with high mapping rates
grep '0 mapped (' *flag* > all.stat
python ../02_90_score.py
rm *flagstat

exec 1000>&-

Quality filtering script (02_90_score.py):

import re
o = open('sra_90.list', 'w')

for x in open('all.stat'):
        sample = x.strip().split('.')[0]
        score = float(re.findall(r'0 mapped \((.*?)\%', x)[0])
        if score >= 90:
                print(sample, file=o)

Explanation: The mapping step aligns the resequencing reads to the reference genome. We use BWA-MEM with 10 threads (-t 10) and mark shorter split hits as secondary (-M). We also add read group information (-R) to identify the sample. Only samples with a mapping rate of at least 90% are kept for further analysis.

2.3 Data Sorting and Duplicate Marking

Sort the mapped reads, fix mate information, and mark duplicates:

# 03_sorted.sh
#! /bin/bash

trap "exec 1000>&-;exec 1000<&-;exit 0" 2
FIFO_FILE=$$.fifo
mkfifo $FIFO_FILE
exec 1000<>$FIFO_FILE

rm -rf $FIFO_FILE

# Maximum number of parallel processes
PROCESS_NUM=2
for ((idx=0; idx<$PROCESS_NUM; idx++));
do
        echo>&1000
done

for x in `cat sra_90.list`;
do
        read -u1000
        {
                # Sort by read name
                samtools sort -n -m 2G -@ 10 -o ${x}.sorted.bam ${x}.bam

                # Fix mate information and sort by coordinate
                samtools fixmate -m ${x}.sorted.bam - | samtools sort -m 2G -@ 10 -o ${x}.fix.bam

                # Mark duplicates and sort by coordinate
                samtools markdup -@ 10 -s ${x}.fix.bam - | samtools sort -m 2G -@ 10 -o  ${x}.fix.markdup.bam

                # Rename the final BAM file
                mv ${x}.fix.markdup.bam ${x}.final.bam

                # Index the final BAM file
                samtools index ${x}.final.bam -@ 10

                # Remove intermediate files
                rm ${x}.bam ${x}.fix.bam ${x}.sorted.bam

                echo >&1000
        } &
done
wait
exec 1000>&-

Explanation: This step performs several important operations:
1. Sort reads by name (-n) to group reads from the same fragment together
2. Fix mate information with fixmate to ensure proper pairing
3. Mark duplicate reads with markdup to identify PCR duplicates
4. Sort the final BAM file by coordinate for efficient variant calling
5. Index the final BAM file for random access

2.4 Variant Calling with bcftools

Call variants using bcftools:

# 04_bcftools.sh
#! /bin/bash

trap "exec 1000>&-;exec 1000<&-;exit 0" 2
FIFO_FILE=$$.fifo
mkfifo $FIFO_FILE
exec 1000<>$FIFO_FILE

rm -rf $FIFO_FILE

# Maximum number of parallel processes
PROCESS_NUM=8
for ((idx=0; idx<$PROCESS_NUM; idx++));
do
        echo>&1000
done

# Index the reference genome
samtools faidx genome.fa

for x in `cat sra_90.list`;
do
        read -u1000
        {
                # Call variants using bcftools
                bcftools mpileup -f genome.fa ${x}.final.bam | bcftools call -mv -Oz -o ${x}.vcf.gz

                # Compress the VCF file
                bgzip -f ${x}.final.vcf

                echo >&1000
        } &
done

wait
exec 1000>&-

Explanation: This script calls variants for each sample using bcftools. The process involves:
1. Generating a pileup of reads at each position using bcftools mpileup
2. Calling variants using bcftools call with the multiallelic caller (-m) and outputting only variant sites (-v)
3. Compressing the output VCF file with bgzip

2.5 Removing Redundant Data

Remove data for samples with low mapping rates:

import re
import os

lis1 = [x.strip() for x in open('sra.list')]
lis2 = [x.strip() for x in open('sra_90.list')]
for x in lis1:
    if x not in lis2:
        os.system('rm '+x+'.bam '+x+'.final.bam')

Explanation: This script compares the original list of samples (sra.list) with the filtered list of samples with high mapping rates (sra_90.list). It removes BAM files for samples that did not meet the 90% mapping rate threshold to save disk space.

2.6 VCF File Filtering

Filter VCF files based on quality and depth, and separate SNPs and indels:

# 06_analysis_1.sh
#! /bin/bash

trap "exec 1000>&-;exec 1000<&-;exit 0" 2
FIFO_FILE=$$.fifo
mkfifo $FIFO_FILE
exec 1000<>$FIFO_FILE

rm -rf $FIFO_FILE

# Maximum number of parallel processes
PROCESS_NUM=10
for ((idx=0; idx<$PROCESS_NUM; idx++));
do
        echo>&1000
done

mkdir -p analysis/snp analysis/indel

for x in `cat sra_90.list`;
do
        read -u1000
        {
                rm ${x}.final.vcf.gz
                gunzip ${x}.vcf.gz

                # Filter variants based on quality and depth
                bcftools filter -e 'QUAL<30 || DP<20' ${x}.vcf -o ${x}.f.vcf

                # Update sample name in the VCF file
                sed -i "s/\${x}/$x/g" ${x}.f.vcf

                # Compress VCF files
                bgzip -@ 10 ${x}.vcf
                bgzip -@ 10 ${x}.f.vcf

                # Separate SNPs and indels
                bcftools view --threads 10 -v snps ${x}.f.vcf.gz -O z -o analysis/snp/${x}.f.snp.vcf.gz
                bcftools view --threads 10 -v indels ${x}.f.vcf.gz -O z -o analysis/indel/${x}.f.indel.vcf.gz

                # Index the filtered VCF files
                bcftools index analysis/snp/${x}.f.snp.vcf.gz
                bcftools index analysis/indel/${x}.f.indel.vcf.gz

                echo >&1000
        } &
done
wait
exec 1000>&-

Explanation: This step filters variants based on quality (QUAL≥30) and depth (DP≥20), and separates SNPs and indels into different files. The filtered VCF files are compressed with bgzip and indexed for efficient access.

2.7 VCF File Merging and Annotation

Merge VCF files and annotate variants:

# 07_analysis_2.sh
#! /bin/bash

trap "exec 1000>&-;exec 1000<&-;exit 0" 2
FIFO_FILE=$$.fifo
mkfifo $FIFO_FILE
exec 1000<>$FIFO_FILE

rm -rf $FIFO_FILE

# Maximum number of parallel processes
PROCESS_NUM=10
for ((idx=0; idx<$PROCESS_NUM; idx++));
do
        echo>&1000
done

cd analysis

# Merge SNP files
bcftools merge --threads 10 snp/*.f.snp.vcf.gz -O z -o merged.snp.vcf.gz
bcftools index merged.snp.vcf.gz

# Merge indel files
bcftools merge --threads 10 indel/*.f.indel.vcf.gz -O z -o merged.indel.vcf.gz
bcftools index merged.indel.vcf.gz

# Filter merged VCF files
python ../../var_filter.py merged.snp.vcf.gz > filtered.snp.vcf
python ../../var_filter.py merged.indel.vcf.gz > filtered.indel.vcf

# Compress filtered files
bgzip -@ 10 filtered.snp.vcf
bgzip -@ 10 filtered.indel.vcf

# Index filtered files
bcftools index filtered.snp.vcf.gz
bcftools index filtered.indel.vcf.gz

exec 1000>&-

Variant filtering script (var_filter.py):

# var_filter.py
from sys import argv

for x in open(argv[1]):
    if not x.startswith('#'):
        x = x.strip()
        tmp = x.split('\t')
        samples = len(tmp) - 9
        alts = tmp.count('./.')
        if alts == samples - 1:
            print(x)
    else:
        print(x)

Explanation: This step annotates the filtered variants using snpEff. The var_filter.py script keeps only variants that are unique to a single sample (i.e., variants where all but one sample have missing genotypes). The annotated variants are saved in both VCF and text formats.

Conclusion

This workflow provides a comprehensive pipeline for variant detection and analysis using resequencing data from insect genomes. The pipeline includes read mapping, variant calling, filtering, and annotation steps.

The resulting annotated variants can be used for various downstream analyses, such as population genetics, evolutionary studies, or functional genomics research.

The methodologies adopted by the InsectBase database team ensure high-quality results and provide a robust framework for insect genome resequencing analysis.

© 2024 InsectBase. All rights reserved.