Help Center/Workflows/Genome Annotation

← Genome Quality Assessment Gene Function Annotation →

Genome Annotation Pipeline

A comprehensive guide to annotating insect genomes using state-of-the-art tools and methodologies.

This workflow represents current best practices for insect genome annotation as used by the InsectBase team.

Required Software

Genome Quality Assessment

BUSCO
compleasm - A faster and more accurate reimplementation of BUSCO

Repeat Annotation

RNA-seq Processing

Gene Prediction

Annotation Integration

Supporting Tools

Required Datasets

RNA-seq data
Available from NCBI SRA
Homology protein sequences
Available from Partitioned OrthoDB

Workflow Steps

Genome Quality Assessment

Assess the completeness of your genome assembly using BUSCO or compleasm.

BUSCO v5

busco --cpu 28 \
    -l /path/to/insecta_odb10 \
    -m genome --force -o busco \
    -i genome.fa \
    --offline

View results with: cat out/short_summary.specific.insecta_odb10.out.txt

compleasm (Faster Alternative)

compleasm.py run -t16 -l insecta -L /data/ -a genome.fa -o busco

Note: The organization of the lineage file downloaded by compleasm is different from that of BUSCO.

Repeat Annotation and Genome Masking

Identify and mask repetitive elements in the genome.

RepeatModeler v2 & RepeatMasker

Building reference repeat database:

# RepeatMasker
famdb.py -i Libraries/RepeatMaskerLib.h5 families -f embl -a -d Insecta > Insecta_ad.embl
util/buildRMLibFromEMBL.pl Insecta_ad.embl > Insecta_ad.fa

# RepeatModeler
mkdir 01_RepeatModeler
BuildDatabase -name GDB -engine ncbi ../genome.fa > BuildDatabase.log
RepeatModeler -engine ncbi -pa 28 -database GDB -LTRStruct > RepeatModele.log
cd ../

Running RepeatMasker for genome masking:

mkdir 02_RepeatMasker
cat 01_RepeatModeler/GDB-families.fa Insecta_ad.fa > repeat_db.fa
RepeatMasker -xsmall -gff -html -lib repeat_db.fa -pa 28 genome.fa > RepeatMasker.log

Output: genome.fa.masked

EDTA (Alternative Method)

EDTA.pl --genome female.fa --species others --sensitive 1 --anno 1 --evaluate 1 --threads 30

Gene Prediction

Predict genes using RNA-seq data, ab initio methods, and homology-based approaches.

RNA-seq Based Gene Prediction

Using HISAT2 & StringTie:

# Build genome index
hisat2-build -p 28 genome.fa genome

# Mapping to genome (single end)
hisat2 -p 28 -x genome --dta -U reads.fq | samtools sort -@ 28 > reads.bam

# Mapping to genome (paired end)
hisat2 -p 28 -x genome --dta -1 reads_1.fq -2 reads_2.fq | samtools sort -@ 28 > reads.bam

# GTF merging
samtools merge -@ 28 merged.bam `ls *bam`
stringtie -p 28 -o stringtie.gtf merged.bam

Output: stringtie.gtf

Using TransDecoder:

util/gtf_genome_to_cdna_fasta.pl stringtie.gtf genome.fa > transcripts.fasta
util/gtf_to_alignment_gff3.pl stringtie.gtf > transcripts.gff3
TransDecoder.LongOrfs -t transcripts.fasta

# homology search
blastp -query transdecoder_dir/longest_orfs.pep -db uniprot_sprot.fasta -max_target_seqs 1 -outfmt 6 -evalue 1e-5 -num_threads 28 > blastp.outfmt6
hmmscan --cpu 28 --domtblout pfam.domtblout Pfam-A.hmm transdecoder_dir/longest_orfs.pep

TransDecoder.Predict -t transcripts.fasta --retain_pfam_hits pfam.domtblout --retain_blastp_hits blastp.outfmt6
util/cdna_alignment_orf_to_genome_orf.pl transcripts.fasta.transdecoder.gff3 transcripts.gff3 transcripts.fasta > transcripts.fasta.transdecoder.genome.gff3

mv transcripts.fasta.transdecoder.genome.gff3 transcript_alignments.gff3

Output: transcript_alignments.gff3

Ab initio Gene Prediction with BRAKER v3

braker.pl --genome=genome.fa \
    --species=Sfru \
    --prot_seq=Arthropoda.fa \
    --bam=merged.bam \
    --threads 30 \
    --gff3 --workingdir=out

python gff_rename.py braker.gff3 sfru > gene_predictions.gff3

Output: gene_predictions.gff3

Homology-based Gene Prediction with miniprot

miniprot -t28 -d genome.mpi genome.fa.masked
miniprot -It28 --gff genome.mpi protein.fasta > miniprot.gff3

Output: miniprot.gff3

EVidenceModeler (EVM)

Integrate evidence from different gene prediction methods.

Preparing Inputs

Create weights.txt file:

PROTEIN	miniprot	5
ABINITIO_PREDICTION	BRAKER3	10
OTHER_PREDICTION	transdecoder	10

Reformat miniprot GFF3:

python ~/software/EVidenceModeler-v2.1.0/EvmUtils/misc/miniprot_GFF_2_EVM_alignment_GFF3.py miniprot.gff3 > protein_alignments.gff3

Run EVidenceModeler

EVidenceModeler \
    --sample_id species \
    --genome genome.fa \
    --weights weights.txt  \
    --gene_predictions gff/gene_predictions.gff3 \
    --protein_alignments gff/protein_alignments.gff3 \
    --transcript_alignments gff/transcript_alignments.gff3 \
    --segmentSize 100000 --overlapSize 10000 --CPU 20

Output: species.evm.gff3

OGS Annotation

Finalize the Official Gene Set (OGS) annotation.

PASA Pipeline

util/gtf_genome_to_cdna_fasta.pl stringtie.gtf genome.fa > transcripts.fasta
bin/seqclean transcripts.fasta

# Transcripts alignments
Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome.fa -t transcripts.fasta.clean -T -u transcripts.fasta --ALIGNERS blat --CPU 16

# Load annotation
misc_utilities/pasa_gff3_validator.pl species.evm.gff3
scripts/Load_Current_Gene_Annotations.dbi -c alignAssembly.config -g genome.fa -P species.evm.gff3

# Update
Launch_PASA_pipeline.pl -c annotCompare.config -A -g genome.fa -t transcripts.fasta.clean

peaks2utr

peaks2utr -p 20 species.evm.gff3 merged.bam

Collect GFF, CDS, and Protein Sequences

# Rename GFF3
python gff_rename.py gene_structures_post_PASA_updates.gff3 Sfru > Sfru.gff3

# Collect sequences
gffread Sfru.gff3 -g genome.fa -x Sfru_cds.fa -y Sfru_pep.fa

# Collect no alt GFF, CDS, and protein sequences
python Collect_no_alt.py pep.fa cds.fa Sfru.gff3

Outputs: Sfru.gff3 (Sfru_no_alt.gff3), cds.fa (cds_no_alt.fa), pep.fa (pep_no_alt.fa)

Function Annotation

Annotate gene functions using online tools.

eggNOG-mapper

Use eggNOG-mapper for functional annotation of proteins.

PANNZER

Use PANNZER for additional functional annotation.

Additional Resources

InsectBase Genome Browser
Visualize genome annotations using our Genome Browser
BUSCO Quality Assessment
Learn more about genome quality assessment in our BUSCO workflow guide
Download Center
Access annotated genomes from our Download Center

← Genome Quality Assessment Gene Function Annotation →