Genome Annotation Pipeline
A comprehensive guide to annotating insect genomes using state-of-the-art tools and methodologies.
This workflow represents current best practices for insect genome annotation as used by the InsectBase team.
Required Software
Repeat Annotation
RNA-seq Processing
Annotation Integration
Supporting Tools
Required Datasets
- RNA-seq data
Available from NCBI SRA
- Homology protein sequences
Available from Partitioned OrthoDB
Workflow Steps
Genome Quality Assessment
Assess the completeness of your genome assembly using BUSCO or compleasm.
BUSCO v5
busco --cpu 28 \
-l /path/to/insecta_odb10 \
-m genome --force -o busco \
-i genome.fa \
--offline
View results with: cat out/short_summary.specific.insecta_odb10.out.txt
compleasm (Faster Alternative)
compleasm.py run -t16 -l insecta -L /data/ -a genome.fa -o busco
Note: The organization of the lineage file downloaded by compleasm is different from that of BUSCO.
Repeat Annotation and Genome Masking
Identify and mask repetitive elements in the genome.
RepeatModeler v2 & RepeatMasker
Building reference repeat database:
# RepeatMasker
famdb.py -i Libraries/RepeatMaskerLib.h5 families -f embl -a -d Insecta > Insecta_ad.embl
util/buildRMLibFromEMBL.pl Insecta_ad.embl > Insecta_ad.fa
# RepeatModeler
mkdir 01_RepeatModeler
BuildDatabase -name GDB -engine ncbi ../genome.fa > BuildDatabase.log
RepeatModeler -engine ncbi -pa 28 -database GDB -LTRStruct > RepeatModele.log
cd ../
Running RepeatMasker for genome masking:
mkdir 02_RepeatMasker
cat 01_RepeatModeler/GDB-families.fa Insecta_ad.fa > repeat_db.fa
RepeatMasker -xsmall -gff -html -lib repeat_db.fa -pa 28 genome.fa > RepeatMasker.log
Output: genome.fa.masked
EDTA (Alternative Method)
EDTA.pl --genome female.fa --species others --sensitive 1 --anno 1 --evaluate 1 --threads 30
Gene Prediction
Predict genes using RNA-seq data, ab initio methods, and homology-based approaches.
RNA-seq Based Gene Prediction
Using HISAT2 & StringTie:
# Build genome index
hisat2-build -p 28 genome.fa genome
# Mapping to genome (single end)
hisat2 -p 28 -x genome --dta -U reads.fq | samtools sort -@ 28 > reads.bam
# Mapping to genome (paired end)
hisat2 -p 28 -x genome --dta -1 reads_1.fq -2 reads_2.fq | samtools sort -@ 28 > reads.bam
# GTF merging
samtools merge -@ 28 merged.bam `ls *bam`
stringtie -p 28 -o stringtie.gtf merged.bam
Output: stringtie.gtf
Using TransDecoder:
util/gtf_genome_to_cdna_fasta.pl stringtie.gtf genome.fa > transcripts.fasta
util/gtf_to_alignment_gff3.pl stringtie.gtf > transcripts.gff3
TransDecoder.LongOrfs -t transcripts.fasta
# homology search
blastp -query transdecoder_dir/longest_orfs.pep -db uniprot_sprot.fasta -max_target_seqs 1 -outfmt 6 -evalue 1e-5 -num_threads 28 > blastp.outfmt6
hmmscan --cpu 28 --domtblout pfam.domtblout Pfam-A.hmm transdecoder_dir/longest_orfs.pep
TransDecoder.Predict -t transcripts.fasta --retain_pfam_hits pfam.domtblout --retain_blastp_hits blastp.outfmt6
util/cdna_alignment_orf_to_genome_orf.pl transcripts.fasta.transdecoder.gff3 transcripts.gff3 transcripts.fasta > transcripts.fasta.transdecoder.genome.gff3
mv transcripts.fasta.transdecoder.genome.gff3 transcript_alignments.gff3
Output: transcript_alignments.gff3
Ab initio Gene Prediction with BRAKER v3
braker.pl --genome=genome.fa \
--species=Sfru \
--prot_seq=Arthropoda.fa \
--bam=merged.bam \
--threads 30 \
--gff3 --workingdir=out
python gff_rename.py braker.gff3 sfru > gene_predictions.gff3
Output: gene_predictions.gff3
Homology-based Gene Prediction with miniprot
miniprot -t28 -d genome.mpi genome.fa.masked
miniprot -It28 --gff genome.mpi protein.fasta > miniprot.gff3
Output: miniprot.gff3
EVidenceModeler (EVM)
Integrate evidence from different gene prediction methods.
Preparing Inputs
Create weights.txt file:
PROTEIN miniprot 5
ABINITIO_PREDICTION BRAKER3 10
OTHER_PREDICTION transdecoder 10
Reformat miniprot GFF3:
python ~/software/EVidenceModeler-v2.1.0/EvmUtils/misc/miniprot_GFF_2_EVM_alignment_GFF3.py miniprot.gff3 > protein_alignments.gff3
Run EVidenceModeler
EVidenceModeler \
--sample_id species \
--genome genome.fa \
--weights weights.txt \
--gene_predictions gff/gene_predictions.gff3 \
--protein_alignments gff/protein_alignments.gff3 \
--transcript_alignments gff/transcript_alignments.gff3 \
--segmentSize 100000 --overlapSize 10000 --CPU 20
Output: species.evm.gff3
OGS Annotation
Finalize the Official Gene Set (OGS) annotation.
PASA Pipeline
util/gtf_genome_to_cdna_fasta.pl stringtie.gtf genome.fa > transcripts.fasta
bin/seqclean transcripts.fasta
# Transcripts alignments
Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome.fa -t transcripts.fasta.clean -T -u transcripts.fasta --ALIGNERS blat --CPU 16
# Load annotation
misc_utilities/pasa_gff3_validator.pl species.evm.gff3
scripts/Load_Current_Gene_Annotations.dbi -c alignAssembly.config -g genome.fa -P species.evm.gff3
# Update
Launch_PASA_pipeline.pl -c annotCompare.config -A -g genome.fa -t transcripts.fasta.clean
peaks2utr
peaks2utr -p 20 species.evm.gff3 merged.bam
Collect GFF, CDS, and Protein Sequences
# Rename GFF3
python gff_rename.py gene_structures_post_PASA_updates.gff3 Sfru > Sfru.gff3
# Collect sequences
gffread Sfru.gff3 -g genome.fa -x Sfru_cds.fa -y Sfru_pep.fa
# Collect no alt GFF, CDS, and protein sequences
python Collect_no_alt.py pep.fa cds.fa Sfru.gff3
Outputs: Sfru.gff3 (Sfru_no_alt.gff3), cds.fa (cds_no_alt.fa), pep.fa (pep_no_alt.fa)
Function Annotation
Annotate gene functions using online tools.
eggNOG-mapper
Use eggNOG-mapper for functional annotation of proteins.
PANNZER
Use PANNZER for additional functional annotation.
Additional Resources
- InsectBase Genome Browser
Visualize genome annotations using our Genome Browser
- BUSCO Quality Assessment
Learn more about genome quality assessment in our BUSCO workflow guide
- Download Center
Access annotated genomes from our Download Center