Genome Annotation Pipeline

A comprehensive guide to annotating insect genomes using state-of-the-art tools and methodologies.

This workflow represents current best practices for insect genome annotation as used by the InsectBase team.

Required Software

Genome Quality Assessment

Repeat Annotation

RNA-seq Processing

Annotation Integration

Required Datasets

Workflow Steps

1

Genome Quality Assessment

Assess the completeness of your genome assembly using BUSCO or compleasm.

BUSCO v5

busco --cpu 28 \
    -l /path/to/insecta_odb10 \
    -m genome --force -o busco \
    -i genome.fa \
    --offline

View results with: cat out/short_summary.specific.insecta_odb10.out.txt

compleasm (Faster Alternative)

compleasm.py run -t16 -l insecta -L /data/ -a genome.fa -o busco

Note: The organization of the lineage file downloaded by compleasm is different from that of BUSCO.

2

Repeat Annotation and Genome Masking

Identify and mask repetitive elements in the genome.

RepeatModeler v2 & RepeatMasker

Building reference repeat database:

# RepeatMasker
famdb.py -i Libraries/RepeatMaskerLib.h5 families -f embl -a -d Insecta > Insecta_ad.embl
util/buildRMLibFromEMBL.pl Insecta_ad.embl > Insecta_ad.fa

# RepeatModeler
mkdir 01_RepeatModeler
BuildDatabase -name GDB -engine ncbi ../genome.fa > BuildDatabase.log
RepeatModeler -engine ncbi -pa 28 -database GDB -LTRStruct > RepeatModele.log
cd ../

Running RepeatMasker for genome masking:

mkdir 02_RepeatMasker
cat 01_RepeatModeler/GDB-families.fa Insecta_ad.fa > repeat_db.fa
RepeatMasker -xsmall -gff -html -lib repeat_db.fa -pa 28 genome.fa > RepeatMasker.log

Output: genome.fa.masked

EDTA (Alternative Method)

EDTA.pl --genome female.fa --species others --sensitive 1 --anno 1 --evaluate 1 --threads 30
3

Gene Prediction

Predict genes using RNA-seq data, ab initio methods, and homology-based approaches.

RNA-seq Based Gene Prediction

Using HISAT2 & StringTie:

# Build genome index
hisat2-build -p 28 genome.fa genome

# Mapping to genome (single end)
hisat2 -p 28 -x genome --dta -U reads.fq | samtools sort -@ 28 > reads.bam

# Mapping to genome (paired end)
hisat2 -p 28 -x genome --dta -1 reads_1.fq -2 reads_2.fq | samtools sort -@ 28 > reads.bam

# GTF merging
samtools merge -@ 28 merged.bam `ls *bam`
stringtie -p 28 -o stringtie.gtf merged.bam

Output: stringtie.gtf

Using TransDecoder:

util/gtf_genome_to_cdna_fasta.pl stringtie.gtf genome.fa > transcripts.fasta
util/gtf_to_alignment_gff3.pl stringtie.gtf > transcripts.gff3
TransDecoder.LongOrfs -t transcripts.fasta

# homology search
blastp -query transdecoder_dir/longest_orfs.pep -db uniprot_sprot.fasta -max_target_seqs 1 -outfmt 6 -evalue 1e-5 -num_threads 28 > blastp.outfmt6
hmmscan --cpu 28 --domtblout pfam.domtblout Pfam-A.hmm transdecoder_dir/longest_orfs.pep

TransDecoder.Predict -t transcripts.fasta --retain_pfam_hits pfam.domtblout --retain_blastp_hits blastp.outfmt6
util/cdna_alignment_orf_to_genome_orf.pl transcripts.fasta.transdecoder.gff3 transcripts.gff3 transcripts.fasta > transcripts.fasta.transdecoder.genome.gff3

mv transcripts.fasta.transdecoder.genome.gff3 transcript_alignments.gff3

Output: transcript_alignments.gff3

Ab initio Gene Prediction with BRAKER v3

braker.pl --genome=genome.fa \
    --species=Sfru \
    --prot_seq=Arthropoda.fa \
    --bam=merged.bam \
    --threads 30 \
    --gff3 --workingdir=out

python gff_rename.py braker.gff3 sfru > gene_predictions.gff3

Output: gene_predictions.gff3

Homology-based Gene Prediction with miniprot

miniprot -t28 -d genome.mpi genome.fa.masked
miniprot -It28 --gff genome.mpi protein.fasta > miniprot.gff3

Output: miniprot.gff3

4

EVidenceModeler (EVM)

Integrate evidence from different gene prediction methods.

Preparing Inputs

Create weights.txt file:

PROTEIN	miniprot	5
ABINITIO_PREDICTION	BRAKER3	10
OTHER_PREDICTION	transdecoder	10

Reformat miniprot GFF3:

python ~/software/EVidenceModeler-v2.1.0/EvmUtils/misc/miniprot_GFF_2_EVM_alignment_GFF3.py miniprot.gff3 > protein_alignments.gff3

Run EVidenceModeler

EVidenceModeler \
    --sample_id species \
    --genome genome.fa \
    --weights weights.txt  \
    --gene_predictions gff/gene_predictions.gff3 \
    --protein_alignments gff/protein_alignments.gff3 \
    --transcript_alignments gff/transcript_alignments.gff3 \
    --segmentSize 100000 --overlapSize 10000 --CPU 20

Output: species.evm.gff3

5

OGS Annotation

Finalize the Official Gene Set (OGS) annotation.

PASA Pipeline

util/gtf_genome_to_cdna_fasta.pl stringtie.gtf genome.fa > transcripts.fasta
bin/seqclean transcripts.fasta

# Transcripts alignments
Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome.fa -t transcripts.fasta.clean -T -u transcripts.fasta --ALIGNERS blat --CPU 16

# Load annotation
misc_utilities/pasa_gff3_validator.pl species.evm.gff3
scripts/Load_Current_Gene_Annotations.dbi -c alignAssembly.config -g genome.fa -P species.evm.gff3

# Update
Launch_PASA_pipeline.pl -c annotCompare.config -A -g genome.fa -t transcripts.fasta.clean

peaks2utr

peaks2utr -p 20 species.evm.gff3 merged.bam

Collect GFF, CDS, and Protein Sequences

# Rename GFF3
python gff_rename.py gene_structures_post_PASA_updates.gff3 Sfru > Sfru.gff3

# Collect sequences
gffread Sfru.gff3 -g genome.fa -x Sfru_cds.fa -y Sfru_pep.fa

# Collect no alt GFF, CDS, and protein sequences
python Collect_no_alt.py pep.fa cds.fa Sfru.gff3

Outputs: Sfru.gff3 (Sfru_no_alt.gff3), cds.fa (cds_no_alt.fa), pep.fa (pep_no_alt.fa)

6

Function Annotation

Annotate gene functions using online tools.

eggNOG-mapper

Use eggNOG-mapper for functional annotation of proteins.

PANNZER

Use PANNZER for additional functional annotation.

Additional Resources