Gene Function Annotation Pipeline

A comprehensive workflow for annotating gene functions using Diamond alignment and functional databases.

This workflow represents methodologies adopted by the InsectBase database team for comprehensive gene function annotation.

Overview

The gene function annotation pipeline combines sequence similarity searches with functional database mappings to provide comprehensive annotations including:

  • Swiss-Prot functional descriptions - High-quality manually curated protein annotations
  • Gene Ontology (GO) terms - Standardized functional classifications
  • KEGG pathway mappings - Metabolic and signaling pathway associations
  • ORF name assignments - Standardized gene nomenclature
  • Protein and CDS sequences - Complete sequence information

Required Software

Sequence Alignment

  • DIAMOND - Fast protein sequence aligner

Data Processing

  • Python 3.x with standard libraries
  • Custom annotation scripts

Required Databases

  • Swiss-Prot protein database

    Available from UniProt - High-quality manually annotated protein sequences

  • UniProt ID mapping file (uniprot_map.id)

    Contains Swiss-Prot ID to description mappings

  • EggNOG functional annotation file (uniprot_final.tsv)

    Maps UniProt IDs to ORF names, KEGG, and GO annotations

Workflow Steps

1

Diamond Alignment Phase

Align genome annotation protein sequences with the Swiss-Prot protein database using Diamond.

Batch Alignment Script

#!/bin/bash
# batch_diamond.sh
for i in `cat update250714.id`:
do
    echo "$i processing"
    sed -i '/^[^>]/s/\.//g' $i.anno.pep.fa
    diamond blastp --db /data/software/uniprot_diamond \
        --query ./$i.anno.pep.fa \
        --out ./diamond_output_update/$i.diamond.out \
        --outfmt 6 --threads 80 \
        --max-target-seqs 1 --evalue 0.01
    echo "$i is completed"
done

Example Diamond Output

Atri000002.1 sp|Q6ZNG9|KRBA2_HUMAN 33.5 176 115 2 42 216 139 313 1.10e-25 107
Atri000012.1 sp|Q0AMI4|RNH_MARMM 56.7 30 13 0 624 653 111 140 1.54e-04 46.2
Atri000015.1 sp|Q96T55|KCNKG_HUMAN 25.0 224 110 5 137 360 56 221 1.27e-11 69.3

Output format: Query ID, Subject ID, % identity, alignment length, mismatches, gap opens, query start, query end, subject start, subject end, e-value, bit score

2

Add Swiss-Prot Functional Descriptions

Extract functional descriptions from Swiss-Prot using ID-description mappings.

Swiss-Prot Mapping File Format

sp|Q197F2|008L_IIV3 Uncharacterized protein 008L OS=Invertebrate iridescent virus 3 OX=345201 GN=IIV3-008L PE=4 SV=1
sp|Q6GZW6|009L_FRG3G Putative helicase 009L OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-009L PE=4 SV=1

Run Swiss-Prot Annotation Script

python gene_table_swissprot.py \
    ./diamond_output/$i.diamond.out \
    /data/software/uniprot_map.id \
    ./gene_table_swissprot/$i.gene.swissprot.tab

Output: Tab-separated file with columns: Gene ID, ID, Accession, Description, OS, OX, GN, PE, SV

3

Add ORF Name, KEGG, and GO Information

Map UniProt IDs to ORF names, KEGG pathways, and Gene Ontology terms using the EggNOG database.

EggNOG File Format Example

ID Accession Description OS OX GN PE SV Gene_ORFName KEGG GO
Q6GZX4 001R_FRG3G Putative transcription factor 001R Frog virus 3 654924 FV3-001R 4 1 FV3-001R vg:2947773 GO:0046782
Q6GZX3 002L_FRG3G Uncharacterized protein 002L Frog virus 3 654924 FV3-002L 4 1 FV3-002L vg:2947774 GO:0033644; GO:0016020

Run EggNOG Annotation Script

python gene_table_eggnog.py \
    ./gene_table_swissprot/$i.gene.swissprot.tab \
    ./uniprot_final.tsv \
    ./gene_table_eggnog/$i.gene.eggnog.tab

Output: Previous table with three additional columns: Gene_ORFName, KEGG, GO

4

Add CDS and Protein Sequences

Integrate protein and CDS sequence information into the gene annotation table.

Add Sequence Information

python gene_table_addseq.py \
    ./gene_table_eggnog/$i.gene.eggnog.tab \
    ./pep/$i.anno.pep.fa \
    ./cds/$i.cds.fa \
    ./gene_table_withseq/$i.withseq.tab

Output: Complete gene table with protein and CDS sequences added

5

Complete Batch Processing

Run the entire annotation pipeline for multiple species using a batch script.

Batch Processing Script

#!/bin/bash
for i in `cat species.id`:
do
    echo "$i processing"

    # Step 1: Add Swiss-Prot descriptions
    python gene_table_swissprot.py \
        ./diamond_output/$i.diamond.out \
        /data/software/uniprot_map.id \
        ./gene_table_swissprot/$i.gene.swissprot.tab

    # Step 2: Add KEGG and GO information
    python gene_table_eggnog.py \
        ./gene_table_swissprot/$i.gene.swissprot.tab \
        ./uniprot_final.tsv \
        ./gene_table_eggnog/$i.gene.eggnog.tab

    # Step 3: Add sequences
    python gene_table_addseq.py \
        ./gene_table_eggnog/$i.gene.eggnog.tab \
        ./pep/$i.anno.pep.fa \
        ./cds/$i.cds.fa \
        ./gene_table_withseq/$i.withseq.tab

    echo "$i is completed"
done

Gene Family Classification

Genes are classified into functional families based on annotation keywords. The classification system includes 11 major categories:

Primary Categories

  • Signaling & Receptors
  • Metabolism & Energy
  • Transporters
  • Structural Components
  • Neural Function
  • Development & Morphogenesis

Secondary Categories

  • Digestion
  • Immunity
  • Gene Regulation
  • Sensory Perception
  • Other / Unclassified

Important Notes

  • Classification is based on sequence similarity, which may have false positives
  • Further filtration based on specific domains or motifs is recommended
  • Detailed classification criteria are available in the gene_family.list file

Output Files

Final Gene Table

Complete annotation table with functional descriptions, GO terms, KEGG pathways, and sequences

species.withseq.tab

Diamond Alignment Results

Raw alignment results from Diamond BLASTP search

species.diamond.out

Intermediate Annotation Files

Swiss-Prot and EggNOG annotation intermediate results

species.gene.swissprot.tab, species.gene.eggnog.tab