Gene Function Annotation Pipeline
A comprehensive workflow for annotating gene functions using Diamond alignment and functional databases.
This workflow represents methodologies adopted by the InsectBase database team for comprehensive gene function annotation.
Overview
The gene function annotation pipeline combines sequence similarity searches with functional database mappings to provide comprehensive annotations including:
- Swiss-Prot functional descriptions - High-quality manually curated protein annotations
- Gene Ontology (GO) terms - Standardized functional classifications
- KEGG pathway mappings - Metabolic and signaling pathway associations
- ORF name assignments - Standardized gene nomenclature
- Protein and CDS sequences - Complete sequence information
Required Software
Sequence Alignment
- DIAMOND - Fast protein sequence aligner
Data Processing
- Python 3.x with standard libraries
- Custom annotation scripts
Required Databases
- Swiss-Prot protein database
Available from UniProt - High-quality manually annotated protein sequences
- UniProt ID mapping file (uniprot_map.id)
Contains Swiss-Prot ID to description mappings
- EggNOG functional annotation file (uniprot_final.tsv)
Maps UniProt IDs to ORF names, KEGG, and GO annotations
Workflow Steps
Diamond Alignment Phase
Align genome annotation protein sequences with the Swiss-Prot protein database using Diamond.
Batch Alignment Script
#!/bin/bash
# batch_diamond.sh
for i in `cat update250714.id`:
do
echo "$i processing"
sed -i '/^[^>]/s/\.//g' $i.anno.pep.fa
diamond blastp --db /data/software/uniprot_diamond \
--query ./$i.anno.pep.fa \
--out ./diamond_output_update/$i.diamond.out \
--outfmt 6 --threads 80 \
--max-target-seqs 1 --evalue 0.01
echo "$i is completed"
done
Example Diamond Output
Atri000002.1 sp|Q6ZNG9|KRBA2_HUMAN 33.5 176 115 2 42 216 139 313 1.10e-25 107
Atri000012.1 sp|Q0AMI4|RNH_MARMM 56.7 30 13 0 624 653 111 140 1.54e-04 46.2
Atri000015.1 sp|Q96T55|KCNKG_HUMAN 25.0 224 110 5 137 360 56 221 1.27e-11 69.3
Output format: Query ID, Subject ID, % identity, alignment length, mismatches, gap opens, query start, query end, subject start, subject end, e-value, bit score
Add Swiss-Prot Functional Descriptions
Extract functional descriptions from Swiss-Prot using ID-description mappings.
Swiss-Prot Mapping File Format
sp|Q197F2|008L_IIV3 Uncharacterized protein 008L OS=Invertebrate iridescent virus 3 OX=345201 GN=IIV3-008L PE=4 SV=1
sp|Q6GZW6|009L_FRG3G Putative helicase 009L OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-009L PE=4 SV=1
Run Swiss-Prot Annotation Script
python gene_table_swissprot.py \
./diamond_output/$i.diamond.out \
/data/software/uniprot_map.id \
./gene_table_swissprot/$i.gene.swissprot.tab
Output: Tab-separated file with columns: Gene ID, ID, Accession, Description, OS, OX, GN, PE, SV
Add ORF Name, KEGG, and GO Information
Map UniProt IDs to ORF names, KEGG pathways, and Gene Ontology terms using the EggNOG database.
EggNOG File Format Example
ID Accession Description OS OX GN PE SV Gene_ORFName KEGG GO
Q6GZX4 001R_FRG3G Putative transcription factor 001R Frog virus 3 654924 FV3-001R 4 1 FV3-001R vg:2947773 GO:0046782
Q6GZX3 002L_FRG3G Uncharacterized protein 002L Frog virus 3 654924 FV3-002L 4 1 FV3-002L vg:2947774 GO:0033644; GO:0016020
Run EggNOG Annotation Script
python gene_table_eggnog.py \
./gene_table_swissprot/$i.gene.swissprot.tab \
./uniprot_final.tsv \
./gene_table_eggnog/$i.gene.eggnog.tab
Output: Previous table with three additional columns: Gene_ORFName, KEGG, GO
Add CDS and Protein Sequences
Integrate protein and CDS sequence information into the gene annotation table.
Add Sequence Information
python gene_table_addseq.py \
./gene_table_eggnog/$i.gene.eggnog.tab \
./pep/$i.anno.pep.fa \
./cds/$i.cds.fa \
./gene_table_withseq/$i.withseq.tab
Output: Complete gene table with protein and CDS sequences added
Complete Batch Processing
Run the entire annotation pipeline for multiple species using a batch script.
Batch Processing Script
#!/bin/bash
for i in `cat species.id`:
do
echo "$i processing"
# Step 1: Add Swiss-Prot descriptions
python gene_table_swissprot.py \
./diamond_output/$i.diamond.out \
/data/software/uniprot_map.id \
./gene_table_swissprot/$i.gene.swissprot.tab
# Step 2: Add KEGG and GO information
python gene_table_eggnog.py \
./gene_table_swissprot/$i.gene.swissprot.tab \
./uniprot_final.tsv \
./gene_table_eggnog/$i.gene.eggnog.tab
# Step 3: Add sequences
python gene_table_addseq.py \
./gene_table_eggnog/$i.gene.eggnog.tab \
./pep/$i.anno.pep.fa \
./cds/$i.cds.fa \
./gene_table_withseq/$i.withseq.tab
echo "$i is completed"
done
Gene Family Classification
Genes are classified into functional families based on annotation keywords. The classification system includes 11 major categories:
Primary Categories
- Signaling & Receptors
- Metabolism & Energy
- Transporters
- Structural Components
- Neural Function
- Development & Morphogenesis
Secondary Categories
- Digestion
- Immunity
- Gene Regulation
- Sensory Perception
- Other / Unclassified
Important Notes
- Classification is based on sequence similarity, which may have false positives
- Further filtration based on specific domains or motifs is recommended
- Detailed classification criteria are available in the gene_family.list file
Output Files
Final Gene Table
Complete annotation table with functional descriptions, GO terms, KEGG pathways, and sequences
species.withseq.tab
Diamond Alignment Results
Raw alignment results from Diamond BLASTP search
species.diamond.out
Intermediate Annotation Files
Swiss-Prot and EggNOG annotation intermediate results
species.gene.swissprot.tab, species.gene.eggnog.tab