HGT Rapid Identification Workflow
A quick identification method for potential Horizontal Gene Transfer (HGT) genes based on DIAMOND and UniProt.
This workflow is intended for preliminary screening and exploration, helping users quickly identify candidate HGT genes for further evolutionary and phylogenetic analyses.
For a comprehensive and phylogenetically robust HGT detection strategy, please refer to the methods described in:Wang et al., Cell (2022)
1. Overview
This workflow identifies potential horizontal gene transfer (HGT) events from DIAMOND BLAST results using taxonomic classification derived from UniProt and NCBI taxonomy data.
Core Identification Principle
If more than 80% of DIAMOND hits for a gene originate from the same non-host taxonomic group (e.g., Fungi, Virus, Bacteria), the gene is considered a potential HGT candidate.
This approach provides a fast and reproducible framework for large-scale screening of candidate HGT genes in genomic or transcriptomic datasets.
2. Required Input Files
You will need three input files for this workflow.
2.1 DIAMOND Alignment Results
- Format: Tab-separated (TSV) file generated by
diamond blastp
(or similar). - Example command:
diamond blastp -f 6 qseqid sseqid bitscore evalue ...
- Required columns:
query_id
— your gene ID.subject_id
— the UniProt protein ID of the hit.
- File naming example:
sample1.diamond.out
- Recommendation: Include at least 20 hits per gene for more reliable classification.
2.2 UniProt ID to Species Mapping File
This file maps each UniProt protein to its corresponding organism.
- Format: Two-column TSV
UniProt_ID <tab> Species_Name
- Example:
Q10126 Caenorhabditis elegans Q9Y2K1 Homo sapiens
2.3 Species Name to Taxonomic Lineage Mapping File
This file links each species to its NCBI taxonomic lineage.
- Format:
Species_Name <tab> tax_id1 tax_id2 ... tax_idN
- Example:
Homo sapiens 9606 9605 33208 2759 131567 Aspergillus fumigatus 451204 4751 33154 2759 131567
3. HGT Classification Rules
The script internally determines the most likely HGT category based on NCBI taxonomy IDs.
Category | Classification Condition (tax_id ) |
---|---|
Archaea | Contains 2157 |
Bacteria | Contains 2 |
Virus | Contains 10239 |
Viridiplantae | Contains 33090 |
Fungi | Contains 4751 |
4. Usage Example
Run the HGT classifier using the following command:
python hgt_classifier_80pct_named_output.py tax_lineage.txt uniprot_id2name.txt sample1.diamond.out > sample1.hgt.tsv
Output Format
Each line of the output file contains:
<File Prefix> <Gene ID> <Predicted HGT Category>
Example Output
sample1 Zvic000013.1 Fungi
sample1 Zvic000002.2 Virus
This file lists genes that meet the ≥80% classification threshold and are considered potential HGT candidates.
5. Output Interpretation
- Each line corresponds to one gene identified as a potential HGT event.
- A gene is reported only if:
- ≥80% of its DIAMOND hits fall under the same non-host taxonomic category.
- Taxonomic mapping is complete and valid.
- Genes that do not meet the threshold or have missing taxonomy are excluded.
- Additional diagnostic information (e.g., percentage distribution across taxa) is printed to
stderr
during execution for verification and debugging.
6. Optional Improvements
You can expand this workflow to improve accuracy and performance:
- E-value or bitscore filtering
Add thresholds to include only high-confidence hits.
- Batch processing
Modify the script to process multiple DIAMOND result files at once.
- Host exclusion filter
Automatically remove hits from the host organism to reduce false positives.
- Visualization
Summarize classification results using R or Python to visualize HGT patterns.
7. Example Data Flow
Below is an example of the typical input/output organization for one dataset:
Input:
├── tax_lineage.txt
├── uniprot_id2name.txt
└── sample1.diamond.out
Command:
python hgt_classifier_80pct_named_output.py tax_lineage.txt uniprot_id2name.txt sample1.diamond.out > sample1.hgt.tsv
Output:
├── sample1.hgt.tsv # Filtered HGT classification results
└── stderr log # Diagnostic and summary information
8. Notes and Best Practices
- Ensure consistent species names across mapping files (case-sensitive).
- The UniProt and taxonomy mapping files must cover all species in your DIAMOND output.
- For large datasets, consider parallelization or chunked processing.
- Always validate a subset of identified HGT candidates using phylogenetic methods before reporting.
9. Citation and License
If you use or adapt this workflow, please cite:
"HGT Rapid Identification Workflow (DIAMOND + UniProt)"
Developed for preliminary detection of horizontal gene transfer events using taxonomic inference.
License: MIT License
For full phylogenetic HGT validation and evolutionary analysis, refer to:
Wang et al., Cell, 2022, "Horizontal gene transfer is a hallmark of animal genome evolution"
10. Acknowledgments
This workflow combines data and tools from:
- DIAMOND — Ultrafast protein alignment tool
- UniProtKB — Comprehensive protein sequence database
- NCBI Taxonomy — Hierarchical organism classification
Developed to support rapid, reproducible, and interpretable HGT screening for comparative genomics and bioinformatics research.