HGT Rapid Identification Workflow

A quick identification method for potential Horizontal Gene Transfer (HGT) genes based on DIAMOND and UniProt.

This workflow is intended for preliminary screening and exploration, helping users quickly identify candidate HGT genes for further evolutionary and phylogenetic analyses.

For a comprehensive and phylogenetically robust HGT detection strategy, please refer to the methods described in:Wang et al., Cell (2022)

1. Overview

This workflow identifies potential horizontal gene transfer (HGT) events from DIAMOND BLAST results using taxonomic classification derived from UniProt and NCBI taxonomy data.

Core Identification Principle

If more than 80% of DIAMOND hits for a gene originate from the same non-host taxonomic group (e.g., Fungi, Virus, Bacteria), the gene is considered a potential HGT candidate.

This approach provides a fast and reproducible framework for large-scale screening of candidate HGT genes in genomic or transcriptomic datasets.

2. Required Input Files

You will need three input files for this workflow.

2.1 DIAMOND Alignment Results

  • Format: Tab-separated (TSV) file generated by diamond blastp (or similar).
  • Example command:
    diamond blastp -f 6 qseqid sseqid bitscore evalue ...
  • Required columns:
    • query_id — your gene ID.
    • subject_id — the UniProt protein ID of the hit.
  • File naming example: sample1.diamond.out
  • Recommendation: Include at least 20 hits per gene for more reliable classification.

2.2 UniProt ID to Species Mapping File

This file maps each UniProt protein to its corresponding organism.

  • Format: Two-column TSV
    UniProt_ID <tab> Species_Name
  • Example:
    Q10126	Caenorhabditis elegans
    Q9Y2K1	Homo sapiens

2.3 Species Name to Taxonomic Lineage Mapping File

This file links each species to its NCBI taxonomic lineage.

  • Format:
    Species_Name <tab> tax_id1 tax_id2 ... tax_idN
  • Example:
    Homo sapiens	9606 9605 33208 2759 131567
    Aspergillus fumigatus	451204 4751 33154 2759 131567

3. HGT Classification Rules

The script internally determines the most likely HGT category based on NCBI taxonomy IDs.

CategoryClassification Condition (tax_id)
ArchaeaContains 2157
BacteriaContains 2
VirusContains 10239
ViridiplantaeContains 33090
FungiContains 4751

4. Usage Example

Run the HGT classifier using the following command:

python hgt_classifier_80pct_named_output.py tax_lineage.txt uniprot_id2name.txt sample1.diamond.out > sample1.hgt.tsv

Output Format

Each line of the output file contains:

<File Prefix>    <Gene ID>    <Predicted HGT Category>

Example Output

sample1	Zvic000013.1	Fungi
sample1	Zvic000002.2	Virus

This file lists genes that meet the ≥80% classification threshold and are considered potential HGT candidates.

5. Output Interpretation

  • Each line corresponds to one gene identified as a potential HGT event.
  • A gene is reported only if:
    • ≥80% of its DIAMOND hits fall under the same non-host taxonomic category.
    • Taxonomic mapping is complete and valid.
  • Genes that do not meet the threshold or have missing taxonomy are excluded.
  • Additional diagnostic information (e.g., percentage distribution across taxa) is printed to stderr during execution for verification and debugging.

6. Optional Improvements

You can expand this workflow to improve accuracy and performance:

  1. E-value or bitscore filtering

    Add thresholds to include only high-confidence hits.

  2. Batch processing

    Modify the script to process multiple DIAMOND result files at once.

  3. Host exclusion filter

    Automatically remove hits from the host organism to reduce false positives.

  4. Visualization

    Summarize classification results using R or Python to visualize HGT patterns.

7. Example Data Flow

Below is an example of the typical input/output organization for one dataset:

Input:
  ├── tax_lineage.txt
  ├── uniprot_id2name.txt
  └── sample1.diamond.out

Command:
  python hgt_classifier_80pct_named_output.py tax_lineage.txt uniprot_id2name.txt sample1.diamond.out > sample1.hgt.tsv

Output:
  ├── sample1.hgt.tsv          # Filtered HGT classification results
  └── stderr log               # Diagnostic and summary information

8. Notes and Best Practices

  • Ensure consistent species names across mapping files (case-sensitive).
  • The UniProt and taxonomy mapping files must cover all species in your DIAMOND output.
  • For large datasets, consider parallelization or chunked processing.
  • Always validate a subset of identified HGT candidates using phylogenetic methods before reporting.

9. Citation and License

If you use or adapt this workflow, please cite:

"HGT Rapid Identification Workflow (DIAMOND + UniProt)"
Developed for preliminary detection of horizontal gene transfer events using taxonomic inference.

License: MIT License

For full phylogenetic HGT validation and evolutionary analysis, refer to:

Wang et al., Cell, 2022, "Horizontal gene transfer is a hallmark of animal genome evolution"

10. Acknowledgments

This workflow combines data and tools from:

  • DIAMOND — Ultrafast protein alignment tool
  • UniProtKB — Comprehensive protein sequence database
  • NCBI Taxonomy — Hierarchical organism classification

Developed to support rapid, reproducible, and interpretable HGT screening for comparative genomics and bioinformatics research.