Insect Genome TE Annotation Using HiTE

A step-by-step guide to annotating transposable elements (TEs) in insect genome sequences using HiTE.

HiTE (High-throughput TE annotation) is a tool designed to efficiently annotate transposable elements in genome sequences.

Introduction

HiTE (High-throughput TE annotation) is a tool designed to efficiently annotate transposable elements (TEs) in genome sequences. This tutorial provides a step-by-step guide to using HiTE for annotating insect genomes. The process includes the creation of a custom repeat library and genome annotation.

Prerequisites

Before running HiTE, ensure you have the following:

  • Genome file

    A genome sequence in FASTA format (e.g., genome.fa).

  • HiTE tool

    Install the HiTE tool from the HiTE GitHub repository.

  • Repeat library

    A custom repeat library for insect genomes (e.g., Insecta_ad.fa), which will be used for TE annotation.

Workflow Steps

1

Install HiTE

Download and install HiTE:

git clone https://github.com/CSU-KangHu/HiTE.git
cd HiTE
2

Creating the Custom Repeat Library (Insecta_ad.fa)

If you don't have the custom repeat library (Insecta_ad.fa) ready, you can create it using the following steps:

2.1. Generate Family Database from RepeatMasker Library

Use the famdb.py script to generate a family database from your RepeatMasker library (Libraries/RepeatMaskerLib.h5):

famdb.py -i Libraries/RepeatMaskerLib.h5 families -f embl -a -d Insecta >Insecta_ad.embl
  • -i Libraries/RepeatMaskerLib.h5: Input RepeatMasker library file.
  • families: The command to generate repeat families.
  • -f embl: Specifies the EMBL format for the output.
  • -a: Annotates families.
  • -d Insecta: Focuses on insect-related repeat families.
  • $> Insecta_ad.embl: Redirects the output to the Insecta_ad.embl file.

2.2. Convert EMBL File to FASTA Format

After generating the Insecta_ad.embl file, convert it into FASTA format using the buildRMLibFromEMBL.pl script:

util/buildRMLibFromEMBL.pl Insecta_ad.embl >Insecta_ad.fa

This converts the Insecta_ad.embl file into a FASTA format library (Insecta_ad.fa) that HiTE will use for the genome annotation process.

3

Running HiTE for Insect Genome Annotation

Now that you have your repeat library ready, you can annotate your insect genome using HiTE.

Use the following command to run HiTE:

python ~/software/HiTE-master/HiTE.py --genome genome.fa --thread 40 --outdir out --domain 1 --intact_anno 1 --annotate 1 --plant 0 --curated_lib ~/database/repeat/Insecta_ad.fa

Explanation of the command

  • --genome genome.fa: Specifies the input genome file in FASTA format.
  • --thread 40: Sets the number of threads for parallel processing. Adjust based on your system's resources.
  • --outdir out: Specifies the output directory where the results will be saved (in this case, the out directory).
  • --domain 1: Enables annotation of domains in the genome.
  • --intact_anno 1: Activates intact annotation, which provides a more comprehensive TE annotation.
  • --annotate 1: Enables general genome annotation.
  • --plant 0: Specifies that the annotation is for non-plant genomes (set to 0 for insect genomes).
  • --curated_lib ~/database/repeat/Insecta_ad.fa: Provides the path to the Insecta_ad.fa curated repeat library.
4

Review the Output

After the process completes, the results will be saved in the out directory. The output will include:

  • Annotated genome sequences: With transposable elements (TEs) identified.
  • Detailed reports: About the transposable elements found, their types, and other characteristics.
  • Visualization files: For analyzing the distribution and density of TEs within the genome.
5

Further Analysis

Once the HiTE annotation process is complete, you can proceed with further analyses such as:

  • TE Classification: Classifying transposable elements into families and subfamilies based on sequence similarity.
  • TE Insertion Site Analysis: Investigating the distribution and frequency of TE insertions across the genome.
  • Structural Analysis: Analyzing the effect of TEs on genome structure and gene function.

References