InsectBase UTR Extraction Workflow
A comprehensive guide to performing transcriptome annotation and UTR extraction for insect species using the PASA pipeline inside a Singularity container.
This workflow has been tested on Linux systems and aims to ensure reproducibility and portability.
Introduction
The purpose of this workflow is to:
- Build accurate transcript structures.
- Infer 5′ and 3′ UTR regions.
- Integrate ab initio predictions with transcript alignments.
- Produce clean, well-structured FASTA files ready for downstream analyses.
1. Input File Preparation
Before starting, make sure you have the following three input files for each species:
File Name | Description |
---|---|
genome.fa | Genome sequence in FASTA format |
merge.gtf | Merged transcript alignment file (contains UTR information) |
augustus.gff3 | ab initio gene prediction results (GFF3 format) |
Tip: Keep all files in the same working directory for easier path binding when running Singularity.
2. Environment Setup
You will need the following tools and dependencies:
Singularity Image Path:
/data/software/pasapipeline.v2.5.3.simg
Required Tools:
gffread
— Extract transcript sequences from GTF.AGAT
— Extract UTR regions from annotations.Python
— Used for sequence renaming and filtering.
Note: Make sure all dependencies are accessible in your $PATH
or properly bound within the Singularity environment.
3. Workflow Steps
Extract Transcript Sequences (FASTA)
Use gffread
to extract transcript sequences from the merged GTF file:
gffread merge.gtf -g genome.fa -w transcripts.fasta
Output: transcripts.fasta
— raw transcript sequences.
Clean Transcript Sequences Using Singularity
Run the PASA seqclean
tool inside the Singularity image:
singularity exec -B $PWD:/mnt /data/software/pasapipeline.v2.5.3.simg \
/usr/local/src/PASApipeline/bin/seqclean /mnt/transcripts.fasta
Output: transcripts.fasta.clean
— cleaned transcript sequences ready for PASA.
Create Configuration File
Create a file named alignAssembly.config
with the following content:
DATABASE=/mnt/pasa.sqlite
TRANSDECODER=/usr/local/src/TransDecoder/TransDecoder.LongOrfs
ALIGNERS=gmap
GMAP_HOME=/usr/local/bin
PASAPATH=/usr/local/src/PASApipeline
MAX_INTRON_LENGTH=50000
Tip: Make sure paths to TransDecoder
, GMAP
, and PASApipeline
are correct in your environment.
Run PASA for Transcript Structure and UTR/CDS Inference
singularity exec -B $PWD:/mnt /data/software/pasapipeline.v2.5.3.simg \
/usr/local/src/PASApipeline/Launch_PASA_pipeline.pl \
-c /mnt/alignAssembly.config -C -R -T \
-g /mnt/genome.fa \
-t /mnt/transcripts.fasta.clean \
-u /mnt/transcripts.fasta \
--ALIGNERS gmap \
--CPU 16
This step will:
- Build transcript assemblies.
- Infer CDS and UTR structures.
Output: A PASA database (pasa.sqlite
) and intermediate annotation results.
Load Ab Initio Annotation (BRAKER Output)
singularity exec -B $PWD:/mnt /data/software/pasapipeline.v2.5.3.simg \
/usr/local/src/PASApipeline/scripts/Load_Current_Gene_Annotations.dbi \
-c /mnt/alignAssembly.config \
-g /mnt/genome.fa \
-P /mnt/augustus.gff3
This command loads the BRAKER (ab initio) gene predictions into the PASA database for integration.
Compare and Integrate Annotation Structures
Duplicate your configuration file and rename it as annotCompare.config
. Then execute the following command:
singularity exec -B $PWD:/mnt /data/software/pasapipeline.v2.5.3.simg \
/usr/local/src/PASApipeline/Launch_PASA_pipeline.pl \
-c /mnt/annotCompare.config -A \
-g /mnt/genome.fa \
-t /mnt/transcripts.fasta.clean
Output:
pasa.sqlite.gene_structures_post_PASA_updates.*.gff3
This GFF3 file contains final integrated gene structures including UTR annotations.
4. Extract UTR Sequences (5′ and 3′)
You can now extract the UTR sequences using AGAT.
Extract 5′ UTR
agat_sp_extract_sequences.pl \
--gff pasa.sqlite.gene_structures_post_PASA_updates.*.gff3 \
--fasta genome.fa \
--type five_prime_UTR \
--output five_utr.fa
Extract 3′ UTR
agat_sp_extract_sequences.pl \
--gff pasa.sqlite.gene_structures_post_PASA_updates.*.gff3 \
--fasta genome.fa \
--type three_prime_UTR \
--output three_utr.fa
Tip: Both outputs will contain UTR sequences in FASTA format for downstream analysis.
5. Rename and Filter UTR FASTA Files
Use the following Python script to clean, rename, and filter UTR sequences shorter than 10 bp.
Save the script as rename_filter_utr.py
:
from Bio import SeqIO
import sys
def process_utr_fasta(input_fasta, output_fasta, utr_type="UTR5P"):
assert utr_type in ("UTR5P", "UTR3P")
with open(output_fasta, "w") as out_fh:
for record in SeqIO.parse(input_fasta, "fasta"):
seq = str(record.seq).upper()
if len(seq) < 10:
continue
gene_id = record.id.split()[0]
new_id = f"{gene_id}_{utr_type}_1"
out_fh.write(f">{new_id}\n")
for i in range(0, len(seq), 60):
out_fh.write(seq[i:i+60] + "\n")
if __name__ == "__main__":
if len(sys.argv) != 4:
print("Usage: python rename_filter_utr.py <input.fa> <output.fa> <UTR5P|UTR3P>")
sys.exit(1)
process_utr_fasta(sys.argv[1], sys.argv[2], sys.argv[3])
Example Usage
python rename_filter_utr.py five_utr.fa five_utr_named.fa UTR5P
python rename_filter_utr.py three_utr.fa three_utr_named.fa UTR3P
Output Files:
five_utr_named.fa
— Filtered and renamed 5′ UTR sequences.three_utr_named.fa
— Filtered and renamed 3′ UTR sequences.
6. Final Output Files
File Name | Description |
---|---|
five_utr_named.fa | Contains 5′ UTR sequences, formatted as >GeneID_UTR5P_1 |
three_utr_named.fa | Contains 3′ UTR sequences, formatted as >GeneID_UTR3P_1 |
pasa.sqlite.gene_structures_post_PASA_updates.gff3 | Final integrated annotation (includes UTR/CDS) |
7. Notes and Recommendations
- Always verify that your genome and GTF files are from the same assembly version.
- For large genomes, increase the CPU count and memory allocation for PASA.
- Keep backups of intermediate files (
pasa.sqlite
,.gff3
) for troubleshooting or reuse. - All steps can be wrapped in a shell script for reproducibility.
Citation and Credits
If you use this workflow in your research, please cite:
- PASA Pipeline (v2.5.3): Program to Assemble Spliced Alignments
- AGAT Toolkit: Another GFF Analysis Toolkit