InsectBase UTR Extraction Workflow

A comprehensive guide to performing transcriptome annotation and UTR extraction for insect species using the PASA pipeline inside a Singularity container.

This workflow has been tested on Linux systems and aims to ensure reproducibility and portability.

Introduction

The purpose of this workflow is to:

  • Build accurate transcript structures.
  • Infer 5′ and 3′ UTR regions.
  • Integrate ab initio predictions with transcript alignments.
  • Produce clean, well-structured FASTA files ready for downstream analyses.

1. Input File Preparation

Before starting, make sure you have the following three input files for each species:

File NameDescription
genome.faGenome sequence in FASTA format
merge.gtfMerged transcript alignment file (contains UTR information)
augustus.gff3ab initio gene prediction results (GFF3 format)

Tip: Keep all files in the same working directory for easier path binding when running Singularity.

2. Environment Setup

You will need the following tools and dependencies:

Singularity Image Path:

/data/software/pasapipeline.v2.5.3.simg

Required Tools:

  • gffread — Extract transcript sequences from GTF.
  • AGAT — Extract UTR regions from annotations.
  • Python — Used for sequence renaming and filtering.

Note: Make sure all dependencies are accessible in your $PATH or properly bound within the Singularity environment.

3. Workflow Steps

1

Extract Transcript Sequences (FASTA)

Use gffread to extract transcript sequences from the merged GTF file:

gffread merge.gtf -g genome.fa -w transcripts.fasta

Output: transcripts.fasta — raw transcript sequences.

2

Clean Transcript Sequences Using Singularity

Run the PASA seqclean tool inside the Singularity image:

singularity exec -B $PWD:/mnt /data/software/pasapipeline.v2.5.3.simg \
  /usr/local/src/PASApipeline/bin/seqclean /mnt/transcripts.fasta

Output: transcripts.fasta.clean — cleaned transcript sequences ready for PASA.

3

Create Configuration File

Create a file named alignAssembly.config with the following content:

DATABASE=/mnt/pasa.sqlite
TRANSDECODER=/usr/local/src/TransDecoder/TransDecoder.LongOrfs
ALIGNERS=gmap
GMAP_HOME=/usr/local/bin
PASAPATH=/usr/local/src/PASApipeline
MAX_INTRON_LENGTH=50000

Tip: Make sure paths to TransDecoder, GMAP, and PASApipeline are correct in your environment.

4

Run PASA for Transcript Structure and UTR/CDS Inference

singularity exec -B $PWD:/mnt /data/software/pasapipeline.v2.5.3.simg \
  /usr/local/src/PASApipeline/Launch_PASA_pipeline.pl \
  -c /mnt/alignAssembly.config -C -R -T \
  -g /mnt/genome.fa \
  -t /mnt/transcripts.fasta.clean \
  -u /mnt/transcripts.fasta \
  --ALIGNERS gmap \
  --CPU 16

This step will:

  • Build transcript assemblies.
  • Infer CDS and UTR structures.

Output: A PASA database (pasa.sqlite) and intermediate annotation results.

5

Load Ab Initio Annotation (BRAKER Output)

singularity exec -B $PWD:/mnt /data/software/pasapipeline.v2.5.3.simg \
  /usr/local/src/PASApipeline/scripts/Load_Current_Gene_Annotations.dbi \
  -c /mnt/alignAssembly.config \
  -g /mnt/genome.fa \
  -P /mnt/augustus.gff3

This command loads the BRAKER (ab initio) gene predictions into the PASA database for integration.

6

Compare and Integrate Annotation Structures

Duplicate your configuration file and rename it as annotCompare.config. Then execute the following command:

singularity exec -B $PWD:/mnt /data/software/pasapipeline.v2.5.3.simg \
  /usr/local/src/PASApipeline/Launch_PASA_pipeline.pl \
  -c /mnt/annotCompare.config -A \
  -g /mnt/genome.fa \
  -t /mnt/transcripts.fasta.clean

Output:

pasa.sqlite.gene_structures_post_PASA_updates.*.gff3

This GFF3 file contains final integrated gene structures including UTR annotations.

4. Extract UTR Sequences (5′ and 3′)

You can now extract the UTR sequences using AGAT.

Extract 5′ UTR

agat_sp_extract_sequences.pl \
  --gff pasa.sqlite.gene_structures_post_PASA_updates.*.gff3 \
  --fasta genome.fa \
  --type five_prime_UTR \
  --output five_utr.fa

Extract 3′ UTR

agat_sp_extract_sequences.pl \
  --gff pasa.sqlite.gene_structures_post_PASA_updates.*.gff3 \
  --fasta genome.fa \
  --type three_prime_UTR \
  --output three_utr.fa

Tip: Both outputs will contain UTR sequences in FASTA format for downstream analysis.

5. Rename and Filter UTR FASTA Files

Use the following Python script to clean, rename, and filter UTR sequences shorter than 10 bp.

Save the script as rename_filter_utr.py:

from Bio import SeqIO
import sys

def process_utr_fasta(input_fasta, output_fasta, utr_type="UTR5P"):
    assert utr_type in ("UTR5P", "UTR3P")
    with open(output_fasta, "w") as out_fh:
        for record in SeqIO.parse(input_fasta, "fasta"):
            seq = str(record.seq).upper()
            if len(seq) < 10:
                continue
            gene_id = record.id.split()[0]
            new_id = f"{gene_id}_{utr_type}_1"
            out_fh.write(f">{new_id}\n")
            for i in range(0, len(seq), 60):
                out_fh.write(seq[i:i+60] + "\n")

if __name__ == "__main__":
    if len(sys.argv) != 4:
        print("Usage: python rename_filter_utr.py <input.fa> <output.fa> <UTR5P|UTR3P>")
        sys.exit(1)
    process_utr_fasta(sys.argv[1], sys.argv[2], sys.argv[3])

Example Usage

python rename_filter_utr.py five_utr.fa five_utr_named.fa UTR5P
python rename_filter_utr.py three_utr.fa three_utr_named.fa UTR3P

Output Files:

  • five_utr_named.fa — Filtered and renamed 5′ UTR sequences.
  • three_utr_named.fa — Filtered and renamed 3′ UTR sequences.

6. Final Output Files

File NameDescription
five_utr_named.faContains 5′ UTR sequences, formatted as >GeneID_UTR5P_1
three_utr_named.faContains 3′ UTR sequences, formatted as >GeneID_UTR3P_1
pasa.sqlite.gene_structures_post_PASA_updates.gff3Final integrated annotation (includes UTR/CDS)

7. Notes and Recommendations

  • Always verify that your genome and GTF files are from the same assembly version.
  • For large genomes, increase the CPU count and memory allocation for PASA.
  • Keep backups of intermediate files (pasa.sqlite, .gff3) for troubleshooting or reuse.
  • All steps can be wrapped in a shell script for reproducibility.

Citation and Credits

If you use this workflow in your research, please cite:

  • PASA Pipeline (v2.5.3): Program to Assemble Spliced Alignments
  • AGAT Toolkit: Another GFF Analysis Toolkit