Transcriptome Data Processing and Integration

A comprehensive guide to processing and integrating transcriptome data using the methodology adopted by the InsectBase database team.

This workflow represents the methodology adopted by the InsectBase database team for transcriptome data processing and integration.

1. Overview

A total of 61,421 transcriptomes from 4,150 species were retrieved from the NCBI Sequence Read Archive (SRA). After extensive manual curation—ensuring metadata completeness, untreated experimental conditions, and valid reference genomes—an additional, manually validated dataset of 18,999 transcripts from 397 species (including 106 newly added species) was integrated into the final release.

All raw reads were processed using the following pipeline:

StepToolVersion
Quality filtering and adapter trimmingfastp0.23.4
Read mapping to reference genomesHISAT22.2.1
Transcript assemblyStringTie22.2.1

Unless otherwise specified, default parameters were used for each software package. All commands, configuration files, and metadata schemas are archived in the project's public GitHub repository for full reproducibility.

2. Metadata Standards

Each transcriptome was accepted only if it satisfied the following criteria:

  1. Complete metadata: period and tissue descriptors present.
  2. Untreated condition: only unmanipulated control samples were included.
  3. Reliable reference genome: species-level genome available and traceable.
  4. Accessible technical metadata: sequencing platform, read length, and layout documented.

Core metadata fields:

Field NameDescriptionExample (Blattella germanica)
ScientificNameScientific name of the organismBlattella germanica
TaxidNCBI Taxonomy ID6973
OrderTaxonomic order of the speciesBlattodea
FamilyTaxonomic familyEctobiidae
GenusTaxonomic genusBlattella
RunNCBI SRA Run accession numberSRR26627362
LibraryStrategySequencing strategy used in the experimentRNA-Seq
LibrarySourceSource type of the sequencing libraryTranscriptomic
LibraryLayoutSequencing layout (single-end or paired-end)Paired
PlatformSequencing platform used for data generationIllumina

3. Controlled Vocabulary (CV) Curation

To eliminate inconsistencies, all Tissue and Period descriptors were standardized using a Controlled Vocabulary (CV) and Synonym Mapping.

Normalization Rules

  1. Handling Missing or Incomplete Metadata

    Missing information was classified into three categories to preserve interpretability:

    • Not available: metadata completely absent in the original record;
    • Not applicable: attribute biologically irrelevant for the sample (e.g., "Sex" for some specific species);
    • Not determined: information present but uncertain or inconsistent.

    Samples lacking essential descriptors were flagged with Is_curated = FALSE, allowing users to easily distinguish uncurated entries in the database interface.

  2. Primary and Secondary Curation Standards

    A two-tier structure was implemented to balance generalization and information retention:

    • Primary metadata (Curated_*) provides standardized categories, reducing heterogeneity across studies. For example, all gut-derived samples are grouped under "Digestive" in Curated_Tissue.
    • Secondary metadata (Curated_*_second) retains detailed contextual information from original submissions, such as "Midgut", "Hindgut", or "Posterior midgut".

    This dual annotation system enables both cross-species comparisons and fine-grained biological interpretation.

  3. Quality Control and Ongoing Monitoring

    Automated validation checks were applied during data import, verifying taxonomy alignment (via NCBI Taxonomy), controlled vocabulary compliance, and completeness of required fields. Curators review flagged entries quarterly, ensuring consistent terminology, traceability, and accuracy as new transcriptomes are added to the database.

Controlled Vocabulary Categories

The following standardized categories were established for primary metadata fields:

Curated_Sex Categories

  • Male
  • Female
  • Mixed
  • Asexual
  • Intersex
  • Neuter
  • Sterile female

Curated_Age Categories

  • Adult
  • Egg
  • Nymph
  • Larva and Adult
  • Larva
  • Mixed
  • Pupa
  • Cell
  • Egg and Pupa
  • Larva and Pupa

Curated_Tissue Categories

  • Digestive
  • Egg
  • Fat body
  • Hemolymph
  • Excretory
  • Gland
  • Whole body
  • Head and Thorax
  • Ovary
  • Mixed

4. Data Acquisition

Transcriptome reads were downloaded via the NCBI SRA Toolkit or the ENA API.

Example (SRA Toolkit):

prefetch SRR16388593
fasterq-dump SRR16388593 -e 8 -O ./fastq/

Directory Structure:

project/
 ├── metadata/
 ├── fastq/
 ├── qc/
 ├── asm/
 ├── bam/
 ├── expr/
 └── logs/

5. Preprocessing (fastp v0.23.4)

Raw reads were quality filtered and adapter-trimmed using fastp. Default settings were applied except where explicitly stated.

Command example:

fastp \
  -i fastq/SRR16388593_1.fastq.gz \
  -I fastq/SRR16388593_2.fastq.gz \
  -o fastq/SRR16388593.clean_1.fq.gz \
  -O fastq/SRR16388593.clean_2.fq.gz \
  -w 8 \
  -j qc/SRR16388593.fastp.json \
  -h qc/SRR16388593.fastp.html

Quality thresholds:

  • ≥80% reads retained post-filtering
  • Q30 ≥85%
  • Samples with extreme adapter contamination or duplication flagged for review

6. Reference Genome Indexing (HISAT2)

Reference genomes were selected to match the species versions displayed on the website. Indexes were built once per genome version:

hisat2-build -p 8 ref/Aalb_v2.fa index/Aalb_v2

7. Read Alignment (HISAT2 v2.2.1)

Paired-end reads were aligned to reference genomes using HISAT2. Strand specificity was inferred using infer_experiment.py where applicable.

Example:

# step1 generate bam file
hisat2 -p 8 -x index/Aalb_v2 \
  -1 fastq/SRR16388593.clean_1.fq.gz \
  -2 fastq/SRR16388593.clean_2.fq.gz \
| samtools sort -@ 8 -o bam/SRR16388593.sorted.bam
# step2 index
samtools index bam/SRR16388593.sorted.bam

Recommended metrics:

  • Overall alignment rate ≥70%
  • Unique alignment rate ≥50%

8. Transcript Assembly (StringTie2 v2.2.1)

Transcript assembly and quantification were performed per sample using StringTie2.

Example:

stringtie bam/SRR16388593.sorted.bam -p 8 \
  -G ref/Aalb_v2.gtf \
  -o asm/SRR16388593.gtf

Merged assemblies were optionally generated for novel transcript discovery:

stringtie --merge -p 8 -G ref/Aalb_v2.gtf -o asm/merged.gtf asm/list.txt

9. Data Preservation and Accessibility

  • Primary transcriptome assemblies are archived in certified data repositories to ensure long-term accessibility and DOI-based citation.
  • Derived datasets (e.g., expression matrices, merged GTFs) are versioned and accompanied by metadata and workflow logs.

10. References