Help Center/Workflows/Transcriptome Data Processing

← Genome Quality Assessment Gene Function Annotation →

Transcriptome Data Processing and Integration

A comprehensive guide to processing and integrating transcriptome data using the methodology adopted by the InsectBase database team.

This workflow represents the methodology adopted by the InsectBase database team for transcriptome data processing and integration.

1. Overview

A total of 61,421 transcriptomes from 4,150 species were retrieved from the NCBI Sequence Read Archive (SRA). After extensive manual curation—ensuring metadata completeness, untreated experimental conditions, and valid reference genomes—an additional, manually validated dataset of 18,999 transcripts from 397 species (including 106 newly added species) was integrated into the final release.

All raw reads were processed using the following pipeline:

Step	Tool	Version
Quality filtering and adapter trimming	fastp	0.23.4
Read mapping to reference genomes	HISAT2	2.2.1
Transcript assembly	StringTie2	2.2.1

Unless otherwise specified, default parameters were used for each software package. All commands, configuration files, and metadata schemas are archived in the project's public GitHub repository for full reproducibility.

2. Metadata Standards

Each transcriptome was accepted only if it satisfied the following criteria:

Complete metadata: period and tissue descriptors present.
Untreated condition: only unmanipulated control samples were included.
Reliable reference genome: species-level genome available and traceable.
Accessible technical metadata: sequencing platform, read length, and layout documented.

Core metadata fields:

Field Name	Description	Example (Blattella germanica)
ScientificName	Scientific name of the organism	Blattella germanica
Taxid	NCBI Taxonomy ID	6973
Order	Taxonomic order of the species	Blattodea
Family	Taxonomic family	Ectobiidae
Genus	Taxonomic genus	Blattella
Run	NCBI SRA Run accession number	SRR26627362
LibraryStrategy	Sequencing strategy used in the experiment	RNA-Seq
LibrarySource	Source type of the sequencing library	Transcriptomic
LibraryLayout	Sequencing layout (single-end or paired-end)	Paired
Platform	Sequencing platform used for data generation	Illumina

3. Controlled Vocabulary (CV) Curation

To eliminate inconsistencies, all Tissue and Period descriptors were standardized using a Controlled Vocabulary (CV) and Synonym Mapping.

Normalization Rules

Handling Missing or Incomplete Metadata
Missing information was classified into three categories to preserve interpretability:
- Not available: metadata completely absent in the original record;
- Not applicable: attribute biologically irrelevant for the sample (e.g., "Sex" for some specific species);
- Not determined: information present but uncertain or inconsistent.
Samples lacking essential descriptors were flagged with Is_curated = FALSE, allowing users to easily distinguish uncurated entries in the database interface.
Primary and Secondary Curation Standards
A two-tier structure was implemented to balance generalization and information retention:
- Primary metadata (Curated_*) provides standardized categories, reducing heterogeneity across studies. For example, all gut-derived samples are grouped under "Digestive" in Curated_Tissue.
- Secondary metadata (Curated_*_second) retains detailed contextual information from original submissions, such as "Midgut", "Hindgut", or "Posterior midgut".
This dual annotation system enables both cross-species comparisons and fine-grained biological interpretation.
Quality Control and Ongoing Monitoring
Automated validation checks were applied during data import, verifying taxonomy alignment (via NCBI Taxonomy), controlled vocabulary compliance, and completeness of required fields. Curators review flagged entries quarterly, ensuring consistent terminology, traceability, and accuracy as new transcriptomes are added to the database.

Controlled Vocabulary Categories

The following standardized categories were established for primary metadata fields:

Curated_Sex Categories

Male
Female
Mixed
Asexual
Intersex
Neuter
Sterile female

4. Data Acquisition

Transcriptome reads were downloaded via the NCBI SRA Toolkit or the ENA API.

Example (SRA Toolkit):

prefetch SRR16388593
fasterq-dump SRR16388593 -e 8 -O ./fastq/

Directory Structure:

project/
 ├── metadata/
 ├── fastq/
 ├── qc/
 ├── asm/
 ├── bam/
 ├── expr/
 └── logs/

5. Preprocessing (fastp v0.23.4)

Raw reads were quality filtered and adapter-trimmed using fastp. Default settings were applied except where explicitly stated.

Command example:

fastp \
  -i fastq/SRR16388593_1.fastq.gz \
  -I fastq/SRR16388593_2.fastq.gz \
  -o fastq/SRR16388593.clean_1.fq.gz \
  -O fastq/SRR16388593.clean_2.fq.gz \
  -w 8 \
  -j qc/SRR16388593.fastp.json \
  -h qc/SRR16388593.fastp.html

Quality thresholds:

≥80% reads retained post-filtering
Q30 ≥85%
Samples with extreme adapter contamination or duplication flagged for review

6. Reference Genome Indexing (HISAT2)

Reference genomes were selected to match the species versions displayed on the website. Indexes were built once per genome version:

hisat2-build -p 8 ref/Aalb_v2.fa index/Aalb_v2

7. Read Alignment (HISAT2 v2.2.1)

Paired-end reads were aligned to reference genomes using HISAT2. Strand specificity was inferred using infer_experiment.py where applicable.

Example:

# step1 generate bam file
hisat2 -p 8 -x index/Aalb_v2 \
  -1 fastq/SRR16388593.clean_1.fq.gz \
  -2 fastq/SRR16388593.clean_2.fq.gz \
| samtools sort -@ 8 -o bam/SRR16388593.sorted.bam
# step2 index
samtools index bam/SRR16388593.sorted.bam

Recommended metrics:

Overall alignment rate ≥70%
Unique alignment rate ≥50%

8. Transcript Assembly (StringTie2 v2.2.1)

Transcript assembly and quantification were performed per sample using StringTie2.

Example:

stringtie bam/SRR16388593.sorted.bam -p 8 \
  -G ref/Aalb_v2.gtf \
  -o asm/SRR16388593.gtf

Merged assemblies were optionally generated for novel transcript discovery:

stringtie --merge -p 8 -G ref/Aalb_v2.gtf -o asm/merged.gtf asm/list.txt

9. Data Preservation and Accessibility

Primary transcriptome assemblies are archived in certified data repositories to ensure long-term accessibility and DOI-based citation.
Derived datasets (e.g., expression matrices, merged GTFs) are versioned and accompanied by metadata and workflow logs.

10. References

Chen et al. fastp: an ultra-fast all-in-one FASTQ preprocessor.
Kim et al. HISAT2: graph-based alignment of next generation sequencing reads.
Kovaka et al. StringTie2: transcriptome assembly from long and short reads.
NCBI SRA Toolkit: https://github.com/ncbi/sra-tools
Data deposition guidelines: https://academic.oup.com/nar/pages/data_deposition_and_standardization

← Genome Quality Assessment Gene Function Annotation →

Transcriptome Data Processing and Integration

1. Overview

2. Metadata Standards

Core metadata fields:

3. Controlled Vocabulary (CV) Curation

Normalization Rules

Controlled Vocabulary Categories

Curated_Sex Categories

Curated_Age Categories

Curated_Tissue Categories

4. Data Acquisition

Example (SRA Toolkit):

Directory Structure:

5. Preprocessing (fastp v0.23.4)

Command example:

Quality thresholds:

6. Reference Genome Indexing (HISAT2)

7. Read Alignment (HISAT2 v2.2.1)

Example:

Recommended metrics:

8. Transcript Assembly (StringTie2 v2.2.1)

Example:

Merged assemblies were optionally generated for novel transcript discovery:

9. Data Preservation and Accessibility

10. References