Transcriptome Data Processing and Integration
A comprehensive guide to processing and integrating transcriptome data using the methodology adopted by the InsectBase database team.
This workflow represents the methodology adopted by the InsectBase database team for transcriptome data processing and integration.
1. Overview
A total of 61,421 transcriptomes from 4,150 species were retrieved from the NCBI Sequence Read Archive (SRA). After extensive manual curation—ensuring metadata completeness, untreated experimental conditions, and valid reference genomes—an additional, manually validated dataset of 18,999 transcripts from 397 species (including 106 newly added species) was integrated into the final release.
All raw reads were processed using the following pipeline:
Step | Tool | Version |
---|---|---|
Quality filtering and adapter trimming | fastp | 0.23.4 |
Read mapping to reference genomes | HISAT2 | 2.2.1 |
Transcript assembly | StringTie2 | 2.2.1 |
Unless otherwise specified, default parameters were used for each software package. All commands, configuration files, and metadata schemas are archived in the project's public GitHub repository for full reproducibility.
2. Metadata Standards
Each transcriptome was accepted only if it satisfied the following criteria:
- Complete metadata: period and tissue descriptors present.
- Untreated condition: only unmanipulated control samples were included.
- Reliable reference genome: species-level genome available and traceable.
- Accessible technical metadata: sequencing platform, read length, and layout documented.
Core metadata fields:
Field Name | Description | Example (Blattella germanica) |
---|---|---|
ScientificName | Scientific name of the organism | Blattella germanica |
Taxid | NCBI Taxonomy ID | 6973 |
Order | Taxonomic order of the species | Blattodea |
Family | Taxonomic family | Ectobiidae |
Genus | Taxonomic genus | Blattella |
Run | NCBI SRA Run accession number | SRR26627362 |
LibraryStrategy | Sequencing strategy used in the experiment | RNA-Seq |
LibrarySource | Source type of the sequencing library | Transcriptomic |
LibraryLayout | Sequencing layout (single-end or paired-end) | Paired |
Platform | Sequencing platform used for data generation | Illumina |
3. Controlled Vocabulary (CV) Curation
To eliminate inconsistencies, all Tissue and Period descriptors were standardized using a Controlled Vocabulary (CV) and Synonym Mapping.
Normalization Rules
- Handling Missing or Incomplete Metadata
Missing information was classified into three categories to preserve interpretability:
Not available
: metadata completely absent in the original record;Not applicable
: attribute biologically irrelevant for the sample (e.g., "Sex" for some specific species);Not determined
: information present but uncertain or inconsistent.
Samples lacking essential descriptors were flagged with
Is_curated = FALSE
, allowing users to easily distinguish uncurated entries in the database interface. - Primary and Secondary Curation Standards
A two-tier structure was implemented to balance generalization and information retention:
- Primary metadata (
Curated_*
) provides standardized categories, reducing heterogeneity across studies. For example, all gut-derived samples are grouped under"Digestive"
inCurated_Tissue
. - Secondary metadata (
Curated_*_second
) retains detailed contextual information from original submissions, such as "Midgut", "Hindgut", or "Posterior midgut".
This dual annotation system enables both cross-species comparisons and fine-grained biological interpretation.
- Primary metadata (
- Quality Control and Ongoing Monitoring
Automated validation checks were applied during data import, verifying taxonomy alignment (via NCBI Taxonomy), controlled vocabulary compliance, and completeness of required fields. Curators review flagged entries quarterly, ensuring consistent terminology, traceability, and accuracy as new transcriptomes are added to the database.
Controlled Vocabulary Categories
The following standardized categories were established for primary metadata fields:
Curated_Sex Categories
- Male
- Female
- Mixed
- Asexual
- Intersex
- Neuter
- Sterile female
Curated_Age Categories
- Adult
- Egg
- Nymph
- Larva and Adult
- Larva
- Mixed
- Pupa
- Cell
- Egg and Pupa
- Larva and Pupa
Curated_Tissue Categories
- Digestive
- Egg
- Fat body
- Hemolymph
- Excretory
- Gland
- Whole body
- Head and Thorax
- Ovary
- Mixed
4. Data Acquisition
Transcriptome reads were downloaded via the NCBI SRA Toolkit or the ENA API.
Example (SRA Toolkit):
prefetch SRR16388593
fasterq-dump SRR16388593 -e 8 -O ./fastq/
Directory Structure:
project/
├── metadata/
├── fastq/
├── qc/
├── asm/
├── bam/
├── expr/
└── logs/
5. Preprocessing (fastp v0.23.4)
Raw reads were quality filtered and adapter-trimmed using fastp
. Default settings were applied except where explicitly stated.
Command example:
fastp \
-i fastq/SRR16388593_1.fastq.gz \
-I fastq/SRR16388593_2.fastq.gz \
-o fastq/SRR16388593.clean_1.fq.gz \
-O fastq/SRR16388593.clean_2.fq.gz \
-w 8 \
-j qc/SRR16388593.fastp.json \
-h qc/SRR16388593.fastp.html
Quality thresholds:
- ≥80% reads retained post-filtering
- Q30 ≥85%
- Samples with extreme adapter contamination or duplication flagged for review
6. Reference Genome Indexing (HISAT2)
Reference genomes were selected to match the species versions displayed on the website. Indexes were built once per genome version:
hisat2-build -p 8 ref/Aalb_v2.fa index/Aalb_v2
7. Read Alignment (HISAT2 v2.2.1)
Paired-end reads were aligned to reference genomes using HISAT2. Strand specificity was inferred using infer_experiment.py
where applicable.
Example:
# step1 generate bam file
hisat2 -p 8 -x index/Aalb_v2 \
-1 fastq/SRR16388593.clean_1.fq.gz \
-2 fastq/SRR16388593.clean_2.fq.gz \
| samtools sort -@ 8 -o bam/SRR16388593.sorted.bam
# step2 index
samtools index bam/SRR16388593.sorted.bam
Recommended metrics:
- Overall alignment rate ≥70%
- Unique alignment rate ≥50%
8. Transcript Assembly (StringTie2 v2.2.1)
Transcript assembly and quantification were performed per sample using StringTie2
.
Example:
stringtie bam/SRR16388593.sorted.bam -p 8 \
-G ref/Aalb_v2.gtf \
-o asm/SRR16388593.gtf
Merged assemblies were optionally generated for novel transcript discovery:
stringtie --merge -p 8 -G ref/Aalb_v2.gtf -o asm/merged.gtf asm/list.txt
9. Data Preservation and Accessibility
- Primary transcriptome assemblies are archived in certified data repositories to ensure long-term accessibility and DOI-based citation.
- Derived datasets (e.g., expression matrices, merged GTFs) are versioned and accompanied by metadata and workflow logs.
10. References
- Chen et al. fastp: an ultra-fast all-in-one FASTQ preprocessor.
- Kim et al. HISAT2: graph-based alignment of next generation sequencing reads.
- Kovaka et al. StringTie2: transcriptome assembly from long and short reads.
- NCBI SRA Toolkit: https://github.com/ncbi/sra-tools
- Data deposition guidelines: https://academic.oup.com/nar/pages/data_deposition_and_standardization