OMArk Genome Annotation Quality Assessment
A comprehensive guide to evaluating the quality of proteomes (protein-coding gene repertoires) using OMArk.
OMArk provides metrics on proteome completeness, characterizes the consistency of all protein-coding genes with respect to their homologs, and identifies contamination from other species.
Introduction
OMArk is a software tool used for evaluating the quality of proteomes (protein-coding gene repertoires). It relies on the OMA orthology database to exploit orthology relationships and utilizes the OMAmer software for fast placement of proteins into gene families.
For more information, refer to the publication: Quality assessment of gene repertoire annotations with OMArk | Nature Biotechnology
Workflow Steps
Run OMArk with Existing Protein Files
First, run OMArk using your protein sequence files:
omamer search --db ~/sda/database/LUCA.h5 --query Onip.aa --out ./omark/Onip.omamer
omark -f ./omark/Onip.omamer -d ~/sda/database/LUCA.h5 -o ./omark/
Extract IDs to Remove Based on OMArk Results
Use the following awk
command to extract IDs of proteins that need to be removed due to inconsistencies or contamination:
awk '/^>(Inconsistent|Contamination)_(Full|Partial|Fragment)/{print;flag=1;next}/^>/{flag=0}flag' omark.ump > to_remove.id
This will identify and isolate Inconsistent and Contamination proteins.
Remove Proteins or GFF Based on Extracted IDs
Once you have the IDs to remove, use them to filter your original protein or GFF files:
# Remove proteins from the fasta file
seqtk seq -A Onip.aa | grep -vFf to_remove.id > Onip_clean.aa
# Alternatively, filter the GFF file by the IDs to remove
awk 'NR==FNR{a[$1];next} !($1 in a)' to_remove.id Onip.gff > Onip_clean.gff
Important Notes
The results from OMArk should be considered as a reference only. Its evaluation methodology is based on the existing evolutionary database, LUCA.h5, which may introduce assessment biases, particularly for less-studied insect lineages. Additionally, the identification of Contamination and Inconsistent sequences may not always be accurate. Users are encouraged to assess these findings based on the specific sequences of their own study insect.