OMArk Genome Annotation Quality Assessment

A comprehensive guide to evaluating the quality of proteomes (protein-coding gene repertoires) using OMArk.

OMArk provides metrics on proteome completeness, characterizes the consistency of all protein-coding genes with respect to their homologs, and identifies contamination from other species.

Introduction

OMArk is a software tool used for evaluating the quality of proteomes (protein-coding gene repertoires). It relies on the OMA orthology database to exploit orthology relationships and utilizes the OMAmer software for fast placement of proteins into gene families.

For more information, refer to the publication: Quality assessment of gene repertoire annotations with OMArk | Nature Biotechnology

Workflow Steps

1

Run OMArk with Existing Protein Files

First, run OMArk using your protein sequence files:

omamer search --db ~/sda/database/LUCA.h5 --query Onip.aa --out ./omark/Onip.omamer

omark -f ./omark/Onip.omamer -d ~/sda/database/LUCA.h5 -o ./omark/
2

Extract IDs to Remove Based on OMArk Results

Use the following awk command to extract IDs of proteins that need to be removed due to inconsistencies or contamination:

awk '/^>(Inconsistent|Contamination)_(Full|Partial|Fragment)/{print;flag=1;next}/^>/{flag=0}flag' omark.ump > to_remove.id

This will identify and isolate Inconsistent and Contamination proteins.

3

Remove Proteins or GFF Based on Extracted IDs

Once you have the IDs to remove, use them to filter your original protein or GFF files:

# Remove proteins from the fasta file
seqtk seq -A Onip.aa | grep -vFf to_remove.id > Onip_clean.aa

# Alternatively, filter the GFF file by the IDs to remove
awk 'NR==FNR{a[$1];next} !($1 in a)' to_remove.id Onip.gff > Onip_clean.gff

Important Notes

The results from OMArk should be considered as a reference only. Its evaluation methodology is based on the existing evolutionary database, LUCA.h5, which may introduce assessment biases, particularly for less-studied insect lineages. Additionally, the identification of Contamination and Inconsistent sequences may not always be accurate. Users are encouraged to assess these findings based on the specific sequences of their own study insect.

References