1 SAAGA: Summarise, Annotate & Assess Genome Annotations

SAAGA is a tool for summarising, annotating and assessing genome annotations, with a particular focus on annotation generated by GeMoMa. The core of SAAGA is reciprocal MMeqs searches of the annotation and reference proteomes. These are used to identify the best hits for protein product identification and to assess annotations based on query and hit coverage. SAAGA will also generate annotation summary statistics, and extract the longest protein from each gene for a representative non-redundant proteome (e.g. for BUSCO analysis).

Please note that SAAGA is still in development and documentation is currently a bit sparse.

The different run modes are set using a set of mode=T/F flags (or simply adding the run mode to the command):

  • assess = Assess annotation using reference annotation (e.g. a reference organism proteome)
  • annotate = Rename annotation using reference annotation (could be Swissprot)
  • longest = Extract the longest protein per gene
  • mmseq = Run the mmseq2 steps in preparation for further analysis
  • summarise = Summarise annotation from a GFF file

See https://slimsuite.github.io/saaga/ for details of each mode. General SLiMSuite run documentation can be found at https://github.com/slimsuite/SLiMSuite.

SAAGA is available as part of SLiMSuite, or via a standalone GitHub repo at https://github.com/slimsuite/saaga.


2 Running SAAGA

SAAGA is written in Python 2.x and can be run directly from the commandline:

python $CODEPATH/diploidocus.py [OPTIONS]

If running as part of SLiMSuite, $CODEPATH will be the SLiMSuite tools/ directory. If running from the standalone SAAGA git repo, $CODEPATH will be the path the to code/ directory. Please see details in the SAAGA git repo for running on example data.

For assess, annotate and mmseq modes, MMseqs2 must be installed and either added to the environment $PATH.

2.1 Commandline options

A list of commandline options can be generated at run-time using the -h or help flags. Please see the general SLiMSuite documentation for details of how to use commandline options, including setting default values with INI files.

### ~ Input/Output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqin=FILE      : Protein annotation file to assess [annotation.faa]
gffin=FILE      : Protein annotation GFF file [annotation.gff]
cdsin=FILE      : Optional transcript annotation file for renaming and/or longest isoform extraction [annotation.fna]
refprot=FILE    : Reference proteome for mapping data onto [refproteome.fasta]
refdb=FILE      : Reference proteome MMseq2 database (over-rule mmseqdb path) []
mmseqdb=PATH    : Directory in which to find/create mmseqs2 databases [./mmseqdb/]
mmsearch=PATH   : Directory in which to find/create mmseqs2 databases [./mmsearch/]
basefile=X      : Prefix for output files [$SEQBASE.$REFBASE]
gffgene=X       : Label for GFF gene feature type ['gene']
gffcds=X        : Label for GFF CDS feature type ['CDS']
gffmrna=X       : Label for GFF mRNA feature type ['mRNA']
### ~ Run mode options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
annotate=T/F    : Rename annotation using reference annotation (could be Swissprot) [False]
assess=T/F      : Assess annotation using reference annotation [False]
longest=T/F     : Extract longest protein per gene into *.longest.faa [False]
mmseqs=T/F      : Run the mmseq2 steps in preparation for further analysis [True]
summarise=T/F   : Summarise annotation from GFF file [True]
dochtml=T/F     : Generate HTML SAAGA documentation (*.docs.html) instead of main run [False]
### ~ Search and filter options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
tophits=INT     : Restrict mmseqs hits to the top X hits [250]
minglobid=PERC  : Minimum global query percentage identity for a hit to be kept [40.0]
### ~ Precomputed MMSeq2 options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
mmqrymap=TSV    : Tab-delimited output for query versus reference search (see docs) [$SEQBASE.$REFBASE.mmseq.tsv]
mmhitmap=TSV    : Tab-delimited output for reference versus query search (see docs) [$REFBASE.$SEQBASE.mmseq.tsv]
### ~ Batch Run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
batchseq=FILELIST   : List of seqin=FILE annotation proteomes for comparison
batchref=FILELIST   : List of refprot=FILE reference proteomes for comparison
### ~ System options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
forks=X         : Number of parallel sequences to process at once [0]
killforks=X     : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X     : Sleep time (seconds) between cycles of forking out more process [0]
tmpdir=PATH     : Temporary directory path for running mmseqs2 [./tmp/]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

3 SAAGA Overview

SAAGA is based on MMseqs mapping of a predicted annotation proteome against a reference proteome. For assessment, this should be a high quality annotated proteome. For annotation, this should be a set of curated proteins such as SwissProt.

3.1 Setup

Unless mmseqs is the only active run mode, the seqin=FILE predicted annotation proteome is loaded and summarised. If cdsin=FILE is given and longest, annotate or summarise are active, the predicted annotation transcriptome will also be summarised. In assess mode, the reference proteome will also be loaded and summarised. If the reference proteome has more than 100,000 proteins, the option to exit will be given.

Unless mmseqs is the only active run mode, the gffin=FILE GFF file will also be loaded and processed. Predicted genes, transcripts and exons are extracted based on the feature type. By default, SAAGA expects GeMoMa annotation with gene, mRNA and CDS feature types for genes, transcripts and exons, respectively. If mRNA features are not found, prediction features will be parsed as transcripts. These can be over-ridden with gffgene=X, gffmrna=X and gffcds=X.

Protein names in the seqin=FILE are mapped onto the ID=X identifiers for transcripts in the GFF file. Transcript Parent=X identifiers should map onto gene ID=X identifiers, and CDS Parent=X identifiers should map onto transcript ID=X identifiers. If mapping is incomplete, the user will be warned and given the option to exit.

If seqin=FILE protein names and transcript IDs do not match, https://github.com/gpertea/gffread can be used to generate the protein fasta file:

gffread -y $PROTEIN_FASTA -g $GENOME $GFF

3.1.1 MMseqs Searches

Unless the MMseq2 runs are already generated (or force=T), mmseqs createdb will be run on the annotation and reference proteomes. Following this, mmseqs search will be run using the tmpdir=PATH temporary directory (default, ./tmp/). If assess or annotate, the reciprocal search of reference proteome versus predicted proteome is also executed. Searches are reduced to the top X hits (tophits=INT, default 250) using mmseqs filterdb and then tabulated mmseqs convertalis.

For the annotation versus reference search, the following fields are output:

query,target,evalue,raw,alnlen,pident,qlen,tlen,qstart,qend,qcov,tstart,tend,tcov,theader

For the reference versus annotation search, the following fields are output:

query,target,evalue,raw,alnlen,pident,qlen,tlen,qstart,qend,qcov,tstart,tend,tcov

Unless mmseqs is the only active run mode, SAAGA will exit at this point. Otherwise, if multiple hits for the same query-hit pair exist, they are ranked by the raw score (big to small).

3.1.2 Data integration

After loading the input data and running MMseq2, the data is integrated and tidied.

The exon (CDS) table is indexed on parent, start and end, and a new exonlen field added (end-start+1). This table is then collapsed by parent transcript id, keeping the smallest start position, biggest end, count of exons and summed exonlen for each transcript.

For transcript/protein annotation, a copy of the annotation versus reference mmseqs output is made, reduced to the fields:

query,target,raw,tcov,pident,theader

The gene and transcripts tables are indexed on id for mapping onto other data.


4 SAAGA Outputs

In addition to outputs generated by mmseqs, the main SAAGA outputs are:

  • *.log = the main SAAGA log file containing
  • *.sys.log = logging of mmseqs runs.
  • *.gene.tdt = summary information per annotated gene
  • *.proteins.tdt = summary information per annotated protein
  • *.refprot.tdt = summary information per reference protein (Assessment mode)
  • *.stats.tdt= summary statistics for full annotation

The fields for the main tables are given below. Details will be added in a future release. Please contact the author in the meantime if anything is not clear.

4.1 Gene table [*.gene.tdt]

  • locus = Sequence name from assembly file
  • source = GFF source
  • start = Start position
  • end = End position
  • strand = Strand
  • geneinfo = Information parsed from GFF
  • name = Annotation gene name
  • geneid = Annotation gene ID (should map onto transcript Parent identifiers)
  • isoforms = Number of transcripts
  • maxprotlen = Maximum protein length
  • longest = Transcript ID for longest protein
  • isoinfo = Parsed GFF information for longest protein isoform

4.2 Protein table [*.proteins.tdt]

  • protname = Protein sequence name
  • protdesc = Protein sequence description
  • accnum = Protein sequence accession number. Should map to Transcript ID.
  • protlen = Protein length (aa)
  • exons = No. exons
  • exonlen = Summed length of exons
  • geneid = Parent Gene ID (geneid)
  • locus = Location sequence name from assembly
  • start = Start position (transcript)
  • end = End position (transcript)
  • strand = Strand
  • attributes = Parsed GFF attributes
  • bestref = Best reference protein from mmseq search
  • protcov = Coverage of protein by bestref hit (0-1)
  • refcov = Coverage of bestref protein by mmseq hit (0-1)
  • protratio = Annotated protein length / bestref protein length
  • lendiff = Annotated protein length - bestref protein length
  • alnlen = Length of alignment
  • pident = Percentage identity of hit
  • globid = Global percentage identity of annotated protein
  • hitnum = Number of hits in mmseq search
  • rbh = Whether a reciprocal best hit (1/0)
  • f1 = F1 score = 2 x Pr x Recall / (Pr + Recall) = 2 x protcov x refcov / (protcov + refcov)

4.3 Reference protein table [*.refprot.tdt]

  • refprot = Reference protein name
  • bestprot = Top hit annotated protein
  • alnlen = Length of alignment
  • pident = Percentage identity of hit
  • reflen = Length of reference protein
  • refcov = Coverage of reference protein
  • protcov = Coverage of bestprot
  • refdesc = Description of reference protein
  • f1 = F1 score = 2 x Pr x Recall / (Pr + Recall) = 2 x refcov x protcov / (refcov + protcov)

4.4 Summary statistics table

  • seqin = Input annotation proteome
  • refdb = Reference database
  • genes = Number of genes
  • isoforms = Number of transcripts
  • exons = Mean exons per gene
  • exonlen = Mean combined exon length
  • protlen = Mean protein length
  • completeness = Summed coverage of reference proteome (%)
  • purity = Summed reference coverage of annotated proteome (%)
  • homology = Percentage of genes with any hit in reference
  • orthology = Percentage of genes with reciprocal best hits
  • protratio_mean = Mean protein length ratio (only proteins with hits)
  • protratio_median = Median protein length ratio (only proteins with hits)
  • protratio_sd = StdDev of protein length ratio (only proteins with hits)
  • duplicity = Mean number of annotated genes sharing the same reference protein bestref
  • compression = Number of unique bestprot annotated genes / number of reference proteins with hit
  • multiplicity = Total no. Qry Genes / Total no Ref proteins
  • f1 = Combined F1 across all query genes
  • f1_hits = Combined F1 across all query genes with hits
  • mean_f1 = Mean F1 across all query genes

5 SAAGA run modes

Details for the main SAAGA run modes will be added below.

NOTE: SAAGA is under development and documentation might be a bit sparse. Please contact the author or post an issue on GitHub if you have any questions.


5.1 Annotation assessment mode [assess=T]

This mode compares the predicted protein sequences from an annotation to a reference proteome and asseses its quality and completeness.

Details to follow.

NOTE: SAAGA is still under development. MMseqs2 stringency settings have not yet been optimised for performance.


5.2 Annotation annotation mode [annotate=T]

Based on MAKER2 renaming, this mode will use the top hit to reference proteins (e.g. SwissProt) to add descriptions to predicted gene and proteins.

Details to follow.


5.3 Longest protein mode [longest=T]

This will extract the longest protein per gene, e.g. for reduced Duplicated ratings in BUSCO completeness estimates.

Details to follow.


5.4 Annotation summarise mode [summarise=T]

This mode will summarise the annotation from a GFF file. This is also executed as part of the assess mode.

Details to follow.


5.5 MMSeq2 preparation mode [mmseq=T]

This run the mmseq2 steps in preparation for further analysis. It is primarily for debugging or when runs need to be split over multiple systems.

Details to follow.



© 2020 Richard Edwards |