SAAGA is a tool for summarising, annotating and assessing genome annotations, with a particular focus on annotation generated by GeMoMa. The core of SAAGA is reciprocal MMeqs searches of the annotation and reference proteomes. These are used to identify the best hits for protein product identification and to assess annotations based on query and hit coverage. SAAGA will also generate annotation summary statistics, and extract the longest protein from each gene for a representative non-redundant proteome (e.g. for BUSCO analysis).
Please note that SAAGA is still in development and documentation is currently a bit sparse.
The different run modes are set using a set of mode=T/F
flags (or simply adding the run mode to the command):
assess
= Assess annotation using reference annotation (e.g. a reference organism proteome)annotate
= Rename annotation using reference annotation (could be Swissprot)longest
= Extract the longest protein per genemmseq
= Run the mmseq2 steps in preparation for further analysissummarise
= Summarise annotation from a GFF fileSee https://slimsuite.github.io/saaga/ for details of each mode. General SLiMSuite run documentation can be found at https://github.com/slimsuite/SLiMSuite.
SAAGA is available as part of SLiMSuite, or via a standalone GitHub repo at https://github.com/slimsuite/saaga.
SAAGA is written in Python 2.x and can be run directly from the commandline:
python $CODEPATH/diploidocus.py [OPTIONS]
If running as part of SLiMSuite, $CODEPATH
will be the SLiMSuite tools/
directory. If running from the standalone SAAGA git repo, $CODEPATH
will be the path the to code/
directory. Please see details in the SAAGA git repo for running on example data.
For assess
, annotate
and mmseq
modes, MMseqs2 must be installed and either added to the environment $PATH
.
A list of commandline options can be generated at run-time using the -h
or help
flags. Please see the general SLiMSuite documentation for details of how to use commandline options, including setting default values with INI files.
### ~ Input/Output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqin=FILE : Protein annotation file to assess [annotation.faa]
gffin=FILE : Protein annotation GFF file [annotation.gff]
cdsin=FILE : Optional transcript annotation file for renaming and/or longest isoform extraction [annotation.fna]
refprot=FILE : Reference proteome for mapping data onto [refproteome.fasta]
refdb=FILE : Reference proteome MMseq2 database (over-rule mmseqdb path) []
mmseqdb=PATH : Directory in which to find/create mmseqs2 databases [./mmseqdb/]
mmsearch=PATH : Directory in which to find/create mmseqs2 databases [./mmsearch/]
basefile=X : Prefix for output files [$SEQBASE.$REFBASE]
gffgene=X : Label for GFF gene feature type ['gene']
gffcds=X : Label for GFF CDS feature type ['CDS']
gffmrna=X : Label for GFF mRNA feature type ['mRNA']
### ~ Run mode options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
annotate=T/F : Rename annotation using reference annotation (could be Swissprot) [False]
assess=T/F : Assess annotation using reference annotation [False]
longest=T/F : Extract longest protein per gene into *.longest.faa [False]
mmseqs=T/F : Run the mmseq2 steps in preparation for further analysis [True]
summarise=T/F : Summarise annotation from GFF file [True]
dochtml=T/F : Generate HTML SAAGA documentation (*.docs.html) instead of main run [False]
### ~ Search and filter options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
tophits=INT : Restrict mmseqs hits to the top X hits [250]
minglobid=PERC : Minimum global query percentage identity for a hit to be kept [40.0]
### ~ Precomputed MMSeq2 options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
mmqrymap=TSV : Tab-delimited output for query versus reference search (see docs) [$SEQBASE.$REFBASE.mmseq.tsv]
mmhitmap=TSV : Tab-delimited output for reference versus query search (see docs) [$REFBASE.$SEQBASE.mmseq.tsv]
### ~ Batch Run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
batchseq=FILELIST : List of seqin=FILE annotation proteomes for comparison
batchref=FILELIST : List of refprot=FILE reference proteomes for comparison
### ~ System options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
forks=X : Number of parallel sequences to process at once [0]
killforks=X : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X : Sleep time (seconds) between cycles of forking out more process [0]
tmpdir=PATH : Temporary directory path for running mmseqs2 [./tmp/]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
SAAGA is based on MMseqs mapping of a predicted annotation proteome against a reference proteome. For assessment, this should be a high quality annotated proteome. For annotation, this should be a set of curated proteins such as SwissProt.
Unless mmseqs
is the only active run mode, the seqin=FILE
predicted annotation proteome is loaded and summarised. If cdsin=FILE
is given and longest
, annotate
or summarise
are active, the predicted annotation transcriptome will also be summarised. In assess
mode, the reference proteome will also be loaded and summarised. If the reference proteome has more than 100,000 proteins, the option to exit will be given.
Unless mmseqs
is the only active run mode, the gffin=FILE
GFF file will also be loaded and processed. Predicted genes, transcripts and exons are extracted based on the feature type. By default, SAAGA expects GeMoMa annotation with gene
, mRNA
and CDS
feature types for genes, transcripts and exons, respectively. If mRNA
features are not found, prediction
features will be parsed as transcripts. These can be over-ridden with gffgene=X
, gffmrna=X
and gffcds=X
.
Protein names in the seqin=FILE
are mapped onto the ID=X
identifiers for transcripts in the GFF file. Transcript Parent=X
identifiers should map onto gene ID=X
identifiers, and CDS Parent=X
identifiers should map onto transcript ID=X
identifiers. If mapping is incomplete, the user will be warned and given the option to exit.
If seqin=FILE
protein names and transcript IDs do not match, https://github.com/gpertea/gffread can be used to generate the protein fasta file:
gffread -y $PROTEIN_FASTA -g $GENOME $GFF
Unless the MMseq2 runs are already generated (or force=T
), mmseqs createdb
will be run on the annotation and reference proteomes. Following this, mmseqs search
will be run using the tmpdir=PATH
temporary directory (default, ./tmp/
). If assess
or annotate
, the reciprocal search of reference proteome versus predicted proteome is also executed. Searches are reduced to the top X hits (tophits=INT
, default 250) using mmseqs filterdb
and then tabulated mmseqs convertalis
.
For the annotation versus reference search, the following fields are output:
query,target,evalue,raw,alnlen,pident,qlen,tlen,qstart,qend,qcov,tstart,tend,tcov,theader
For the reference versus annotation search, the following fields are output:
query,target,evalue,raw,alnlen,pident,qlen,tlen,qstart,qend,qcov,tstart,tend,tcov
Unless mmseqs
is the only active run mode, SAAGA will exit at this point. Otherwise, if multiple hits for the same query-hit pair exist, they are ranked by the raw
score (big to small).
After loading the input data and running MMseq2, the data is integrated and tidied.
The exon (CDS) table is indexed on parent
, start
and end
, and a new exonlen
field added (end
-start
+1). This table is then collapsed by parent transcript id
, keeping the smallest start
position, biggest end
, count of exons
and summed exonlen
for each transcript.
For transcript/protein annotation, a copy of the annotation versus reference mmseqs output is made, reduced to the fields:
query,target,raw,tcov,pident,theader
The gene and transcripts tables are indexed on id
for mapping onto other data.
In addition to outputs generated by mmseqs, the main SAAGA outputs are:
*.log
= the main SAAGA log file containing*.sys.log
= logging of mmseqs runs.*.gene.tdt
= summary information per annotated gene*.proteins.tdt
= summary information per annotated protein*.refprot.tdt
= summary information per reference protein (Assessment mode)*.stats.tdt
= summary statistics for full annotationThe fields for the main tables are given below. Details will be added in a future release. Please contact the author in the meantime if anything is not clear.
locus
= Sequence name from assembly filesource
= GFF sourcestart
= Start positionend
= End positionstrand
= Strandgeneinfo
= Information parsed from GFFname
= Annotation gene namegeneid
= Annotation gene ID (should map onto transcript Parent identifiers)isoforms
= Number of transcriptsmaxprotlen
= Maximum protein lengthlongest
= Transcript ID for longest proteinisoinfo
= Parsed GFF information for longest protein isoformprotname
= Protein sequence nameprotdesc
= Protein sequence descriptionaccnum
= Protein sequence accession number. Should map to Transcript ID.protlen
= Protein length (aa)exons
= No. exonsexonlen
= Summed length of exonsgeneid
= Parent Gene ID (geneid
)locus
= Location sequence name from assemblystart
= Start position (transcript)end
= End position (transcript)strand
= Strandattributes
= Parsed GFF attributesbestref
= Best reference protein from mmseq searchprotcov
= Coverage of protein by bestref
hit (0-1)refcov
= Coverage of bestref
protein by mmseq hit (0-1)protratio
= Annotated protein length / bestref
protein lengthlendiff
= Annotated protein length - bestref
protein lengthalnlen
= Length of alignmentpident
= Percentage identity of hitglobid
= Global percentage identity of annotated proteinhitnum
= Number of hits in mmseq searchrbh
= Whether a reciprocal best hit (1/0)f1
= F1 score = 2 x Pr x Recall / (Pr + Recall) = 2 x protcov x refcov / (protcov + refcov)refprot
= Reference protein namebestprot
= Top hit annotated proteinalnlen
= Length of alignmentpident
= Percentage identity of hitreflen
= Length of reference proteinrefcov
= Coverage of reference proteinprotcov
= Coverage of bestprot
refdesc
= Description of reference proteinf1
= F1 score = 2 x Pr x Recall / (Pr + Recall) = 2 x refcov x protcov / (refcov + protcov)seqin
= Input annotation proteomerefdb
= Reference databasegenes
= Number of genesisoforms
= Number of transcriptsexons
= Mean exons per geneexonlen
= Mean combined exon lengthprotlen
= Mean protein lengthcompleteness
= Summed coverage of reference proteome (%)purity
= Summed reference coverage of annotated proteome (%)homology
= Percentage of genes with any hit in referenceorthology
= Percentage of genes with reciprocal best hitsprotratio_mean
= Mean protein length ratio (only proteins with hits)protratio_median
= Median protein length ratio (only proteins with hits)protratio_sd
= StdDev of protein length ratio (only proteins with hits)duplicity
= Mean number of annotated genes sharing the same reference protein bestref
compression
= Number of unique bestprot
annotated genes / number of reference proteins with hitmultiplicity
= Total no. Qry Genes / Total no Ref proteinsf1
= Combined F1 across all query genesf1_hits
= Combined F1 across all query genes with hitsmean_f1
= Mean F1 across all query genesDetails for the main SAAGA run modes will be added below.
NOTE: SAAGA is under development and documentation might be a bit sparse. Please contact the author or post an issue on GitHub if you have any questions.
This mode compares the predicted protein sequences from an annotation to a reference proteome and asseses its quality and completeness.
Details to follow.
NOTE: SAAGA is still under development. MMseqs2 stringency settings have not yet been optimised for performance.
Based on MAKER2 renaming, this mode will use the top hit to reference proteins (e.g. SwissProt) to add descriptions to predicted gene and proteins.
Details to follow.
This will extract the longest protein per gene, e.g. for reduced Duplicated
ratings in BUSCO completeness estimates.
Details to follow.
This mode will summarise the annotation from a GFF file. This is also executed as part of the assess
mode.
Details to follow.
This run the mmseq2 steps in preparation for further analysis. It is primarily for debugging or when runs need to be split over multiple systems.
Details to follow.
© 2020 Richard Edwards | richard.edwards@unsw.edu.au