SAAGA can execute one or more different run modes that will generate
different subsets of the main outputs. At its core, SAAGA (v0.7.x+) has
four different underlying motivations:
Annotation assessment
mode [assess=T]
This mode compares the predicted protein sequences from an annotation
to a reference proteome and asseses its quality and completeness. SAAGA
performs a reciprocal MMseqs2 search against refdb=FILE
,
which should be a high-quality non-redundant reference proteome. The
best hits are identified and then used to calculate coverage ratios
between query and hit proteins as a means of annotation assessment. In
general, metrics will be closer to 1 (or 100%) for complete annotations
and assemblies without duplications. Although the maximum achievable
value for these metrics will generally be unknown, comparative values
can be used to assess improvement in assembly and/or annotation, akin to
BUSCO scores.
The main assessment statistics generated by SAAGA are:
Protein length ratio. (protratio
)
The length ratio of the annotated proteins versus its top reference hit.
Long-read genome assemblies are prone to indel errors, whereas
short-read assemblies tend to be fragmented. In each case, protein
annotations can be truncated or fragmented. SAAGA uses a protein length
ratio to assess the extent of this problem. Ideally, annotated protein
will, on average, be approximately the same length as orthologous
proteins in a high-quality reference proteome. If the protein length
ratio is heavily skewed below 1.0, this will indicate a problem with
truncated and/or fragmented protein annotations. (This can also be cause
by incorrect annotation settings that miss long introns, for example.)
proratio
captures the best value per gene, which is
reported as a mean, median and std dev for the whole
annotation.
F1 score. (f1
, f1_mean
and f1_hits
). F1 extends the protein length ratio to an
annotation consistency metric calculated using the formula: (2 X PROTCOV
X REFCOV) / (PROTCOV + REFCOV) where PROTCOV is the proportion of the
annotated protein covered by its best reference protein hit, and REFCOV
is the proportion of the best reference protein hit covered by the
annotated protein. For the proteome, f1
is the sum of the
proteome coverage, whereas f1_mean
is the mean
f1
per gene and f1_hit
is the mean for the
subset of proteins with a reference hit. (The former evaluates
annotation completeness, whereas the latter evaluates the general
quality of the indiviudal annotations.) As with protratio
,
f1
values close to 1 mean that the query protein closely
matches the length of the hit protein, indicating high fidelity of the
gene prediction model and underlying assembly.
Completeness. (completeness
) The
summed percentage coverage of reference proteome by the annotated
proteome. This is checking for “missing” proteins. Unless
refdb=FILE
represents the same species, (as with other
genome completeness metrics) it is unlikely that the theoretical maximum
is 100%. Nevertheless, assembly and/or annotation errors should be more
likely to reduced completeness, making it a useful comparative
statistic.
Purity. (purity
) The summed
percentage reference coverage of the annotated proteome, i.e. the
reciprocal of completeness. This is checking for “extra” proteins, which
may be indicative of either contamination (check with
taxonomy
mode) or false positive gene predictions. Note
that an incomplete or divergent reference proteome will also result in
low purity. As with completeness, it is unlikely that the theoretical
maximum is 100%, but it should be a useful comparative
statistic.
Homology. (homology
) The percentage
of annotated genes with any hit in reference. As with
purity
, this statistic gives an indication of contamination
or false predictions, but without requiring good coverage of individudal
genes.
Orthology. (orthology
) The
percentage of annotated genes with reciprocal best hits in the reference
proteome. This should increase as assembly/annotation redundancy and
duplication is decreased. As with completeness
and
purity
, it is very unlikely to reach 100% due to
lineage-specific duplications and deletions.
Duplicity. (duplicity
) The mean
number of annotated genes sharing the same best reference hit. This is
somewhat analogous to the “Duplicated” part of the BUSCO score, but does
not enforce an minimum coverage cutoffs. It can be useful for assessing
the success of purging haplotigs, for example.
Compression. (compression
) The
number of unique annotated genes that were the top hit for reference
proteins, divided by the total number of reference proteins with a hit.
This is the inverse of duplicity
and big deviations from
100% can indicate either redundancy in the reference proteome, or
missing members of gene families.
Multiplicity. (multiplicity
) The
ratio of total number of annotated genes to reference proteins. This
gives a broad ballpark indication of the completeness and stringency of
the genome annotation. Big deviations from 1 need some explanation,
whether that is genome/annotation incompleteness (under 1) or an excess
of low quality annotations and/or duplications (over 1).
NOTE: MMseqs2 stringency settings have not yet been
optimised for performance. Results of assess
mode should be
used primarily for comparisons between annotation, rather than treated
as an absolute truth in terms of completeness etc.
NOTE: To use SAAGA for assembly assessment (rather
than annotation assessment), the rapid homology-based gene
prediction program GEMOMA is recommended to generate a draft
annotation.
Annotation annotation
mode [annotate=T]
Based on MAKER2 renaming, this mode will use the top hit to reference
proteins (e.g. SwissProt) to add descriptions to predicted gene and
proteins. Currently, annotate
mode is explicitly designed
to work with Uniprot format sequences, and will parse the description,
organism (OS=
) and gene (GN=
where available)
for the top reference hit. Proteins will then be renamed:
Similar to $GENE: $DESCRIPTION [$ORGANISM]
or:
Similar to $DESCRIPTION [$ORGANISM]
If no OS=
tag is found, renaming will be simpler:
Similar to $DESCRIPTION
In each case, the description is appended with coverage and homology
details, in the form (X% cov @Y%id)
, as generated from the
refcov
and pident
statistics from MMseqs2.
Protein sequences with updated descriptions will be output to
*.renamed.faa
. If cdsin=FILE
was provided, the
corresponding transcripts will be output to
*.renamed.fna
.
Summarise taxonomic
assignments for contamination assessments [taxonomy=T]
Taxonomy mode combines the MMseqs2 easy-taxonomy
with
GFF parsing to perform taxonomic analysis of the input proteome and any
subsets given by taxsubsets=LIST
. Taxonomic assignments are
mapped onto genes as well as assembly scaffolds and (if
assembly=FILE
is given) contigs.
The first step is to run MMseqs2:
mmseqs easy-taxonomy $PROTEOME $TAXDB $TAXBASE $TMPDIR
Where $PROTEOME
is the proteome provided with
seqin=FILE
, $TAXDB
is a MMseqs2 taxonomic
database (see below for creation), provided with
taxdb=FILE
, $TAXBASE
is the
easy-taxonomy
output prefix, and $TMPDIR
is
the temporary directory (default tmp
). If pre-existing
results exist ($TAXBASE._report
and
$TAXBASE_lca.tsv
) then these will be loaded, unless
force=T
is set. If MMseqs2 is not installed, pre-computed
results must be provided. In principle, report
and
lca.tsv
files generate by other tools should work as long
as the format is the same.
The core of taxonomy mode is the MMSeqs2 “Lowest Common Ancestor”
(LCA) assignment, in which each sequence is associated with the lowest
unabmigious taxonomic rank possible. Where amibiguity exists, a sequence
will be assigned to a higher level. Higher levels also receive all the
taxonomic assignments of their daughter taxa, and so the sequence count
for any given taxonomic group will always be equal or greater than its
lower subdivisions. Conceptually, SAAGA separates out the counts into
taxnum
, which are counts at that level or below, and
taxpure
, which are the numbers assigned specifically to
that level. (i.e. taxnum
will be the sum of
taxpure
for that taxonomic group and all lower divisions.)
See the MMseqs2 documentation for more details.
Taxonomy
overview
SAAGA will first read in the *_report
file to build its
internal taxonomy tree for the samples. By default, mmseqs will report
all possible taxonomic levels, and SAAGA will retain the following:
species, species subgroup, species group, subgenus, genus, subtribe, tribe, subfamily, family, superfamily, parvorder, infraorder, suborder, order, superorder, infraclass, subclass, class, superclass, subphylum, phylum, superphylum, subkingdom, kingdom, superkingdom
This can be reduced further by specifying a subset of taxonomic
levels of interest with taxlevels=LIST
. Any missing levels,
along with “no rank” or “clade” taxa (except unclassified
,
root
, and cellular organisms
), will be mapped
to the next highest taxonomic level. Any MMseqs2 assignments to that
level will be transferred to the higher level. Any taxa failing to meet
the mintaxnum=INT
threshold (default=2) will also be mapped
onto higher levels.
Next, the *_lca.tsv
file is read and mapped onto the
gffin=FILE
GFF file to assign proteins to genes and
sequences. The lowest-level hit for each gene will be kept, remapping to
taxlevels
as required. These collated ratings will be
output to *.lca_genes.tsv
and *.lca_genes.gff
Gene ratings are then summed for each assembly sequence, and the
dominant classification for each taxonomic level established for (a)
each sequence, and (b) the whole dataset. Full collated ratings will be
output to *.taxolotl_report.tsv
. Ratings per sequence are
output to *.taxbyseq.tsv
. Dominant taxa are reported in the
log file as #BEST
entries.
To flag contamination, each sequence is assessed against the dominant
taxonomic rating at each taxonomic level. The percentage of genes
matching each dominant rating is reported for each sequence in
*.taxolotl.tsv
along with the number of genes with a rating
at that level, separated with a |
. This will exclude any
genes without ratings at that taxonomic level. A
:consensus:
entry will also report the overall values for
the whole assembly.
Any sequences that have a dominant taxonomic label deviating from the
overall consensus at any ranking levels set by
taxwarnrank=X
(default family) or above will raise a
contamination warning and be output in the log file with a
#BADTAX
rating. These sequences will have their dominant
taxon and it’s precentage appended to the consensus percentage, also
separated by |
. For example,
25.00|20|Chordata|50.00
would indicate that 25% of the 20
genes with ratings at that level matched the consensus, whilst the
dominant classification was Chordata
with 50% of 20 rated
genes assigned to this category. Such sequences will also have
badtax
rating in the rating
field of
*.taxolotl.tsv
. Sequences matching the dominant taxa will
have a goodtax
rating, whilst sequences without any genes
mapped onto taxa by MMseqs2 will be rated notax
.
Good, Bad and missing sequence counts will be summarised in the log
file in #BEST
, BADTAX
, and #NOTAX
entries. Sequence subsets are output to *.id
and
*.fasta
files, and summarised along with the full assembly
in *.seqsummary.tsv
. (Any ratings without sequences will
not be output/summarised.) If assembly=FILE
is provided,
sequences without genes will also be summarised. Taxonomy ratings for
these subsets are also output to
*.$RATING.taxolotl_report.tsv
files. Any sequence subsets
provided by taxsubsets=LIST
(see below) will also be
summarised in *.$SUBSET.taxolotl_report.tsv
files. It is
recommended that all the MMseqs2 _report
file is loaded
with all the *.taxolotl_report.tsv
for visualisation with
Pavian (Breitwieser
FP and Salzberg SL (2020) Bioinformatics
36(4):1303-1304) through its Shiny App.
Finally, if assembly=FILE
is provided (unless
taxbycontig=F
), contigs will be extracted by splitting
scaffolds on mingap=INT
(default 10) consecutive
N
s. Genes will be remapped onto contigs as with sequences,
and taxonomic ratings output to *.taxbyctg.tsv
and
*.ctgtaxolotl.tsv
. These are the contig equivalents of
*.taxbyseq.tsv
and *.taxolotl.tsv
. Contigs
without taxonomic ratings will be listed in the log file with
#BADTAX
entries, unless already reported as an assembly
sequence.
Main taxonomy
outputs
Outputs will be given a file prefix set by taxbase=X
. By
default, this will be $SEQBASE.$TAXADB
, where
$SEQBASE
is the basename of seqin=FILE
and
$TAXADB
is the taxonomy database set by
taxdb=FILE
.
The main mmseqs easy-taxonomy
output will generate:
*_lca.tsv
= best assignments per protein sequence
(protein, taxid, rank, taxname): required.
*_report
= text summary of overall taxonomy that can be
loaded by Pavian etc.: required.
*_tophit_aln
= top database hits for each protein (not
currently used): not required.
*_tophit_report
= taxonomic classification of the top
hit proteins: not required.
In addition, Taxolotl will output:
*.taxbyseq.tsv
= Rating counts for each taxonomic group
by assembly sequence (scaffold).
*.taxolotl_report.tsv
= Collated Kraken-style report
file.
*.lca_genes.tsv
= Best assignments (lowest taxonomic
level) for each gene.
*.lca_genes.gff
= GFF file with Taxolotly ratings for
each gene.
*.taxolotl.tsv
= Tab separated file with consensus
taxonomic assignment at each taxonomic rank, and ratings per
sequence.
*.$SUBSET.id
= Sequence identifiers for assembly
subsets based on Taxolotl ratings.
*.$SUBSET.fasta
= Fasta files of assembly subsets based
on Taxolotl ratings.
*.seqsummary.tsv
= Summary statistics for assembly
subset fasta files.
*.taxbyctg.tsv
= Rating counts for each taxonomic group
by assembly contig.
*.ctgtaxolotl.tsv
= Taxolotl ratings by assembly
contig.
Taxonomy by
sequence output
If taxbyseq=T
then an addition
*.taxbyseq.tsv
file will be produced, with the following
fields:
seqname
= assembly sequence name
genenum
= number of genes parsed for that sequence
protnum
= number of proteins parsed for that
sequence
rank
= taxonomic rank of rating
genetax
= number of genes with assignment at that
level
taxid
= taxonomic label identifier number
taxname
= taxonomic label name at that rank
taxperc
= percentage assignment to this rank or
lower
taxnum
= number of genes assigned to this rank or
lower
taxpure
= number of genes assigned to this rank
specifically
Sequence subset
analysis
In addition to the main output for the whole proteome, any subsets
given by taxsubsets=LIST
will have their own
*.taxolotl_report.tsv
file, which can be visualised with
Pavian. These must be lists of IDs that match the assembly sequence
names in the GFF file. Subsets will be named after the subset file
prefix, e.g. assembly.suspect.id
would generate
*.assembly.suspect.taxolotl_report.tsv
.
Generating a
taxonomic database
Please see the MMseqs2 documentation for generating a taxonomic
database. To date, Taxolotl has been tested with taxonomy databases
generated from NCBI nr, using BLAST+ and MMSeqs2 and the NCBI taxonomy
dump (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz):
blastdbcmd -db $NCBIPATH/nr -entry all > ncbinr.faa
blastdbcmd -db $NCBIPATH/nr -entry all -outfmt "%a %T" > ncbinr.faa.taxidmapping
mmseqs createdb ncbinr.faa ncbinr.faaDB
mmseqs createtaxdb ncbinr.faaDB tmp --ncbi-tax-dump taxonomy/ --tax-mapping-file ncbinr.faa.taxidmapping
mmseqs createindex ncbinr.faaDB tmp
If the assembly is already in RefSeq, it is recommended that the taxa
of the assembly is removed before running Taxolotl, e.g.:
mmseqs filtertaxseqdb ncbinr.faaDB seqTaxNoQueryDB --taxon-list '!178133,!38626'
If getting an error that the *.dmp
files are missing,
these can be added with soft links from the taxonomy/
directory containing the NCBI taxonomy dump.
Simple ORF
mode
If no proteins are given, ORFs will be generated by
SeqSuite
with default settings
minorf=100 rftran=6 terminorf=50 orfgaps=F
, i.e. ORFs of
100+ amino acids from all six reading frames, or 50+ amino acids if
truncated at the end of a sequence. ORFs will not span assembly gaps,
and any ambiguous (X
) translations will be replaced with
stop codons (*
), unless orfgaps=T
is set. Note
that, due to introns, it is expected that these ORFs will often
represent partial coding sequences, and many will be random junk
translations.
The idea of ORF mode is to provide a quick, crude impression of the
taxonomic profile. However, for large assemblies it can be very slow to
process.
In ORF mode, each ORF is assumed to represent a different gene,
although this may not be the case. Currently, SeqSuite
will
not generate a GFF file for the ORFs. As a result, the
taxbycontig
output is not available.
© 2021 Richard Edwards | richard.edwards@unsw.edu.au