BUSCOMP is designed to overcome some of the non-deterministic limitations of BUSCO to:
For each BUSCO gene, BUSCOMP will extract the best “Single Complete”
sequence from those available, using the full_table_*.tsv
results table and single_copy_busco_sequences/
directory of
hit sequences. BUSCOMP ranks all the hits across all assemblies by Score
and keeps the top-ranking hits. Ties are then resolved by Length,
keeping the longest sequence. Ties for Score and Length will keep an
arbitrary entry as the winner. Single complete hits are given preference
over Duplicate hits, even if they have a lower score, because only
Single hit have their sequences saved by BUSCO in the
single_copy_busco_sequences/
directory. This set of
predicted gene sequences forms the “BUSCOMPSeq” gene set.
BUSCOMP uses minimap2 to map BUSCOSeq predicted CDS sequences onto genome/transcriptome assemblies, including those not included in the original BUSCO compilation. This way, the compiled set of species-specific BUSCO sequences can also be used to generate a quick-and-dirty assessment of completeness for a new genome assembly. Hits are converted into percentage coverage stats, which are then used to reclassify the BUSCO gene on the basis of coverage and identity. BUSCOMP ratings are designed to mimic the original BUSCO ratings but have different definitions. In addition, two extra classes of low quality hit have been added: “Partial” and “Ghost”.
NOTE: Duplicated
genes are explicitly
those with hits on two different contigs/scaffolds. BUSCOMP is
focused on identifying the coding potential of an assembly rather than a
detailed completeness assessment. Duplicated genes on the same
contig/scaffold will be marked as Complete
.
In addition to individual assembly stats, BUSCO and BUSCOMP ratings
are compiled across user-defined groups of assemblies with various
outputs to give insight into how different assemblies complement each
other. Ratings are also combined with traditional genome assembly
statistics (NG50 and LG50) based on a given genomesize=X
to
help identify the “best” assemblies. Details of settings, key results,
tables and plots are output to an HTML report using Rmarkdown.
NOTE: For HTML output, R must be installed and a pandoc environment variable must be set, e.g.
export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc
NOTE: BUSCOMPSeq sequences can be provided with
buscofas=FILE
in place of compilation. This option has not
been tested and might give some unexpected behaviours, as some of the
quoted figures will still be based on the calculated BUSCOMPSeq data.
Please report any unexpected behaviour.
When using BUSCOMP, please cite the Starling genome paper:
Stuart KC, Edwards RJ et al. (2022). Molecular Ecology 22(8):3141-3160. (doi: 10.1111/1755-0998.13679)
BUSCOMP is written in Python 2.x and can be run directly from the commandline:
python $CODEPATH/buscomp.py [OPTIONS]
If running as part of SLiMSuite,
$CODEPATH
will be the SLiMSuite tools/
directory. If running from the standalone BUSCOMP git repo,
$CODEPATH
will be the path the to code/
directory. Please see details in the BUSCOMP git repo for
running on example data.
For BUSCOMPSeq analysis, minimap2 must be installed
and either added to the environment $PATH
or given to
BUSCOMP with the minimap2=PROG
setting.
A full list of options and the workflow is provided below. The easiest/standard way to run BUSCOMP is to:
fasta/
busco/
, with run_
prefixes
(-o setting
)basefile=$PREFIX
setting to get
$PREFIX.*
outputs, e.g.# Generate a name-matched BUSCO run per genome
cd busco/
for GENOME in ../fasta/*.fasta; do
GENBASE=$( basename ${GENOME/.fasta/} )
busco -o run_$GENBASE -i $GENOME -l $LINEAGE --cpu $PPN -m genome
done
cd ../
# Run BUSCOMP
python $CODEPATH/buscomp.py runs="./busco/run_*" fastadir=./fasta/ basefile=$PREFIX
If you want to run the full BUSCOMP comparison, make sure minimap2 is
installed and give the number of threads to use with
forks=$THREADNUM
. If you only want the BUSCO compilation,
add buscompseq=F
.
A list of commandline options can be generated at run-time using the
-h
or help
flags. Please see the general SLiMSuite
documentation for details of how to use commandline options,
including setting default values with INI files. If
anything is not clear, or for questions how to perform a particular
analysis, please post a question in the GitHub Issues.
### ~ Input/Output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
runs=DIRLIST : List of BUSCO run directories (wildcards allowed) [run_*]
fastadir=DIRLIST: List of directories containing genome fasta files (wildcards allowed) [./]
fastaext=LIST : List of accepted fasta file extensions that will be checked for in fastadir [fasta,fas,fsa,fna,fa]
genomes=FILE : File of Prefix and Genome fields for generate user-friendly output [*.genomes.tdt if found]
restrict=T/F : Restrict analysis to genomes with a loaded alias [False]
runsort=X : Output sorting order for genomes and groups (or "Genome","Prefix","Complete","Group") [Group]
stripnum=T/F : Whether to strip numbers ("XX_*") at start of Genome alias in output [True]
groups=FILE : File of Genome and Group fields to define Groups for compilation [*.groups.tdt]
buscofas=FASFILE: Fasta file of BUSCO DNA sequences. Will combine and make (NR) if not given [None]
metaeukfna=T/F : Perform v5 metaeuk nucleotide busco sequences extraction if missing [True]
buscomp=T/F : Whether to run BUSCO compilation across full results tables [True]
dupbest=T/F : Whether to rate "Duplicated" above "Complete" when compiling "best" BUSCOs across Groups [False]
buscompseq=T/F : Whether to run full BUSCO comparison using buscofas and minimap2 [True]
phylofas=T/F : Generate output of compiled and renamed files for BUSCO-based phylogenomics [False]
ratefas=FILELIST: Additional fasta files of assemblies to rate with BUSCOMPSeq (No BUSCO run) (wildcards allowed) []
rmdreport=T/F : Generate Rmarkdown report and knit into HTML [True]
fullreport=T/F : Generate full Rmarkdown report including detailed tables of all ratings [True]
missing=T/F : Generate summary tables for sets of Missing genes for each assembly/group [True]
dochtml=T/F : Generate HTML BUSCOMP documentation (*.docs.html) instead of main run [False]
summarise=T/F : Include summaries of genomes in main `*.genomes.tdt` output [True]
### ~ Mapping/Classification options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
minimap2=PROG : Full path to run minimap2 [minimap2]
endextend=X : Extend minimap2 hits to end of sequence if query region with X bp of end [0]
minlocid=INT : Minimum percentage identity for aligned chunk to be kept (local %identity) [0]
minloclen=INT : Minimum length for aligned chunk to be kept (local hit length in bp) [1]
uniquehit=T/F : Option to use *.hitunique.tdt table of unique coverage for GABLAM coverage stats [True]
mmsecnum=INT : Max. number of secondary alignments to keep (minimap2 -N) [3]
mmpcut=NUM : Minimap2 Minimal secondary-to-primary score ratio to output secondary mappings (minimap2 -p) [0]
mapopt=CDICT : Dictionary of additional minimap2 options to apply (caution: over-rides conflicting settings) []
### ~ Processing options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
forks=X : Number of parallel sequences to process at once [0]
killforks=X : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X : Sleep time (seconds) between cycles of forking out more process [0]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
STEP 1 is to set up the input data. BUSCOMP will first identify and
check the run directories (given with the runs=DIRLIST
command). This command can either be a list of directories, or an
expandable list with wildcard. A list of individual directories can be
provided as a file if desired.
NOTE: runs=DIRLIST
wants the actual
BUSCO run directories (run_$PREFIX[.busco]/
), not the
parent directory. These directories can optionally have a
.busco
suffix, which will be trimmed off. By default,
BUSCOMP will look for ./run*
.
From these, the run list will be extracted, consisting of a number of
genome $PREFIX
values. It is expected that the
$PREFIX
for each run will match the input genome file,
found in a directory set by fastadir=DIRLIST
. This file can
also match the Genome
alias for that genome (see below). If
you have not used consistent naming across files, please check data has
been loaded and mapped correctly during the Review phase (below).
The presence of a full_$PREFIX[.busco].tsv
table and
single_copy_busco_sequences/
directory within each run
directory will also be checked. If buscocompseq=F
then all
of the full_*.tsv
tables in the runs
directories will be loaded for compilation, but no sequence searching
will be performed. The presence of sequences available for compilation
will be stored in the Sequences
field of
*.genomes.tdt
.
BUSCO v4.x: From v0.9.1
, BUSCOMP should
recognise BUSCO v4 output. Due to the reorganisation and altered naming
strategy of version 4, the naming of results files is more strict. The
main results directory (set by -o
when you run BUSCO)
should match the genome assembly prefix with an optional
.busco
suffix. (A leading run_
is also
permitted and will be ignored. For example, if
assembly.fasta
was analysed with BUSCO v4, BUSCOMP should
be able to parse results generated using -o assembly
,
-o assembly.busco
, -o run_assembly
, or
-o run_assembly.busco
. If none of these settings were used,
the results directory can be manually renamed. For additional sorting, a
XX_ numerical prefix can be used (see below),
e.g. run_01_assembly
will still look for
assembly.*
in the fastadir
path.
BUSCO v5.x From v0.11.0
, BUSCOMP should
recognised BUSCO v5 output. This will be treated like v4 output.
Additional assemblies can rated using the BUSCOMPSeq analysis (see
below) without having BUSCO data analysed. These are given with the
ratefas=LIST
command (wildcards allowed), and will be added
to the main Genomes table after BUSCO compilation has been performed.
The default prefix for these files will be their basename (filename,
minus path and extension). This can be linked to an alias and/or have a
prefix number stripped, as with the run directories. When combined with
buscofas=FASFILE
, this enables the BUSCOMPSeq analysis to
be performed in the absence of any BUSCO results.
Each genome analysed has a “Prefix” that can be used to help sorting
(if runsort=prefix
) and identify relevant BUSCO files where
the genome (Fasta) name and BUSCO run name do not match. It is generally
assumed that this Prefix will match the prefix of the original assembly
file on which BUSCO was run. This Prefix is set once as the data is
loaded and is never changed. (Sorting and visualisations can be made
more informative altered using the Genome (a.k.a. Alias) and Description
fields.)
The assembly Prefix will be set as:
full_table_*.tsv
file is found, the core
(indicated with *
) will be used as the prefix.Genome Aliases will be parsed initially from the loaded data:
Genome
alias, minus the leading “run_
”.full_table_*.tsv
core used for the Prefix
will
also be used for the Genome
alias.Genome
alias.If Genome
contains a trailing .busco
, this
will be stripped. If Genome
starts with preceding numbers
(XX_
), these will be kept for sorting but stripped for
outputs and fasta matching (unless stripnum=F
).
NOTE: If the assembly fasta file does not match the
name used for the BUSCO run file, then the Genome
alias
must be set to the basefile
of the fasta file by
appropriate naming of the run_
directory, i.e. the
run_*
directory should contain a single BUSCO run and the
directory name should match a *.fasta
file (or other
accepted fastaext=LIST
extension) in one of the directories
given by fastadir=DIRLIST
.
An optional alias table can be provided (genomes=FILE
)
that contains Prefix
and Genome
fields (others
allowed) and can give alternative names for each genome to be used in
other outputs. If running interactively, there will also be an option to
check/update these aliases manually. Genome aliases are output as part
of the main *.genomes.tdt
output, which can also be used as
subsequent genomes=FILE
input (the default, if found).
This table can also have a Description
field, which will
be used to update descriptions for output if found. (Empty descriptions
will not be mapped.) Otherwise, descriptions can be parsed from a
description_$PREFIX.txt
file (exactly matching the
$PREFIX
parsed from a full_table_*.tsv
file)
if found. Alternatively, if the run directory has a run_
prefix AND there is only one full_table
, or it is a BUSCO
v4 output directory, a description will be parsed from
description.txt
. In each case, only the first line will be
used. If neither condition is met, the name of the genome fasta file
will be used. Failing that, Genome
will be used. (Note that
for BUSCO v4, description.txt
must be in the same directory
as full_table.tsv
.
NOTE: The optional Alias table can be used
to change a Genome
name to something other than its Fasta
file - this mapping is performed after fasta files have been
identified.
NOTE: Prefix
and Genome
fields must be unique. A common Prefix
and
Genome
value is permitted but these will be assumed to
refer to the same assembly. (In other words, do not run BUSCO
on two assemblies and give each BUSCO run the name of the other
assembly!)
NOTE: Some Genome
names may cause
conflicts with running minimap2 and/or the Rmarkdown output.
Whitespace is not permitted in Genome
names (though can be in R labels) and will be stripped. It is
recommended to keep Genome
names simple and only containing
standard characters (letters, numbers, dots and underscores).
BUSCOMP compilation will take a set of genomes and identify the best
rating for each BUSCO across the set. By default, this is done for all
input genomes and assigned to a “BUSCOMP” rating. Ratings are compiled
according to the ratings=LIST
hierarchy, which by default
is: Complete
, Duplicated
,
Fragmented
, Missing
. The “best” rating (first
in the hierarchy) will be used for that group, even if it is found in
only a single assembly in the group. If dupbest=T
, the
hierarchy is Duplicated
, Complete
,
Fragmented
, Missing
.
Additional subsets of genomes can also be grouped together for
compilation, using the groups=FILE
input. This must have
Genome
and Group
fields. Once generated, a
Group becomes a “Genome”. Groups can therefore include other Groups in
them, provided there is no circular membership. Genome mapping should
also recognise Prefix values. The “BUSCOMP” group with all
genomes will be added unless a conflicting “BUSCOMP” group is provided,
or buscomp=F
.
NOTE: Genome groups are generated mapping onto the
Genome
(or Prefix
) values after any
alias mapping but before any manual editing of
Genome
aliases.
When running interactively (i>=0
), Group review and
editing can be accessed from the main Genome menu. This will show the
genomes currently in each group, and provides the following menu
options:
<A>dd
= Create a new Group. First, a group name
will need to be chosen, then the <E>dit
menu will be
triggered.<D>elete
= Remove the Group. Note that this will
not remove Group members from any “parent” groups that contained the
deleted group when they were loaded.<R>ename
= Change the Group name.<E>dit
= Edit group membership. This will give
options to add/remove each assembly in the analysis.E<x>it menu
= Return to main Genome Menu.<Q>uit
= Quit BUSCOMP.Once group edits are complete, group data will be saved to the
*.groups.tdt
file. If runsort=group
then
genome sorting will also be performed following group edits.
The final setup step is to load any genomes for which Fasta files
have been provided. Unless summarise=F
, summary statistics
will be calculated for each genome, enabling assessements of genome
completeness and quality in combination with the BUSCO and BUSCOMP
results. Summary statistics are calculated by rje_seqlist
.
The following statistics are calculated for each genome:
SeqNum
+GapCount
).genomesize=X
. If no genome size is given, it will be
relative to the biggest assembly.genomesize=X
. If no genome size is given, it will be
relative to the biggest assembly.N
) nucleotides in the assembly.N
) regions above mingap=X (default 10bp).NOTE: NG50Length
and
LG50Count
statistics use genomesize=X
or the
biggest assembly loaded. If BUSCOMP has been run more than once on the
same data (e.g. to update descriptions or sorting), previously
calculates statistics will be read from the genomes=FILE
table, if loaded. This may cause some inconsistencies between reported
NG50 and LG50 values and the given/calculated genomesize=X
.
If partial results are reloaded following a change in
genomesize=X
, this will give rise to inconsistencies
between assemblies. When re-running, please make sure that a consistent
genome size is used. If in doubt, run with force=T
and
force regeneration of statistics.
Once all the genomes are loaded, BUSCOMP compiles the BUSCO results. This is done for three purposes:
Complete
BUSCO hits for
BUSCOMPSeq analysis (below).Loaded assemblies with an identified full results Table
will be processed and added to the *.full.tdt
table. The
best rating per BUSCO gene will then be used for summary BUSCO counts
for each genome. Finally, each Group will have its combined BUSCO rating
calculated. This is performed by choosing the “best” rating for each
BUSCO across the Group’s members, where “best” is determined by the
dupbest=T/F
setting:
dupbest=F
), the rating hierarchy is:
‘Complete’, ‘Duplicated’, ‘Fragmented’, ‘Missing’.dupbest=T
, the rating hierarchy is: ‘Duplicated’,
‘Complete’, ‘Fragmented’, ‘Missing’.Where ratings are defined (quoting from the BUSCO v3 User Guide as:
Complete
: Single-copy hits where “BUSCO matches have
scored within the expected range of scores and within the expected range
of length alignments to the BUSCO profile.”Duplicated
: As Complete
but 2+
copies.Fragmented
: “BUSCO matches … within the range of scores
but not within the range of length alignments to the BUSCO
profile.”Missing
: “Either no significant matches at all, or the
BUSCO matches scored below the range of scores for the BUSCO
profile.”Total BUSCO counts (N
) and summed Ratings are added to
the main *.genomes.tdt
table for each assembly and group. A
*.busco.tdt
file is also generated that has the rating for
each BUSCO gene (rows) in each assembly/group (columns).
If output is being sorted using runsort=Complete
or
runsort=Missing
, data will be sorted at this point. Raw
BUSCO results and compiled numbers will then be output to the log
file.
The following tables will then be saved (see Output files, below, for details):
*.genomes.tdt
= summary of loaded data, genome
statistics and BUSCO ratings.*.full.tdt
= full BUSCO results across all
assemblies.*.busco.tdt
= compiled BUSCO results showing the rating
per genome/group for each BUSCO gene.The final step of the BUSCOMP BUSCO compilation is to extract the
best Complete BUSCO sequences from those available. For all assemblies
with BUSCO results and a single_copy_busco_sequences/
,
BUSCOMP ranks all the hits across all assemblies by Score
and keeps the top-ranking hits. Ties are then resolved by
Length
, keeping the longest sequence. Ties for
Score
and Length
will keep an arbitrary entry
as the winner. Single
complete hits are given preference
over Duplicate
hits, even if they have a lower score,
because only Single
hit have their sequences saved by BUSCO
in the single_copy_busco_sequences/
directory.
If buscofas=FASFILE
has been given, this will be used
for BUSCOMPSeq searches. Otherwise, the best BUSCO seqences identified
for each BUSCO gene will be saved as *.buscomp.fasta
for
BUSCOMPSeq analysis. The exception is if buscofas=FASFILE
is pointing to the *.buscomp.fasta
and
force=T
, or if the buscofas=FILE
fasta file
cannot be found.
NOTE: The buscofas=FASFILE
option has
not been tested and might give some unexpected behaviours, as some of
the quoted figures will still be based on the calculated BUSCOMPSeq
data.
Optionally, BUSCOMP can compile fasta files for phylogenomic analysis
(phylofas=T
). These will be saved in the
phylogenomics/
directory as a series of nucleotide
*.fna
files in phylofna
and protein
*.faa
files in phylofaa
. The key organisation
of these directories is a single file per BUSCO ID that contains
sequences for every genome with a single Complete rating. Files will be
renamed from >$SEQNAME:$START-$END
to
>$PREFIX $SEQNAME:$START-$END:$BUSCOID
where
$PREFIX
is the assembly Prefix (see above).
NOTE: for HAQESAC-based phylogenomics, you might want the prefixes to have the format eog_\(SPCODE__\)ID, though. it should also run OK with unkspec=T allowvar=T gnspacc=F.
Once the gene sequences for BUSCOMPSeq have been established, the
next step is to search for them in the assembly fasta files using a more
deterministic (and faster) approach. For this, minimap2 has been chosen.
BUSCOMP will perform a minimap2 search on each assembly fasta file. (Use
minimap2=FILE
to tell BUSCOMP where to find minimap2 if the
path is not an environment variable.)
Minimap2 will run with -x splice -uf -C5
and hits will
be filtered by length, keeping those that meet the
minloclen=X
cutoff (default 20 bp). A percentage identity
cutoff can also be applied with minlocid=X
(default 0%).
Hits are also reduced to unique coverage by starting with the local
alignment with the largest number of identical positions, and then
trimming any other local alignments that overlap with the same part(s)
of the query (BUSCO) sequence. This is repeated until all local
alignments are non-overlapping (and still meet the identity and length
criteria). Local hits are then compiled into global alignment (GABLAM)
statistics of coverage and identity.
By default, Minimap2 searches will keep the Top 3
secondary alignments for each sequence (the minimap2 -N
commandline parameter). In limited tests, this appears to give a good
trade-off between search speed and the identification of
Duplicated
BUSCO genes. This default can be adjusted with
mmsecnum=INT
to make runs faster or more sensitive, as
required. This can be further modulated using mmpcut=NUM
,
which sets the Minimap2 minimal secondary-to-primary score ratio to
output secondary mappings (the minimap2 -p
commandline
parameter).
Minimap2 search results are saved in a *.minimap/
directory, with the prefix $BASEFILE.$CUTOFFS
, where
$BASEFILE
is the fasta file basename for assembly, and
$CUTOFFS
takes the form NXLXXIDXX
, where
NX
is the max number of secondary hits
(mmpcut=NUM
), LXX
is the length cutoff
(minloclen=X
) and IDXX
is the identity cutoff
(minlocid=X
) - L20ID0
by default. Unless
uniquehit=F
(see below), $CUTOFFS
will have a
U
suffix, e.g. L20ID0U
. The Minimap2
*.paf
file will only have the .NX
suffix. A
*.NX.paf.cmd
file with the actual Minimap2 command used
will also be saved, so the mmpcut=NUM
setting and any other
Minimap2 options set using mapopt=CDICT
can be checked if
required.
Existing files will be loaded and reused unless
force=T
.
Minimap2 hits are first converted into BLAST-style “local” hits,
which are then converted into GABLAM-style summary
statistics. For the *.hitsum.tdt
table, local hits are
reduced to unique coverage of each Query, such that each part of the
Query is covered by a single hit. Local hits are selected in order of
the number of identical positions between Query and Hit, followed by
Length in the case of a tie. Overlapping regions from other local hits
are trimmed, and the process repeated with the next-best local hit,
ranked on trimmed stats where necessary. For the
*.gablam.tdt
output, each Query-Hit pair is considered
independently and local hits are reduced to (a) unique coverage of the
Query for Qry_*
output, and (b) unique coverage of the Hit
for Hit_*
output. (Note that for each Query-Hit pair, only
the local hits between that pair are considered for “unique” local hit
reduction - there may be overlapping alignments to different
queries/hits.) Unless uniquehit=F
, hits will first be
reduced to be non-overlapping across all queries before being
coverted to Query-Hit pair GABLAM coverage stats.
NOTE: Re-using results does NOT robustly check
whether the BUSCOMPSeq data has changed, so this directory should be
deleted if re-running with different sequences. BUSCOMP will save the
name of the BUSCO file along with the md5sum of its contents to
*.NX.input.md5
, which will be checked if present and
re-using data. Some basic checks should also be performed during the
following results compilation stage, but please note that BUSCO IDs are
used for sequence identifiers and so changes in BUSCO hit sequences will
not be identified if files have been inappropriatley copied/moved
between runs etc. - please look out for unexpected behaviour outside a
“clean” run.
NOTE: Minimap2 only works with high sequence
identity. BUSCOMP is designed to be used on multiple assemblies from
the same species. If using cross-species BUSCO analysis, odd
results might be obtained, biased by the evolutionary distance from the
species with the most BUSCOMP sequences. Under these circumstances, it
might be better to swap out minimap2 for BLAST+. This can be achieved by
independently running GABLAM using the BUSCOMPSeq fasta file as queries
and each assembly as the search database, then replacing the
*.gablam.tdt
and *.hitsum.tdt
files before
re-running BUSCOMP. Please contact the author if you want help with
this.
Minimap2 searches of the compiled BUSCOMP sequences are then compiled in similar vein to the original BUSCO results. The primary difference is that the search results need to first be converted into BUSCO-style ratings. This rating is explicitly more “quick and dirty” than the original BUSCO ratings, and should be considered complementary rather than superior. It aims to provide a quick, consistent assessment, but does have fewer checks and balances as a result.
BUSCOMP uses results from both the *.hitsum.tdt
table
(overall coverage in assembly) and the *.gablam.tdt
table
(coverage in a single contig/scaffold) to rate sequences as:
Complete
ratings, these might include part of closely related paralogues.)When compiling the results for all BUSCO genes, Single/Duplicated Complete hits will also be rated as Identical if they have 100% coverage and 100% identity in at least one contig/scaffold.
Once all the individual assemblies have been rated for the full set of assemblies, results are compiled across Groups as described for the original BUSCO results (above). Because no genes receive an individual “Identical” rating, Groups will not have a summary count for Identical hits.
Individual gene ratings for each genome and group are output to
*.LnnIDxx.buscomp.tdt
, where LnnIDxx
records
the minloclen=nn
and minlocid=xx
settings.
Compiled ratings are output to *.LnnIDxx.ratings.tdt
.
The final step of the BUSCOMP compilation is to summarise the findings in the log file, akin to the BUSCO summary file. This will first generate a one line summary of the percentages, along with the original number of complete BUSCOs and the number of BUSCOMP sequences contributed by that assembly (i.e. the number with the best score of all the BUSCO searches.) This is followed by a more detailed breakdown of each category. For example:
#BUSCO 00:00:41 Results: C:89.2%[S:87.8%,D:1.4%,I:23.3%],F:5.8%,P:3.8%,G:1.6%,M:0.9%,n:3736 - canetoad_v2.2 (3194 Complete BUSCOs; 102 BUSCOMP Seqs)
#INFO 00:00:41 870 Identical BUSCOs (I) [100% complete and identical]
#INFO 00:00:41 3333 Complete BUSCOs (C) [95%+ coverage in a single contig/scaffold]
#INFO 00:00:41 3281 Complete and single-copy BUSCOs (S) [1 Complete hit]
#INFO 00:00:41 52 Complete and duplicated BUSCOs (D) [2+ Complete hits]
#INFO 00:00:41 217 Fragmented BUSCOs (F) [95%+ coverage spread over 2+ contigs/scaffolds]
#INFO 00:00:41 143 Partial BUSCOs (P) [40-95% coverage]
#INFO 00:00:41 61 Ghost BUSCOs (G) [<40% coverage]
#INFO 00:00:41 34 Missing BUSCOs (M) [No hits]
#INFO 00:00:41 3736 Total BUSCO gene hits searched
There is a risk that performing a low stringency search will identify
homologues or pseudogenes of the desired BUSCO gene in error. If there
is a second copy of a gene in the genome that is detectable by the
search then we would expect the same genes that go from
Missing
to Complete
in some genomes to go from
Single
to Duplicated
in others.
To test this, data is reduced for each pair of genomes to BUSCO-BUSCOMP rating pairs of:
Single
-Single
Single
-Duplicated
Missing
-Missing
Missing
-Single
This is then converted in to G
ain ratings
(Single
-Duplicated
&
Missing
-Single
) or N
o Gain
ratings (Single
-Single
&
Missing
-Missing
). The
Single
-Duplicated
shift in one genome is then
used to set the expected Missing
-Single
shift
in the other, and assess the probability of observing the
Missing
-Single
shift using a cumulative
binomial distribution, where:
k
is the number of observed GG
pairs
(Single
-Duplicated
and
Missing
-Single
)n
is the number of
Missing
-Single
G
ains in the focal
genome (NG
+GG
)p
is the proportion of
Single
-Duplicated
G
ains in the
background genome (GN
+GG
/
(GN
+GG
+NN
+NG
))pB
is the probability of observing k+
Missing
-Single
gains, given p
and
n
This is output to *.gain.tdt
, where each row is a Genome
and each field gives the probability of the row genome’s
Missing
-Single
gains, given the column
genome’s Single
-Duplicated
gains.
Contents of the main output files are given below. In addition to the
tab-delimited text output, a summary HTML (and source RMarkdown) file
will be generated, assuming R is installed. If buscompseq=T
then a fasta file of the “best” (top-scoring) Single Complete BUSCO
genes will also be generate (*.buscomp.fasta
), unless it is
provided using buscofas=FASFILE
.
Unless buscompseq=F
or buscofas=FASFILE
is
provided, nucleotide and protein sequences for the “best” BUSCO gene
hits will be output to *.buscomp.fasta
(nucleotide) and
*.buscomp.faa
(protein). The *.buscomp.faa
protein file is not used by BUSCOMP but is provided as a useful set of
(near-)complete protein sequences. Each fasta file is made from
concatenating individual fasta files from the
single_copy_busco_sequences/
directories. As such, they
will be named as with the standard BUSCO output:
>$BUSCOID $FASTAFILE:$CONTIG:$START-$END
NOTE: The $FASTAFILE
path given in this
file will be the one from the original BUSCO run, not the path to the
genome file used by BUSCOMP if it has subsequently been
moved/renamed.
The *.genomes.tdt
table is the main summary table for
the input genomes, their BUSCO rating summaries and genome
statistics.
#
= Sorting order for outputDirectory
= Directory for this BUSCO runPrefix
= Prefix identified from BUSCO directory (or
full table), corresponding to BUSCO output filesGenome
= Optional alternative name (or “alias”) to be
used in outputsFasta
= path to genome file, if foundsummarise=T
. (See above.)Sequences
(True/False) = Whether
single_copy_busco_sequences/
directory foundTable
= Path to full_*.tsv
tableTable
is
True
) (See above.)The *.groups.tdt
table has the mappings between
Genome
and Group
identifiers to make the
groups for combined ratings (see above).
#
= Arbitrary unique key for tableGenome
= Genome alias or PrefixGroup
= Name for grouped ratingThe *.full.tdt
table is a compiled version of the all
the individual BUSCO full results tables.
Genome
= Genome alias#
= Arbitrary unique key element for table (adjusting
for duplicates) - BuscoID
= BUSCO gene identifierStatus
= BUSCO ratingContig
= Contig containing BUSCO hitStart
= Start positionEnd
= End positionScore
= BUSCO scoreLength
= Length of BUSCO hitThe *.busco.tdt
table has the compiled ratings for
individual BUSCO genes in each genome/group
BuscoID
= BUSCO EOG gene identifierThe *.buscoseq.tdt
table has the set of best BUSCO genes
used for the compiled BUSCOMPSeq sequence set. In addition to the BUSCO
stats for all the BUSCOMP sequences, BUSCOMP ratings for each Genome are
output in this file.
Genome
= Genome aliasBuscoID
= BUSCO gene identifierStatus
= BUSCO ratingContig
= Contig containing BUSCO hitStart
= Start positionEnd
= End positionScore
= BUSCO scoreLength
= Length of BUSCO hitThe *.LnnIDxx.buscomp.tdt
table is the same as the
*.busco.tdt
but with revised ratings based on BUSCOMP
analysis.
BuscoID
= BUSCO EOG gene identifierThe *.LnnIDxx.ratings.tdt
table has the compiled BUCOMP
summary ratings per genome/group.
Genome
= Genome alias or group nameN
= Number of BUSCOMP genesIdentical
= 100% coverage and 100% identity in at least
one contig/scaffold. These will also be rated as Complete
or Duplicated
. Groups do not have Identical
ratings.Complete
= 95%+ Coverage in a single contig/scaffold.
(Note: accuracy/identity is not considered.)Duplicated
= 95%+ Coverage in 2+
contigs/scaffolds.Fragmented
= 95%+ combined coverage but not in any
single contig/scaffold.Partial
= 40-95% combined coverage.Ghost
= Hits meeting local cutoff but <40% combined
coverage.Missing
= No hits meeting local cutoff.The *.LnnIDxx.changes.tdt
table has counts of change
ratings (BUSCO to BUSCOMP) for each genome, in addition to a Total count
across all genomes.
BUSCO
= BUSCO rating.BUSCOMP
= BUSCOMP rating.Total
= Summed count over all genomes.The *.LnnIDxx.changes.full.tdt
table is the complete set
of ratings changes (BUSCO to BUSCOMP) for each gene and genome.
BuscoID
= BUSCO EOG gene identifierComplete, Duplicated, Fragmented, Partial, Ghost, Missing
The *.LnnIDxx.unique.tdt
table has Genome
and Group
identifiers for each BuscoID
where
that gene is only Complete
(or Duplicated
) in
that genome/group.
BuscoID
= BUSCO EOG gene identifierBUSCO
= Genome/Group that uniquely has a BUSCO
Complete
rating.BUSCOMP
= Genome/Group that uniquely has a BUSCOMP
Complete
rating.The *.LnnIDxx.rdata.tdt
table has the BUSCOMP summary
ratings (converted into percentage values) and sequence statistics for
each Genome
/Group
, in addition to:
BUSCO
= BUSCO Complete
(Single
and Duplicated
) percentageNoBUSCO
= BUSCO Missing
percentageUniqBUSCO
= Number of unique BUSCO
Complete
genesUniqBUSCOMP
= Number of unique BUSCOMP
Complete
genescol
= colour field for R plotting.pch
= point type for R plotting.label
= Genome/Group label for R plotting.plot
= TRUE/FALSE, whether to plot the Genome/Group
statistics.best
= any categories for which this Genome is rated
“best”.Minimap2 will be run for each genome with a Fasta
file,
generating:
*.paf
*.paf.cmd
*.input.md5
*.L20ID80.local.tdt
*.L20ID80.hitunique.tdt
*.L20ID80.qryunique.tdt
*.L20ID80.hitsum.tdt
*.L20ID80.gablam.tdt
Output details will be added here.
The final step in BUSCOMP analysis is to generate a summary report.
This is produced as an RMarkdown document (*.Rmd
) and
(assuming the path to pandoc is set) is then “knitted” to make an HTML
file (*.html
). The RMarkdown file is retained to make it
easy for the user to modify the content and convert the output into a
report of their own. HTML output can be switched off with
rmdreport=F
. Unless fullreport=F
, a larger
report with full BUSCO results tables and comparative plots of Missing
BUSCO genes will be generated (`*.full.html’).
Prior to generation of the document, results are first compiled into an overview stats file with additional plot attributes, before the “best” assemblies are identified under various criteria. Finally, there is an option to modify some of the plotting attributes before the report is generated.
The main BUSCOMP *.ratings.tdt
output is combined with
key genome statistics and BUSCO ratings from the
*.genomes.tdt
table. BUSCOMP ratings are converted into
percentage values. BUSCO ratings are converted into percentage values
and reduced to BUSCO
(Single and Duplicated Complete
BUSCOs) and NoBUSCO
(Missing BUSCOs). (Full BUSCO results
are plotted directly from the *.genomes.tdt
.)
After results are compiled, additional plotting fields are generated:
col
= Plotting colour. (Default, “red”. Genomes with 1+
“best” ratings, “blue”)pch
= Point type. (Default, 21 [circle]. Genomes with
“best” rating will be 22 [square] for best BUSCO(MP) ratings, 24
[triangle] for best contiguity ratings, or 23 [diamond] for best in both
categories.label
= Additional text field to be used for labels. In
interactive mode, the option will be given to leave this blank for
unlabelled plots.plot
= Whether or not to include a genome in the plot.
(Default, TRUE)In interactive mode, the option will be provided to edit plotting
attributes for each assembly, following the calculation of “best”
assemblies (below). Compiled data are then saved to
*.LnnIDxx.rdata.tdt
for summary plots and tables in the
RMarkdown output.
There are many ways of assessing genome assembly quality, but they can be broadly broken down into four aspects:
Coverage. How much of the genome is included in the assembly.
Contiguity. How many fragments is the genome in, and how big are they.
Accuracy. How accurate is the assembly in terms of sequence errors.
Redundancy. How much of the genome has been included multiple times.
Standard reference-free genome statistics (e.g. number, length and
gappiness of scaffolds), can only give limited insights. Whilst
assemblies smaller than the genome size are clearly missing bits,
assembly size could be artificially inflated by redundant regions,
making it impossible to assess Coverage. Scaffold sizes can give an
indicator of Contiguity, but are also prone to biases introduced by
filtering of small scaffolds, for example. If an estimated Genome Size
can be provided, the most useful measures in this respect are
NG50
and LG50
. These statistics sort scaffolds
by length (big to small) and identify the contig/scaffold size at which
half the genome (not the assembly) is covered by
contigs/scaffolds of that size or bigger. This is the NG50
value (called NG50Length
in BUSCOMP), with
LG50
(LG50Count
) being the number of
contigs/scaffolds needed to cover half the genome. (N50
and
L50
are the same statistics only for the assembly.) These
statistics can still be mislead by redundancy in the assembly. None of
theses statistics speak to sequence accuracy.
The power of BUSCO is that it speaks to all four aspects of quality.
Complete
and Fragmented
genes give an
indication of Coverage, Continuity and Accuracy; Duplicated
genes give an indication of Redundancy; Missing
genes give
an indication of Coverage and Accuracy. However, the weakness is that
these different aspects cannot easily be disentangled. This makes
side-by-side comparisons of different assemblies challenging, as it is
not always clear what a difference is indicating.
BUSCOMP is designed on the principle that Coverage and Contiguity are the two most important aspects of assembly quality. Accuracy can, in principle, be improved by additional error-correction steps (including manual curation). Suspected Redundancy can likewise be identified a flagged. Coverage and Contiguity, in contrast, can only be improved by further assembly - either trying again from scratch, or employing a scaffolding or gap-filling alogrithm.
BUSCOMP has therefore identified seven statistics that can be used to rank assemblies on inferred Completeness or Contiguity:
Complete
= Maximum number of Complete BUSCOMP
ratings.BUSCO
= Maximum number of Complete BUSCO ratings.Missing
= Smallest number of Missing BUSCOMP
ratings.NoBUSCO
= Smallest number of Missing BUSCO
ratings.MaxLength
= Maximum length of contigs/scaffolds.NG50Length
= Half the genome lies on contigs/scaffolds
at least this size. (See above.)LG50Count
= Half the genome can be covered by this many
contigs/scaffolds. (See above.)Individual assemblies are rated as “best” under all seven criteria, with ties allowed. Each assembly can be best under multiple criteria and each criterion can have several best assemblies. BUSCOMP will report all such combinations.
The BUSCOMP HTML reports (full and summary) consist of the following sections:
Details of the BUSCOMP run, including the version, directory and commands.
This section contains summary details of the genome assemblies analysed, including the loaded data, the summary statistics for the assemblies, and coverage/contiguity assessment plots:
The compilation of BUSCO ratings, including details of the BUSCOMP Groups is given in this section. In addition to the summary ratings for each genome/group, the full report will have a table of all the individual gene ratings:
Next, BUSCOMP ratings using the compiled BUSCOSeq dataset are reported. In addition to the summary ratings for each genome/group, the full report will have a table of all the individual gene ratings:
The final report section features direct comparisons of the BUSCO and
BUSCOMP ratings, reporting changes in ratings between BUSCO and BUSCOMP.
Unique Complete genes are also identified: those rated as
Complete
in only a single genome, or multiple genomes in a
single Group. The full report also has a series of summary ratings plots
for subsets of BUSCO genes that are rated as Missing
in a
genome or group.
Details of the BUSCOMP run in terms of when, where and how
(e.g. commandline options) are found in the Appendix. Any errors or
warnings generated during the BUSCO run are reported here. Check the
*.log
file generated for details.
© 2019 Richard Edwards | richard.edwards@unsw.edu.au