BUSCOMP is designed to overcome some of the non-deterministic limitations of BUSCO to:
For each BUSCO gene, BUSCOMP will extract the best “Single Complete” sequence from those available, using the full_table_*.tsv
results table and single_copy_busco_sequences/
directory of hit sequences. BUSCOMP ranks all the hits across all assemblies by Score and keeps the top-ranking hits. Ties are then resolved by Length, keeping the longest sequence. Ties for Score and Length will keep an arbitrary entry as the winner. Single complete hits are given preference over Duplicate hits, even if they have a lower score, because only Single hit have their sequences saved by BUSCO in the single_copy_busco_sequences/
directory. This set of predicted gene sequences forms the “BUSCOMPSeq” gene set.
BUSCOMP uses minimap2 to map BUSCOSeq predicted CDS sequences onto genome/transcriptome assemblies, including those not included in the original BUSCO compilation. This way, the compiled set of species-specific BUSCO sequences can also be used to generate a quick-and-dirty assessment of completeness for a new genome assembly. Hits are converted into percentage coverage stats, which are then used to reclassify the BUSCO gene on the basis of coverage and identity. BUSCOMP ratings are designed to mimic the original BUSCO ratings but have different definitions. In addition, two extra classes of low quality hit have been added: “Partial” and “Ghost”.
NOTE: Duplicated
genes are explicitly those with hits on two different contigs/scaffolds. BUSCOMP is focused on identifying the coding potential of an assembly rather than a detailed completeness assessment. Duplicated genes on the same contig/scaffold will be marked as Complete
.
In addition to individual assembly stats, BUSCO and BUSCOMP ratings are compiled across user-defined groups of assemblies with various outputs to give insight into how different assemblies complement each other. Ratings are also combined with traditional genome assembly statistics (NG50 and LG50) based on a given genomesize=X
to help identify the “best” assemblies. Details of settings, key results, tables and plots are output to an HTML report using Rmarkdown.
NOTE: For HTML output, R must be installed and a pandoc environment variable must be set, e.g.
export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc
NOTE: BUSCOMPSeq sequences can be provided with buscofas=FILE
in place of compilation. This option has not been tested and might give some unexpected behaviours, as some of the quoted figures will still be based on the calculated BUSCOMPSeq data. Please report any unexpected behaviour.
BUSCOMP is still under review as part of the Starling genome paper. For now, please cite the preprint:
Stuart KC, Edwards RJ, Cheng Y, Warren WC, Burt DW, Sherwin WB, Hofmeister NR, Werner SJ, Ball GF, Bateson M, Brandley MC, Buchanan KL, Cassey P, Clayton DF, De Meyer T, Meddle SL & Rollins LA (preprint): Transcript- and annotation-guided genome assembly of the European starling. bioRxiv 2021.04.07.438753; doi: 10.1101/2021.04.07.438753. [*Joint first authors]
BUSCOMP is written in Python 2.x and can be run directly from the commandline:
python $CODEPATH/buscomp.py [OPTIONS]
If running as part of SLiMSuite, $CODEPATH
will be the SLiMSuite tools/
directory. If running from the standalone BUSCOMP git repo, $CODEPATH
will be the path the to code/
directory. Please see details in the BUSCOMP git repo for running on example data.
For BUSCOMPSeq analysis, minimap2 must be installed and either added to the environment $PATH
or given to BUSCOMP with the minimap2=PROG
setting.
A full list of options and the workflow is provided below. The easiest/standard way to run BUSCOMP is to:
fasta/
busco/
, with run_
prefixes (-o setting
)basefile=$PREFIX
setting to get $PREFIX.*
outputs, e.g.# Generate a name-matched BUSCO run per genome
cd busco/
for GENOME in ../fasta/*.fasta; do
GENBASE=$( basename ${GENOME/.fasta/} )
busco -o run_$GENBASE -i $GENOME -l $LINEAGE --cpu $PPN -m genome
done
cd ../
# Run BUSCOMP
python $CODEPATH/buscomp.py runs="./busco/run_*" fastadir=./fasta/ basefile=$PREFIX
If you want to run the full BUSCOMP comparison, make sure minimap2 is installed and give the number of threads to use with forks=$THREADNUM
. If you only want the BUSCO compilation, add buscompseq=F
.
A list of commandline options can be generated at run-time using the -h
or help
flags. Please see the general SLiMSuite documentation for details of how to use commandline options, including setting default values with INI files. If anything is not clear, or for questions how to perform a particular analysis, please post a question in the GitHub Issues.
### ~ Input/Output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
runs=DIRLIST : List of BUSCO run directories (wildcards allowed) [run_*]
fastadir=DIRLIST: List of directories containing genome fasta files (wildcards allowed) [./]
fastaext=LIST : List of accepted fasta file extensions that will be checked for in fastadir [fasta,fas,fsa,fna,fa]
genomes=FILE : File of Prefix and Genome fields for generate user-friendly output [*.genomes.tdt if found]
restrict=T/F : Restrict analysis to genomes with a loaded alias [False]
runsort=X : Output sorting order for genomes and groups (or "Genome","Prefix","Complete","Group") [Group]
stripnum=T/F : Whether to strip numbers ("XX_*") at start of Genome alias in output [True]
groups=FILE : File of Genome and Group fields to define Groups for compilation [*.groups.tdt]
buscofas=FASFILE: Fasta file of BUSCO DNA sequences. Will combine and make (NR) if not given [None]
metaeukfna=T/F : Perform v5 metaeuk nucleotide busco sequences extraction if missing [True]
buscomp=T/F : Whether to run BUSCO compilation across full results tables [True]
dupbest=T/F : Whether to rate "Duplicated" above "Complete" when compiling "best" BUSCOs across Groups [False]
buscompseq=T/F : Whether to run full BUSCO comparison using buscofas and minimap2 [True]
phylofas=T/F : Generate output of compiled and renamed files for BUSCO-based phylogenomics [False]
ratefas=FILELIST: Additional fasta files of assemblies to rate with BUSCOMPSeq (No BUSCO run) (wildcards allowed) []
rmdreport=T/F : Generate Rmarkdown report and knit into HTML [True]
fullreport=T/F : Generate full Rmarkdown report including detailed tables of all ratings [True]
missing=T/F : Generate summary tables for sets of Missing genes for each assembly/group [True]
dochtml=T/F : Generate HTML BUSCOMP documentation (*.docs.html) instead of main run [False]
summarise=T/F : Include summaries of genomes in main `*.genomes.tdt` output [True]
### ~ Mapping/Classification options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
minimap2=PROG : Full path to run minimap2 [minimap2]
endextend=X : Extend minimap2 hits to end of sequence if query region with X bp of end [0]
minlocid=INT : Minimum percentage identity for aligned chunk to be kept (local %identity) [0]
minloclen=INT : Minimum length for aligned chunk to be kept (local hit length in bp) [1]
uniquehit=T/F : Option to use *.hitunique.tdt table of unique coverage for GABLAM coverage stats [True]
mmsecnum=INT : Max. number of secondary alignments to keep (minimap2 -N) [3]
mmpcut=NUM : Minimap2 Minimal secondary-to-primary score ratio to output secondary mappings (minimap2 -p) [0]
mapopt=CDICT : Dictionary of additional minimap2 options to apply (caution: over-rides conflicting settings) []
### ~ Processing options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
forks=X : Number of parallel sequences to process at once [0]
killforks=X : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X : Sleep time (seconds) between cycles of forking out more process [0]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
STEP 1 is to set up the input data. BUSCOMP will first identify and check the run directories (given with the runs=DIRLIST
command). This command can either be a list of directories, or an expandable list with wildcard. A list of individual directories can be provided as a file if desired.
NOTE: runs=DIRLIST
wants the actual BUSCO run directories (run_$PREFIX[.busco]/
), not the parent directory. These directories can optionally have a .busco
suffix, which will be trimmed off. By default, BUSCOMP will look for ./run*
.
From these, the run list will be extracted, consisting of a number of genome $PREFIX
values. It is expected that the $PREFIX
for each run will match the input genome file, found in a directory set by fastadir=DIRLIST
. This file can also match the Genome
alias for that genome (see below). If you have not used consistent naming across files, please check data has been loaded and mapped correctly during the Review phase (below).
The presence of a full_$PREFIX[.busco].tsv
table and single_copy_busco_sequences/
directory within each run directory will also be checked. If buscocompseq=F
then all of the full_*.tsv
tables in the runs
directories will be loaded for compilation, but no sequence searching will be performed. The presence of sequences available for compilation will be stored in the Sequences
field of *.genomes.tdt
.
BUSCO v4.x: From v0.9.1
, BUSCOMP should recognise BUSCO v4 output. Due to the reorganisation and altered naming strategy of version 4, the naming of results files is more strict. The main results directory (set by -o
when you run BUSCO) should match the genome assembly prefix with an optional .busco
suffix. (A leading run_
is also permitted and will be ignored. For example, if assembly.fasta
was analysed with BUSCO v4, BUSCOMP should be able to parse results generated using -o assembly
, -o assembly.busco
, -o run_assembly
, or -o run_assembly.busco
. If none of these settings were used, the results directory can be manually renamed. For additional sorting, a XX_ numerical prefix can be used (see below), e.g. run_01_assembly
will still look for assembly.*
in the fastadir
path.
BUSCO v5.x From v0.11.0
, BUSCOMP should recognised BUSCO v5 output. This will be treated like v4 output.
Additional assemblies can rated using the BUSCOMPSeq analysis (see below) without having BUSCO data analysed. These are given with the ratefas=LIST
command (wildcards allowed), and will be added to the main Genomes table after BUSCO compilation has been performed. The default prefix for these files will be their basename (filename, minus path and extension). This can be linked to an alias and/or have a prefix number stripped, as with the run directories. When combined with buscofas=FASFILE
, this enables the BUSCOMPSeq analysis to be performed in the absence of any BUSCO results.
Each genome analysed has a “Prefix” that can be used to help sorting (if runsort=prefix
) and identify relevant BUSCO files where the genome (Fasta) name and BUSCO run name do not match. It is generally assumed that this Prefix will match the prefix of the original assembly file on which BUSCO was run. This Prefix is set once as the data is loaded and is never changed. (Sorting and visualisations can be made more informative altered using the Genome (a.k.a. Alias) and Description fields.)
The assembly Prefix will be set as:
full_table_*.tsv
file is found, the core (indicated with *
) will be used as the prefix.Genome Aliases will be parsed initially from the loaded data:
Genome
alias, minus the leading “run_
”.full_table_*.tsv
core used for the Prefix
will also be used for the Genome
alias.Genome
alias.If Genome
contains a trailing .busco
, this will be stripped. If Genome
starts with preceding numbers (XX_
), these will be kept for sorting but stripped for outputs and fasta matching (unless stripnum=F
).
NOTE: If the assembly fasta file does not match the name used for the BUSCO run file, then the Genome
alias must be set to the basefile
of the fasta file by appropriate naming of the run_
directory, i.e. the run_*
directory should contain a single BUSCO run and the directory name should match a *.fasta
file (or other accepted fastaext=LIST
extension) in one of the directories given by fastadir=DIRLIST
.
An optional alias table can be provided (genomes=FILE
) that contains Prefix
and Genome
fields (others allowed) and can give alternative names for each genome to be used in other outputs. If running interactively, there will also be an option to check/update these aliases manually. Genome aliases are output as part of the main *.genomes.tdt
output, which can also be used as subsequent genomes=FILE
input (the default, if found).
This table can also have a Description
field, which will be used to update descriptions for output if found. (Empty descriptions will not be mapped.) Otherwise, descriptions can be parsed from a description_$PREFIX.txt
file (exactly matching the $PREFIX
parsed from a full_table_*.tsv
file) if found. Alternatively, if the run directory has a run_
prefix AND there is only one full_table
, or it is a BUSCO v4 output directory, a description will be parsed from description.txt
. In each case, only the first line will be used. If neither condition is met, the name of the genome fasta file will be used. Failing that, Genome
will be used. (Note that for BUSCO v4, description.txt
must be in the same directory as full_table.tsv
.
NOTE: The optional Alias table can be used to change a Genome
name to something other than its Fasta file - this mapping is performed after fasta files have been identified.
NOTE: Prefix
and Genome
fields must be unique. A common Prefix
and Genome
value is permitted but these will be assumed to refer to the same assembly. (In other words, do not run BUSCO on two assemblies and give each BUSCO run the name of the other assembly!)
NOTE: Some Genome
names may cause conflicts with running minimap2 and/or the Rmarkdown output. Whitespace is not permitted in Genome
names (though can be in R labels) and will be stripped. It is recommended to keep Genome
names simple and only containing standard characters (letters, numbers, dots and underscores).
BUSCOMP compilation will take a set of genomes and identify the best rating for each BUSCO across the set. By default, this is done for all input genomes and assigned to a “BUSCOMP” rating. Ratings are compiled according to the ratings=LIST
hierarchy, which by default is: Complete
, Duplicated
, Fragmented
, Missing
. The “best” rating (first in the hierarchy) will be used for that group, even if it is found in only a single assembly in the group. If dupbest=T
, the hierarchy is Duplicated
, Complete
, Fragmented
, Missing
.
Additional subsets of genomes can also be grouped together for compilation, using the groups=FILE
input. This must have Genome
and Group
fields. Once generated, a Group becomes a “Genome”. Groups can therefore include other Groups in them, provided there is no circular membership. Genome mapping should also recognise Prefix values. The “BUSCOMP” group with all genomes will be added unless a conflicting “BUSCOMP” group is provided, or buscomp=F
.
NOTE: Genome groups are generated mapping onto the Genome
(or Prefix
) values after any alias mapping but before any manual editing of Genome
aliases.
When running interactively (i>=0
), Group review and editing can be accessed from the main Genome menu. This will show the genomes currently in each group, and provides the following menu options:
<A>dd
= Create a new Group. First, a group name will need to be chosen, then the <E>dit
menu will be triggered.<D>elete
= Remove the Group. Note that this will not remove Group members from any “parent” groups that contained the deleted group when they were loaded.<R>ename
= Change the Group name.<E>dit
= Edit group membership. This will give options to add/remove each assembly in the analysis.E<x>it menu
= Return to main Genome Menu.<Q>uit
= Quit BUSCOMP.Once group edits are complete, group data will be saved to the *.groups.tdt
file. If runsort=group
then genome sorting will also be performed following group edits.
The final setup step is to load any genomes for which Fasta files have been provided. Unless summarise=F
, summary statistics will be calculated for each genome, enabling assessements of genome completeness and quality in combination with the BUSCO and BUSCOMP results. Summary statistics are calculated by rje_seqlist
. The following statistics are calculated for each genome:
SeqNum
+GapCount
).genomesize=X
. If no genome size is given, it will be relative to the biggest assembly.genomesize=X
. If no genome size is given, it will be relative to the biggest assembly.N
) nucleotides in the assembly.N
) regions above mingap=X (default 10bp).NOTE: NG50Length
and LG50Count
statistics use genomesize=X
or the biggest assembly loaded. If BUSCOMP has been run more than once on the same data (e.g. to update descriptions or sorting), previously calculates statistics will be read from the genomes=FILE
table, if loaded. This may cause some inconsistencies between reported NG50 and LG50 values and the given/calculated genomesize=X
. If partial results are reloaded following a change in genomesize=X
, this will give rise to inconsistencies between assemblies. When re-running, please make sure that a consistent genome size is used. If in doubt, run with force=T
and force regeneration of statistics.
Once all the genomes are loaded, BUSCOMP compiles the BUSCO results. This is done for three purposes:
Complete
BUSCO hits for BUSCOMPSeq analysis (below).Loaded assemblies with an identified full results Table
will be processed and added to the *.full.tdt
table. The best rating per BUSCO gene will then be used for summary BUSCO counts for each genome. Finally, each Group will have its combined BUSCO rating calculated. This is performed by choosing the “best” rating for each BUSCO across the Group’s members, where “best” is determined by the dupbest=T/F
setting:
dupbest=F
), the rating hierarchy is: ‘Complete’, ‘Duplicated’, ‘Fragmented’, ‘Missing’.dupbest=T
, the rating hierarchy is: ‘Duplicated’, ‘Complete’, ‘Fragmented’, ‘Missing’.Where ratings are defined (quoting from the BUSCO v3 User Guide as:
Complete
: Single-copy hits where “BUSCO matches have scored within the expected range of scores and within the expected range of length alignments to the BUSCO profile.”Duplicated
: As Complete
but 2+ copies.Fragmented
: “BUSCO matches … within the range of scores but not within the range of length alignments to the BUSCO profile.”Missing
: “Either no significant matches at all, or the BUSCO matches scored below the range of scores for the BUSCO profile.”Total BUSCO counts (N
) and summed Ratings are added to the main *.genomes.tdt
table for each assembly and group. A *.busco.tdt
file is also generated that has the rating for each BUSCO gene (rows) in each assembly/group (columns).
If output is being sorted using runsort=Complete
or runsort=Missing
, data will be sorted at this point. Raw BUSCO results and compiled numbers will then be output to the log file.
The following tables will then be saved (see Output files, below, for details):
*.genomes.tdt
= summary of loaded data, genome statistics and BUSCO ratings.*.full.tdt
= full BUSCO results across all assemblies.*.busco.tdt
= compiled BUSCO results showing the rating per genome/group for each BUSCO gene.The final step of the BUSCOMP BUSCO compilation is to extract the best Complete BUSCO sequences from those available. For all assemblies with BUSCO results and a single_copy_busco_sequences/
, BUSCOMP ranks all the hits across all assemblies by Score
and keeps the top-ranking hits. Ties are then resolved by Length
, keeping the longest sequence. Ties for Score
and Length
will keep an arbitrary entry as the winner. Single
complete hits are given preference over Duplicate
hits, even if they have a lower score, because only Single
hit have their sequences saved by BUSCO in the single_copy_busco_sequences/
directory.
If buscofas=FASFILE
has been given, this will be used for BUSCOMPSeq searches. Otherwise, the best BUSCO seqences identified for each BUSCO gene will be saved as *.buscomp.fasta
for BUSCOMPSeq analysis. The exception is if buscofas=FASFILE
is pointing to the *.buscomp.fasta
and force=T
, or if the buscofas=FILE
fasta file cannot be found.
NOTE: The buscofas=FASFILE
option has not been tested and might give some unexpected behaviours, as some of the quoted figures will still be based on the calculated BUSCOMPSeq data.
Optionally, BUSCOMP can compile fasta files for phylogenomic analysis (phylofas=T
). These will be saved in the phylogenomics/
directory as a series of nucleotide *.fna
files in phylofna
and protein *.faa
files in phylofaa
. The key organisation of these directories is a single file per BUSCO ID that contains sequences for every genome with a single Complete rating. Files will be renamed from >$SEQNAME:$START-$END
to >$PREFIX $SEQNAME:$START-$END:$BUSCOID
where $PREFIX
is the assembly Prefix (see above).
NOTE: for HAQESAC-based phylogenomics, you might want the prefixes to have the format eog_\(SPCODE__\)ID, though. it should also run OK with unkspec=T allowvar=T gnspacc=F.
Once the gene sequences for BUSCOMPSeq have been established, the next step is to search for them in the assembly fasta files using a more deterministic (and faster) approach. For this, minimap2 has been chosen. BUSCOMP will perform a minimap2 search on each assembly fasta file. (Use minimap2=FILE
to tell BUSCOMP where to find minimap2 if the path is not an environment variable.)
Minimap2 will run with -x splice -uf -C5
and hits will be filtered by length, keeping those that meet the minloclen=X
cutoff (default 20 bp). A percentage identity cutoff can also be applied with minlocid=X
(default 0%). Hits are also reduced to unique coverage by starting with the local alignment with the largest number of identical positions, and then trimming any other local alignments that overlap with the same part(s) of the query (BUSCO) sequence. This is repeated until all local alignments are non-overlapping (and still meet the identity and length criteria). Local hits are then compiled into global alignment (GABLAM) statistics of coverage and identity.
By default, Minimap2 searches will keep the Top 3 secondary alignments for each sequence (the minimap2 -N
commandline parameter). In limited tests, this appears to give a good trade-off between search speed and the identification of Duplicated
BUSCO genes. This default can be adjusted with mmsecnum=INT
to make runs faster or more sensitive, as required. This can be further modulated using mmpcut=NUM
, which sets the Minimap2 minimal secondary-to-primary score ratio to output secondary mappings (the minimap2 -p
commandline parameter).
Minimap2 search results are saved in a *.minimap/
directory, with the prefix $BASEFILE.$CUTOFFS
, where $BASEFILE
is the fasta file basename for assembly, and $CUTOFFS
takes the form NXLXXIDXX
, where NX
is the max number of secondary hits (mmpcut=NUM
), LXX
is the length cutoff (minloclen=X
) and IDXX
is the identity cutoff (minlocid=X
) - L20ID0
by default. Unless uniquehit=F
(see below), $CUTOFFS
will have a U
suffix, e.g. L20ID0U
. The Minimap2 *.paf
file will only have the .NX
suffix. A *.NX.paf.cmd
file with the actual Minimap2 command used will also be saved, so the mmpcut=NUM
setting and any other Minimap2 options set using mapopt=CDICT
can be checked if required.
Existing files will be loaded and reused unless force=T
.
Minimap2 hits are first converted into BLAST-style “local” hits, which are then converted into GABLAM-style summary statistics. For the *.hitsum.tdt
table, local hits are reduced to unique coverage of each Query, such that each part of the Query is covered by a single hit. Local hits are selected in order of the number of identical positions between Query and Hit, followed by Length in the case of a tie. Overlapping regions from other local hits are trimmed, and the process repeated with the next-best local hit, ranked on trimmed stats where necessary. For the *.gablam.tdt
output, each Query-Hit pair is considered independently and local hits are reduced to (a) unique coverage of the Query for Qry_*
output, and (b) unique coverage of the Hit for Hit_*
output. (Note that for each Query-Hit pair, only the local hits between that pair are considered for “unique” local hit reduction - there may be overlapping alignments to different queries/hits.) Unless uniquehit=F
, hits will first be reduced to be non-overlapping across all queries before being coverted to Query-Hit pair GABLAM coverage stats.
NOTE: Re-using results does NOT robustly check whether the BUSCOMPSeq data has changed, so this directory should be deleted if re-running with different sequences. BUSCOMP will save the name of the BUSCO file along with the md5sum of its contents to *.NX.input.md5
, which will be checked if present and re-using data. Some basic checks should also be performed during the following results compilation stage, but please note that BUSCO IDs are used for sequence identifiers and so changes in BUSCO hit sequences will not be identified if files have been inappropriatley copied/moved between runs etc. - please look out for unexpected behaviour outside a “clean” run.
NOTE: Minimap2 only works with high sequence identity. BUSCOMP is designed to be used on multiple assemblies from the same species. If using cross-species BUSCO analysis, odd results might be obtained, biased by the evolutionary distance from the species with the most BUSCOMP sequences. Under these circumstances, it might be better to swap out minimap2 for BLAST+. This can be achieved by independently running GABLAM using the BUSCOMPSeq fasta file as queries and each assembly as the search database, then replacing the *.gablam.tdt
and *.hitsum.tdt
files before re-running BUSCOMP. Please contact the author if you want help with this.
Minimap2 searches of the compiled BUSCOMP sequences are then compiled in similar vein to the original BUSCO results. The primary difference is that the search results need to first be converted into BUSCO-style ratings. This rating is explicitly more “quick and dirty” than the original BUSCO ratings, and should be considered complementary rather than superior. It aims to provide a quick, consistent assessment, but does have fewer checks and balances as a result.
BUSCOMP uses results from both the *.hitsum.tdt
table (overall coverage in assembly) and the *.gablam.tdt
table (coverage in a single contig/scaffold) to rate sequences as:
Complete
ratings, these might include part of closely related paralogues.)When compiling the results for all BUSCO genes, Single/Duplicated Complete hits will also be rated as Identical if they have 100% coverage and 100% identity in at least one contig/scaffold.
Once all the individual assemblies have been rated for the full set of assemblies, results are compiled across Groups as described for the original BUSCO results (above). Because no genes receive an individual “Identical” rating, Groups will not have a summary count for Identical hits.
Individual gene ratings for each genome and group are output to *.LnnIDxx.buscomp.tdt
, where LnnIDxx
records the minloclen=nn
and minlocid=xx
settings. Compiled ratings are output to *.LnnIDxx.ratings.tdt
.
The final step of the BUSCOMP compilation is to summarise the findings in the log file, akin to the BUSCO summary file. This will first generate a one line summary of the percentages, along with the original number of complete BUSCOs and the number of BUSCOMP sequences contributed by that assembly (i.e. the number with the best score of all the BUSCO searches.) This is followed by a more detailed breakdown of each category. For example:
#BUSCO 00:00:41 Results: C:89.2%[S:87.8%,D:1.4%,I:23.3%],F:5.8%,P:3.8%,G:1.6%,M:0.9%,n:3736 - canetoad_v2.2 (3194 Complete BUSCOs; 102 BUSCOMP Seqs)
#INFO 00:00:41 870 Identical BUSCOs (I) [100% complete and identical]
#INFO 00:00:41 3333 Complete BUSCOs (C) [95%+ coverage in a single contig/scaffold]
#INFO 00:00:41 3281 Complete and single-copy BUSCOs (S) [1 Complete hit]
#INFO 00:00:41 52 Complete and duplicated BUSCOs (D) [2+ Complete hits]
#INFO 00:00:41 217 Fragmented BUSCOs (F) [95%+ coverage spread over 2+ contigs/scaffolds]
#INFO 00:00:41 143 Partial BUSCOs (P) [40-95% coverage]
#INFO 00:00:41 61 Ghost BUSCOs (G) [<40% coverage]
#INFO 00:00:41 34 Missing BUSCOs (M) [No hits]
#INFO 00:00:41 3736 Total BUSCO gene hits searched
There is a risk that performing a low stringency search will identify homologues or pseudogenes of the desired BUSCO gene in error. If there is a second copy of a gene in the genome that is detectable by the search then we would expect the same genes that go from Missing
to Complete
in some genomes to go from Single
to Duplicated
in others.
To test this, data is reduced for each pair of genomes to BUSCO-BUSCOMP rating pairs of:
Single
-Single
Single
-Duplicated
Missing
-Missing
Missing
-Single
This is then converted in to G
ain ratings (Single
-Duplicated
& Missing
-Single
) or N
o Gain ratings (Single
-Single
& Missing
-Missing
). The Single
-Duplicated
shift in one genome is then used to set the expected Missing
-Single
shift in the other, and assess the probability of observing the Missing
-Single
shift using a cumulative binomial distribution, where:
k
is the number of observed GG
pairs (Single
-Duplicated
and Missing
-Single
)n
is the number of Missing
-Single
G
ains in the focal genome (NG
+GG
)p
is the proportion of Single
-Duplicated
G
ains in the background genome (GN
+GG
/ (GN
+GG
+NN
+NG
))pB
is the probability of observing k+
Missing
-Single
gains, given p
and n
This is output to *.gain.tdt
, where each row is a Genome and each field gives the probability of the row genome’s Missing
-Single
gains, given the column genome’s Single
-Duplicated
gains.
Contents of the main output files are given below. In addition to the tab-delimited text output, a summary HTML (and source RMarkdown) file will be generated, assuming R is installed. If buscompseq=T
then a fasta file of the “best” (top-scoring) Single Complete BUSCO genes will also be generate (*.buscomp.fasta
), unless it is provided using buscofas=FASFILE
.
Unless buscompseq=F
or buscofas=FASFILE
is provided, nucleotide and protein sequences for the “best” BUSCO gene hits will be output to *.buscomp.fasta
(nucleotide) and *.buscomp.faa
(protein). The *.buscomp.faa
protein file is not used by BUSCOMP but is provided as a useful set of (near-)complete protein sequences. Each fasta file is made from concatenating individual fasta files from the single_copy_busco_sequences/
directories. As such, they will be named as with the standard BUSCO output:
>$BUSCOID $FASTAFILE:$CONTIG:$START-$END
NOTE: The $FASTAFILE
path given in this file will be the one from the original BUSCO run, not the path to the genome file used by BUSCOMP if it has subsequently been moved/renamed.
The *.genomes.tdt
table is the main summary table for the input genomes, their BUSCO rating summaries and genome statistics.
#
= Sorting order for outputDirectory
= Directory for this BUSCO runPrefix
= Prefix identified from BUSCO directory (or full table), corresponding to BUSCO output filesGenome
= Optional alternative name (or “alias”) to be used in outputsFasta
= path to genome file, if foundsummarise=T
. (See above.)Sequences
(True/False) = Whether single_copy_busco_sequences/
directory foundTable
= Path to full_*.tsv
tableTable
is True
) (See above.)The *.groups.tdt
table has the mappings between Genome
and Group
identifiers to make the groups for combined ratings (see above).
#
= Arbitrary unique key for tableGenome
= Genome alias or PrefixGroup
= Name for grouped ratingThe *.full.tdt
table is a compiled version of the all the individual BUSCO full results tables.
Genome
= Genome alias#
= Arbitrary unique key element for table (adjusting for duplicates) - BuscoID
= BUSCO gene identifierStatus
= BUSCO ratingContig
= Contig containing BUSCO hitStart
= Start positionEnd
= End positionScore
= BUSCO scoreLength
= Length of BUSCO hitThe *.busco.tdt
table has the compiled ratings for individual BUSCO genes in each genome/group
BuscoID
= BUSCO EOG gene identifierThe *.buscoseq.tdt
table has the set of best BUSCO genes used for the compiled BUSCOMPSeq sequence set. In addition to the BUSCO stats for all the BUSCOMP sequences, BUSCOMP ratings for each Genome are output in this file.
Genome
= Genome aliasBuscoID
= BUSCO gene identifierStatus
= BUSCO ratingContig
= Contig containing BUSCO hitStart
= Start positionEnd
= End positionScore
= BUSCO scoreLength
= Length of BUSCO hitThe *.LnnIDxx.buscomp.tdt
table is the same as the *.busco.tdt
but with revised ratings based on BUSCOMP analysis.
BuscoID
= BUSCO EOG gene identifierThe *.LnnIDxx.ratings.tdt
table has the compiled BUCOMP summary ratings per genome/group.
Genome
= Genome alias or group nameN
= Number of BUSCOMP genesIdentical
= 100% coverage and 100% identity in at least one contig/scaffold. These will also be rated as Complete
or Duplicated
. Groups do not have Identical
ratings.Complete
= 95%+ Coverage in a single contig/scaffold. (Note: accuracy/identity is not considered.)Duplicated
= 95%+ Coverage in 2+ contigs/scaffolds.Fragmented
= 95%+ combined coverage but not in any single contig/scaffold.Partial
= 40-95% combined coverage.Ghost
= Hits meeting local cutoff but <40% combined coverage.Missing
= No hits meeting local cutoff.The *.LnnIDxx.changes.tdt
table has counts of change ratings (BUSCO to BUSCOMP) for each genome, in addition to a Total count across all genomes.
BUSCO
= BUSCO rating.BUSCOMP
= BUSCOMP rating.Total
= Summed count over all genomes.The *.LnnIDxx.changes.full.tdt
table is the complete set of ratings changes (BUSCO to BUSCOMP) for each gene and genome.
BuscoID
= BUSCO EOG gene identifierComplete, Duplicated, Fragmented, Partial, Ghost, Missing
The *.LnnIDxx.unique.tdt
table has Genome
and Group
identifiers for each BuscoID
where that gene is only Complete
(or Duplicated
) in that genome/group.
BuscoID
= BUSCO EOG gene identifierBUSCO
= Genome/Group that uniquely has a BUSCO Complete
rating.BUSCOMP
= Genome/Group that uniquely has a BUSCOMP Complete
rating.The *.LnnIDxx.rdata.tdt
table has the BUSCOMP summary ratings (converted into percentage values) and sequence statistics for each Genome
/Group
, in addition to:
BUSCO
= BUSCO Complete
(Single
and Duplicated
) percentageNoBUSCO
= BUSCO Missing
percentageUniqBUSCO
= Number of unique BUSCO Complete
genesUniqBUSCOMP
= Number of unique BUSCOMP Complete
genescol
= colour field for R plotting.pch
= point type for R plotting.label
= Genome/Group label for R plotting.plot
= TRUE/FALSE, whether to plot the Genome/Group statistics.best
= any categories for which this Genome is rated “best”.Minimap2 will be run for each genome with a Fasta
file, generating:
*.paf
*.paf.cmd
*.input.md5
*.L20ID80.local.tdt
*.L20ID80.hitunique.tdt
*.L20ID80.qryunique.tdt
*.L20ID80.hitsum.tdt
*.L20ID80.gablam.tdt
Output details will be added here.
The final step in BUSCOMP analysis is to generate a summary report. This is produced as an RMarkdown document (*.Rmd
) and (assuming the path to pandoc is set) is then “knitted” to make an HTML file (*.html
). The RMarkdown file is retained to make it easy for the user to modify the content and convert the output into a report of their own. HTML output can be switched off with rmdreport=F
. Unless fullreport=F
, a larger report with full BUSCO results tables and comparative plots of Missing BUSCO genes will be generated (`*.full.html’).
Prior to generation of the document, results are first compiled into an overview stats file with additional plot attributes, before the “best” assemblies are identified under various criteria. Finally, there is an option to modify some of the plotting attributes before the report is generated.
The main BUSCOMP *.ratings.tdt
output is combined with key genome statistics and BUSCO ratings from the *.genomes.tdt
table. BUSCOMP ratings are converted into percentage values. BUSCO ratings are converted into percentage values and reduced to BUSCO
(Single and Duplicated Complete BUSCOs) and NoBUSCO
(Missing BUSCOs). (Full BUSCO results are plotted directly from the *.genomes.tdt
.)
After results are compiled, additional plotting fields are generated:
col
= Plotting colour. (Default, “red”. Genomes with 1+ “best” ratings, “blue”)pch
= Point type. (Default, 21 [circle]. Genomes with “best” rating will be 22 [square] for best BUSCO(MP) ratings, 24 [triangle] for best contiguity ratings, or 23 [diamond] for best in both categories.label
= Additional text field to be used for labels. In interactive mode, the option will be given to leave this blank for unlabelled plots.plot
= Whether or not to include a genome in the plot. (Default, TRUE)In interactive mode, the option will be provided to edit plotting attributes for each assembly, following the calculation of “best” assemblies (below). Compiled data are then saved to *.LnnIDxx.rdata.tdt
for summary plots and tables in the RMarkdown output.
There are many ways of assessing genome assembly quality, but they can be broadly broken down into four aspects:
Coverage. How much of the genome is included in the assembly.
Contiguity. How many fragments is the genome in, and how big are they.
Accuracy. How accurate is the assembly in terms of sequence errors.
Redundancy. How much of the genome has been included multiple times.
Standard reference-free genome statistics (e.g. number, length and gappiness of scaffolds), can only give limited insights. Whilst assemblies smaller than the genome size are clearly missing bits, assembly size could be artificially inflated by redundant regions, making it impossible to assess Coverage. Scaffold sizes can give an indicator of Contiguity, but are also prone to biases introduced by filtering of small scaffolds, for example. If an estimated Genome Size can be provided, the most useful measures in this respect are NG50
and LG50
. These statistics sort scaffolds by length (big to small) and identify the contig/scaffold size at which half the genome (not the assembly) is covered by contigs/scaffolds of that size or bigger. This is the NG50
value (called NG50Length
in BUSCOMP), with LG50
(LG50Count
) being the number of contigs/scaffolds needed to cover half the genome. (N50
and L50
are the same statistics only for the assembly.) These statistics can still be mislead by redundancy in the assembly. None of theses statistics speak to sequence accuracy.
The power of BUSCO is that it speaks to all four aspects of quality. Complete
and Fragmented
genes give an indication of Coverage, Continuity and Accuracy; Duplicated
genes give an indication of Redundancy; Missing
genes give an indication of Coverage and Accuracy. However, the weakness is that these different aspects cannot easily be disentangled. This makes side-by-side comparisons of different assemblies challenging, as it is not always clear what a difference is indicating.
BUSCOMP is designed on the principle that Coverage and Contiguity are the two most important aspects of assembly quality. Accuracy can, in principle, be improved by additional error-correction steps (including manual curation). Suspected Redundancy can likewise be identified a flagged. Coverage and Contiguity, in contrast, can only be improved by further assembly - either trying again from scratch, or employing a scaffolding or gap-filling alogrithm.
BUSCOMP has therefore identified seven statistics that can be used to rank assemblies on inferred Completeness or Contiguity:
Complete
= Maximum number of Complete BUSCOMP ratings.BUSCO
= Maximum number of Complete BUSCO ratings.Missing
= Smallest number of Missing BUSCOMP ratings.NoBUSCO
= Smallest number of Missing BUSCO ratings.MaxLength
= Maximum length of contigs/scaffolds.NG50Length
= Half the genome lies on contigs/scaffolds at least this size. (See above.)LG50Count
= Half the genome can be covered by this many contigs/scaffolds. (See above.)Individual assemblies are rated as “best” under all seven criteria, with ties allowed. Each assembly can be best under multiple criteria and each criterion can have several best assemblies. BUSCOMP will report all such combinations.
The BUSCOMP HTML reports (full and summary) consist of the following sections:
Details of the BUSCOMP run, including the version, directory and commands.
This section contains summary details of the genome assemblies analysed, including the loaded data, the summary statistics for the assemblies, and coverage/contiguity assessment plots:
The compilation of BUSCO ratings, including details of the BUSCOMP Groups is given in this section. In addition to the summary ratings for each genome/group, the full report will have a table of all the individual gene ratings:
Next, BUSCOMP ratings using the compiled BUSCOSeq dataset are reported. In addition to the summary ratings for each genome/group, the full report will have a table of all the individual gene ratings:
The final report section features direct comparisons of the BUSCO and BUSCOMP ratings, reporting changes in ratings between BUSCO and BUSCOMP. Unique Complete genes are also identified: those rated as Complete
in only a single genome, or multiple genomes in a single Group. The full report also has a series of summary ratings plots for subsets of BUSCO genes that are rated as Missing
in a genome or group.
Details of the BUSCOMP run in terms of when, where and how (e.g. commandline options) are found in the Appendix. Any errors or warnings generated during the BUSCO run are reported here. Check the *.log
file generated for details.
© 2019 Richard Edwards | richard.edwards@unsw.edu.au