Running Diploidocus
Diploidocus is written in Python 2.x and can be run directly from the commandline:
python $CODEPATH/diploidocus.py [OPTIONS]
If running as part of SLiMSuite, $CODEPATH
will be the SLiMSuite tools/
directory. If running from the standalone Diploidocus git repo, $CODEPATH
will be the path the to code/
directory. Please see details in the Diploidocus git repo for running on example data.
Dependencies
For full functionality, Diploidocus needs a number of additional programs installed on the system:
- Python 2.x and Python 3.x
- KAT
- BedTools
- R
- purge_haplotigs
- bbmap
- BLAST+
- Minimap2 (added to environment
$PATH
or given with the minimap2=PROG
setting)
- Samtools
To generate documentation with dochtml
, R will need to be installed and a pandoc environment variable must be set, e.g.
export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc
For Diploidocus documentation, run with dochtml=T
and read the *.docs.html
file generated.
Commandline options
A list of commandline options can be generated at run-time using the -h
or help
flags. Please see the general SLiMSuite documentation for details of how to use commandline options, including setting default values with INI files.
### ~ Main Diploidocus run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqin=FILE : Input sequence assembly [None]
runmode=X : Diploidocus run mode (diploidocus/dipcycle/gensize/purgehap/telomeres/vecscreen/regcheck/regcnv/deptrim/sortnr/diphap/diphapnr/insilico) [diploidocus]
basefile=FILE : Root of output file names [diploidocus or $SEQIN basefile]
summarise=T/F : Whether to generate and output summary statistics sequence data before and after processing [True]
genomesize=INT : Haploid genome size (bp) [0]
scdepth=NUM : Single copy ("diploid") read depth. If zero, will use SC BUSCO mode [0]
bam=FILE : BAM file of long reads mapped onto assembly [$BASEFILE.bam]
bamcsi=T/F : Use CSI indexing for BAM files, not BAI (needed for v long scaffolds) [False]
reads=FILELIST : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype=LIST : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
dochtml=T/F : Generate HTML Diploidocus documentation (*.docs.html) instead of main run [False]
tmpdir=PATH : Path for temporary output files during forking (not all modes) [./tmpdir/]
### ~ Genome size prediction options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
busco=TSVFILE : BUSCO full table [full_table_$BASEFILE.busco.tsv]
readbp=INT : Total combined read length for depth calculations (over-rides reads=FILELIST) []
quickdepth=T/F : Whether to use samtools depth in place of mpileup (quicker but underestimates?) [False]
depdensity=T/F : Whether to use the BUSCO depth density profile in place of modal depth [True]
mapadjust=T/F : Whether to adjust predicted genome size based on read length:mapping ratio [False]
### ~ DiploidocusHocusPocus and Purge haplotigs options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
kmerreads=FILELIST : File of high quality reads for KAT kmer analysis []
10xtrim=T/F : Whether to trim 16bp 10x barcodes from Read 1 of Kmer Reads data for KAT analysis [False]
minmedian=INT : Minimum median depth coverage to avoid low coverage filter [3]
minlen=INT : Minimum scaffold length to avoid low quality filter [500]
purgehap=X : Purge_haplotigs method (purgehap/diploidocus) [purgehap]
phlow=INT : Low depth cutoff for purge_haplotigs (-l X). Will use SCDepth/4 if zero. [0]
phmid=INT : Middle depth for purge_haplotigs (-m X). Will derive from SCDepth if zero. [0]
phhigh=INT : High depth cutoff for purge_haplotigs (-h X). Will use SCDepth x 2 if zero. [0]
zeroadjust=T/F : Add zero coverage bases to purge_haplotigs LowPerc and adjust total [True]
includegaps=T/F : Whether to include gaps in the zero coverage bases for adjustment (see docs) [False]
mingap=INT : Minimum length of a stretch of N bases to count as a gap for exclusion [10]
purgemode=X : Rules used for purgehap analysis (simple/complex/nala) [complex]
diploidify=T/F : Whether to generate alternative diploid output with duplicated diploid contigs and no hpurge [False]
pretrim=T/F : Run vectrim/vecmask and deptrim trimming prior to diploidocus run [False]
maxcycle=INT : Restrict run to maximum of INT cycles (0=No limit) [0]
### ~ Depth Trim options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
deptrim=INT : Trim termini with <X depth [1]
mintrim=INT : Min length of terminal depth trimming [1000]
### ~ Telomere options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
telofwd=X : Regex for 5' telomere sequence search [C{2,4}T{1,2}A{1,3}]
telorev=X : Regex for 5' telomere sequence search [T{1,3}A{1,2}G{2,4}]
telosize=INT : Size of terminal regions (bp) to scan for telomeric repeats [50]
teloperc=PERC : Percentage of telomeric region matching telomeric repeat to call as telomere [50]
telonull=T/F : Whether to output sequences without telomeres to telomere table [False]
### ~ VecScreen options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
screendb=FILE : File of vectors/contaminants to screen out using blastn and VecScreen rules []
screenmode=X : Action to take following vecscreen searching (report/purge) [report]
minvechit=INT : Minimum length for a screendb match [50]
minidhit=INT : Minimum length for an identical screendb match (based on NCBI filtering) [27]
efdr=NUM : Expected FDR threshold for VecScreen queries (0 is no filter) [1.0]
vecpurge=PERC : Remove any scaffolds with >= PERC % vector coverage [50.0]
vectrim=INT : Trim any vector hits (after any vecpurge) within INT bp of the nearest end of a scaffold [1000]
vecmask=INT : Mask any vector hits of INT bp or greater (after vecpurge and vectrim) [900]
maskmode=X : Whether to perform full (all bases) or partial (every other base) masking of hits [partial]
keepnames=T/F : Whether to keep names unchanged for edited sequences or append 'X' [False]
veccheck=T/F : Check coverage of filtered contaminant hits using reads=FILELIST data [False]
### ~ Region checking options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
regcheck=FILE : File of SeqName, Start, End positions for read coverage checking [None]
checkfields=LIST: Fields in checkpos file to give Locus, Start and End for checking (unless GFF) [Hit,SbjStart,SbjEnd]
checkflanks=LIST: List of lengths flanking check regions that must also be spanned by reads [0,100,1000,5000]
spanid=X : Generate sets of read IDs that span veccheck/regcheck regions, grouped by values of field X []
regcnv=T/F : Whether to calculate mean depth and predicted CNV of regcheck regions based on SCdepth [True]
gfftype=LIST : Optional feature types to use if performing regcheck on GFF file (e.g. gene) ['gene']
mingapspan=INT : Minimum number of reads spanning a gap in order to re-assemble [2]
minlocid=PERC : Minimum percentage identity for aligned chunk to be kept (local %identity) [0]
minloclen=INT : Minimum length for aligned chunk to be kept (local hit length in bp) [500]
### ~ SortNR filtering/output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
checkcov=PERC : Percentage coverage for double-checking partial exact matches [95]
seqout=FILE : Output sequence assembly [$BASEFILE.nr.fasta]
### ~ In silico diploid input/output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
rqfilter=X : Minimum RQ for output subreads [0]
lenfilter=X : Min read length for filtered subreads [500]
parent1=FOFN : File of file names for subreads fasta files on Parent 1. []
parent2=FOFN : File of file names for subreads fasta files on Parent 2. []
See also SMRTSCAPE `summarise=T` options if `*.unique.tdt`/`*.smrt.tdt` have not been pre-generated with SMRTSCAPE.
### ~ Advanced/Developmental options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
memperthread=INT: Number of Gb per thread to allocate to samtools etc. [6]
useqsub=T/F : Whether to use qsub to queue up system calls [False]
qsubvmem=INT : Memory setting (Gb) when queuing with qsub [126]
qsubwall=INT : Walltime setting (hours) when queuing with qsub [12]
modules=LIST : List of modules that needs to be loaded for running with qsub []
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
Diploidocus run modes
Details for the main Diploidocus run modes are given below.
NOTE: Diploidocus is under development and documentation might be a bit sparse. Please contact the author or post an issue on GitHub if you have any questions.
Cycled Diploidocus filtering [runmode=dipcycle]
Diploidocus can be automated to cycle through repeated rounds of the main purge_haplotigs/filtering method until no further scaffolds are removed. Each cycle will run Diploidocus with a numerical suffix, e.g. $BASEFILE.1.*
, using the *.diploidocus.fasta
output from the previous cycle as input. Cycling will continue until no further scaffolds are filtered into either *.quarantine.fasta
or *.junk.fasta
.
Output for each cycle will be initially generated in the run directory but then moved to a dipcycle_$BASEFILE
directory upon completion.
Final outputs from the final cycle will then be compiled under the original $BASEFILE
prefix:
$BASEFILE.diploidocus.tdt
= Final ratings for the input scaffolds. This is like the single cycle output with an additional Cycle
field containing the last cycle this scaffold was processed in.
$BASEFILE.ratings.tdt
= Reduced final ratings output for the input scaffolds (SeqName
,SeqLen
,ScreenPerc
,Class
,Rating
,Set
,Cycle
).
$BASEFILE.diploidocus.fasta
= the scaffolds kept from the final Diploidocus cycle
$BASEFILE.repeats.fasta
= the repeat scaffolds excluded from core
$BASEFILE.core.fasta
= the same set of scaffolds, minus repeats
$BASEFILE.quarantine.fasta
= concatenated purged scaffolds from all Diploidocus cycles.
$BASEFILE.junk.fasta
= concatenated low coverage and low quality scaffolds, removed as junk, from all cycles.
NOTE: Contents for these four *.fasta
files are summarised in the main log. Individual purge cycles have their own log files in the dipcycle_$BASEFILE
directory.
See runmode=diploidocus
documentation for more details.
Re-starting aborted runs
If Diploidocus is killed before completion, files will not have been moved to the dipcycle_$BASEFILE
directory. Any cycles that have generated a *.diploidocus.fasta
file will be skipped until an incomplete cycle is found. In the rare case that Diploidocus was killed whilst generating fasta output, and *.diploidocus.fasta
is thus incomplete, further cycles will be messed up. In this case, deleted the relevant *.diploidocus.fasta
file before re-running.
If Diploidocus is run with runmode=dipcycle
and the dipcycle_$BASEFILE
directory is found, it will abort. Rename or delete the directory to re-run Diploidocus with the same $BASEFILE
setting, or change basefile=X
.
Main Diploidocus filtering [runmode=diploidocus]
Diploidocus builds on the PurgeHaplotigs classifications to use the depth bins, KAT assembly kmer frequencies and BUSCO results to reclassify scaffolds and partition them into:
*.diploidocus.fasta
= the scaffolds kept for the next round of PurgeHap
*.core.fasta
= the same set of scaffolds, minus repeats
*.repeats.fasta
= the repeat scaffolds excluded from core
*.junk.fasta
= low coverage scaffolds, removed as junk
*.quarantine.fasta
= putative haplotigs, removed from the assembly but retained for reference.
Unless summarise=F
, summary statistics for sequence data before and after processing will be included in the run log.
In addition, two key output tables are generated:
*.diploidocus.tdt
= full compiled data and classification. (See field descriptions, below.)
*.rating.tdt
= reduced data and classification for easier extraction of sequence names with grep
etc.
All output files are named with the basefile=X
prefix. This is the prefix of the assembly given with seqin=FILE
if no basefile=X
setting is provided.
Dependencies
Diploidocus needs the following programs installed for full functionality:
bbmap
blast+
kat
minimap2
purge_haplotigs
samtools
Data compilation
Once all the different elements have been run, Diploidocus compiles results per input sequence, extracting a subset of fields of interest and renaming fields where appropriate. The following results files are compiled, joined by input sequence name (SeqName
):
*.depth.tdt
: bbmap pileup.sh read mapping statistics
SeqName
= Scaffold sequence name
SeqLen
= Scaffold sequence length (bp)
Median_fold
= Median long read sequence depth for sequence
Avg_fold
= Mean long read sequence depth for sequence
Covered_percent
= Percentage of sequence covered by long reads
Covered_bases
= Number of bp covered by long reads
Plus_reads
= Number of reads mapped to positive strand
Minus_reads
= Number of reads mapped to negative strand
Read_GC
= GC content of reads mapping to sequence
*.purge.coverage_stats.csv
: purge_haplotigs coverage stats
LowPerc
= Proportion of bases with read depth below low coverage cutoff
HapPerc
= Proportion of bases with read depth between low coverage and mid coverage cutoffs
DipPerc
= Proportion of bases with read depth between mid coverage and high coverage cutoffs
HighPerc
= Proportion of bases with read depth above high coverage
*.purge.reassignments.tsv
: purge_haplotigs best sequence hits and reassignments
TopHit
= scaffold with biggest sequence coverage match
SecHit
= scaffold with second biggest sequence coverage match
TopHitCov
= percentage sequence coverage by TopHit
MaxHitCov
= combined percentage sequence coverage by all sequence hits
PurgeHap
= purge_haplotigs reassignment rating
TopNum
= number of sequences for which this sequence is the TopHit
SecNum
= number of sequences for which this sequence is the SecHit
*.selfkat-stats.tsv
: assembly kmer coverage stats
SelfMedK
= Median assembly kmer frequency for sequence
SelfAvgK
= Mean assembly kmer frequency for sequence
*.kat-stats.tsv
: short read kmer coverage stats
MedK
= Median short read kmer frequency for sequence
AvgK
= Mean short read kmer frequency for sequence
SeqGC
= GC content for sequence
KPerc
= Percentage of non-gap kmers in sequence that are found in short read data
- NOTE: In the absence of short read data, these values will be set to -1
*.screencov.tdt
: Contaminant coverage from vecscreen mode
ScreenPerc
= Total combined sequence coverage by contaminants
*.telomeres.tdt
: Telomere prediction results
Tel5
= Whether a 5’ telomere is predicted
Tel3
= Whether a 5’ telomere is predicted
Tel5Len
= Length in window chunks of 5’ telomere
Tel3Len
= Length in window chunks of 3’ telomere
TelPerc
= Percentage of sequence predicted to telomeres. (Crude calculation.)
full_table_*.busco.tsv
Complete
= Number of BUSCO Complete genes in sequence
Duplicated
= Number of BUSCO Duplicated genes in sequence
Fragmented
= Number of BUSCO Fragmented genes in sequence
$SEQIN
: input assembly
N_bases
= total count of all unresolved (N
) bases
Gap_bases
= total count of all gap (N
) bases as determined by mingap=INT
SeqDesc
= sequence description
Diploidocus classification
Diploidocus classification adds the following fields to the compiled table:
Class
= Six-part (PURITY|DEPTH|HOM|TOP|MEDK|BUSCO
) Diploidocus classifcation (see below)
TopClass
= Class of TopHit
(if any)
SecClass
= Class of SecHit
(if any)
Classification is based on a combination of commandline parameters and hard-coded cutoffs:
minmedian=INT
: Minimum median depth coverage to avoid low coverage filter [3]
pureperc=80
: Depth class percentage to classify as PURE
rephitcov=250
: Min MaxHitCov
value to classifty as REPEAT
Each sequence is given classification in six different criteria (PURITY|DEPTH|HOM|TOP|MEDK|BUSCO
) with two possible suffixes:
PURITY
: Purity of dominant read depth class
PURE
= At least 80% of sequence in that depth bin
GOOD
= At least 50% of sequence in that depth bin
WEAK
= Under 50% of sequence in that depth bin
DEPTH
: Dominant read depth class
LOWX
= Median read depth (Median_fold
) fails to meet minmedian=INT
criterion (default=3)
LOW
= LowPerc
read depth bin has highest percentage coverage (ties assigned to other class)
HAP
= HapPerc
read depth bin has highest percentage coverage (non-DIP
ties addigned to HAP
)
DIP
= DipPerc
read depth bin has highest percentage coverage (ties assigned to DIP
)
EXS
= HighPerc
read depth bin has highest percentage coverage
HOM
: Homology/repeat status based on purge_haplotigs hits
UNIQ
= No TopHit
PART
= Partial (<50%) coverage of TopHit
HAPL
= 50%+ TopHit
coverage but no SecHit
HOMO
= TopHit
and SecHit
but combined MaxHitCov
< 250%
REPT
= TopHit
and SecHit
and 250%+ combined MaxHitCov
TOP
: Top/Sec Hit status for sequence
TOP
= TopHit
for at least one other sequence
SEC
= Not a TopHit
but SecHit
at least one other sequence
NON
= Neither a TopHit
nor SecHit
for any other sequence
MEDK
: Assembly redundancy based on KAT assembly kmers
PRI
= Over 50% unique kmers (SelfMedK
= 1)
ALT
= Median kmer frequency of two (SelfMedK
= 2)
REP
= Median kmer frequency exceeds two (SelfMedK
> 2)
BUSCO
: Dominant BUSCO class (can help decide about risk)
COMP
= 1+ Complete BUSCO genes and more Complete than Duplicated
DUPL
= 1+ Duplicated BUSCO genes and more Duplicated than Complete
FRAG
= 1+ Fragmented BUSCO genes and no Complete or Duplicated
NONE
= No Complete, Duplicated or Fragmented BUSCO genes
+TEL
: If any telomeres are detected, +TEL
is added
+VEC
: If any contamination is detected, +VEC
is added
Purge modes [purgemode=X]
Diploidocus rating based on purgemode=X
adds the following fields to the compiled table:
Rating
= Main Diploidocus rating of sequence (see below)
TopRating
= Rating of TopHit
(if any)
SecRating
= Rating of SecHit
(if any)
Set
= Diploidocus output set (keep
/repeat
/quarantine
/junk
)
There are three basic purgemode
implementations:
purgemode=complex
[Default] uses all of the Diploidocus classification data to partition sequences into classes and then partition those classes into the main output sets.
purgemode=simple
uses a simplified set of Diploidocus classification rules data to partition sequences into classes before partitioning those classes into the main output sets. This is conceptually easier than the complex
mode but unlikely to produce as good results.
purgemode=crude
uses a crude set of Diploidocus classification rules data to partition sequences into classes before partitioning those classes into the main output sets. This is more complicated than simple
but still conceptually easier than the complex
mode but unlikely to produce as good results.
purgemode=nala
uses a simplified and conservative set of filtering rules used for the Nala German Shepherd Dog assembly.
Details for these modes can be found below.
Ratings are based on a combination of commandline parameters and hard-coded cutoffs:
minlen=INT
: Minimum scaffold length to avoid low quality filter [500]
mindipcov=20
: Minimim DipPerc coverage to assign a “KEEP” rating (purgemode=crude
)
artefactcov=80
: Min High/Low coverage to assign as artefacts (purgemode=crude
, purgemode=nala
)
hpurgecov=80
: Min TopHitCov to assign HPURGE
rating.
covwarning=95
: Read coverage threshold for warnings
Complex rating
Diploidocus performs a hierarchical rating of scaffolds, based on their classifications and compiled data:
- Initial low quality filter:
CONTAMINATION
= 50%+ identified contamination (ScreenPerc
)
LOWCOV
= Poor median read coverage (Median_fold < minmedian=INT
)
LOWQUAL
= Scaffolds below the sequence length set by minlen=INT
LOWCOV
= Low read depth and 50%+ sequence without short read kmers (PURE|LOW|...
or GOOD|LOW|...
, and MedK=0
)
LOWCOV
= Low read depth and high frequency assembly kmers (...|LOW|...|REP|...
)
- High quality scaffolds to keep:
QUALITY
= Highest quality scaffolds: pure duploid, complete BUSCOs, no Duplicated BUSCOs (PURE|DIP|UNIQ|...|PRI|COMP
& Duplicated
= 0)
FINAL
= As Quality but PRI|FRAG
and PRI|NONE
also allowed (PURE|DIP|UNIQ|...|PRI|...
& Duplicated
= 0)
CORE
= Predominantly Diploid with <50% covered by TopHit
and SelfMedK
=1 ([PURE/GOOD]|DIP|[UNIQ/PART]|...|PRI|...
)
COREHAP
= Predominantly haploid but <50% covered by TopHit
and 1+ Complete BUSCOs (...|HAP|[UNIQ/PART]|...
& Complete
> 0)
PRIMARY
= Putative primary scaffold but with possible alternative scaffolds still in assembly and/or low quality regions (...|DIP|[UNIQ/PART/HAPL/HOMO]|...|[PRI/ALT]|...
)
- Repetitive scaffolds to keep:
PRIRPT
= Putative primary scaffold but >50% repeated (...|DIP|[UNIQ/PART]|...|REP|...
or [PURE/GOOD]|DIP|...|REP|...
)
COLLAPSED
= High coverage scaffolds representing putative collapsed repeats. (Check for other contamination.) (...|EXS|[UNIQ/PART]|...
or [GOOD/PURE]|EXS|...|TOP|...
, or WEAK|EXS|...
and less of the sequence in the Diploid coverage bin than below Diploid coverage.
REPEAT
= Predominantly Diploid scaffolds that have major signs of redundancy, probably due to presence of alternative contigs (...|DIP|REPT|...
or WEAK|DIP|[HAPL/HOMO]|...|REP|...
)
- Haploid coverage scaffolds to keep (NOTE: any
HAPLOTIG
rating will be converted to HAPRPT
if ...|REP|...
)
HAPLOID
= Predominantly haploid coverage but enough unique sequence to keep for now? (Option to quarantine?) Might be very heterozygous Alternative haplotigs (...|HAP|[UNIQ/PART]|PRI|...
)
HAPLOTIG
= Predominantly haploid coverage but enough unique sequence to keep for now - possible Alternative haplotig (...|HAP|[UNIQ/PART]|ALT|...
or ...|HAP|...|TOP|[PRI/ALT]|...
)
HAPRPT
= Low quality scaffold that is probably most repeats, but not bad enough to dump outright (...|HAP|[UNIQ/PART]|...|REP|...
or ...|HAP|...|TOP|...|REP|...
)
HAPLOTIG
= Low/haploid coverage scaffold with insufficient coverage of a Diploid scaffold to purge (...|[LOW/HAP]|[HAPL/HOMO/REPT]|...
and |DIP|
in TopHit
rating and TopHitCov
< hpurgecov
(80%))
HAPLOTIG
= Low/haploid coverage scaffold with insufficient coverage of another scaffold to purge ([PURE/GOOD]|HAP|HOMO|...' and
TopHitCov<
hpurgecov` (80%))
HAPRPT
= Other haploid coverage scaffold with insufficient coverage of another scaffold to purge (...|HAP|...
and TopHitCov
< hpurgecov
(80%))
HAPRPT
= Low coverage and collapsed repeats with insufficient coverage of another scaffold to purge (...|EXS|...
and less of the sequence in the Diploid coverage bin than below Diploid coverage and TopHitCov
< hpurgecov
(80%))
REPEAT
= High coverage scaffold with insufficient coverage of another scaffold to purge ([GOOD/PURE]|EXS|[HAPL/HOMO/REPT]|[SEC/NON]|...
and TopHitCov
< hpurgecov
(80%))
- Scaffolds to quarantine for removal
RPURGE
= Messy scaffolds that are largely repeats and are sufficiently redundant/low quality to purge (WEAK|EXS
and less of the sequence in the Diploid coverage bin than below Diploid coverage, or [GOOD/PURE]|EXS|[HAPL/HOMO/REPT]|[SEC/NON]
, and TopHitCov
>= hpurgecov
(80%))
- Additional low coverage scaffolds
LOWCOV
= Low depth of coverage and strong matches to other scaffolds, and not TopHit (...|LOW|[HAPL/HOMO/REPT]|[SEC/NON]
)
- Further scaffolds to quarantine for removal
HPURGE
= Clear candidate haplotig to purge (...|LOW|[HAPL/HOMO/REPT]|...
and |DIP|
in TopHit
rating, or ...|HAP|...
, and TopHitCov
>= hpurgecov
(80%))
- Final low quality filter
LOWQUAL
= Unconvincing scaffolds that do not fall into a clear class but are not bad enough to dump outright as junk (...|LOW|...
)
Any scaffolds that escape the above rules will be UNRATED
. (There should not be any!)
Simple rating
The simplified rating system is based much more on the dominant read depth class. A subset core scaffolds are identified and the rest are classified on the basis of dominant read depth and REP
or REPT
status.
LOWCOV
= Poor median read coverage (Median_fold < minmedian=INT
)
CONTAMINATION
= 50%+ identified contamination (ScreenPerc
)
LOWQUAL
= Scaffolds below the sequence length set by minlen=INT
LOWQUAL
= Low quality artefacts with less than 50% of the scaffold covered by short-read kmers (MedK=0
)
- Keep Scaffolds with Diploid+ depth (
[DIP/EXS]
), or limited homology to other scaffolds ([UNIQ/PART]
) or Top hits for other scaffolds (TOP
) or 1+ Complete
BUSCO genes. These are broken down into:
PRIMARY
= Diploid depth or no TopHit
or dominant BUSCO rating is Complete
, and not repetitive
PRIRPT
= As PRIMARY
but REPT
or REP
classifications
COLLAPSED
= Dominant high depth (EXS
)
REPEAT
= Remaining Scaffolds to keep that have REPT
or REP
classifications
HAPLOID
= Remaining Scaffolds to keep without REPT
or REP
classifications
HPURGE
= Dominant haploid read depth without REPT
or REP
classifications
RPURGE
= Dominant haploid read depth, plus REPT
or REP
classifications
LOWCOV
= Any remaining scaffolds (dominant LOW
read depth)
Crude rating
The crude
purge mode is a less nuanced version of the main Diploidocus rating.
LOWCOV
= Poor median read coverage (Median_fold < minmedian=INT
)
CONTAMINATION
= 50%+ identified contamination (ScreenPerc
)
LOWQUAL
= Scaffolds below the sequence length set by minlen=INT
HPURGE
= Low+Haploid coverage meet artefactcov
cutoff (>=80%) and TopHitCov
meets hpurgecov
cutoff (>=80%)
HAPLOTIG
= Haploid coverage exceeds 50%, TopHitCov
meets hpurgecov
cutoff (>=80%), and SelfMedK=2
- Scaffolds with a Diploid depth coverage otherwise meeting the
mindipcov
cutoff (>=20%) are kept:
PRIMARY
= Median assembly kmer frequency of 1
REPEAT
= MaxHitCov
meets the rephitcov
cutoff (>=250%)
KEEP
= Remaining diploid scaffolds.
LOWCOV
= Any other scaffolds with 80%+ low coverage bases are filtered as Low Coverage.
COLLAPSED
= Any other scaffolds with 80%+ high coverage bases are flagged as COLLAPSED.
HAPLOTIG
= Scaffolds that are 50%+ Haploid Depth if mostly present 2+ times (median assembly kmer frequency = 2)
HAPLOID
= Scaffolds that are 50%+ Haploid Depth if mostly present once (median assembly kmer frequency = 1)
ARTEFACT
= Scaffolds <50% Haploid Depth but 80%+ Low or Haploif depth
HAPLOTIG
= Scaffolds with 80%+ hitting other scaffolds but does not meet the rephitcov
cutoff (<250%)
HAPRPT
= Scaffolds with 80%+ hitting other scaffolds and meets the rephitcov
cutoff (>=250%)
ARTEFACT
= Remaining scaffolds with 80%+ bases in the low/haploid coverage bins
Nala rating
The purgemode=nala
rating scheme was used for the Nala German Shepherd Dog genome assembly, and features a simplified set of ratings:
CONTAMINATION
= 50%+ identified contamination (ScreenPerc
)
LOWCOV
= Poor median read coverage (Median_fold < minmedian=INT
)
LOWQUAL
= Scaffolds below the sequence length set by minlen=INT
HPURGE
= Any scaffold with 80%+ bases in the low/haploid coverage bins (haplotigs or assembly artefacts).
PRIMARY
= Scaffolds with 20%+ diploid coverage are marked as retention as probable diploids.
COLLAPSED
= Scaffolds with <20% diploid coverage and 50%+ high coverage are marked as probable collapsed repeats.
JUNK
/HAPLOTIG
/KEEP
= Remaining Scaffolds are given the PurgeHaplotigs rating (over 80% low/high coverage will be filtered as a probable artefact)
Final sequence sets for output
The purgemode
ratings are then converted into subsets of scaffolds for output:
keep
= Scaffolds that form part of the *.core.fasta
and *.diploidocus.fasta
outputs:
- [‘QUALITY’,‘FINAL’,‘CORE’,‘COREHAP’,‘PRIMARY’,‘PRIRPT’,‘HAPLOID’,‘HAPLOTIG’,‘KEEP’]
repeat
= Scaffolds excluded from the *.core.fasta
but part of the *.diploidocus.fasta
output:
- [‘COLLAPSED’, ‘REPEAT’, ‘HAPRPT’]
quarantine
= Scaffolds removed as probably Haplotigs and/or excess copies of repeats but saved to *.quarantine.fasta
in case additional checks are required:
junk
= Scaffolds removed due to low coverage, poor quality, or contamination. These are partitioned into *.junk.fasta
:
- [‘LOWCOV’, ‘LOWQUAL’, ‘CONTAMINATION’,‘JUNK’]
The final ratings for each TopHit
and SecHit
are also added to the main Diploidocus table as TopHitRating
and SecHitRating
fields. This might flag up additional scaffolds (e.g. with HAPLOTIG
ratings) that the user decides to remove despite failing to meet the automated criteria.
NOTE: A cutdown output of Scaffold classifications and ratings is saved as *.ratings.tdt
. This can be used to easily extract a subset of sequence names that can be filtered from the final output using SeqSuite
, e.g.
python $SLIMSUITE/tools/seqsuite.py seqin=$PREFIX badseq=$(grep HAPLOTIG | awk '{print $1;}' ORS=",") seqout=$PREFIX.filtered.fasta
Rating warnings
Once rating is complete, the following warnings are generated in the log file:
- All
CONTAMINATION
scaffolds
- Scaffolds not classed as
junk
but:
- <50% covered by short read kmers
- any vector contamination flagged
- long read coverage < 95%
- Any
COREHAP/HAPLOID/HAPLOTIG
scaffolds with Duplicated
BUSCO genes (more evidence for redundancy)
- Any purged (
junk
or quarantine
) scaffolds that are the PurgeHaplotigs TopHit
for 1+ other scaffolds
Full Diploidocus outputs
Temporary files generated during forking will be generated in a subdirectory set by tmpdir=PATH
(Default: ./tmpdir/
).
Details of full outputs will be added. Please see other documentation sections for specific outputs, or contact the author if anything is unclear.
Special output mode: diploidify
Diploidocus features a special diploidify=T
mode, which aims to generate fully diploid output. This is achieved by adding HPURGE
scaffolds to the keep
set rather than the quarantine
set. In addition, diploid scaffolds are duplicated. For this purpose, diploid scaffolds are defined as those in the keep
or repeat
sets that meet are:
- Classified as
...|DUP|...
- or Classified as
WEAK|...|NON|...
, i.e. are neither a TopHit
nor SecHit
for any other sequence
Diploid scaffolds are saved with an X2
suffix on their sequence name.
DepthSizer Genome size prediction [runmode=gensize]
NOTE: This mode is now primarily documented and updated through DepthSizer.
The main inputs for Diploidocus genome size prediction are:
seqin=FILE
: Input sequence assembly to tidy [Required].
reads=FILELIST
: List of fasta/fastq files containing long reads. Wildcard allowed. Can be gzipped. For a single run (not cycling), a BAM file can be supplied instead with bam=FILE
. (This will be preferentially used if found, and defaults to $BASEFILE.bam
.) Read types (pb/ont/hifi) for each file are set with readtype=LIST
, which will be cycled if shorter (default=ont
). Optionally, the pre-calculated total read length can be provided with readbp=INT
and/or the pre-calculated (haploid) genome size can be provided with genomesize=INT
.
busco=TSVFILE
: BUSCO full table [full_table_$BASEFILE.busco.tsv
] used for calculating single copy (“diploid”) read depth. This can be over-ridden by setting scdepth=NUM
.
quickdepth=T/F
: Whether to use samtools depth in place of mpileup (quicker but underestimates?) [Default: False
]
This works on the principle that Complete
BUSCO genes should represent predominantly single copy (diploid read depth) regions along with some ooor quality and/or repeat regions. Assembly artefacts and collapsed repeats etc. are predicted to deviate from diploid read depth in an inconsistent manner. Therefore, even if less than half the region is actually diploid coverage, the modal read depth is expected to represent the actual single copy read depth.
Diploidocus uses samtools mpileup
(or samtools depth
if quickdepth=T
) to calculate the per-base read depth and extracts the modal read depth for each BUSCO gene along with the overall modal read depth for all gene regions. Genome size is then estimated based on a crude calculation using the total combined sequencing length. This will be caculated from reads=FILELIST
unless provided with readbp=INT
.
BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE
(default full_table_$BASEFILE.busco.tsv
). This can be replaced with any table with the fields: [‘BuscoID’,‘Status’,‘Contig’,‘Start’,‘End’,‘Score’,‘Length’]. Entries are reduced to those with Status
= Complete
and the Contig
, Start
and End
fields are used to define the regions that should be predominantly single copy.
NOTE: There is currently no adjustment for contamination, read mapping or imbalanced insertion:deletion ratios etc. As a consequence, the current genome size prediction appears to be an over-estimate.
Running Purge_haplotigs using BUSCO-guided cutoffs [runmode=purgehap]
Diploidocus can automate the running of purge_haplotigs is automated, using the single copy (diploid) read depth to set the purge_haplotigs thresholds. If scdepth=NUM
is not set, this will be calculated using BUSCO single copy genes (see Genome size prediction [runmode=gensize] for details).
The purge_haplotigs
mid cutoff (-m X
) will be set at the halfway point between the diploid depth (SCDepth) and the haploid depth (SCDepth/2). The low depth (-l X
) and high depth (-h X
) cutoffs will then be set to SCDepth/4 and SCDepth*2. This can be over-ridden with:
phlow=INT
: Low depth cutoff for purge_haplotigs (-l X
).
phmid=INT
: Middle depth for purge_haplotigs (-m X
).
phhigh=INT
: High depth cutoff for purge_haplotigs (-h X
).
Output from this run is:
*.purge.coverage_stats.csv
: purge_haplotigs coverage stats
*.purge.reassignments.tsv
: purge_haplotigs best sequence hits and reassignments
Additional purge_haplotigs files and output will be saved in a subdirectory:
This is to stop multiple purge_haplotigs runs from interfering with each other.
Telomere finding [runmode=telomere]
Diploidocus has implemented a regex-based search for Telomeres, based on the code at https://github.com/JanaSperschneider/FindTelomeres. This looks for a canonical telomere motif of TTAGGG/CCCTAA, allowing for some variation. For each sequence, Diploidocus trims off any trailing Ns and then searches for telomere-like sequences at sequence ends. For each sequence, the presence/absence and length of trimming are reported for the 5’ end (tel5
and trim5
) and 3’ end (tel3
and trim3
), along with the total percentage telomeric sequence (TelPerc
).
By default, Diploidocus searches for a forward telomere regex sequence of C{2,4}T{1,2}A{1,3}
at the 5’ end, and a reverse sequence at the 3’ end of T{1,3}A{1,2}G{2,4}
. These can be set with telofwd=X
and telorev=X
. Telomeres are marked if at least 50% (teloperc=PERC
) of the terminal 50 bp (telosize=INT
) matches the appropriate regex. If either end contains a telomere, the total percentage of the sequence matching either regex is calculated as TelPerc
. Note that this number neither restricts matches to the termini, not includes sequences within predicted telomeres that do not match the regex.
By default, only sequences with telomeres are output to the *.telomeres.tdt
output, but switching telonull=T
will output all sequences.
Vector/contamination screening [runmode=vecscreen]
This mode screens scaffolds for possible contaminants, given by screendb=FILE
.
VecScreen overview
First, a blastn
search of the screendb=FILE
sequences is performed, using the NCBI VecScreen search and match strategy (see below). These parameters can be modified using blastopt=X
and/or optionfile=FILE
, which will be appended to the run command. Alternatively, the $BASEFILE.vecscreen.blast
file produced can be pre-made with different parameters (if force=F
).
Results are then parsed into a local hits table and rated according to the strength of a match. This is performed iteratively, re-assigning internal matches as “proximal” if they are withing 25 bases of another match (of any strength) and re-rating the strength of the match, until no ratings changes occur. Two additional fields are added to the local hits table during this process: MatchPos
and MatchStr
. Once all assignments have been made, segments of the assembly between two matches and/or sequence ends are added to the table as Suspect
.
MatchPos
will have a value of:
Terminal
= within 25 bp of either end of the sequence.
Proximal
= within 25 bp of a vecsreen match (Weak
, Moderate
or Strong
).
Internal
= over 25 bp from a sequence end or vecsreen match.
Suspect
= Segments added as Suspect
.
MatchStr
will have a value of:
Strong
= Terminal
/Proximal
match with Score >= 24, or Internal
match with Score >= 30.
Moderate
= Terminal
/Proximal
match with Score 19 to 23, or Internal
match with Score 25 to 29.
Weak
= Terminal
/Proximal
match with Score 16 to 18, or Internal
match with Score 23 to 24.
Suspect
= Any segment of fewer than 50 bases between two vector matches or between a match and an end.
These parameters seem to be designed for small (Illumina?) assemblies, and give excessive false positives for larger genomes. Diploidocus therefore features two additional contaminant identification filters:
eFDR
= The “Expected False Discovery Rateis calculated for each contaminant sequence. This is simply the
Expectvalue for that hit, divided by the total number of hits with that
Expectvalue (or lower). If
efdr=Xis set, any hits with an
eFDRvalue exceeding the threshold will be removed. By default, this is set to
1.0, i.e. any contaminants with fewer assembly hits than their
Expect` value will be dropped.
Minimum hit length
= Short hits are inevitable in large assemblies and unlikely to be real for long read assemblies (e.g. without cloning etc.). An additional minvechit=X
setting will remove any really short hits. This is set by default to minvechit=50
, meaning that a hit of at least 50 bp is required.
Finally, the percentage coverage per scaffold is calculated from the filtered hits. This is performed first for each contaminant individually, before being collapsed into total contamination coverage per query.
Results are output into two main delimited results files:
*.vecscreen.tdt
= Contaminant local hit table with VecScreen ratings and eFDR calculation. (Unfiltered)
*.screencov.tdt
= Query coverage per contaminant
Query
= Contaminant name. For total coverage, this will be TOTAL
.
Hit
= Scaffold name.
HitLen
= Length of scaffold (bp).
Coverage
= Covered length of scaffold (bp).
CovPC
= Percentage coverage of scaffold (0-100).
VecScreen BLAST+ parameters (from NCBI website)
The VecScreen BLAST+ parameters are pre-set using blastn options: -task blastn -reward 1 -penalty -5 -gapopen 3 -gapextend 3 -dust yes -soft_masking true -evalue 700 -searchsp 1750000000000
VecScreen Match Categories Vector contamination usually occurs at the beginning or end of a sequence; therefore, different criteria are applied for terminal and internal matches. VecScreen considers a match to be terminal if it starts within 25 bases of the beginning of the query sequence or stops within 25 bases of the end of the sequence. Matches that start or stop within 25 bases of another match are also treated like terminal matches. Matches are categorized according to the expected frequency of an alignment with the same score occurring between random sequences.
Strong Match to Vector (Expect 1 random match in 1,000,000 queries of length 350 kb.) Terminal match with Score >= 24. Internal match with Score >= 30.
Moderate Match to Vector (Expect 1 random match in 1,000 queries of length 350 kb.) Terminal match with Score 19 to 23. Internal match with Score 25 to 29.
Weak Match to Vector (Expect 1 random match in 40 queries of length 350 kb.) Terminal match with Score 16 to 18. Internal match with Score 23 to 24.
Segment of Suspect Origin Any segment of fewer than 50 bases between two vector matches or between a match and an end.
Vector coverage checking
If veccheck=T
then long reads given by reads=FILELIST readtype=LIST
will be mapped onto the assembly using minimap2 and complete coverage of each hit reported. This will report the number of reads that span the entrie contaminant hit, plus flanking distances each side set by checkflanks=LIST
(default: 0,100,1000,5000). If the flanking distance (bp) or the end of the sequence is reached before the end of the read in both directions, that read will be added to the coverage count. Because this can inflate apparent flanking coverage for hits near the end of a sequence, the distance from each end of the sequence is also returned as MaxFlank5
and MaxFlank3
.
Depth trimming [runmode=deptrim]
Depth trimming (deptrim
) mode trims sequence termini of at least mintrim=INT
bp with less than deptrim=INT
read depth. First, samtools mpileup
or depth
(depmethod=X
) is used to find the first and last positions that exceed deptrim=INT
. If no positions meet this criterio, the entire sequence will removed. Otherwise, either terminus that exceeds mintrim=INT
base pairs of insufficent read depth are trimmed.
Assembly region read-spanning and copy number analysis [runmode=regcheck/regcnv]
Region checking, whether for read spanning analysis (runmode=regcheck
) or copy number analysis (runmode=regcnv
or runmode=regcheck regcnv=T
), analyses regions extracted from a delimited file given by: regcheck=FILE
. This can be a GFF file, in which case gfftype=LIST
will restrict analysis to specific feature types, or regions can be defined by checkfields=LIST
, which defines the locus, start and end positions. These fields default to SeqName
, Start
and End
fields. If these fields cannot be found, the first three fields of the regcheck=FILE
file will be used.
Region read-spanning analysis [runmode=regcheck]
Long read data, given with the reads=FILELIST
and readtype=LIST
options, are mapped onto the assembly (seqin=FILE
) using minimap2 to generate a PAF file. This is then parsed and reads spanning each feature based on their positions and the target start and end positions in the PAF file. In addition to absolute spanning of regions, reads spanning a region +/- distances set by checkflanks=LIST
will also be calculated. If the end of a sequence is reached before the end of the read, this will also count as flanking. Such regions can be identified using the MaxFlank5
and MaxFlank3
fields, which identify the maximum distance 5’ and 3’ that a read can span due to sequence length constraints.
If regcnv=T
then the region copy number analysis (below) will also be performed.
NOTE: Depth calculations are performed in parallel in the directory set with tmpdir=PATH
. The number of parallel calculations is set by forks=INT
.
Region copy number analysis [runmode=regcnv]
Copy number analysis uses the same single copy depth profile analysis as the DepthSizer (Diploidocus runmode=gensize
) genome size prediction. In short, the modal read depth of BUSCO single copy Complete
genes is calculated using samtools mpileup
(or samtools depth
if quickdepth=T
) and used to defined “single copy read depth”. BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE
(default full_table_$BASEFILE.busco.tsv
). This can be replaced with any table with the fields: [‘BuscoID’,‘Status’,‘Contig’,‘Start’,‘End’,‘Score’,‘Length’]. Entries are reduced to those with Status
= Complete
and the Contig
, Start
and End
fields are used to define the regions that should be predominantly single copy.
Single-copy read depth can also be set using scdepth=NUM
to re-used or over-ride previously calculated values.
Long read data, given with the reads=FILELIST
and readtype=LIST
options, are then mapped onto the assembly (seqin=FILE
) using minimap2. This can be skipped by providing a BAM file with bam=FILE
. For each regcheck feature, the same samtools depth calculation is performed as for the BUSCO data. The mean depth across the region is then divided by the single copy depth to estimate total copy number across the region. Note that unlike the single copy depth estimation itself, this might be biased by repeat sequences etc. that have a different depth profile to the main region. One way to control for this might be to restrict analysis to a subset of reads that meet a certain minimum length cutoff, e.g. 10kb.
Query-based CNV analysis. If the regcheck=FILE
file has additional Qry
, QryLen
, QryStart
and QryEnd
fields, the copy number analysis will have an additional query-focus. In this case, each region mapping onto a specific query is summed up, adjusted for the proportion of the query covered by that region. For example, 3.5X mean depth of a 100% length copy and 3.0X coverage of a 50% length copy would sum to (3.5x1.0 + 3x0.5 = 5 total copies). If these fields are not present, each region will be analysed independently.
NOTE: Depth calculations are performed in parallel in the directory set with tmpdir=PATH
. The number of parallel calculations is set by forks=INT
.
GapSpanner functions [runmode=gapspan/gapass/gapfill]
NOTE: These modes are now primarily documented and updated through GapSpanner.
Assembly gap read-spanning analysis [runmode=gapspan]
This mode first identifies all the gaps in an assembly (seqin=FILE
) (using SeqList gapstats
or $SEQBASE.gaps.tdt
if pre- existing) and then runs the read spanning analysis (runmode=regcheck
) with regcnv=F
. Long read data, given with the reads=FILELIST
and readtype=LIST
options, are mapped onto the assembly using minimap2 to generate a PAF file. This is then parsed and reads spanning each gap are identified based on their positions and the target start and end positions in the PAF file. In addition to absolute spanning of regions, reads spanning a region +/- distances set by checkflanks=LIST
will also be calculated. If the end of a sequence is reached before the end of the read, this will also count as flanking. Such regions can be identified using the MaxFlank5
and MaxFlank3
fields, which identify the maximum distance 5’ and 3’ that a read can span due to sequence length constraints.
Spanning spanid
output is also generated for each gap and saved in $BASEFILE_spanid
. Each gap will be named: seqname.start-end
.
Assembly gap re-assembly [runmode=gapass]
In addition to the gapspan
analysis, reads identified as spanning each gap are extracted and assembled using flye
in a $BASEFILE__gapassemble/
output directory. Only gaps with at least mingapspan=INT
(default 2) reads are re-assembled.
Re-assembled gap-filling [runmode=gapfill]
In addition to the gapspan
and gapass
outputs, re-assembled gap regions are compiled into a single file and then mapped back on the original assembly using Minimap2, with tabulated hit output into $BASEFILE__gapfill/
. Local hits are reduced to unique coverage of the assembly sequences. Gaps are filled if one of the two conditions are met:
- A single local alignment spans an entire gap.
- A pair of concordant local alignments from the same re-assembly contig (same orientation) flank an entire gap.
In the case of a single spanning local alignment, the entire assembly region is replaced by the corresponding re-assembly contig region. For a pair of hits, the region between the two hits is replaced.
Sorted non-redundant assembly cleanup [runmode=sortnr]
The sorted non-redundant assembly cleanup mode (runmode=sortnr
) screens out any sequences that are 100% gap, then removes any sequences that are 100% redundant with other sequences in the input. This includes full and partial matches, i.e. if sequence X is wholly contained within sequence Y then X will be removed.
First, sequences are loaded from the file given with seqin=FILE
and any rje_seqlist filters and sequence sorting are applied to the input. Sequences that are 100% Ns are removed and any gaps exceeding 10 nt are reduced to 10 N
s (NNNNNNNNNN
) to prevent minimap2 from splitting sequences on long gaps. These gap-reduced sequences are output to $BASEFILE.tmp.fasta
and used for an all-by-all minimap2 search.
By default, minimap2 is run with the options to generate a $BASEFILE.tmp.paf
file:
--cs -p 0.0001 -t 4 -x asm20 -N 250
To modify minimap2 search settings, please see the rje_paf documentation.
NOTE: These run options can probably be made more stringent to speed up minimap2 without loss of function. Future releases may alter defaults accordingly.
Minimap2 output is parsed to identify scaffold-scaffold matches. Self-hits are ignored. The minimum (gap-reduced) sequence length is used as a rapid parsing filter: any minimap2 matches that are less than 95% of the query sequence (Length
+nn
fields) or less that 100% identity (Identity
+nn
)/(Length
+nn
) are filtered during parsing.
NOTE: Future releases may feature an option to reduce the global percentage identity cut-off. Please contact the author if you wish to see this implemented.
Minimap2 hits are then processed reverse-sorted by reference sequence size (e.g. scaffold length). Any hits where either sequence has already been filtered are skipped. Otherwise, if the match (as determined by the length of :
regions in the CS string) matches the query length, the Query sequence will be flagged for remove as “identical to” or “contained within” the Hit. (Mutually partial overlapping exact matches are NOT filtered.) Filtered IDs and their matches are output to $BASEFILE.redundant.txt
.
Once all sequences have been filtered, the remaining sequences are output to: $BASEFILE.nr.fasta
.
NOTE: By default, sequences are output in the same order as in the input. To output in reverse size order, add the sortseq=invsize
command to the Diploidocus run command.
Finally, the input and output files are summarised (unless summarise=F
) and statistics output to: $BASEFILE.summarise.tdt
.
Temporary gap-reduced and minimap2 PAF files are deleted unless running in debug
or dev
modes.
Pseudodiploid to primary and alternative haploptigs [runmode=diphap(nr)]
This protocol is based on 10x assemblies made for multiple organisms with supernova v2.0.0 and supernova v2.1.1. In each case, some redundancy in output was discovered (a) within pseudohap output, and (b) in terms of fully homozygous (identical) scaffolds shared by both haplotigs. It was also not entirely clear on what basis a particular haplotype was assigned to pseudohap1 or pseudohap2.
The general workflow therefore sought to remove redundancy, generate a set of primary scaffolds based on scaffold length, and generate a non-redundant set of alternative scaffolds where heterozygosity exists. If diphapnr
mode is used, the full workflow is implemented by first running the sortnr
workflow described above. In the reduced diphap
mode, redundancy is not removed first.
Sequences are loaded and matching haplotigs identified based on their names. Sequence names MUST end HAP(\d+)
, where (\d+)
indicates an integer that matches up haplotigs (as produced by supernova pseudohap2 output, for example). This is not a pipeline to identify haplotig pairs, it is purely for splitting identified haplotigs into primary and alternative assemblies.
Processing itself is quite simple. Haplotig pairs are identified based on matching HAP(\d+)
numbers. Where a single haplotig is found, it is assigned as diploid
, under the assumption that the two haplotigs were identical and one was removed. (It is possible that only one parent had this scaffold, e.g. sex chromosomes, so some post- processing of descriptions may be required.) If two haplotigs with the same number are identified, the longest is assigned to haploidA
and the shorter haploidB
.
The Primary Assemmbly is then compiled from all haploidA
and diploid
sequences. These are given pri
prefixes and output to $BASEFILE.pri.fasta
. The Alternative comprised of all haploidB
sequences is output to $BASEFILE.alt.fasta
. If redundancy has been removed, this will likely be a subset of the full assembly. The combined set of all primary and alternative sequences is output to $BASEFILE.dipnr.fasta
.
NOTE: By default, sequences are output in the same order as in the input. To output in reverse size order, add the sortseq=invsize
command to the Diploidocus run command.
Finally, the input and output files are summarised (unless summarise=F
) and statistics output to $BASEFILE.summarise.tdt
:
$BASEFILE.dipnr.fasta
= Combined pseudodiploid with haploidA
, haploidB
and diploid
annotation.
$BASEFILE.pri.fasta
= Primary assembly with haploidA
and diploid
sequences.
$BASEFILE.alt.fasta
= Alternative assembly with haploidB
sequences.
In silico diploid generator [runmode=insilico]
This module generates balanced “in silico diploid” PacBio subread data from two sequenced haploid parents. Each parent must first be run through SMRTSCAPE to generate subread summary data. (This will be performed if missing. Each parent needs a *.fofn
file of subread file names, *.unique.tdt
unique subreads table and *.smrt.tdt
SMRT cell identifier table.)
A new set of subreads is then generated from the combined set of parent subreads. This is done by first ranking the unique subreads from each parent by length. First, the longest subread from each parent are compared and the shortest selected to be the first subread of the diploid. (The shortest is taken to minimise length differences between the two parents.) Next, the longest subread from the next parent that is no longer than the previous subread is added. This cycles, picking a read from the the parent with fewest cumulative bases each cycle. The longest subread that is no longer than the previous subread is selected. This continues until one parent runs out of subreads. Additional subreads will be added from the other parent if they reduce the difference in cumulative output for each parent, or until lenfilter=X
is reached.
Final output will be a *.LXXXRQXX.fasta
file in which each parent has a similar total sequence content and for which the subread length distributions should also be similar. This is to overcome biases in resulting diploid assemblies, where one parent has higher quality data than the other.
NOTE: If performing downstream filtering by Read Quality (RQ), this might reintroduce a bias if one parent has much higher RQ values than the other. The rqfilter=X
setting can therefore be used to restrict output to reads with a minimum RQ value. By default this is 0.84. If you do not get enough sequence output, this setting may need to be relaxed. Similarly, only sequences above lenfilter=X
in length will be output. These are the figures given in the LXXXRQXX
part of the output file, e.g. defaults of RQ>=0.84 and Len>=500 generates *.L500RQ84.fas
.
© 2019 Richard Edwards | richard.edwards@unsw.edu.au