1 Diploidocus: Diploid genome assembly analysis tools

Diploidocus logo Diploidocus is a sequence analysis toolkit for a number of different analyses related to diploid genome assembly. The main suite of analyses combines long read depth profiles, short read kmer analysis, assembly kmer analysis, BUSCO gene prediction and contaminant screening for a number of assembly tasks including contamination identification, haplotig identification/removal and low quality contig/scaffold trimming/filtering.

In addition, Diploidocus will use mapped long reads and BUSCO single copy read depths for genome size prediction (gensize), and coverage (regcheck) or copy number estimation (regcnv) for user-defined regions. Diploidocus also has functions for removing redundancy (sortnr), generating a non-redundant pseudo-diploid assembly with primary and secondary scaffolds from 10x pseudohap output (diphap), and creating an in silico diploid set of long reads from two haploid parents (for testing phasing etc.) (insilico).

Please note that Diploidocus is still in development and documentation is currently a bit sparse.

The different run modes are set using runmode=X:

  • diploidocus default run mode will run gensize, telomeres, vecscreen, deptrim and purgehap analysis
  • dipcycle runs iterative cycles of diploidocus mode until convergence (no more filtering) is reached
  • gensize uses BUSCO results, a BAM file and read file(s) to predict the genome size of the organism
  • purgehap filters scaffolds based on post-processing of purge_haplotigs
  • telomeres performs a regex telomere search based on method of https://github.com/JanaSperschneider/FindTelomeres
  • vecscreen searches for contaminants and flags/masks/removes identified scaffolds
  • deptrim trims sequence termini of at least mintrim=INT bp with less than deptrim=INT read depth
  • regcheck checks reads spanning given regions and also calculates mean depth and estimated copy number (if regcnv=T)
  • regcnv calculates mean depth and estimated copy number for regcheck regions from BAM file (no read spanning analysis)
  • gapspan is a specialised regcheck analysis that checks for reads spanning genome assembly gaps
  • gapass extends the gapspan mode to extract the reads spanning a gap and re-assemble them using flye
  • gapfill extends the gapass mode to map re-assembled gaps back onto the assembly and replace gaps spanned by the new contigs * sortnr performs an all-by-all mapping with minimap2 and then removes redundancy
  • diphap splits a pseudodiploid assembly into primary and alternative scaffolds
  • diphapnr runs sortnr followed by diphap
  • insilico generates balanced diploid combined reads from two sequenced haploid parents
  • summarise just runs the seqin summarise code and then stops.

See https://slimsuite.github.io/diploidocus/ for details of each mode. General SLiMSuite run documentation can be found at https://github.com/slimsuite/SLiMSuite.

Diploidocus is available as part of SLiMSuite, or via a standalone GitHub repo at https://github.com/slimsuite/diploidocus.

1.1 Citing Diploidocus

The main Diploidocus tidy mode has been published as part of the Waratah genome paper:

Chen SH, Rossetto M, van der Merwe M, Lu-Irving P, Yap JS, Sauquet H, Bourke G, Amos TG, Bragg JG & Edwards RJ (2022). Chromosome-level de novo genome assembly of Telopea speciosissima (New South Wales waratah) using long-reads, linked-reads and Hi-C. Molecular Ecology Resources doi: 10.1111/1755-0998.13574

Note that the genome size prediction and copy number estimation modes are now available through DepthSizer and DepthKopy, which should cite the same article. Please contact the author if you have trouble getting the full text version, or read the bioRxiv preprint version:

Chromosome-level de novo genome assembly of Telopea speciosissima (New South Wales waratah) using long-reads, linked-reads and Hi-C. bioRxiv 2021.06.02.444084; doi: 10.1101/2021.06.02.444084.

If using the simplified Nala version of the tidy algorithm, please cite: Edwards RJ et al. (2021), BMC Genomics [PMID: 33726677].

If using the 10x genomics non-redundancy pipeline, cite the Starling genome paper:

Stuart KC, Edwards RJ, Cheng Y, Warren WC, Burt DW, Sherwin WB, Hofmeister NR, Werner SJ, Ball GF, Bateson M, Brandley MC, Buchanan KL, Cassey P, Clayton DF, De Meyer T, Meddle SL & Rollins LA (2022): Transcript- and annotation-guided genome assembly of the European starling. Molecular Ecology 22(8):3141-3160. doi: 10.1111/1755-0998.13679) [*Joint first authors]


2 Running Diploidocus

Diploidocus is written in Python 2.x and can be run directly from the commandline:

python $CODEPATH/diploidocus.py [OPTIONS]

If running as part of SLiMSuite, $CODEPATH will be the SLiMSuite tools/ directory. If running from the standalone Diploidocus git repo, $CODEPATH will be the path the to code/ directory. Please see details in the Diploidocus git repo for running on example data.

2.1 Dependencies

For full functionality, Diploidocus needs a number of additional programs installed on the system:

  • Python 2.x and Python 3.x
  • KAT
  • BedTools
  • R
  • purge_haplotigs
  • bbmap
  • BLAST+
  • Minimap2 (added to environment $PATH or given with the minimap2=PROG setting)
  • Samtools

To generate documentation with dochtml, R will need to be installed and a pandoc environment variable must be set, e.g.

export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc

For Diploidocus documentation, run with dochtml=T and read the *.docs.html file generated.

2.2 Commandline options

A list of commandline options can be generated at run-time using the -h or help flags. Please see the general SLiMSuite documentation for details of how to use commandline options, including setting default values with INI files.

### ~ Main Diploidocus run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqin=FILE      : Input sequence assembly [None]
runmode=X       : Diploidocus run mode (diploidocus/dipcycle/gensize/purgehap/telomeres/vecscreen/regcheck/regcnv/deptrim/sortnr/diphap/diphapnr/insilico) [diploidocus]
basefile=FILE   : Root of output file names [diploidocus or $SEQIN basefile]
summarise=T/F   : Whether to generate and output summary statistics sequence data before and after processing [True]
genomesize=INT  : Haploid genome size (bp) [0]
scdepth=NUM     : Single copy ("diploid") read depth. If zero, will use SC BUSCO mode [0]
bam=FILE        : BAM file of long reads mapped onto assembly [$BASEFILE.bam]
bamcsi=T/F      : Use CSI indexing for BAM files, not BAI (needed for v long scaffolds) [False]
reads=FILELIST  : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype=LIST   : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
dochtml=T/F     : Generate HTML Diploidocus documentation (*.docs.html) instead of main run [False]
tmpdir=PATH     : Path for temporary output files during forking (not all modes) [./tmpdir/]
### ~ Genome size prediction options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
busco=TSVFILE   : BUSCO full table [full_table_$BASEFILE.busco.tsv]
readbp=INT      : Total combined read length for depth calculations (over-rides reads=FILELIST) []
quickdepth=T/F  : Whether to use samtools depth in place of mpileup (quicker but underestimates?) [False]
depdensity=T/F  : Whether to use the BUSCO depth density profile in place of modal depth [True]
mapadjust=T/F   : Whether to adjust predicted genome size based on read length:mapping ratio [False]
### ~ DiploidocusHocusPocus and Purge haplotigs options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
kmerreads=FILELIST : File of high quality reads for KAT kmer analysis []
10xtrim=T/F     : Whether to trim 16bp 10x barcodes from Read 1 of Kmer Reads data for KAT analysis [False]
minmedian=INT   : Minimum median depth coverage to avoid low coverage filter [3]
minlen=INT      : Minimum scaffold length to avoid low quality filter [500]
purgehap=X      : Purge_haplotigs method (purgehap/diploidocus) [purgehap]
phlow=INT       : Low depth cutoff for purge_haplotigs (-l X). Will use SCDepth/4 if zero. [0]
phmid=INT       : Middle depth for purge_haplotigs (-m X). Will derive from SCDepth if zero. [0]
phhigh=INT      : High depth cutoff for purge_haplotigs (-h X). Will use SCDepth x 2 if zero. [0]
zeroadjust=T/F  : Add zero coverage bases to purge_haplotigs LowPerc and adjust total [True]
includegaps=T/F : Whether to include gaps in the zero coverage bases for adjustment (see docs) [False]
mingap=INT      : Minimum length of a stretch of N bases to count as a gap for exclusion [10]
purgemode=X     : Rules used for purgehap analysis (simple/complex/nala) [complex]
diploidify=T/F  : Whether to generate alternative diploid output with duplicated diploid contigs and no hpurge [False]
pretrim=T/F     : Run vectrim/vecmask and deptrim trimming prior to diploidocus run [False]
maxcycle=INT    : Restrict run to maximum of INT cycles (0=No limit) [0]
### ~ Depth Trim options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
deptrim=INT     : Trim termini with <X depth [1]
mintrim=INT     : Min length of terminal depth trimming [1000]
### ~ Telomere options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
telofwd=X       : Regex for 5' telomere sequence search [C{2,4}T{1,2}A{1,3}]
telorev=X       : Regex for 5' telomere sequence search [T{1,3}A{1,2}G{2,4}]
telosize=INT    : Size of terminal regions (bp) to scan for telomeric repeats [50]
teloperc=PERC   : Percentage of telomeric region matching telomeric repeat to call as telomere [50]
telonull=T/F    : Whether to output sequences without telomeres to telomere table [False]
### ~ VecScreen options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
screendb=FILE   : File of vectors/contaminants to screen out using blastn and VecScreen rules []
screenmode=X    : Action to take following vecscreen searching (report/purge) [report]
minvechit=INT   : Minimum length for a screendb match [50]
minidhit=INT    : Minimum length for an identical screendb match (based on NCBI filtering) [27]
efdr=NUM        : Expected FDR threshold for VecScreen queries (0 is no filter) [1.0]
vecpurge=PERC   : Remove any scaffolds with >= PERC % vector coverage [50.0]
vectrim=INT     : Trim any vector hits (after any vecpurge) within INT bp of the nearest end of a scaffold [1000]
vecmask=INT     : Mask any vector hits of INT bp or greater (after vecpurge and vectrim) [900]
maskmode=X      : Whether to perform full (all bases) or partial (every other base) masking of hits [partial]
keepnames=T/F   : Whether to keep names unchanged for edited sequences or append 'X' [False]
veccheck=T/F    : Check coverage of filtered contaminant hits using reads=FILELIST data [False]
### ~ Region checking options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
regcheck=FILE   : File of SeqName, Start, End positions for read coverage checking [None]
checkfields=LIST: Fields in checkpos file to give Locus, Start and End for checking (unless GFF) [Hit,SbjStart,SbjEnd]
checkflanks=LIST: List of lengths flanking check regions that must also be spanned by reads [0,100,1000,5000]
spanid=X        : Generate sets of read IDs that span veccheck/regcheck regions, grouped by values of field X []
regcnv=T/F      : Whether to calculate mean depth and predicted CNV of regcheck regions based on SCdepth [True]
gfftype=LIST    : Optional feature types to use if performing regcheck on GFF file (e.g. gene) ['gene']
mingapspan=INT  : Minimum number of reads spanning a gap in order to re-assemble [2]
minlocid=PERC   : Minimum percentage identity for aligned chunk to be kept (local %identity) [0]
minloclen=INT   : Minimum length for aligned chunk to be kept (local hit length in bp) [500]
### ~ SortNR filtering/output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
checkcov=PERC   : Percentage coverage for double-checking partial exact matches [95]
seqout=FILE     : Output sequence assembly [$BASEFILE.nr.fasta]
### ~ In silico diploid input/output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
rqfilter=X      : Minimum RQ for output subreads [0]
lenfilter=X     : Min read length for filtered subreads [500]
parent1=FOFN    : File of file names for subreads fasta files on Parent 1. []
parent2=FOFN    : File of file names for subreads fasta files on Parent 2. []
See also SMRTSCAPE `summarise=T` options if `*.unique.tdt`/`*.smrt.tdt` have not been pre-generated with SMRTSCAPE.
### ~ Advanced/Developmental options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
memperthread=INT: Number of Gb per thread to allocate to samtools etc. [6]
useqsub=T/F     : Whether to use qsub to queue up system calls [False]
qsubvmem=INT    : Memory setting (Gb) when queuing with qsub [126]
qsubwall=INT    : Walltime setting (hours) when queuing with qsub [12]
modules=LIST    : List of modules that needs to be loaded for running with qsub []
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

3 Diploidocus run modes

Details for the main Diploidocus run modes are given below.

NOTE: Diploidocus is under development and documentation might be a bit sparse. Please contact the author or post an issue on GitHub if you have any questions.


3.1 Cycled Diploidocus filtering [runmode=dipcycle]

Diploidocus can be automated to cycle through repeated rounds of the main purge_haplotigs/filtering method until no further scaffolds are removed. Each cycle will run Diploidocus with a numerical suffix, e.g. $BASEFILE.1.*, using the *.diploidocus.fasta output from the previous cycle as input. Cycling will continue until no further scaffolds are filtered into either *.quarantine.fasta or *.junk.fasta.

Output for each cycle will be initially generated in the run directory but then moved to a dipcycle_$BASEFILE directory upon completion.

Final outputs from the final cycle will then be compiled under the original $BASEFILE prefix:

  • $BASEFILE.diploidocus.tdt = Final ratings for the input scaffolds. This is like the single cycle output with an additional Cycle field containing the last cycle this scaffold was processed in.
  • $BASEFILE.ratings.tdt = Reduced final ratings output for the input scaffolds (SeqName,SeqLen,ScreenPerc,Class,Rating,Set,Cycle).
  • $BASEFILE.diploidocus.fasta = the scaffolds kept from the final Diploidocus cycle
  • $BASEFILE.repeats.fasta = the repeat scaffolds excluded from core
  • $BASEFILE.core.fasta = the same set of scaffolds, minus repeats
  • $BASEFILE.quarantine.fasta = concatenated purged scaffolds from all Diploidocus cycles.
  • $BASEFILE.junk.fasta = concatenated low coverage and low quality scaffolds, removed as junk, from all cycles.

NOTE: Contents for these four *.fasta files are summarised in the main log. Individual purge cycles have their own log files in the dipcycle_$BASEFILE directory.

See runmode=diploidocus documentation for more details.

3.1.1 Re-starting aborted runs

If Diploidocus is killed before completion, files will not have been moved to the dipcycle_$BASEFILE directory. Any cycles that have generated a *.diploidocus.fasta file will be skipped until an incomplete cycle is found. In the rare case that Diploidocus was killed whilst generating fasta output, and *.diploidocus.fasta is thus incomplete, further cycles will be messed up. In this case, deleted the relevant *.diploidocus.fasta file before re-running.

If Diploidocus is run with runmode=dipcycle and the dipcycle_$BASEFILE directory is found, it will abort. Rename or delete the directory to re-run Diploidocus with the same $BASEFILE setting, or change basefile=X.


3.2 Main Diploidocus filtering [runmode=diploidocus]

Diploidocus builds on the PurgeHaplotigs classifications to use the depth bins, KAT assembly kmer frequencies and BUSCO results to reclassify scaffolds and partition them into:

  • *.diploidocus.fasta = the scaffolds kept for the next round of PurgeHap
  • *.core.fasta = the same set of scaffolds, minus repeats
  • *.repeats.fasta = the repeat scaffolds excluded from core
  • *.junk.fasta = low coverage scaffolds, removed as junk
  • *.quarantine.fasta = putative haplotigs, removed from the assembly but retained for reference.

Unless summarise=F, summary statistics for sequence data before and after processing will be included in the run log.

In addition, two key output tables are generated:

  • *.diploidocus.tdt = full compiled data and classification. (See field descriptions, below.)
  • *.rating.tdt = reduced data and classification for easier extraction of sequence names with grep etc.

All output files are named with the basefile=X prefix. This is the prefix of the assembly given with seqin=FILE if no basefile=X setting is provided.

3.2.1 Dependencies

Diploidocus needs the following programs installed for full functionality:

  • bbmap
  • blast+
  • kat
  • minimap2
  • purge_haplotigs
  • samtools

3.2.2 Input and assembly processing

The main inputs for Diploidocus rating and filtering are:

  • seqin=FILE : Input sequence assembly to tidy [Required].
  • screendb=FILE : File of vectors/contaminants to screen out using blastn and VecScreen rules [Optional].
  • reads=FILELIST: List of fasta/fastq files containing long reads. Wildcard allowed. Can be gzipped. For a single run (not cycling), a BAM file can be supplied instead with bam=FILE. (This will be preferentially used if found, and defaults to $BASEFILE.bam.) Read types (pb/ont) for each file are set with readtype=LIST, which will be cycled if shorter (default=ont). Optionally, the pre-calculated total read length can be provided with readbp=INT and/or the pre-calculated (haploid) genome size can be provided with genomesize=INT.
  • busco=TSVFILE : BUSCO full table [full_table_$BASEFILE.busco.tsv] used for calculating single copy (“diploid”) read depth. This can be over-ridden by setting scdepth=NUM.
  • kmerreads=FILELIST : File of high quality (i.e. short or error-corrected) reads for KAT kmer analysis [Optional]

If a BAM file is not provided/found, Diploidocus will use minimap2 to generate a BAM file of reads=FILELIST data mapped onto the seqin=FILE assembly. Each read file is mapped separately (--secondary=no -L -ax map-ont or --secondary=no -L -ax map-pb) and converted into a sorted BAM files, before merging the BAM files with samtools and indexing the combined file.

Diploidocus will re-use files where they already exist, providing the downstream files are newer than the upstream files. (If files have been copied and lost their datestamp information, switching ignoredate=T will re-use files regardless.) Setting force=T should force regeneration of files even if they exist.

3.2.2.1 Purge Haplotigs

Diploidocus uses output from purge_haplotigs as on of the core components for its classification system. Running purge_haplotigs is automated, using the single copy (diploid) read depth to set the purge_haplotigs thresholds. If scdepth=NUM is not set, this will be calculated using BUSCO single copy genes (see Genome size prediction [runmode=gensize] for details). The purge_haplotigs mid cutoff (-m X) will be set at the halfway point between the diploid depth (SCDepth) and the haploid depth (SCDepth/2). The low depth (-l X) and high depth (-h X) cutoffs will then be set to SCDepth/4 and SCDepth*2. This can be over-ridden with:

  • phlow=INT : Low depth cutoff for purge_haplotigs (-l X).
  • phmid=INT : Middle depth for purge_haplotigs (-m X).
  • phhigh=INT : High depth cutoff for purge_haplotigs (-h X).

Diploidocus adjusts the purge_haplotigs binning proportions, which do not use the zero coverage bases, adding the zero coverage bases to the low coverage bin. This adjustment can be with zeroadjust=F. By default, gaps are excluded from this adjustment, defined as a run of 10+ N bases. Setting includegaps=T will include these regions in the adjustment, whilst changing the mingap=INT setting will redefine gaps.

3.2.2.2 KAT analysis

Next, the kat sect function of KAT is used to calculate kmer frequencies for (a) raw data using the kmerreads=FILELIST file of high quality (i.e. short or error-corrected) reads, and (b) the assembly itself. This generates four outputs:

  • *.kat-stats.tsv = summary kmerreads kmer coverage per sequence
  • *.kat-counts.cvg = kmerreads kmer coverage per base for each input sequence
  • *.selfkat-stats.tsv = summary assembly kmer coverage per sequence
  • *.selfkat-counts.cvg = assembly kmer coverage per base for each input sequence

If the kmerreads=FILELIST files are 10x chromium reads, the first 16bp of read 1 (the 10x barcode) can be trimmed by setting 10xtrim=T. (This sets --5ptrim 16 in kat for the first read file.)

3.2.2.3 BBMap depth statistics

Read depth statistics for the BAM file are calculated per input sequence using the pileup.sh program of bbmap. This generates two files:

  • *.depth.tdt = read depth stats per sequence
  • *.depths.stats = overall read depth summary, generated by pileup.sh stdout and stderr

3.2.2.4 Vector/contaminant coverage

The runmode=vecscreen contaminant screening using vector/contaminants from screendb=FILE is run on the assembly - see VecScreen mode documentation for details.

3.2.2.5 Telomeres

The runmode=telomeres telomere identifcation screen is run on the assembly - see Telomeres mode documentation for details.

3.2.2.6 BUSCO gene prediction

The full BUSCO results table (busco=FILE) is used to generate counts of Complete, Duplicate and Fragmented BUSCO genes for each input sequence. These are converted into a single BUSCO rating of the dominant (i.e. most abundant) BUSCO class during classification.

3.2.3 Data compilation

Once all the different elements have been run, Diploidocus compiles results per input sequence, extracting a subset of fields of interest and renaming fields where appropriate. The following results files are compiled, joined by input sequence name (SeqName):

  • *.depth.tdt: bbmap pileup.sh read mapping statistics
    • SeqName = Scaffold sequence name
    • SeqLen = Scaffold sequence length (bp)
    • Median_fold = Median long read sequence depth for sequence
    • Avg_fold = Mean long read sequence depth for sequence
    • Covered_percent = Percentage of sequence covered by long reads
    • Covered_bases = Number of bp covered by long reads
    • Plus_reads = Number of reads mapped to positive strand
    • Minus_reads = Number of reads mapped to negative strand
    • Read_GC = GC content of reads mapping to sequence
  • *.purge.coverage_stats.csv: purge_haplotigs coverage stats
    • LowPerc = Proportion of bases with read depth below low coverage cutoff
    • HapPerc = Proportion of bases with read depth between low coverage and mid coverage cutoffs
    • DipPerc = Proportion of bases with read depth between mid coverage and high coverage cutoffs
    • HighPerc = Proportion of bases with read depth above high coverage
  • *.purge.reassignments.tsv: purge_haplotigs best sequence hits and reassignments
    • TopHit = scaffold with biggest sequence coverage match
    • SecHit = scaffold with second biggest sequence coverage match
    • TopHitCov = percentage sequence coverage by TopHit
    • MaxHitCov = combined percentage sequence coverage by all sequence hits
    • PurgeHap = purge_haplotigs reassignment rating
    • TopNum = number of sequences for which this sequence is the TopHit
    • SecNum = number of sequences for which this sequence is the SecHit
  • *.selfkat-stats.tsv: assembly kmer coverage stats
    • SelfMedK = Median assembly kmer frequency for sequence
    • SelfAvgK = Mean assembly kmer frequency for sequence
  • *.kat-stats.tsv: short read kmer coverage stats
    • MedK = Median short read kmer frequency for sequence
    • AvgK = Mean short read kmer frequency for sequence
    • SeqGC = GC content for sequence
    • KPerc = Percentage of non-gap kmers in sequence that are found in short read data
    • NOTE: In the absence of short read data, these values will be set to -1
  • *.screencov.tdt: Contaminant coverage from vecscreen mode
    • ScreenPerc = Total combined sequence coverage by contaminants
  • *.telomeres.tdt: Telomere prediction results
    • Tel5 = Whether a 5’ telomere is predicted
    • Tel3 = Whether a 5’ telomere is predicted
    • Tel5Len = Length in window chunks of 5’ telomere
    • Tel3Len = Length in window chunks of 3’ telomere
    • TelPerc = Percentage of sequence predicted to telomeres. (Crude calculation.)
  • full_table_*.busco.tsv
    • Complete = Number of BUSCO Complete genes in sequence
    • Duplicated = Number of BUSCO Duplicated genes in sequence
    • Fragmented = Number of BUSCO Fragmented genes in sequence
  • $SEQIN: input assembly
    • N_bases = total count of all unresolved (N) bases
    • Gap_bases = total count of all gap (N) bases as determined by mingap=INT
    • SeqDesc = sequence description

3.2.4 Diploidocus classification

Diploidocus classification adds the following fields to the compiled table:

  • Class = Six-part (PURITY|DEPTH|HOM|TOP|MEDK|BUSCO) Diploidocus classifcation (see below)
  • TopClass = Class of TopHit (if any)
  • SecClass = Class of SecHit (if any)

Classification is based on a combination of commandline parameters and hard-coded cutoffs:

  • minmedian=INT : Minimum median depth coverage to avoid low coverage filter [3]
  • pureperc=80 : Depth class percentage to classify as PURE
  • rephitcov=250 : Min MaxHitCov value to classifty as REPEAT

Each sequence is given classification in six different criteria (PURITY|DEPTH|HOM|TOP|MEDK|BUSCO) with two possible suffixes:

  1. PURITY: Purity of dominant read depth class
    • PURE = At least 80% of sequence in that depth bin
    • GOOD = At least 50% of sequence in that depth bin
    • WEAK = Under 50% of sequence in that depth bin
  2. DEPTH: Dominant read depth class
    • LOWX = Median read depth (Median_fold) fails to meet minmedian=INT criterion (default=3)
    • LOW = LowPerc read depth bin has highest percentage coverage (ties assigned to other class)
    • HAP = HapPerc read depth bin has highest percentage coverage (non-DIP ties addigned to HAP)
    • DIP = DipPerc read depth bin has highest percentage coverage (ties assigned to DIP)
    • EXS = HighPerc read depth bin has highest percentage coverage
  3. HOM: Homology/repeat status based on purge_haplotigs hits
    • UNIQ = No TopHit
    • PART = Partial (<50%) coverage of TopHit
    • HAPL = 50%+ TopHit coverage but no SecHit
    • HOMO = TopHit and SecHit but combined MaxHitCov < 250%
    • REPT = TopHit and SecHit and 250%+ combined MaxHitCov
  4. TOP: Top/Sec Hit status for sequence
    • TOP = TopHit for at least one other sequence
    • SEC = Not a TopHit but SecHit at least one other sequence
    • NON = Neither a TopHit nor SecHit for any other sequence
  5. MEDK: Assembly redundancy based on KAT assembly kmers
    • PRI = Over 50% unique kmers (SelfMedK = 1)
    • ALT = Median kmer frequency of two (SelfMedK = 2)
    • REP = Median kmer frequency exceeds two (SelfMedK > 2)
  6. BUSCO: Dominant BUSCO class (can help decide about risk)
    • COMP = 1+ Complete BUSCO genes and more Complete than Duplicated
    • DUPL = 1+ Duplicated BUSCO genes and more Duplicated than Complete
    • FRAG = 1+ Fragmented BUSCO genes and no Complete or Duplicated
    • NONE = No Complete, Duplicated or Fragmented BUSCO genes
  7. +TEL: If any telomeres are detected, +TEL is added
  8. +VEC: If any contamination is detected, +VEC is added

3.2.5 Purge modes [purgemode=X]

Diploidocus rating based on purgemode=X adds the following fields to the compiled table:

  • Rating = Main Diploidocus rating of sequence (see below)
  • TopRating = Rating of TopHit (if any)
  • SecRating = Rating of SecHit (if any)
  • Set = Diploidocus output set (keep/repeat/quarantine/junk)

There are three basic purgemode implementations:

  • purgemode=complex [Default] uses all of the Diploidocus classification data to partition sequences into classes and then partition those classes into the main output sets.
  • purgemode=simple uses a simplified set of Diploidocus classification rules data to partition sequences into classes before partitioning those classes into the main output sets. This is conceptually easier than the complex mode but unlikely to produce as good results.
  • purgemode=crude uses a crude set of Diploidocus classification rules data to partition sequences into classes before partitioning those classes into the main output sets. This is more complicated than simple but still conceptually easier than the complex mode but unlikely to produce as good results.
  • purgemode=nala uses a simplified and conservative set of filtering rules used for the Nala German Shepherd Dog assembly.

Details for these modes can be found below.

Ratings are based on a combination of commandline parameters and hard-coded cutoffs:

  • minlen=INT : Minimum scaffold length to avoid low quality filter [500]
  • mindipcov=20 : Minimim DipPerc coverage to assign a “KEEP” rating (purgemode=crude)
  • artefactcov=80 : Min High/Low coverage to assign as artefacts (purgemode=crude, purgemode=nala)
  • hpurgecov=80 : Min TopHitCov to assign HPURGE rating.
  • covwarning=95 : Read coverage threshold for warnings

3.2.5.1 Complex rating

Diploidocus performs a hierarchical rating of scaffolds, based on their classifications and compiled data:

  1. Initial low quality filter:
    • CONTAMINATION = 50%+ identified contamination (ScreenPerc)
    • LOWCOV = Poor median read coverage (Median_fold < minmedian=INT)
    • LOWQUAL = Scaffolds below the sequence length set by minlen=INT
    • LOWCOV = Low read depth and 50%+ sequence without short read kmers (PURE|LOW|... or GOOD|LOW|..., and MedK=0)
    • LOWCOV = Low read depth and high frequency assembly kmers (...|LOW|...|REP|...)
  2. High quality scaffolds to keep:
    • QUALITY = Highest quality scaffolds: pure duploid, complete BUSCOs, no Duplicated BUSCOs (PURE|DIP|UNIQ|...|PRI|COMP & Duplicated = 0)
    • FINAL = As Quality but PRI|FRAG and PRI|NONE also allowed (PURE|DIP|UNIQ|...|PRI|... & Duplicated = 0)
    • CORE = Predominantly Diploid with <50% covered by TopHit and SelfMedK=1 ([PURE/GOOD]|DIP|[UNIQ/PART]|...|PRI|...)
    • COREHAP = Predominantly haploid but <50% covered by TopHit and 1+ Complete BUSCOs (...|HAP|[UNIQ/PART]|... & Complete > 0)
    • PRIMARY = Putative primary scaffold but with possible alternative scaffolds still in assembly and/or low quality regions (...|DIP|[UNIQ/PART/HAPL/HOMO]|...|[PRI/ALT]|...)
  3. Repetitive scaffolds to keep:
    • PRIRPT = Putative primary scaffold but >50% repeated (...|DIP|[UNIQ/PART]|...|REP|... or [PURE/GOOD]|DIP|...|REP|...)
    • COLLAPSED = High coverage scaffolds representing putative collapsed repeats. (Check for other contamination.) (...|EXS|[UNIQ/PART]|... or [GOOD/PURE]|EXS|...|TOP|..., or WEAK|EXS|... and less of the sequence in the Diploid coverage bin than below Diploid coverage.
    • REPEAT = Predominantly Diploid scaffolds that have major signs of redundancy, probably due to presence of alternative contigs (...|DIP|REPT|... or WEAK|DIP|[HAPL/HOMO]|...|REP|...)
  4. Haploid coverage scaffolds to keep (NOTE: any HAPLOTIG rating will be converted to HAPRPT if ...|REP|...)
    • HAPLOID = Predominantly haploid coverage but enough unique sequence to keep for now? (Option to quarantine?) Might be very heterozygous Alternative haplotigs (...|HAP|[UNIQ/PART]|PRI|...)
    • HAPLOTIG = Predominantly haploid coverage but enough unique sequence to keep for now - possible Alternative haplotig (...|HAP|[UNIQ/PART]|ALT|... or ...|HAP|...|TOP|[PRI/ALT]|...)
    • HAPRPT = Low quality scaffold that is probably most repeats, but not bad enough to dump outright (...|HAP|[UNIQ/PART]|...|REP|... or ...|HAP|...|TOP|...|REP|...)
    • HAPLOTIG = Low/haploid coverage scaffold with insufficient coverage of a Diploid scaffold to purge (...|[LOW/HAP]|[HAPL/HOMO/REPT]|... and |DIP| in TopHit rating and TopHitCov < hpurgecov (80%))
    • HAPLOTIG = Low/haploid coverage scaffold with insufficient coverage of another scaffold to purge ([PURE/GOOD]|HAP|HOMO|...' andTopHitCov<hpurgecov` (80%))
    • HAPRPT = Other haploid coverage scaffold with insufficient coverage of another scaffold to purge (...|HAP|... and TopHitCov < hpurgecov (80%))
    • HAPRPT = Low coverage and collapsed repeats with insufficient coverage of another scaffold to purge (...|EXS|... and less of the sequence in the Diploid coverage bin than below Diploid coverage and TopHitCov < hpurgecov (80%))
    • REPEAT = High coverage scaffold with insufficient coverage of another scaffold to purge ([GOOD/PURE]|EXS|[HAPL/HOMO/REPT]|[SEC/NON]|... and TopHitCov < hpurgecov (80%))
  5. Scaffolds to quarantine for removal
    • RPURGE = Messy scaffolds that are largely repeats and are sufficiently redundant/low quality to purge (WEAK|EXS and less of the sequence in the Diploid coverage bin than below Diploid coverage, or [GOOD/PURE]|EXS|[HAPL/HOMO/REPT]|[SEC/NON], and TopHitCov >= hpurgecov (80%))
  6. Additional low coverage scaffolds
    • LOWCOV = Low depth of coverage and strong matches to other scaffolds, and not TopHit (...|LOW|[HAPL/HOMO/REPT]|[SEC/NON])
  7. Further scaffolds to quarantine for removal
    • HPURGE = Clear candidate haplotig to purge (...|LOW|[HAPL/HOMO/REPT]|... and |DIP| in TopHit rating, or ...|HAP|..., and TopHitCov >= hpurgecov (80%))
  8. Final low quality filter
    • LOWQUAL = Unconvincing scaffolds that do not fall into a clear class but are not bad enough to dump outright as junk (...|LOW|...)

Any scaffolds that escape the above rules will be UNRATED. (There should not be any!)

3.2.5.2 Simple rating

The simplified rating system is based much more on the dominant read depth class. A subset core scaffolds are identified and the rest are classified on the basis of dominant read depth and REP or REPT status.

  • LOWCOV = Poor median read coverage (Median_fold < minmedian=INT)
  • CONTAMINATION = 50%+ identified contamination (ScreenPerc)
  • LOWQUAL = Scaffolds below the sequence length set by minlen=INT
  • LOWQUAL = Low quality artefacts with less than 50% of the scaffold covered by short-read kmers (MedK=0)
  • Keep Scaffolds with Diploid+ depth ([DIP/EXS]), or limited homology to other scaffolds ([UNIQ/PART]) or Top hits for other scaffolds (TOP) or 1+ Complete BUSCO genes. These are broken down into:
    • PRIMARY = Diploid depth or no TopHit or dominant BUSCO rating is Complete, and not repetitive
    • PRIRPT = As PRIMARY but REPT or REP classifications
    • COLLAPSED = Dominant high depth (EXS)
    • REPEAT = Remaining Scaffolds to keep that have REPT or REP classifications
    • HAPLOID = Remaining Scaffolds to keep without REPT or REP classifications
  • HPURGE = Dominant haploid read depth without REPT or REP classifications
  • RPURGE = Dominant haploid read depth, plus REPT or REP classifications
  • LOWCOV = Any remaining scaffolds (dominant LOW read depth)

3.2.5.3 Crude rating

The crude purge mode is a less nuanced version of the main Diploidocus rating.

  • LOWCOV = Poor median read coverage (Median_fold < minmedian=INT)
  • CONTAMINATION = 50%+ identified contamination (ScreenPerc)
  • LOWQUAL = Scaffolds below the sequence length set by minlen=INT
  • HPURGE = Low+Haploid coverage meet artefactcov cutoff (>=80%) and TopHitCov meets hpurgecov cutoff (>=80%)
  • HAPLOTIG = Haploid coverage exceeds 50%, TopHitCov meets hpurgecov cutoff (>=80%), and SelfMedK=2
  • Scaffolds with a Diploid depth coverage otherwise meeting the mindipcov cutoff (>=20%) are kept:
    • PRIMARY = Median assembly kmer frequency of 1
    • REPEAT = MaxHitCov meets the rephitcov cutoff (>=250%)
    • KEEP = Remaining diploid scaffolds.
  • LOWCOV = Any other scaffolds with 80%+ low coverage bases are filtered as Low Coverage.
  • COLLAPSED = Any other scaffolds with 80%+ high coverage bases are flagged as COLLAPSED.
  • HAPLOTIG = Scaffolds that are 50%+ Haploid Depth if mostly present 2+ times (median assembly kmer frequency = 2)
  • HAPLOID = Scaffolds that are 50%+ Haploid Depth if mostly present once (median assembly kmer frequency = 1)
  • ARTEFACT = Scaffolds <50% Haploid Depth but 80%+ Low or Haploif depth
  • HAPLOTIG = Scaffolds with 80%+ hitting other scaffolds but does not meet the rephitcov cutoff (<250%)
  • HAPRPT = Scaffolds with 80%+ hitting other scaffolds and meets the rephitcov cutoff (>=250%)
  • ARTEFACT = Remaining scaffolds with 80%+ bases in the low/haploid coverage bins

3.2.5.4 Nala rating

The purgemode=nala rating scheme was used for the Nala German Shepherd Dog genome assembly, and features a simplified set of ratings:

  • CONTAMINATION = 50%+ identified contamination (ScreenPerc)
  • LOWCOV = Poor median read coverage (Median_fold < minmedian=INT)
  • LOWQUAL = Scaffolds below the sequence length set by minlen=INT
  • HPURGE = Any scaffold with 80%+ bases in the low/haploid coverage bins (haplotigs or assembly artefacts).
  • PRIMARY = Scaffolds with 20%+ diploid coverage are marked as retention as probable diploids.
  • COLLAPSED = Scaffolds with <20% diploid coverage and 50%+ high coverage are marked as probable collapsed repeats.
  • JUNK/HAPLOTIG/KEEP = Remaining Scaffolds are given the PurgeHaplotigs rating (over 80% low/high coverage will be filtered as a probable artefact)

3.2.5.5 Final sequence sets for output

The purgemode ratings are then converted into subsets of scaffolds for output:

  • keep = Scaffolds that form part of the *.core.fasta and *.diploidocus.fasta outputs:
    • [‘QUALITY’,‘FINAL’,‘CORE’,‘COREHAP’,‘PRIMARY’,‘PRIRPT’,‘HAPLOID’,‘HAPLOTIG’,‘KEEP’]
  • repeat = Scaffolds excluded from the *.core.fasta but part of the *.diploidocus.fasta output:
    • [‘COLLAPSED’, ‘REPEAT’, ‘HAPRPT’]
  • quarantine = Scaffolds removed as probably Haplotigs and/or excess copies of repeats but saved to *.quarantine.fasta in case additional checks are required:
    • [‘HPURGE’, ‘RPURGE’]
  • junk = Scaffolds removed due to low coverage, poor quality, or contamination. These are partitioned into *.junk.fasta:
    • [‘LOWCOV’, ‘LOWQUAL’, ‘CONTAMINATION’,‘JUNK’]

The final ratings for each TopHit and SecHit are also added to the main Diploidocus table as TopHitRating and SecHitRating fields. This might flag up additional scaffolds (e.g. with HAPLOTIG ratings) that the user decides to remove despite failing to meet the automated criteria.

NOTE: A cutdown output of Scaffold classifications and ratings is saved as *.ratings.tdt. This can be used to easily extract a subset of sequence names that can be filtered from the final output using SeqSuite, e.g.

python $SLIMSUITE/tools/seqsuite.py seqin=$PREFIX badseq=$(grep HAPLOTIG | awk '{print $1;}' ORS=",") seqout=$PREFIX.filtered.fasta

3.2.5.6 Rating warnings

Once rating is complete, the following warnings are generated in the log file:

  • All CONTAMINATION scaffolds
  • Scaffolds not classed as junk but:
    • <50% covered by short read kmers
    • any vector contamination flagged
    • long read coverage < 95%
  • Any COREHAP/HAPLOID/HAPLOTIG scaffolds with Duplicated BUSCO genes (more evidence for redundancy)
  • Any purged (junk or quarantine) scaffolds that are the PurgeHaplotigs TopHit for 1+ other scaffolds

3.2.6 Full Diploidocus outputs

Temporary files generated during forking will be generated in a subdirectory set by tmpdir=PATH (Default: ./tmpdir/).

Details of full outputs will be added. Please see other documentation sections for specific outputs, or contact the author if anything is unclear.

3.2.7 Special output mode: diploidify

Diploidocus features a special diploidify=T mode, which aims to generate fully diploid output. This is achieved by adding HPURGE scaffolds to the keep set rather than the quarantine set. In addition, diploid scaffolds are duplicated. For this purpose, diploid scaffolds are defined as those in the keep or repeat sets that meet are:

  • Classified as ...|DUP|...
  • or Classified as WEAK|...|NON|..., i.e. are neither a TopHit nor SecHit for any other sequence

Diploid scaffolds are saved with an X2 suffix on their sequence name.


3.3 DepthSizer Genome size prediction [runmode=gensize]

NOTE: This mode is now primarily documented and updated through DepthSizer.

The main inputs for Diploidocus genome size prediction are:

  • seqin=FILE : Input sequence assembly to tidy [Required].
  • reads=FILELIST: List of fasta/fastq files containing long reads. Wildcard allowed. Can be gzipped. For a single run (not cycling), a BAM file can be supplied instead with bam=FILE. (This will be preferentially used if found, and defaults to $BASEFILE.bam.) Read types (pb/ont/hifi) for each file are set with readtype=LIST, which will be cycled if shorter (default=ont). Optionally, the pre-calculated total read length can be provided with readbp=INT and/or the pre-calculated (haploid) genome size can be provided with genomesize=INT.
  • busco=TSVFILE : BUSCO full table [full_table_$BASEFILE.busco.tsv] used for calculating single copy (“diploid”) read depth. This can be over-ridden by setting scdepth=NUM.
  • quickdepth=T/F : Whether to use samtools depth in place of mpileup (quicker but underestimates?) [Default: False]

This works on the principle that Complete BUSCO genes should represent predominantly single copy (diploid read depth) regions along with some ooor quality and/or repeat regions. Assembly artefacts and collapsed repeats etc. are predicted to deviate from diploid read depth in an inconsistent manner. Therefore, even if less than half the region is actually diploid coverage, the modal read depth is expected to represent the actual single copy read depth.

Diploidocus uses samtools mpileup (or samtools depth if quickdepth=T) to calculate the per-base read depth and extracts the modal read depth for each BUSCO gene along with the overall modal read depth for all gene regions. Genome size is then estimated based on a crude calculation using the total combined sequencing length. This will be caculated from reads=FILELIST unless provided with readbp=INT.

BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE (default full_table_$BASEFILE.busco.tsv). This can be replaced with any table with the fields: [‘BuscoID’,‘Status’,‘Contig’,‘Start’,‘End’,‘Score’,‘Length’]. Entries are reduced to those with Status = Complete and the Contig, Start and End fields are used to define the regions that should be predominantly single copy.

NOTE: There is currently no adjustment for contamination, read mapping or imbalanced insertion:deletion ratios etc. As a consequence, the current genome size prediction appears to be an over-estimate.


3.4 Running Purge_haplotigs using BUSCO-guided cutoffs [runmode=purgehap]

Diploidocus can automate the running of purge_haplotigs is automated, using the single copy (diploid) read depth to set the purge_haplotigs thresholds. If scdepth=NUM is not set, this will be calculated using BUSCO single copy genes (see Genome size prediction [runmode=gensize] for details).

The purge_haplotigs mid cutoff (-m X) will be set at the halfway point between the diploid depth (SCDepth) and the haploid depth (SCDepth/2). The low depth (-l X) and high depth (-h X) cutoffs will then be set to SCDepth/4 and SCDepth*2. This can be over-ridden with:

  • phlow=INT : Low depth cutoff for purge_haplotigs (-l X).
  • phmid=INT : Middle depth for purge_haplotigs (-m X).
  • phhigh=INT : High depth cutoff for purge_haplotigs (-h X).

Output from this run is:

  • *.purge.coverage_stats.csv: purge_haplotigs coverage stats
  • *.purge.reassignments.tsv: purge_haplotigs best sequence hits and reassignments

Additional purge_haplotigs files and output will be saved in a subdirectory:

  • purge_*/

This is to stop multiple purge_haplotigs runs from interfering with each other.


3.5 Telomere finding [runmode=telomere]

Diploidocus has implemented a regex-based search for Telomeres, based on the code at https://github.com/JanaSperschneider/FindTelomeres. This looks for a canonical telomere motif of TTAGGG/CCCTAA, allowing for some variation. For each sequence, Diploidocus trims off any trailing Ns and then searches for telomere-like sequences at sequence ends. For each sequence, the presence/absence and length of trimming are reported for the 5’ end (tel5 and trim5) and 3’ end (tel3 and trim3), along with the total percentage telomeric sequence (TelPerc).

By default, Diploidocus searches for a forward telomere regex sequence of C{2,4}T{1,2}A{1,3} at the 5’ end, and a reverse sequence at the 3’ end of T{1,3}A{1,2}G{2,4}. These can be set with telofwd=X and telorev=X. Telomeres are marked if at least 50% (teloperc=PERC) of the terminal 50 bp (telosize=INT) matches the appropriate regex. If either end contains a telomere, the total percentage of the sequence matching either regex is calculated as TelPerc. Note that this number neither restricts matches to the termini, not includes sequences within predicted telomeres that do not match the regex.

By default, only sequences with telomeres are output to the *.telomeres.tdt output, but switching telonull=T will output all sequences.


3.6 Vector/contamination screening [runmode=vecscreen]

This mode screens scaffolds for possible contaminants, given by screendb=FILE.

3.6.1 VecScreen overview

First, a blastn search of the screendb=FILE sequences is performed, using the NCBI VecScreen search and match strategy (see below). These parameters can be modified using blastopt=X and/or optionfile=FILE, which will be appended to the run command. Alternatively, the $BASEFILE.vecscreen.blast file produced can be pre-made with different parameters (if force=F).

Results are then parsed into a local hits table and rated according to the strength of a match. This is performed iteratively, re-assigning internal matches as “proximal” if they are withing 25 bases of another match (of any strength) and re-rating the strength of the match, until no ratings changes occur. Two additional fields are added to the local hits table during this process: MatchPos and MatchStr. Once all assignments have been made, segments of the assembly between two matches and/or sequence ends are added to the table as Suspect.

MatchPos will have a value of:

  • Terminal = within 25 bp of either end of the sequence.
  • Proximal = within 25 bp of a vecsreen match (Weak, Moderate or Strong).
  • Internal = over 25 bp from a sequence end or vecsreen match.
  • Suspect = Segments added as Suspect.

MatchStr will have a value of:

  • Strong = Terminal/Proximal match with Score >= 24, or Internal match with Score >= 30.
  • Moderate = Terminal/Proximal match with Score 19 to 23, or Internal match with Score 25 to 29.
  • Weak = Terminal/Proximal match with Score 16 to 18, or Internal match with Score 23 to 24.
  • Suspect = Any segment of fewer than 50 bases between two vector matches or between a match and an end.

These parameters seem to be designed for small (Illumina?) assemblies, and give excessive false positives for larger genomes. Diploidocus therefore features two additional contaminant identification filters:

  • eFDR = The “Expected False Discovery Rateis calculated for each contaminant sequence. This is simply theExpectvalue for that hit, divided by the total number of hits with thatExpectvalue (or lower). Ifefdr=Xis set, any hits with aneFDRvalue exceeding the threshold will be removed. By default, this is set to1.0, i.e. any contaminants with fewer assembly hits than theirExpect` value will be dropped.
  • Minimum hit length = Short hits are inevitable in large assemblies and unlikely to be real for long read assemblies (e.g. without cloning etc.). An additional minvechit=X setting will remove any really short hits. This is set by default to minvechit=50, meaning that a hit of at least 50 bp is required.

Finally, the percentage coverage per scaffold is calculated from the filtered hits. This is performed first for each contaminant individually, before being collapsed into total contamination coverage per query.

Results are output into two main delimited results files:

  • *.vecscreen.tdt = Contaminant local hit table with VecScreen ratings and eFDR calculation. (Unfiltered)
  • *.screencov.tdt = Query coverage per contaminant
    • Query = Contaminant name. For total coverage, this will be TOTAL.
    • Hit = Scaffold name.
    • HitLen = Length of scaffold (bp).
    • Coverage = Covered length of scaffold (bp).
    • CovPC = Percentage coverage of scaffold (0-100).

3.6.2 VecScreen BLAST+ parameters (from NCBI website)

The VecScreen BLAST+ parameters are pre-set using blastn options: -task blastn -reward 1 -penalty -5 -gapopen 3 -gapextend 3 -dust yes -soft_masking true -evalue 700 -searchsp 1750000000000

VecScreen Match Categories Vector contamination usually occurs at the beginning or end of a sequence; therefore, different criteria are applied for terminal and internal matches. VecScreen considers a match to be terminal if it starts within 25 bases of the beginning of the query sequence or stops within 25 bases of the end of the sequence. Matches that start or stop within 25 bases of another match are also treated like terminal matches. Matches are categorized according to the expected frequency of an alignment with the same score occurring between random sequences.

Strong Match to Vector (Expect 1 random match in 1,000,000 queries of length 350 kb.) Terminal match with Score >= 24. Internal match with Score >= 30.

Moderate Match to Vector (Expect 1 random match in 1,000 queries of length 350 kb.) Terminal match with Score 19 to 23. Internal match with Score 25 to 29.

Weak Match to Vector (Expect 1 random match in 40 queries of length 350 kb.) Terminal match with Score 16 to 18. Internal match with Score 23 to 24.

Segment of Suspect Origin Any segment of fewer than 50 bases between two vector matches or between a match and an end.

3.6.3 Vector coverage checking

If veccheck=T then long reads given by reads=FILELIST readtype=LIST will be mapped onto the assembly using minimap2 and complete coverage of each hit reported. This will report the number of reads that span the entrie contaminant hit, plus flanking distances each side set by checkflanks=LIST (default: 0,100,1000,5000). If the flanking distance (bp) or the end of the sequence is reached before the end of the read in both directions, that read will be added to the coverage count. Because this can inflate apparent flanking coverage for hits near the end of a sequence, the distance from each end of the sequence is also returned as MaxFlank5 and MaxFlank3.


3.7 Depth trimming [runmode=deptrim]

Depth trimming (deptrim) mode trims sequence termini of at least mintrim=INT bp with less than deptrim=INT read depth. First, samtools mpileup or depth (depmethod=X) is used to find the first and last positions that exceed deptrim=INT. If no positions meet this criterio, the entire sequence will removed. Otherwise, either terminus that exceeds mintrim=INT base pairs of insufficent read depth are trimmed.


3.8 Assembly region read-spanning and copy number analysis [runmode=regcheck/regcnv]

Region checking, whether for read spanning analysis (runmode=regcheck) or copy number analysis (runmode=regcnv or runmode=regcheck regcnv=T), analyses regions extracted from a delimited file given by: regcheck=FILE. This can be a GFF file, in which case gfftype=LIST will restrict analysis to specific feature types, or regions can be defined by checkfields=LIST, which defines the locus, start and end positions. These fields default to SeqName, Start and End fields. If these fields cannot be found, the first three fields of the regcheck=FILE file will be used.

3.8.1 Region read-spanning analysis [runmode=regcheck]

Long read data, given with the reads=FILELIST and readtype=LIST options, are mapped onto the assembly (seqin=FILE) using minimap2 to generate a PAF file. This is then parsed and reads spanning each feature based on their positions and the target start and end positions in the PAF file. In addition to absolute spanning of regions, reads spanning a region +/- distances set by checkflanks=LIST will also be calculated. If the end of a sequence is reached before the end of the read, this will also count as flanking. Such regions can be identified using the MaxFlank5 and MaxFlank3 fields, which identify the maximum distance 5’ and 3’ that a read can span due to sequence length constraints.

If regcnv=T then the region copy number analysis (below) will also be performed.

NOTE: Depth calculations are performed in parallel in the directory set with tmpdir=PATH. The number of parallel calculations is set by forks=INT.


3.8.2 Region copy number analysis [runmode=regcnv]

Copy number analysis uses the same single copy depth profile analysis as the DepthSizer (Diploidocus runmode=gensize) genome size prediction. In short, the modal read depth of BUSCO single copy Complete genes is calculated using samtools mpileup (or samtools depth if quickdepth=T) and used to defined “single copy read depth”. BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE (default full_table_$BASEFILE.busco.tsv). This can be replaced with any table with the fields: [‘BuscoID’,‘Status’,‘Contig’,‘Start’,‘End’,‘Score’,‘Length’]. Entries are reduced to those with Status = Complete and the Contig, Start and End fields are used to define the regions that should be predominantly single copy.

Single-copy read depth can also be set using scdepth=NUM to re-used or over-ride previously calculated values.

Long read data, given with the reads=FILELIST and readtype=LIST options, are then mapped onto the assembly (seqin=FILE) using minimap2. This can be skipped by providing a BAM file with bam=FILE. For each regcheck feature, the same samtools depth calculation is performed as for the BUSCO data. The mean depth across the region is then divided by the single copy depth to estimate total copy number across the region. Note that unlike the single copy depth estimation itself, this might be biased by repeat sequences etc. that have a different depth profile to the main region. One way to control for this might be to restrict analysis to a subset of reads that meet a certain minimum length cutoff, e.g. 10kb.

Query-based CNV analysis. If the regcheck=FILE file has additional Qry, QryLen, QryStart and QryEnd fields, the copy number analysis will have an additional query-focus. In this case, each region mapping onto a specific query is summed up, adjusted for the proportion of the query covered by that region. For example, 3.5X mean depth of a 100% length copy and 3.0X coverage of a 50% length copy would sum to (3.5x1.0 + 3x0.5 = 5 total copies). If these fields are not present, each region will be analysed independently.

NOTE: Depth calculations are performed in parallel in the directory set with tmpdir=PATH. The number of parallel calculations is set by forks=INT.


3.9 GapSpanner functions [runmode=gapspan/gapass/gapfill]

NOTE: These modes are now primarily documented and updated through GapSpanner.

3.9.1 Assembly gap read-spanning analysis [runmode=gapspan]

This mode first identifies all the gaps in an assembly (seqin=FILE) (using SeqList gapstats or $SEQBASE.gaps.tdt if pre- existing) and then runs the read spanning analysis (runmode=regcheck) with regcnv=F. Long read data, given with the reads=FILELIST and readtype=LIST options, are mapped onto the assembly using minimap2 to generate a PAF file. This is then parsed and reads spanning each gap are identified based on their positions and the target start and end positions in the PAF file. In addition to absolute spanning of regions, reads spanning a region +/- distances set by checkflanks=LIST will also be calculated. If the end of a sequence is reached before the end of the read, this will also count as flanking. Such regions can be identified using the MaxFlank5 and MaxFlank3 fields, which identify the maximum distance 5’ and 3’ that a read can span due to sequence length constraints.

Spanning spanid output is also generated for each gap and saved in $BASEFILE_spanid. Each gap will be named: seqname.start-end.


3.9.2 Assembly gap re-assembly [runmode=gapass]

In addition to the gapspan analysis, reads identified as spanning each gap are extracted and assembled using flye in a $BASEFILE__gapassemble/ output directory. Only gaps with at least mingapspan=INT (default 2) reads are re-assembled.


3.9.3 Re-assembled gap-filling [runmode=gapfill]

In addition to the gapspan and gapass outputs, re-assembled gap regions are compiled into a single file and then mapped back on the original assembly using Minimap2, with tabulated hit output into $BASEFILE__gapfill/. Local hits are reduced to unique coverage of the assembly sequences. Gaps are filled if one of the two conditions are met:

  1. A single local alignment spans an entire gap.
  2. A pair of concordant local alignments from the same re-assembly contig (same orientation) flank an entire gap.

In the case of a single spanning local alignment, the entire assembly region is replaced by the corresponding re-assembly contig region. For a pair of hits, the region between the two hits is replaced.


3.10 Sorted non-redundant assembly cleanup [runmode=sortnr]

The sorted non-redundant assembly cleanup mode (runmode=sortnr) screens out any sequences that are 100% gap, then removes any sequences that are 100% redundant with other sequences in the input. This includes full and partial matches, i.e. if sequence X is wholly contained within sequence Y then X will be removed.

First, sequences are loaded from the file given with seqin=FILE and any rje_seqlist filters and sequence sorting are applied to the input. Sequences that are 100% Ns are removed and any gaps exceeding 10 nt are reduced to 10 Ns (NNNNNNNNNN) to prevent minimap2 from splitting sequences on long gaps. These gap-reduced sequences are output to $BASEFILE.tmp.fasta and used for an all-by-all minimap2 search.

By default, minimap2 is run with the options to generate a $BASEFILE.tmp.paf file:

--cs -p 0.0001 -t 4 -x asm20 -N 250

To modify minimap2 search settings, please see the rje_paf documentation.

NOTE: These run options can probably be made more stringent to speed up minimap2 without loss of function. Future releases may alter defaults accordingly.

Minimap2 output is parsed to identify scaffold-scaffold matches. Self-hits are ignored. The minimum (gap-reduced) sequence length is used as a rapid parsing filter: any minimap2 matches that are less than 95% of the query sequence (Length+nn fields) or less that 100% identity (Identity+nn)/(Length+nn) are filtered during parsing.

NOTE: Future releases may feature an option to reduce the global percentage identity cut-off. Please contact the author if you wish to see this implemented.

Minimap2 hits are then processed reverse-sorted by reference sequence size (e.g. scaffold length). Any hits where either sequence has already been filtered are skipped. Otherwise, if the match (as determined by the length of : regions in the CS string) matches the query length, the Query sequence will be flagged for remove as “identical to” or “contained within” the Hit. (Mutually partial overlapping exact matches are NOT filtered.) Filtered IDs and their matches are output to $BASEFILE.redundant.txt.

Once all sequences have been filtered, the remaining sequences are output to: $BASEFILE.nr.fasta.

NOTE: By default, sequences are output in the same order as in the input. To output in reverse size order, add the sortseq=invsize command to the Diploidocus run command.

Finally, the input and output files are summarised (unless summarise=F) and statistics output to: $BASEFILE.summarise.tdt.

Temporary gap-reduced and minimap2 PAF files are deleted unless running in debug or dev modes.


3.11 Pseudodiploid to primary and alternative haploptigs [runmode=diphap(nr)]

This protocol is based on 10x assemblies made for multiple organisms with supernova v2.0.0 and supernova v2.1.1. In each case, some redundancy in output was discovered (a) within pseudohap output, and (b) in terms of fully homozygous (identical) scaffolds shared by both haplotigs. It was also not entirely clear on what basis a particular haplotype was assigned to pseudohap1 or pseudohap2.

The general workflow therefore sought to remove redundancy, generate a set of primary scaffolds based on scaffold length, and generate a non-redundant set of alternative scaffolds where heterozygosity exists. If diphapnr mode is used, the full workflow is implemented by first running the sortnr workflow described above. In the reduced diphap mode, redundancy is not removed first.

Sequences are loaded and matching haplotigs identified based on their names. Sequence names MUST end HAP(\d+), where (\d+) indicates an integer that matches up haplotigs (as produced by supernova pseudohap2 output, for example). This is not a pipeline to identify haplotig pairs, it is purely for splitting identified haplotigs into primary and alternative assemblies.

Processing itself is quite simple. Haplotig pairs are identified based on matching HAP(\d+) numbers. Where a single haplotig is found, it is assigned as diploid, under the assumption that the two haplotigs were identical and one was removed. (It is possible that only one parent had this scaffold, e.g. sex chromosomes, so some post- processing of descriptions may be required.) If two haplotigs with the same number are identified, the longest is assigned to haploidA and the shorter haploidB.

The Primary Assemmbly is then compiled from all haploidA and diploid sequences. These are given pri prefixes and output to $BASEFILE.pri.fasta. The Alternative comprised of all haploidB sequences is output to $BASEFILE.alt.fasta. If redundancy has been removed, this will likely be a subset of the full assembly. The combined set of all primary and alternative sequences is output to $BASEFILE.dipnr.fasta.

NOTE: By default, sequences are output in the same order as in the input. To output in reverse size order, add the sortseq=invsize command to the Diploidocus run command.

Finally, the input and output files are summarised (unless summarise=F) and statistics output to $BASEFILE.summarise.tdt:

  • $BASEFILE.dipnr.fasta = Combined pseudodiploid with haploidA, haploidB and diploid annotation.
  • $BASEFILE.pri.fasta = Primary assembly with haploidA and diploid sequences.
  • $BASEFILE.alt.fasta = Alternative assembly with haploidB sequences.

3.12 In silico diploid generator [runmode=insilico]

This module generates balanced “in silico diploid” PacBio subread data from two sequenced haploid parents. Each parent must first be run through SMRTSCAPE to generate subread summary data. (This will be performed if missing. Each parent needs a *.fofn file of subread file names, *.unique.tdt unique subreads table and *.smrt.tdt SMRT cell identifier table.)

A new set of subreads is then generated from the combined set of parent subreads. This is done by first ranking the unique subreads from each parent by length. First, the longest subread from each parent are compared and the shortest selected to be the first subread of the diploid. (The shortest is taken to minimise length differences between the two parents.) Next, the longest subread from the next parent that is no longer than the previous subread is added. This cycles, picking a read from the the parent with fewest cumulative bases each cycle. The longest subread that is no longer than the previous subread is selected. This continues until one parent runs out of subreads. Additional subreads will be added from the other parent if they reduce the difference in cumulative output for each parent, or until lenfilter=X is reached.

Final output will be a *.LXXXRQXX.fasta file in which each parent has a similar total sequence content and for which the subread length distributions should also be similar. This is to overcome biases in resulting diploid assemblies, where one parent has higher quality data than the other.

NOTE: If performing downstream filtering by Read Quality (RQ), this might reintroduce a bias if one parent has much higher RQ values than the other. The rqfilter=X setting can therefore be used to restrict output to reads with a minimum RQ value. By default this is 0.84. If you do not get enough sequence output, this setting may need to be relaxed. Similarly, only sequences above lenfilter=X in length will be output. These are the figures given in the LXXXRQXX part of the output file, e.g. defaults of RQ>=0.84 and Len>=500 generates *.L500RQ84.fas.


© 2019 Richard Edwards | richard.edwards@unsw.edu.au