PAFScaff is designed for mapping genome assembly scaffolds to a closely-related chromosome-level reference genome assembly. It uses (or runs) Minimap2 to perform an efficient (if rough) all- against-all mapping, then parses the output to assign assembly scaffolds to reference chromosomes.
Mapping is based on minimap2-aligned assembly scaffold (“Query”)
coverage against the reference chromosomes. Scaffolds are “placed” on
the reference scaffold with most coverage. Any scaffolds failing to map
onto any chromosome are rated as “Unplaced”. For each reference
chromosome, PAFScaff then “anchors” placed assembly scaffolds starting
with the longest assembly scaffold. Each placed scaffold is then
assessed in order of decreasing scaffold length. Any scaffolds that do
not overlap with already anchored scaffolds in terms of the Reference
chromosome positions they map onto are also considered “Anchored”. if
newprefix=X
is set, scaffolds are renamed with the
Reference chromosome they match onto. The original scaffold name and
mapping details are included in the description. Unplaced scaffolds are
not renamed.
Finally, Anchored scaffolds are super-scaffolded by inserting gaps of
NnNnNnNnNn
sequence between anchored scaffolds. The lengths
of these gaps are determined by the space between the reference
positions, modified by overhanging query scaffold regions (min. length
10). The alternating case of these gaps makes them easy to identify
later.
PAFScaff outputs renamed, sorted and reoriented scaffolds in fasta format, along with mapping details:
*.anchored.fasta
, *.placed.fasta
and
*.unplaced.fasta
contain the relevant subsets of assembly
scaffolds, renamed and/or reverse-complemented if appropriate.*.scaffolds.fasta
contains the super-scaffolded
anchored scaffolds.*.scaffolds.tdt
contains the details of the PAFScaff
mapping of scaffolds to chromosomes.*.log
contains run details, including any warnings or
errors encountered.NOTE: The precise ordering, orientation and naming
of the output scaffolds depends on the settings for:
refprefix=X newprefix=X sorted=T/F revcomp=T/F
. See main
documentation (below) for details.
The main minimap-based PAFScaff approach has been published as part of the German Shepherd Dog genome paper:
Field MA, Rosen BD, Dudchenko O, Chan EKF, Minoche AM, Edwards RJ, Barton K, Lyons RJ, Enosi Tuipulotu D, Hayes VM, Omer AD, Colaric Z, Keilwagen J, Skvortsova K, Bogdanovic O, Smith MA, Lieberman Aiden E, Smith TPL, Zammit RA & Ballard JWO (2020): Canfam_GSD: De novo chromosome-length genome assembly of the German Shepherd Dog (Canis lupus familiaris) using a combination of long reads, optical mapping, and Hi-C. GigaScience 9(4):giaa027. doi: 10.1093/gigascience/giaa027
PAFScaff is written in Python 2.x and can be run directly from the commandline:
python $CODEPATH/pafscaff.py [OPTIONS]
If running as part of SLiMSuite,
$CODEPATH
will be the SLiMSuite tools/
directory. If running from the standalone PAFScaff git repo,
$CODEPATH
will be the path the to code/
directory. Please see details in the PAFScaff git repo for
running on example data.
For mapping prior to parsing, minimap2 must be installed
and either added to the environment $PATH
or given to
PAFScaff with the minimap2=PROG
setting.
A list of commandline options can be generated at run-time using the
-h
or help
flags. Please see the general SLiMSuite
documentation for details of how to use commandline options,
including setting default values with INI files.
### ~ Input/Output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
pafin=PAFFILE : PAF generated from $REFERENCE $ASSEMBLY mapping; or run minimap2, or use busco [minimap2]
basefile=STR : Base for file outputs [PAFIN basefile]
seqin=FASFILE : Input genome assembly to map/scaffold onto $REFERENCE (minimap2 $ASSEMBLY) []
reference=FILE : Fasta (with accession numbers matching Locus IDs) ($REFERENCE) []
assembly=FASFILE: As seqin=FASFILE
busco=TSVFILE : BUSCO v5 full table (pafin=busco) [full_table_$BASEFILE.busco.tsv]
refbusco=TSVFILE: Reference BUSCO v5 full table [full_table_$REFBASE.busco.tsv]
refprefix=X : Reference chromosome prefix. If None, will use all $REFERENCE scaffolds [None]
newprefix=X : Assembly chromosome prefix. If None, will not rename $ASSEMBLY scaffolds [None]
unplaced=X : Unplaced scaffold prefix. If None, will not rename unplaced $ASSEMBLY scaffolds [None]
ctgprefix=X : Unplaced contig prefix. Replaces unplaced=X when 0 gaps. [None]
purechrom=T/F : Whetheer to always output the first hit to any chromosome without the numerical suffix [False]
sorted=X : Criterion for $ASSEMBLY scaffold sorting (QryLen/Coverage/RefStart/None) [QryLen]
minmap=PERC : Minimum percentage mapping to a chromosome for assignment [0.0]
minpurity=PERC : Minimum percentage "purity" for assignment to Ref chromosome [50.0]
revcomp=T/F : Whether to reverse complement relevant scaffolds to maximise concordance [True]
scaffold=T/F : Whether to "anchor" non-overlapping scaffolds by Coverage and then scaffold [True]
dochtml=T/F : Generate HTML PAFScaff documentation (*.info.html) instead of main run [False]
pagsat=T/F : Whether to output sequence names in special PAGSAT-compatible format [False]
newchr=X : Prefix for short PAGSAT sequence identifiers [ctg]
spcode=X : Species code for renaming assembly sequences in PAGSAT mode [PAFSCAFF]
### ~ Mapping/Classification options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
minimap2=PROG : Full path to run minimap2 [minimap2]
mmsecnum=INT : Max. number of secondary alignments to keep (minimap2 -N) [0]
mmpcut=NUM : Minimap2 Minimal secondary-to-primary score ratio to output secondary mappings (minimap2 -p) [0]
mapopt=CDICT : Dictionary of additional minimap2 options to apply (caution: over-rides conflicting settings) []
purebusco=T/F : Whether to keep BUSCO genes separate rather than generating synteny blocks [False]
### ~ Processing options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
forks=X : Number of parallel sequences to process at once [0]
killforks=X : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X : Sleep time (seconds) between cycles of forking out more process [0]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
Details of reference genome naming requirements to be added.
Chromosomes should start with the prefix given by
refprefix=X
.
The PAF file is generated by running Minimap2 without secondary
alignments. See rje_paf.py
for details.
The PAF file is first parsed and the following fields extracted:
If the refprefix=X
setting was provided, hits are then
cleaned up to (a) strip the prefix from chromosome names and convert
numerical chromosomes to numbers for proper sorting, and (b) drop any
unplaced reference scaffolds. If no refprefix
is given, no
reference scaffold renaming/filtering will be performed.
For each hit, Coverage
is then calculated as
QryEnd
-QryStart
+1. This Coverage
stat is ultimately used for ranking the assignments of assembly
scaffolds to reference chromosomes/scaffolds. For each Qry/Ref/Strand
combination, the Coverage
, Identity
and
Length
statistics are summed, and a new field
N
added with the number of individual alignments. The top
ranked reference strand is selected for each query, and two more fields
added:
Inv
= Query coverage on the opposite strand to the best
reference strandTrans
= Summed query coverage for all other reference
chromosomes/scaffolds.Together, these are used to establish the total percentage of query coverage that is scaffolded, versus mapping to a different reference sequence. These are converted into:
Purity
= 100 * (Coverage
+
Inv
) / (Coverage
+ Inv
+
Trans
)v0.4.0 has introduced a couple of additional parameters than can be
used to increase the stringency of any mapping. This is mainly for the
purpose of reducing situtations where highly repetitive multi-mapping
sequences are assigned to a single chromosome. By default,
minpurity=50.0
, meaning that at least half of the
chromosome mapping should be to the main reference chromosome.
Once queries have been assigned to reference scaffolds, they are then
ordered according to the reference scaffold and start position of the
match to that scaffold. Queries are also sorted for the purposes of
renaming, using the sorted=X
statistic. By default, this is
query length, but could be switched to Coverage
or
RefStart
.
NOTE: This ordering is quite crude and assumes the reference start position is part of the main mapping block. If the query has a fragment mapping to an upstream repeat sequence, for example, this ordering could be messed up. Future releases will check and try to fix this. Visualisation of the mapping is recommended.
Scaffolds mapped to a reference chromosome are designated
Placed
. Anchored
scaffolds are a special
subset of Placed
scaffolds, which can be combined into a
super-scaffold. Any scaffolds failing to map onto a scaffold will be
designated Unplaced
.
Anchor
assignment is only performed if
scaffold=T
. Assembly scaffolds are processed in decreasing
size and assigned to the Anchor
set if either (a) no
previous scaffold has been assigned to the mapped reference chromosome,
or (b) the scaffold mapping positions on the reference chromosome do not
overlap with any previous Anchor
scaffold for that
chromosome. As with the ordering (above), this scaffolding will get
messed up if the earlier mapped scaffolds have fragments a great
distance up or downstream of the main mapped region.
If scaffold=T
, the
’Anchorsequences for each reference chromosome will be scaffolded by concatenating them in reference position order, reverse-complementing where the main mapping was to the negative strand. Gaps will be inserted between scaffolds, consisting of a stretch of
Nn`
repeats that matches the size of the gap between reference positions
(e.g. the end of the one match and the start of the next), with a
minimum gap of 10 nt.
NOTE: This explicitly expects chromosome-level reference scaffolds. It will not scaffold the reference using the assembly to maximise the overall scaffolding. It is purely for chromosome assignment and then additional scaffolding based on assumed synteny.
If newprefix=X
is given, the sequence name will be
replaced with the prefix and reference chromosome identifier (parsed
using refprefix=X
). If multiple assembly sequences map to
the same reference chromosome, these will be numbered .1
,
.2
.. .N
. (NOTE: these numbers will have zero
prefixes to make them sort correctly, e.g. 10+ sequences will start
.01
, 100+ sequences will start .001
etc.
Sequence descriptions take the form:
SEQNAME len=SEQLEN COV% REF(STRAND) START:END; [INV% REF(INVSTRAND);] [OTHER% other;]
where:
SEQNAME
= original assembly sequence name.SEQLEN
= assembly sequence length.COV%
= percentage of query mapping onto reference
strand.REF(STRAND)
= reference chromosome and strand. If
mapped to the -ve strand, the description will have a
RevComp
prefix.START:END
= start and end positions on reference
chromosome of query mapping.INV% REF(INVSTRAND)
= percentage mapped onto inverse
strand of reference chromosome. (Omitted if 0%.)OTHER%
= percentage mapped onto other reference
chromosome. (Omitted if 0%.)Unplaced sequences will either not be renamed, or will have a
different prefix if unplaced=X
is set. If
ctgprefix=X
is also set, any unplaced sequences without
gaps will have a different prefix set by ctgprefix=X
.
Assembly mapping an scaffolding details will be save to
$BASEFILE.scaffolds.tdt
.
Sequences will be output to:
$BASEFILE.anchored.fasta
$BASEFILE.placed.fasta
$BASEFILE.unplaced.fasta
If scaffold=T
, the scaffolded assembly will be saved to
$BASEFILE.scaffolded.fasta
and the sequences summaried in
$BASEFILE.scaffolded.tdt
.
© 2019 Richard Edwards | richard.edwards@unsw.edu.au