Telociraptor is a genome assembly editing tool that combined the standalone telomere prediction functions from Diploidocus with some simple genome assembly scaffolding functions that were originally packaged up as “GenomeTweak”. Telociraptor predicts telomeres at the end of assembly scaffolds, where (by default) at least 50% of the terminal 50 bp matches the telomere regex provided. This approach is based on FindTelomeres by Jana Sperschneider. In addition, Telociraptor will generate a simple text assembly map (see below) with contig positions and gap lengths. By default, Telomere predictions are also made for each contig, and internal telomeres marked in the map. Finally, Telociraptor will try to fix scaffolding errors based on telomere predictions, either inverting or trimming the ends of scaffolds (see below). Alternatively, the assembly map can be manually edited/generated and used to create a new fasta file from an existing assembly. As an extra bonus, Telociraptor will generate the “telonull” telomere output and assembly gaps table used as input by ChromSyn.
Telociraptor performs a regex-based search for Telomeres, based on FindTelomeres.
By default, this looks for a canonical telomere motif of TTAGGG/CCCTAA,
allowing for some variation. Telociraptor searches for a forward
telomere regex sequence of C{2,4}T{1,2}A{1,3} at the 5’
end, and a reverse sequence at the 3’ end of
T{1,3}A{1,2}G{2,4}. These can be set with
telofwd=X and telorev=X. For each sequence,
Telociraptor trims off any trailing Ns and then searches for
telomere-like sequences at sequence ends. For each sequence, the
presence/absence and length of trimming are reported for the 5’ end
(tel5 and trim5) and 3’ end (tel3 and trim3), along with the total
percentage telomeric sequence (TelPerc).
Telomeres are marked if at least 50% (teloperc=PERC) of
the terminal 50 bp (telosize=INT) matches the appropriate
regex. If either end contains a telomere, the total percentage of the
sequence matching either regex is calculated as TelPerc. Note that this
number neither restricts matches to the termini, not includes sequences
within predicted telomeres that do not match the regex. By default, only
sequences with telomeres are output to the *.telomeres.tdt
output, but switching telonull=T will output all sequences.
This can be useful if you also need a table of sequence lengths, and is
the recommended input for ChromSyn.
GenomeTweak was a simple tool designed to help manual curation of
genome assembly scaffolding. A genome assembly file is provided with
seqin=FILE. If run with mapout=FiLE then an
assembly map will be generated from this assembly file. If a
*.gaps.tdt is found (or gapfile=TDT), this
will be used for gap positions. Otherwise, rje_seqlist will
generate the gaps file and use that. If telofile=TDT is
provided, telomere positions for the contigs will be loaded in. (These
must match elements of the assembly map.) If that file is missing, the
assembly will be broken into contigs and telomere prediction run on the
contigs to generate.
If mapin=FILE is given, then GenomeTweak will generate a
new assembly (seqout=FILE
[$BASEFILE.tweak.fasta]) based on the map file and
extracting sequences from the input assembly.
The map format is as follows:
||NewName Description>>SeqName:Start-End:Strand|~GapLen~| ... |SeqName:Start-End:Strand<<
Where there is a full-length sequence, Start-End is not required:
||NewName Description>>SeqName|~GapLen~| ... |SeqName<<
Gaps can have zero length.
If telomere predictions have been loaded from a table (must match
SeqName exactly) then 5’ and 3’ telomeres will be annotated in the file
with { and }:
||NewName Description>>{SeqName:Start-End:Strand|~GapLen~| ... |SeqName:Start-End:Strand}<<
GenomeTweak mode can also be used for some direct auto-correction of
genome assemblies (autofix=T), producing
*.tweak.* output (or seqout=FILE and
fixout=FILE for the assembly and assembly map,
respectively). This goes through up to four editing cycles, depending on
input settings:
Removal of any contigs given by badcontig=LIST and
descaffold=LIST. The former will be entirely removed from
the assembly (e.g. contigs that have been identified by DepthKopy as
lacking sequence coverage), whilst the latter will be excised from
scaffolds but left in the assembly as a contig. (These can be comma
separated lists, or text files containing one contig per line.) If
removed, the downstream gap will also be removed, unless it is the 3’
end contig, in which case the upstream gap will be removed. If
gapsize=INT has been set, all gaps will be standardised to
this size.
Any regions in the MapIn file flanked by /
characters (/../) will be inverted.
Contigs and regions provided by invert=LIST will be
inverted. These are searched directly in the sequence map and can either
be direct matches, or have an internal .. that will map
onto any internal sequence map elements. In this case, the entire chunk
including the intervening region will be inverted.
Finally, each sequence is considered in term and assessed with
respect to internal telomere positions. First, end inversions are
identified as inward-facing telomeres that are (a) within
invlimit=INT bp of the end, and
trimlimit=INT bp of the end. If
trimfrag=T then these will be split into contigs, otherwise
the whole scaffold chunk will be a new sequence. Where a possible
inversion or trim is identified but beyond the distance limits, it will
appear in the log as a #NOTE.Telociraptor has not yet been published. Please cite github in the meantime.
Telociraptor is written in Python 2.x or Python 3.x and can be run directly from the commandline:
python $CODEPATH/telociraptor.py [OPTIONS]
If running as part of SLiMSuite,
$CODEPATH will be the SLiMSuite tools/
directory. If running from the standalone Telociraptor git
repo, $CODEPATH will be the path the to
code/ directory. Please see details in the Telociraptor git
repo for running on example data.
The main Telociraptor functions do not currently have any
dependencies. To generate documentation with dochtml, R
will need to be installed and a pandoc environment variable must be set,
e.g.
export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc
For Telociraptor documentation, run with dochtml=T and
read the *.docs.html file generated.
A list of commandline options can be generated at run-time using the
-h or help flags. Please see the general SLiMSuite
documentation for details of how to use commandline options,
including setting default values with INI files.
### ~ Main Telociraptor run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqin=FILE : Input sequence assembly [None]
basefile=FILE : Root of output file names [$SEQINBASE]
tweak=T/F : Whether to execute GenomeTweak pipeline [False]
chromsyn=T/F : Whether to execute ChromSyn preparation (gaps.tdt and telonull telomere prediction) [False]
dochtml=T/F : Generate HTML Telociraptor documentation (*.docs.html) instead of main run [False]
### ~ Telomere options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
telomeres=T/F : Whether to generate telomere predictions [True]
telonull=T/F : Whether to output sequences without telomeres to telomere table [False]
telofile=TDT : Delimited file of `seqname seqlen tel5 tel3` [$SEQINBASE.telomeres.tdt]
telofwd=X : Regex for 5' telomere sequence search [C{2,4}T{1,2}A{1,3}]
telorev=X : Regex for 5' telomere sequence search [T{1,3}A{1,2}G{2,4}]
telosize=INT : Size of terminal regions (bp) to scan for telomeric repeats [50]
teloperc=PERC : Percentage of telomeric region matching telomeric repeat to call as telomere [50]
### ~ Genome Tweak run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqout=FILE : Output sequence filename [$BASEFILE.tweak.fasta]
mapin=FILE : Text file input for genome assembly map [None]
mapout=FILE : Text file output for genome assembly map [$BASEFILE.ctgmap.txt]
gapfile=TDT : Delimited file of `seqname start end seqlen gaplen` [$SEQINBASE.gaps.tdt]
autofix=T/F : Whether to try to fix terminal inversions based on telomere predictions [True]
fixout=FILE : Text file output for auto-fixed genome assembly map [$BASEFILE.tweak.txt]
badcontig=LIST : List of contigs to completely remove from the assembly (prior to inversion) []
descaffold=LIST : List of contigs to remove from scaffolds but keep in assembly (prior to inversion) []
invert=LIST : List of contigs/regions to invert (in order) []
invlimit=NUM : Limit inversion distance to within X% (<=100) or Xbp (>100) of chromosome termini [25]
trimlimit=NUM : Limit trimming distance to within X% (<=100) or Xbp (>100) of chromosome termini [5]
trimfrag=T/F : Whether to fragment trimmed scaffold chunks into contigs [False]
gapsize=INT : Set all gaps to size INT bp (0=removes gaps; <0 leave unchanged) [-1]
minlen=INT : Minimum sequence length (bp) [0]
### ~ Chromosome rennaming options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
chromsort=T/F : Whether to size sort sequences and then rename as CHR, CTG or SCF [False]
minchrom=INT : Minimum sequence length to be named as chromosome [1e7]
newprefix=X : Assembly chromosome prefix. If None, will just use chr, ctg and scf [None]
### ~ Telomeric read extraction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
teloreads=FILE : Extract telomeric reads from FILE into $BASEFILE.teloreads.fa [None]
minreadlen=INT : Minimum length for filtered telomeric reads [1000]
teloseed=INT : Seed number of telomere repeats to pull out telomeric reads for the findTelomere() method [3]
termseed=T/F : Whether the seed zgrep for multiple telomeric repeats needs to be terminally constrained [False]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
Details of the Telociraptor workflow will be added with time. Please contact the author or raise an issue on GitHub if you have any questions.
© 2023 Richard Edwards | rich.edwards@uwa.edu.au