Support scripts
In addition to the main species_separator
script, execution of the Sargasso pipeline relies on a number of supporting Python and Bash scripts. Their usage patterns are described here; note, however, that in normal usage these scripts need not be executed directly by the user.
build_bowtie2_index (Bash)
Usage:
build_bowtie2_index
<sequence-fasta-file> <num-threads> <index-dir> <bowtie2-build-executable>
Build a Bowtie2 index for a species’ genome. build_bowtie2_index
is called from the species separation Makefile.
Options:
<sequence-fasta-file>
(file path): path to a FASTA file containing genome sequences.<num-threads>
(integer): Number of threads to be used for genome generation.<index-dir>
(file path): Path to directory where genome index files will be stored.<bowtie2-build-executable>
(file path): Path to, or name of,bowtie2-build
executable.
build_star_index (Bash)
Usage:
build_star_index
<sequence-fasta-files> <gtf-file> <num-threads> <index-dir> <star-executable>
Build a STAR index for a species’ genome. build_star_index
is called from the species separation Makefile.
Options:
<sequence-fasta-files>
(list of file paths): Space-separated list of genome FASTA files.<gtf-file>
(file path): Path to GTF file containing transcript annotations.<num-threads>
(integer): Number of threads to be used for genome generation.<index-dir>
(file path): Path to directory where genome index files will be stored.<star-executable>
(file path): Path to, or name of, STAR executable.
collate_raw_reads (Bash)
Usage:
collate_raw_reads
<samples> <raw-reads-directory> <reads-dir> <reads-type>
<raw-read-files-1> <raw-read-files-2>
Assemble links to the FASTQ files containing raw sequencing reads for each sample. collate_raw_reads
is called from the species separation Makefile.
<samples>
(text parameter): Space-separated list of sample names.<raw-reads-directory>
(file path): Base directory for raw sequencing read data files.<reads-dir>
(file path): Directory in which links to raw sequencing read files will be collated.<reads-type>
(text parameter): Either “single” for single-end reads, or “paired” for paired-end reads.<raw-read-files-1>
(list of lists of file paths): Space-separated list of comma-separated lists of paths to raw sequencing read files. Each comma-separated list should correspond to a sample name in the<samples>
parameter, and paths should be given relative to the<raw-reads-directory>
parameter. In the case of paired-end reads, the read files should correspond to the first read of the pair.<raw-read-files-2>
(list of lists of file paths): Space-separated list of comma-separated list of paths to raw sequencing read files. Each comma-separated list should correspond to a sample name in the<samples>
parameter, and paths should be given relative to the<raw-reads-directory>
parameter. In the case of paired-end reads, the read files should correspond to the second read of the pair. In the case of single-end reads, this parameter should be omitted.
filter_control (Python)
Usage:
filter_control
[--log-level=<log-level>] [--reject-multimaps]
<block-dir> <output-dir> <sample-name>
<mismatch-threshold> <minmatch-threshold> <multimap-threshold>
(<species>) (<species>) ...
Takes as input a directory containing sets of BAM files, each set being the result of mapping a set of mixed species sequencing reads against each species’ genome (in normal operation, all pairs of BAM files will correspond to a single sample, having been split in pieces for efficiency of filtering). Each set of BAM files is passed to an instance of the script filter_sample_reads
, running on a separate thread, which writes filtered read mappings to a set of species-specific output BAM files.
filter_control
is called by the script filter_reads
.
--log-level=<log-level>
(text parameter): Sets the minimum severity level at which log messages will be output (one of “debug”, “info”, “warning”, “error” or “criticial”).--reject-multimaps
(flag): If set, any read which multimaps to either species’ genome will be rejected and not be assigned to either species.<block-dir>
(file path): Directory containing pairs of mapped read BAM files.<output-dir>
(file path): Directory into which species-separated reads will be written.<sample-name>
(text parameter): Name of sample being processed.<mismatch-threshold>
(float): Maximum percentage of read bases allowed to be mismatches against the genome during filtering.<minmatch-threshold>
(float): Maximum percentage of read length allowed to not be mapped during filtering.<multimap-threshold>
(integer): Maximum number of multi-mappings allowed during filtering.<species>
(text parameter): Name of nth species.
filter_reads (Bash)
Usage:
filter_reads
<data_type> <samples>
<input-dir> <output-dir> <num-threads>
<mismatch-threshold> <minmatch-threshold> <multimap-threshold>
<reject-multimaps>
(<species>) (<species>) ...
For each sample, take the sequencing reads mapping to each genome, and assign them to their correct species of origin. filter_reads
is called by the species separation Makefile.
<data-type>
(text parameter): One of “dnaseq” or “rnaseq”.<samples>
(text parameter): Space-separated list of sample names.<input-dir>
(file path): Directory containing, for each sample and each species, name-sorted BAM files containing read mappings for that sample’s RNA-seq reads to the species’ genome reference.<output-dir>
(file path): Directory into which species-separated BAM files are to be written.<num-threads>
(integer): Number of threads to be used during species separation.<mismatch-threshold>
(float): Maximum percentage of read bases allowed to be mismatches against the genome during filtering.<minmatch-threshold>
(float): Maximum percentage of read length allowed to not be mapped during filtering.<multimap-threshold>
(integer): Maximum number of multi-mappings allowed during filtering.<reject-multimaps>
(text parameter): If set to “–reject-multimaps”, any read which multimaps to any species’ genome will be rejected and not be assigned to any species.-
<log-level>
(text parameter): Sets the minimum severity level at which log messages will be output (one of “debug”, “info”, “warning”, “error” or “criticial”). <species>
(text parameter): Name of nth species.
filter_sample_reads (Python)
Usage:
filter_sample_reads
[--log-level=<log-level>] [--reject-multimaps]
<mismatch-threshold> <minmatch-threshold> <multimap-threshold>
(<species> <species-input-bam> <species-output-bam>)
(<species> <species-input-bam> <species-output-bam>) ...
filter_sample_reads
takes a set of BAM files as input, the results of mapping a set of mixed species sequencing reads against each species’ genome, and determines, where possible, from which species each read or read pair originates. Disambiguated read mappings are written to a set of species-specific output BAM files. Note that the input BAM files must be sorted in read order (and should contain mappings for the same set of reads) — failure to ensure input BAM files are correctly sorted will result in erroneous output.
filter_sample_reads
is called by the script filter_control
.
--log-level=<log-level>
(text parameter): Sets the minimum severity level at which log messages will be output (one of “debug”, “info”, “warning”, “error” or “criticial”).--reject-multimaps
(flag): If set, any read which multimaps to either species’ genome will be rejected and not be assigned to either species.<mismatch-threshold>
(float): Maximum percentage of read bases allowed to be mismatches against the genome during filtering.<minmatch-threshold>
(float): Maximum percentage of read length allowed to not be mapped during filtering.<multimap-threshold>
(integer): Maximum number of multi-mappings allowed during filtering.<species>
(text parameter): Name of nth species.<species-input-bam>
(file path): BAM file containing reads mapped against the nth species’ genome.<species-output-bam>
(file path): BAM file to which read mappings assigned to the nth species after filtering will be written.
map_reads_dnaseq (Bash)
Usage:
map_reads_dnaseq
<species> <samples> <bowtie-indexes-dir> <num-threads>
<input-dir> <output-dir> <reads-type> <bowtie2-executable>
For each sample, map raw sequencing reads to each species’ genome. map_reads_dnaseq
is called by the species separation Makefile.
<species>
(text parameter): Space-separated list of species names.<samples>
(text parameter): Space-separated list of sample names.<star-indexes-dir>
(file path): Directory containing Bowtie2 index directories for each species (or links to index directories).<num-threads>
(integer): Number of threads to be used by Bowtie2 during read mapping.<input-dir>
(file path): Directory containing per-sample directories, each of which contains links to the input raw sequencing read files for that sample.<output-dir>
(file path): Directory into which to write BAM files containing read mappings.<reads-type>
(text parameter): Either “single” for single-end reads, or “paired” for paired-end reads.<bowtie2-executable>
(file path): Path to, or name of, the Bowtie2 executable.
map_reads_rnaseq (Bash)
Usage:
map_reads_rnaseq
<species> <samples> <bowtie-indexes-dir> <num-threads>
<input-dir> <output-dir> <reads-type> <star-executable>
For each sample, map raw RNA-seq reads to each species’ genome. map_reads_rnaseq
is called by the species separation Makefile.
<species>
(text parameter): Space-separated list of species names.<samples>
(text parameter): Space-separated list of sample names.<star-indexes-dir>
(file path): Directory containing STAR index directories for each species (or links to index directories).<num-threads>
(integer): Number of threads to be used by STAR during read mapping.<input-dir>
(file path): Directory containing per-sample directories, each of which contains links to the input raw sequencing read files for that sample.<output-dir>
(file path): Directory into which to write BAM files containing read mappings.<reads-type>
(text parameter): Either “single” for single-end reads, or “paired” for paired-end reads.<star-executable>
(file path): Path to, or name of, the STAR executable.
sort_reads (Bash)
Usage:
sort_reads
<species> <samples> <num-threads> <input-dir> <output-dir> <tmp-dir>
For each sample, sort mapped reads for each species into name order. sort_reads
is called by the species separation Makefile.
<species>
(text parameter): Space-separated list of species names.<samples>
(text parameter): Space-separated list of sample names.<num-threads>
(integer): Number of threads to be used bysambamba
Sambamba during read sorting.<input-dir>
(file path): Directory containing BAM files containing read mappings for each sample and species.<output-dir>
(file path): Directory into which to write name-ordered BAM files containing read mappings.<tmp-dir>
(file path): Temporary directory to be used bysambamba
.