Introduction
Snakemake
integrates with the package manager conda
and the container engine Signlarity
Snakemake is a general purpose workflow management system not only for Bioinformatics but also for any discipline.
mamba
is a package manager, which is similar to conda
but faster than miniconda
. It’s recommended to use mamba
to install snakemake
,but it’s not necessary. Alternatively, you can install snakemake
just using conda
.
A Snakemake
workflow is defined by specifying rules in a Snakefile
, in which rules decompose the workflow into small steps.
Here, we created an example workflow of genome analysis, which includes several steps from reads mapping to a reference genome to variants calling.
Step 1: Mapping reads
Firstly, we create a new file called Snakefile
with an editor of your choice (I used vim
here). In the Snakefile
, define the following rule:
1
2
3
4
5
6
7
8
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
A Snakemake
rule has a name (here bwa_map
) and a number of directives, here input
, output
and shell
. The input
and output
are just explicit Python string.
Notably, it’s compulsory to specify the maximum number of CPU cores to use at the same time. If you want to use N cores, say --cores N
or -cN
. For all cores on your system (be sure that this is appropriate) use --cores all
. For no parallelization use --cores 1
or -c1
.
Execute the snakefile
:
1
snakemake --cores 1 mapped_reads/{A,B}.bam
Step 2: Sorting read alignments
1
2
3
4
5
6
7
8
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
Note that snakemake
automatically creates missing directories before jobs are executed.