Wildcards

In this chapter, we will learn about wildcards, one of Snakemake's most powerful features, which make input and output directives incredibly flexible. By replacing hardcoded variable names in input and output directives with wildcards, you can greatly reduce the amount of code needed, and have the pipeline work on new data, without modification.

We will use wildcards to create a general pipeline that tells you how long it takes to get a reply from internet hosts, such as DuckDuckGo or github. First, we will solve the problem without wildcards, and afterwards we will include them, thereby cleaning up the code.

To start with, we tell Snakemake what files to produce in the pipeline. We do this in the topmost rule, which we call targets. (Targets is the other most common name for the all rule.)

rule targets:
 input: "results/github.txt", "results/duckduckgo.txt"

Now we create the rule to fetch these host response times. We call it ping, which is the name of the command line tool that does this. (Notice that ping is another rule with no input files.)

rule ping:
 output: "results/github.txt", "results/duckduckgo.txt
 run:
  shell("ping github.com > results/github.txt")
  shell("ping duckduckgo.com > results/duckduckgo.txt")

There are two new elements here; the run directive and the shell function.

The directive we have used up until now has been shell. It runs command line programs for you. The directive run is for Python code. Here we use it to call the function shell, which does the same thing as the shell directive. This means you can call a shell function within a block of python code.

The reason we use the run directive with the shell function is that we need to call several shell commands. We cannot have several shell directives in the same rule, but we can have several shell functions in one run directive.

You can now store these two rules in a file called ping.Snakefile and run it with snakemake --snakefile ping.Snakefile. Everything works, yay!

But what if you now were told to update the pipeline to create logs for three more hosts, namely google.com, apple.com and example.com? This would be a lot of drudge work; you would need to update the input directive of the rule targets and the output directive of the rule ping with three new files. Plus you would need to call the function shell("ping ... ") three more times in the run directive. Snakemake has a feature called wildcards that lets you solve this problem much more efficiently.

If you look closely at it, the rule ping is actually two rules in one:

rule ping_github:
 output: "results/github.txt"
 run:
  shell("ping github.com > results/github.txt")

and

rule ping_duckduckgo:
 output: "results/duckduckgo.txt"
 run:
  shell("ping duckduckgo.com > results/duckduckgo.txt")

And as is obvious, the two rules do exactly the same thing, only one for github.com, the other for duckduckgo.com.

Wildcards lets you make the rule ping more general. With it, you do not need to hardcode the names into the rule, but instead you make a slot in which any variable can fit. Let's look more closely at how this looks:

rule ping:
 output: "results/{host}.txt"
 run:
  shell("ping {wildcards.host}.com > results/{wildcards.host}.txt")

We do not say that the output files should be named results/duckduckgo.txt or results/github.txt, but have made a slot, or rather wildcard, in the file name, where both github and duckduckgo can fit.

Where do you tell the pipeline to ping github.com and google.com though? In the targets (or all) rule, just like before.

In the rule targets you specify that you want Snakemake to produce the files, "results/github.txt" and "results/duckduckgo.txt". So when you run Snakemake now, it will look for a rule that can produce files like these. It will find our rule ping, and see that it can produce files with names like "results/github.txt", by replacing the {host} part of the output file with github.

The {host} part of the path in the output directive, written in braces, is a wildcard. When Snakemake uses the rule ping to produce the output file results/github.txt, the {host} wildcard of the rule is replaced by github. The point of wildcards is that the code of the rule can change depending on the output files produced. You see this in the shell command, where "ping {wildcards.host}.com > results/{wildcards.host}.txt" is replaced with "ping github.com > results/github.txt". Note that in the run or shell directives of rules, you have to write "{wildcards.wildcard_name}" such as "{wildcards.host}" instead of just "{host}".

The rule ping is now general; it can take any address of the form "www.{hostname}.com" and ping it. So if you know want to ping a lot of other hosts, you can just add their name to the targets rule.

Takeaways

  • If you want to run the same code on many different inputs (with the same structure), use wildcards.
  • In the input and output directives of rules, wildcards are written in braces: "{sample}/{datatype}.txt".
  • In the shell function or directive, wildcards can be accessed by prefacing the name of the wildcard with wildcards., like: "{wildcards.sample}" and "{wildcards.datatype}".
  • In addition to the shell directive, there is a run directive which executes Python code.

Exercises

  1. Can you make the pipeline above ping the hosts example.com, apple.com and google.com?

Advanced exercises

  1. Are you able to make the pipeline accept hosts with other extensions, such as .net and .org, not only ones that end in .com? Hint: you will need to include the extension in the output filename, like so: "results/bitbucket.org.txt".

results matching ""

    No results matching ""