Expand

When you are writing real bioinformatics pipelines, you will often have mountains of data to analyze. Perhaps you have a time series experiment, with nine timepoints, three types of tissue and five biological replicates. If you want to produce one output file for each combination of timepoint, tissue and replicate, there are 135 output files in total. Writing these by hand in the targets/all rule will be a chore. This is the motivation for Snakemake's expand function.

rule all:
    input: ["time_0h/tissue_blood/replicate_1/result.csv",  ...] # and 134 more of these

If you have the timepoints ["0h", "3h", "6h", "9h", "12h", "15h", "18h", "21h", "24h"], tissues ["blood", "hair", "nails"] and replicates [1, 2, 3] you can list all the 135 different files you want like so:

timepoints = ["0h", "3h", "6h", "9h", "12h", "15h", "18h", "21h", "24h"]
tissues = ["blood", "hair", "nails"]
replicates = [1, 2, 3]
expand("time_{time}/tissue_{tissue}/replicate_{replicate}/result.csv", time=timepoints, tissue=tissues, replicate=replicates)

This produces the following list: ...

Now your all rule just became a lot simpler:

timepoints = ["0h", "3h", "6h", "9h", "12h", "15h", "18h", "21h", "24h"]
tissues = ["blood", "hair", "nails"]
replicates = [1, 2, 3]

rule all:
    input: expand("time_{time}/tissue_{tissue}/replicate_{replicate}/result.csv", time=timepoints,
                  tissue=tissues, replicate=replicates)

This works very well when the data are orderly and all follow the same pattern, but what if you have some irregularities in your data, like say, that the tissue blood only contains two samples, not three?

The easiest way to fix this is to have two different expand rules in the all rule, like so:

timepoints = ["0h", "3h", "6h", "9h", "12h", "15h", "18h", "21h", "24h"]

rule all:
    input: expand("time_{time}/tissue_{tissue}/replicate_{replicate}/result.csv",
                  time=timepoints, tissue=["hair", "nails"], replicate=[1, 2, 3]),
           expand("time_{time}/tissue_{tissue}/replicate_{replicate}/result.csv",
                  time=timepoints, tissue="blood", replicate=[1, 2])

Or what if you have two different types of data, with each one requiring different parameter settings? Let us say you have two ChIP-Seq datasets with histone H3K27me3 and H3K4me3 and they require the parameter bin_size to be 200 and 50, respectively. The path template for the output files are "histone{histone}/bin_size{bin_size}/result.csv"

Then you want the result of the expand function to be

"histone_H3K27me3/bin_size_200/result.csv", "histone_H3K4me3/bin_size_50/result.csv"

However, if you call

expand("histone_{histone}/bin_size_{bin_size}/result.csv",
       histone=["H3K27me3", "H3K4me3"], bin_size=[200, 50])

you get all four combinations, namely:

["histone_H3K27me3/bin_size_200/result.csv", "histone_H3K4me3/bin_size_50/result.csv",
 "histone_H3K27me3/bin_size_50/result.csv", "histone_H3K4me3/bin_size_200/result.csv"]

You can fix this by adding the argument zip after the file path template, like so:

expand("histone_{histone}/bin_size_{bin_size}/result.csv", zip,
       histone=["H3K27me3", "H3K4me3"], bin_size=[200, 50])

The zip argument is actually a Python function. It takes two or more lists and creates a new list where the nth element is a tuple made from all the nth elements in each list, like so:

list(zip([1, 2, 3], [4, 5, 6]))
[(1, 4), (2, 5), (3, 6)]

Here you see the first element in each list has become a tuple, the second element in each list has become a tuple and so on... This is what happens when you add the zip argument to expand too; "H3K27me3" and 200, the first elements, becomes one tuple, while the second elements, "H3K4me3" and 50 becomes another tuple.

There is one thing you should be aware of with the zip function though: If one of the lists is shorter than another, let us say the first list has three elements and the second one has four, it only creates three tuples, without any warning:

list(zip([1, 2, 3, 0], [4, 5, 6]))
[(1, 4), (2, 5), (3, 6)]

Note that the same thing happens when you use zip with the expand function in Snakemake:

expand("histone_{histone}/bin_size_{bin_size}/result.csv", zip,
       histone=["H3K27me3", "H3K4me3", "PolII"], bin_size=[200, 50])

Here you have forgotten to add a bin_size to go with the argument "PolII", but instead of throwing an error, Snakemake silently ignores PolII, and only produces two files, the ones for the tuples ("H3K27me3", 200) and ("H3K4me3", 50). This can be a subtle source of bugs, so in many cases it might be best to just call the expand function several times instead of using zip.

Takeaways

  • If your all rule should produce many files, there are more convenient ways of producing that list of files than writing all the filenames out by hand.
  • The expand function is one way to create a list of filenames with a similar structure.

Exercises

  1. Produce the list ["sample1/data.txt", "sample2/data.txt", "sample1/background.txt", "sample2/background.txt"] using the expand function.
  2. Produce the list ["sample1/data1.txt", "sample2/data2.txt", "sample1/background1.txt", "sample2/background2.txt"] using the expand function with the zip argument.

To test your code, simply put your code in a file and run it with snakemake --snakefile <your_filename>. You do not need to create any rules, you may just write

result = expand("{name}.txt", name=["file1", "file2"])
print(result)
# will print
# ['file1.txt', 'file2.txt']
# Nothing to be done.

results matching ""

    No results matching ""