This tutorial shows how to use RVenn
, a package for
dealing with multiple sets. The base R functions
(intersect
, union
and setdiff
)
only work with two sets. %>%
can be used from
magrittr
but, for many sets this can be tedious.
reduce
function from purrr
package also
provides a solution, which is the function that is used for set
operations in this package. The functions overlap
,
unite
and discern
abstract away the details,
so one can just construct the universe and choose the sets to operate by
index or set name. Further, by using ggvenn
Venn diagram
can be drawn for 2-3 sets. As you can notice from the name of the
function, ggvenn
is based on ggplot2
, so it is
a neat way to show the relationship among a reduced number sets. For
many sets, it is much better to use UpSet or setmap
function provided within this package. Finally, by using
enrichment_test
function, the p-value of an overlap between
two sets can be calculated. Here, the usage of all these functions will
be shown.
This chunk of code will create 10 sets with sizes ranging from 5 to 25.
set.seed(42)
toy = map(sample(5:25, replace = TRUE, size = 10),
function(x) sample(letters, size = x))
toy[1:3] # First 3 of the sets.
#> [[1]]
#> [1] "y" "e" "n" "t" "r" "o" "c" "i" "d" "z" "u" "m" "q" "v" "b" "h" "w" "a" "l"
#> [20] "f" "j"
#>
#> [[2]]
#> [1] "o" "v" "h" "d" "w" "r" "m" "e" "y"
#>
#> [[3]]
#> [1] "b" "x" "r" "c" "q"
Intersection of all sets:
Intersection of selected sets (chosen with set names or indices, respectively):
overlap_pairs(toy, slice = 1:4)
#> $Set_1...Set_2
#> [1] "y" "e" "r" "o" "d" "m" "v" "h" "w"
#>
#> $Set_1...Set_3
#> [1] "r" "c" "q" "b"
#>
#> $Set_1...Set_4
#> [1] "e" "n" "t" "r" "c" "u" "v" "b" "h" "w" "a" "f" "j"
#>
#> $Set_2...Set_3
#> [1] "r"
#>
#> $Set_2...Set_4
#> [1] "v" "h" "w" "r" "e"
#>
#> $Set_3...Set_4
#> [1] "b" "x" "r" "c"
Union of all sets:
unite(toy)
#> [1] "y" "e" "n" "t" "r" "o" "c" "i" "d" "z" "u" "m" "q" "v" "b" "h" "w" "a" "l"
#> [20] "f" "j" "x" "g" "k" "p" "s"
Union of selected sets (chosen with set names or indices, respectively):
unite_pairs(toy, slice = 1:4)
#> $Set_1...Set_2
#> [1] "y" "e" "n" "t" "r" "o" "c" "i" "d" "z" "u" "m" "q" "v" "b" "h" "w" "a" "l"
#> [20] "f" "j"
#>
#> $Set_1...Set_3
#> [1] "y" "e" "n" "t" "r" "o" "c" "i" "d" "z" "u" "m" "q" "v" "b" "h" "w" "a" "l"
#> [20] "f" "j" "x"
#>
#> $Set_1...Set_4
#> [1] "y" "e" "n" "t" "r" "o" "c" "i" "d" "z" "u" "m" "q" "v" "b" "h" "w" "a" "l"
#> [20] "f" "j" "x"
#>
#> $Set_2...Set_3
#> [1] "o" "v" "h" "d" "w" "r" "m" "e" "y" "b" "x" "c" "q"
#>
#> $Set_2...Set_4
#> [1] "o" "v" "h" "d" "w" "r" "m" "e" "y" "u" "f" "x" "b" "t" "c" "j" "a" "n"
#>
#> $Set_3...Set_4
#> [1] "b" "x" "r" "c" "q" "u" "f" "t" "v" "w" "j" "h" "e" "a" "n"
discern_pairs(toy, slice = 1:4)
#> $Set_1...Set_2
#> [1] "n" "t" "c" "i" "z" "u" "q" "b" "a" "l" "f" "j"
#>
#> $Set_1...Set_3
#> [1] "y" "e" "n" "t" "o" "i" "d" "z" "u" "m" "v" "h" "w" "a" "l" "f" "j"
#>
#> $Set_1...Set_4
#> [1] "y" "o" "i" "d" "z" "m" "q" "l"
#>
#> $Set_2...Set_3
#> [1] "o" "v" "h" "d" "w" "m" "e" "y"
#>
#> $Set_2...Set_4
#> [1] "o" "d" "m" "y"
#>
#> $Set_3...Set_4
#> [1] "q"
#>
#> $Set_2...Set_1
#> character(0)
#>
#> $Set_3...Set_1
#> [1] "x"
#>
#> $Set_4...Set_1
#> [1] "x"
#>
#> $Set_3...Set_2
#> [1] "b" "x" "c" "q"
#>
#> $Set_4...Set_2
#> [1] "u" "f" "x" "b" "t" "c" "j" "a" "n"
#>
#> $Set_4...Set_3
#> [1] "u" "f" "t" "v" "w" "j" "h" "e" "a" "n"
For two sets:
For three sets:
Without clustering
er = enrichment_test(toy, 6, 7)
er$Significance
#> [1] 0.6028
qplot(er$Overlap_Counts, geom = "blank") +
geom_histogram(fill = "lemonchiffon4", bins = 8, color = "black") +
geom_vline(xintercept = length(overlap(toy, c(6, 7))), color = "firebrick2",
size = 2, linetype = "dashed", alpha = 0.7) +
ggtitle("Null Distribution") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_x_continuous(name = "Overlap Counts") +
scale_y_continuous(name = "Frequency")
#> Warning: `qplot()` was deprecated in ggplot2 3.4.0.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
The test above, of course, is not very meaningful as we randomly
created the sets; therefore, we get a high p-value. However, when you
are working with actual data, e.g. to check if a motif is enriched in
the promoter regions of the genes in a gene set, you can use this test.
In that case, set1
will be the gene set of interest,
set2
will be the all the genes that the motif is found in
the genome and univ
will be all the genes of a genome.