diff --git a/_freeze/posts/2025-10-25-tidySummarizedExperiment-optimization/index/execute-results/html.json b/_freeze/posts/2025-10-25-tidySummarizedExperiment-optimization/index/execute-results/html.json new file mode 100644 index 0000000..e3f751e --- /dev/null +++ b/_freeze/posts/2025-10-25-tidySummarizedExperiment-optimization/index/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "ed26b8fc10ef321cc5e9274df88fa34b", + "result": { + "engine": "knitr", + "markdown": "---\ntitle: \"Speed-up tidySummarizedExperiment through query optimisation and plyxp backend\"\nauthor: \"Stefano Mangiola\"\ncontributors:\n - Michael Love\n - Justin Landis\ndate: \"2025-10-25\"\npackage: tidySummarizedExperiment\ntags:\n - tidyomics/tidyomicsBlog\n - optimization\n - performance\n - plyxp\n - SummarizedExperiment\n - benchmarking\ndescription: \"Performance optimization of tidySummarizedExperiment using plyxp for enhanced bulk RNA-seq data analysis workflows.\"\nimage: benchmark_plot.png\nformat:\n html:\n toc: true\n toc-float: true\n theme: yeti\n css: ../../../styles.css\nexecute:\n freeze: true\n---\n\n{width=\"150px\" fig-align=\"left\"}\n\n\n::: {.cell}\n\n:::\n\n\n\nThe generality of [`tidySummarizedExperiment`](https://bioconductor.org/packages/tidySummarizedExperiment) makes heasy to interface with several of the [`tidyverse`](https://www.tidyverse.org/) packages (e.g. [`dplyr`](https://CRAN.R-project.org/package=dplyr), [`tidyr`](https://CRAN.R-project.org/package=tidyr), [`ggplot2`](https://CRAN.R-project.org/package=ggplot2), [`purrr`](https://CRAN.R-project.org/package=purrr), [`plotly`](https://CRAN.R-project.org/package=plotly)). This is possible thanks to its approach of converting [`SummarizedExperiment`](https://bioconductor.org/packages/SummarizedExperiment) objects to tibbles, performing operations, and converting back to the original format. This conversion process introduces substantial overhead when working with large-scale datasets. Each operation requires multiple data transformations, with the conversion to tibble format creating memory copies of the entire dataset, followed by the reverse conversion back to [`SummarizedExperiment`](https://bioconductor.org/packages/SummarizedExperiment). For datasets containing hundreds of samples and tens of thousands of genes, these repeated conversions can consume gigabytes of additional memory and add significant computational time to even simple operations like filtering or grouping. \n\nWith the new [`tidySummarizedExperiment`](https://bioconductor.org/packages/tidySummarizedExperiment) release ([v1.19.7](https://github.com/tidyomics/tidySummarizedExperiment/releases/tag/v1.19.7)), we have introduced a new optimization that addresses these performance limitations. This optimization is powered by:\n\n1) Check for the query domain (assay, colData, rowData), and execute specialised operation.\n2) Use of [`plyxp`](https://bioconductor.org/packages/plyxp) for complex domain-specific queries.\n\n\n# What is plyxp?\n\n[`plyxp`](https://bioconductor.org/packages/plyxp) provides `rlang` data masks for the `SummarizedExperiment` class. This enables evaluation of unquoted expressions directly in the relevant contexts of a `SummarizedExperiment` (e.g. assays, `colData`, `rowData`), with optional access to other contexts when needed. The goal is for evaluation to feel like operating on a regular data.frame—without ever unwinding to a rectangular tibble/data.frame. This avoids costly round‑trips through intermediate tabular representations while preserving the semantics and integrity of the original Bioconductor object.\n\n### Motivation and design principles\n\nThis benchmark supports ongoing work to improve mutate() execution paths, including exploring plyxp-backed execution for mixed-slot operations (see issue: [Attempt using plyxp for some cases in tidySummarizedExperiment](https://github.com/tidyomics/genomics-todos/issues/19)). The current proposal is grounded in three principles:\n\n- Decompose operation series: break `mutate(a=..., b=..., c=...)` into single operations for simpler handling and clearer routing. Reference implementation in `R/mutate.R` (decomposition step) at [L146](https://github.com/tidyomics/tidySummarizedExperiment/blob/92072d71f9d3b9a82cfc5fdced8e52477c44d80f/R/mutate.R#L146).\n- Analyze scope: infer whether each expression targets `colData`, `rowData`, `assays`, or a mix (noting that the current analyser is likely over-engineered and could be simplified). See [L149](https://github.com/tidyomics/tidySummarizedExperiment/blob/92072d71f9d3b9a82cfc5fdced8e52477c44d80f/R/mutate.R#L149).\n- Route mixed operations via plyxp: when an expression touches multiple slots, prefer the plyxp path for correctness and performance. See [L155](https://github.com/tidyomics/tidySummarizedExperiment/blob/92072d71f9d3b9a82cfc5fdced8e52477c44d80f/R/mutate.R#L155).\n\nThese design choices aim to preserve dimnames, avoid unnecessary tibble round-trips, and provide predictable performance across simple and mixed-slot scenarios.\n\n# Benchmarking Overview\n\nThis vignette benchmarks a set of [`mutate()`](https://tidyomics.github.io/tidySummarizedExperiment/reference/mutate.html), [`filter()`](https://tidyomics.github.io/tidySummarizedExperiment/reference/filter.html), [`select()`](https://tidyomics.github.io/tidySummarizedExperiment/reference/select.html), and [`distinct()`](https://tidyomics.github.io/tidySummarizedExperiment/reference/distinct.html) scenarios (see [documentation](https://bioconductor.org/packages/tidySummarizedExperiment)) comparing performance before and after optimization, by explicitly checking out specific commits via git worktree, loading each commit's code with `devtools::load_all()`, running the same scenarios multiple times, and comparing the runtimes with ggplot boxplots.\n\n- Before optimization: [commit 87445757d2d0332e7d335d22cd28f73568b7db66](https://github.com/tidyomics/tidySummarizedExperiment/commit/87445757d2d0332e7d335d22cd28f73568b7db66)\n- After optimization: [commit 9f7c26e0519c92f9682b270d566da127367bcbc0](https://github.com/tidyomics/tidySummarizedExperiment/commit/9f7c26e0519c92f9682b270d566da127367bcbc0)\n\n\n### Setup helper functions\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsuppressPackageStartupMessages({\n library(ggplot2)\n library(dplyr)\n library(SummarizedExperiment)\n library(rlang)\n library(devtools)\n library(airway)\n library(microbenchmark)\n})\n\nload_branch_code <- function(worktree_dir) {\n if (!requireNamespace(\"devtools\", quietly = TRUE)) stop(\"Please install devtools to run this vignette.\")\n # Debug: print the directory we're looking for\n cat(\"Looking for worktree directory:\", worktree_dir, \"\\n\")\n cat(\"Directory exists:\", dir.exists(worktree_dir), \"\\n\")\n cat(\"Current working directory:\", getwd(), \"\\n\")\n # Check if directory exists\n if (!dir.exists(worktree_dir)) {\n stop(paste(\"Worktree directory does not exist:\", worktree_dir))\n }\n suppressMessages(devtools::load_all(worktree_dir, quiet = TRUE))\n}\n\ncreate_airway_test_se <- function() {\n suppressPackageStartupMessages(library(airway))\n data(airway)\n se <- airway\n se[1:200, ]\n}\n\nbenchmark_scenarios <- function() {\n list(\n coldata_simple_assignment = quo({ se |> mutate(new_dex = dex) }),\n coldata_arithmetic = quo({ se |> mutate(avgLength_plus_5 = avgLength + 5) }),\n coldata_concat = quo({ se |> mutate(sample_info = paste(cell, dex, SampleName, sep = \"_\")) }),\n coldata_grouped_mean = quo({ se |> group_by(dex) |> mutate(avgLength_group_mean = mean(avgLength)) |> ungroup() }),\n assay_simple_assignment = quo({ se |> mutate(counts_copy = counts) }),\n assay_plus_one = quo({ se |> mutate(counts_plus_1 = counts + 1) }),\n assay_log = quo({ se |> mutate(log_counts_manual = log2(counts + 1)) }),\n complex_conditional_coldata = quo({ se |> mutate(length_group = ifelse(avgLength > mean(avgLength), \"longer\", \"shorter\")) }),\n complex_nested = quo({ se |> mutate(complex_category = ifelse(dex == \"trt\" & avgLength > mean(avgLength), \"treated_long\", ifelse(dex == \"untrt\", \"untreated\", \"other\"))) }),\n mixed_assay_coldata = quo({ se |> mutate(new_counts = counts * avgLength) }),\n multiple_simple_assay = quo({ se |> mutate(normalized_counts = counts / 1000, sqrt_counts = sqrt(counts)) }),\n chained_mutates = quo({ se |> mutate(tmp = avgLength * 2) |> mutate(flag = ifelse(tmp > mean(tmp), 1, 0)) }),\n\n # Filter benchmarks (scoped and non-rectangular)\n filter_coldata_simple = quo({ se |> filter(dex == \"trt\") }),\n filter_coldata_numeric = quo({ se |> filter(avgLength > median(avgLength)) }),\n filter_assay_nonrect = quo({ se |> filter(counts > 0) }),\n\n # Select benchmarks (covering colData-only, rowData-only, assays-only, mixed)\n select_coldata_simple = quo({ se |> select(.sample, dex) }),\n select_rowdata_simple = quo({ se |> select(.feature) }),\n select_assay_only = quo({ se |> select(counts) }),\n select_mixed_keys_counts = quo({ se |> select(.sample, .feature, counts) }),\n select_coldata_wide = quo({ se |> select(.sample, dex, avgLength, SampleName) }),\n\n # Distinct benchmarks (covering colData-only, rowData-only, assays-only, mixed)\n distinct_coldata_simple = quo({ se |> distinct(dex) }),\n distinct_coldata_multiple = quo({ se |> distinct(dex, avgLength) }),\n distinct_rowdata_simple = quo({ se |> distinct(.feature) }),\n distinct_assay_only = quo({ se |> distinct(counts) }),\n distinct_mixed_keys_counts = quo({ se |> distinct(.sample, .feature, counts) }),\n distinct_coldata_wide = quo({ se |> distinct(.sample, dex, avgLength, SampleName) }),\n distinct_with_keep_all = quo({ se |> distinct(dex, .keep_all = TRUE) }),\n distinct_complex_expression = quo({ se |> distinct(dex, avgLength) })\n )\n}\n\nrun_one <- function(expr_quo, reps = 5L) {\n se_base <- create_airway_test_se()\n mb <- microbenchmark::microbenchmark(\n eval_tidy(expr_quo),\n times = reps,\n setup = { se <- se_base }, # reuse the same input, avoid recreating inside the timed expr\n control = list(warmup = 2L)\n )\n # microbenchmark returns nanoseconds; convert to milliseconds\n as.numeric(mb$time) / 1e6\n}\n\nrun_all_scenarios <- function(branch_label, reps = 7L) {\n scenarios <- benchmark_scenarios()\n out <- list()\n for (nm in names(scenarios)) {\n tms <- run_one(scenarios[[nm]], reps = reps)\n out[[length(out) + 1L]] <- data.frame(\n branch = branch_label,\n scenario = nm,\n replicate = seq_along(tms),\n elapsed_ms = tms,\n stringsAsFactors = FALSE\n )\n }\n bind_rows(out)\n}\n\n# Parallel version: run each scenario on a separate worker\nrun_all_scenarios_parallel <- function(branch_label, reps = 20L, workers = 1L, initializer = NULL) {\n scenarios <- benchmark_scenarios()\n nms <- names(scenarios)\n old_plan <- future::plan()\n on.exit(future::plan(old_plan), add = TRUE)\n future::plan(future::multisession, workers = workers)\n res <- future.apply::future_lapply(nms, function(nm) {\n if (!is.null(initializer)) initializer()\n tms <- run_one(scenarios[[nm]], reps = reps)\n data.frame(\n branch = branch_label,\n scenario = nm,\n replicate = seq_along(tms),\n elapsed_ms = tms,\n stringsAsFactors = FALSE\n )\n }, future.seed = TRUE)\n dplyr::bind_rows(res)\n}\n```\n:::\n\n\n### Run benchmarks on both branches\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Worktree directories (already exist in the post directory)\nwt_before <- \".__bench_before__\"\nwt_after <- \".__bench_after__\"\n\n# Verify worktrees exist\nif (!dir.exists(wt_before)) {\n stop(\"Worktree directory does not exist: \", wt_before)\n}\nif (!dir.exists(wt_after)) {\n stop(\"Worktree directory does not exist: \", wt_after)\n}\n\n# Before optimization (commit 87445757)\nload_branch_code(wt_before)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nLooking for worktree directory: .__bench_before__ \nDirectory exists: TRUE \nCurrent working directory: /Users/a1234450/Documents/GitHub/tidyomicsBlog/posts/2025-10-25-tidySummarizedExperiment-optimization \n```\n\n\n:::\n\n```{.r .cell-code}\nres_before <- run_all_scenarios(branch_label = \"before_optimization\", reps = 10L)\n\n# After optimization (commit 9f7c26e)\nload_branch_code(wt_after)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nLooking for worktree directory: .__bench_after__ \nDirectory exists: TRUE \nCurrent working directory: /Users/a1234450/Documents/GitHub/tidyomicsBlog/posts/2025-10-25-tidySummarizedExperiment-optimization \n```\n\n\n:::\n\n```{.r .cell-code}\nres_after <- run_all_scenarios(branch_label = \"after_optimization\", reps = 10L)\n\nresults <- dplyr::bind_rows(res_before, res_after) |>\n dplyr::mutate(operation = dplyr::case_when(\n grepl(\"^filter\", scenario) ~ \"filter\",\n grepl(\"^select\", scenario) ~ \"select\",\n grepl(\"^distinct\", scenario) ~ \"distinct\",\n TRUE ~ \"mutate\"\n ))\n\nsummary_table <- results |>\n group_by(branch, scenario) |>\n summarise(median_ms = median(elapsed_ms), .groups = \"drop\") |>\n tidyr::pivot_wider(names_from = branch, values_from = median_ms) |> \n dplyr::mutate(speedup = round(before_optimization / after_optimization, 2))\n```\n:::\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n```\n\n:::\n:::\n\n\n# Visualise with ggplot boxplots\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndodge_w <- 0.7\n\nggplot(results, aes(x = scenario, y = elapsed_ms, fill = branch)) +\n geom_boxplot(position = position_dodge(width = dodge_w), width = 0.7, outlier.shape = NA) +\n\n # Add jittered points aligned with the dodged boxplots\n geom_point(\n position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0, dodge.width = dodge_w), \n alpha = 0.6, \n size = 0.5\n ) +\n scale_y_log10() + \n coord_flip() +\n facet_grid(operation ~ ., scales = \"free_y\", space = \"free_y\") +\n annotation_logticks(sides = \"b\") +\n labs(title = \"Performance comparison: Before vs After optimization\",\n x = \"Scenario\",\n y = \"Elapsed (ms)\") +\n theme_bw() +\n \n # Angle x labels \n theme(legend.position = \"top\", axis.text.x = element_text(angle = 45, hjust = 1))\n```\n\n::: {.cell-output-display}\n{width=960}\n:::\n\n```{.r .cell-code}\n # Save the plot\n ggsave(\"benchmark_plot.png\", width = 10, height = 8)\n```\n:::\n\n\n# Session Info\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nR version 4.5.0 (2025-04-11)\nPlatform: x86_64-apple-darwin20\nRunning under: macOS Sonoma 14.6.1\n\nMatrix products: default\nBLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib \nLAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1\n\nlocale:\n[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8\n\ntime zone: Australia/Adelaide\ntzcode source: internal\n\nattached base packages:\n[1] stats4 stats graphics grDevices utils datasets methods \n[8] base \n\nother attached packages:\n [1] tidySummarizedExperiment_1.19.7 tidyr_1.3.1 \n [3] testthat_3.2.3 ttservice_0.5.3 \n [5] rlang_1.1.6 microbenchmark_1.5.0 \n [7] airway_1.29.0 SummarizedExperiment_1.39.2 \n [9] Biobase_2.69.1 GenomicRanges_1.61.6 \n[11] Seqinfo_0.99.3 IRanges_2.43.5 \n[13] S4Vectors_0.47.4 BiocGenerics_0.55.4 \n[15] generics_0.1.4 MatrixGenerics_1.21.0 \n[17] matrixStats_1.5.0 dplyr_1.1.4 \n[19] ggplot2_4.0.0 devtools_2.4.6 \n[21] usethis_3.2.1 \n\nloaded via a namespace (and not attached):\n [1] gtable_0.3.6 xfun_0.53 htmlwidgets_1.6.4 \n [4] remotes_2.5.0 lattice_0.22-7 crosstalk_1.2.2 \n [7] vctrs_0.6.5 tools_4.5.0 fansi_1.0.6 \n[10] tibble_3.3.0 pkgconfig_2.0.3 Matrix_1.7-4 \n[13] data.table_1.17.8 RColorBrewer_1.1-3 S7_0.2.0 \n[16] desc_1.4.3 lifecycle_1.0.4 stringr_1.5.2 \n[19] compiler_4.5.0 farver_2.1.2 textshaping_1.0.4 \n[22] brio_1.1.5 plyxp_1.3.11 htmltools_0.5.8.1 \n[25] lazyeval_0.2.2 yaml_2.3.10 plotly_4.11.0 \n[28] pillar_1.11.1 crayon_1.5.3 ellipsis_0.3.2 \n[31] cachem_1.1.0 DelayedArray_0.35.3 sessioninfo_1.2.3 \n[34] abind_1.4-8 tidyselect_1.2.1 digest_0.6.37 \n[37] stringi_1.8.7 purrr_1.1.0 rprojroot_2.1.1 \n[40] fastmap_1.2.0 grid_4.5.0 cli_3.6.5 \n[43] SparseArray_1.9.1 magrittr_2.0.4 S4Arrays_1.9.1 \n[46] pkgbuild_1.4.8 reactR_0.6.1 withr_3.0.2 \n[49] scales_1.4.0 rmarkdown_2.30 XVector_0.49.1 \n[52] httr_1.4.7 ragg_1.5.0 memoise_2.0.1 \n[55] evaluate_1.0.5 knitr_1.50 viridisLite_0.4.2 \n[58] reactable_0.4.4 glue_1.8.0 pkgload_1.4.1 \n[61] rstudioapi_0.17.1 jsonlite_2.0.0 R6_2.6.1 \n[64] systemfonts_1.3.1 fs_1.6.6 \n```\n\n\n:::\n:::\n\n\n\n", + "supporting": [ + "index_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n\n\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/posts/2025-10-25-tidySummarizedExperiment-optimization/index/figure-html/plot-1.png b/_freeze/posts/2025-10-25-tidySummarizedExperiment-optimization/index/figure-html/plot-1.png new file mode 100644 index 0000000..da7c49a Binary files /dev/null and b/_freeze/posts/2025-10-25-tidySummarizedExperiment-optimization/index/figure-html/plot-1.png differ diff --git a/posts/2025-10-25-tidySummarizedExperiment-optimization/.gitignore b/posts/2025-10-25-tidySummarizedExperiment-optimization/.gitignore new file mode 100644 index 0000000..66141ab --- /dev/null +++ b/posts/2025-10-25-tidySummarizedExperiment-optimization/.gitignore @@ -0,0 +1,2 @@ +.__bench_before__ +.__bench_after__ \ No newline at end of file diff --git a/posts/2025-10-25-tidySummarizedExperiment-optimization/benchmark_plot.png b/posts/2025-10-25-tidySummarizedExperiment-optimization/benchmark_plot.png new file mode 100644 index 0000000..67b7598 Binary files /dev/null and b/posts/2025-10-25-tidySummarizedExperiment-optimization/benchmark_plot.png differ diff --git a/posts/2025-10-25-tidySummarizedExperiment-optimization/index.qmd b/posts/2025-10-25-tidySummarizedExperiment-optimization/index.qmd new file mode 100644 index 0000000..c730fb5 --- /dev/null +++ b/posts/2025-10-25-tidySummarizedExperiment-optimization/index.qmd @@ -0,0 +1,315 @@ +--- +title: "Speed-up tidySummarizedExperiment through query optimisation and plyxp backend" +author: "Stefano Mangiola" +contributors: + - Michael Love + - Justin Landis +date: "2025-10-25" +package: tidySummarizedExperiment +tags: + - tidyomics/tidyomicsBlog + - optimization + - performance + - plyxp + - SummarizedExperiment + - benchmarking +description: "Performance optimization of tidySummarizedExperiment using plyxp for enhanced bulk RNA-seq data analysis workflows." +image: benchmark_plot.png +format: + html: + toc: true + toc-float: true + theme: yeti + css: ../../../styles.css +execute: + freeze: true +--- + +{width="150px" fig-align="left"} + +```{r setup} +#| echo: false +#| message: false +#| warning: false + +# Install missing packages if needed (universal approach) +required_packages <- c("devtools", "ggplot2", "dplyr", "SummarizedExperiment", + "airway", "microbenchmark", "rlang") + +for (pkg in required_packages) { + if (!require(pkg, character.only = TRUE, quietly = TRUE)) { + if (pkg %in% c("SummarizedExperiment", "airway")) { + if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") + BiocManager::install(pkg, ask = FALSE) + } else { + install.packages(pkg) + } + } +} +``` + + +The generality of [`tidySummarizedExperiment`](https://bioconductor.org/packages/tidySummarizedExperiment) makes heasy to interface with several of the [`tidyverse`](https://www.tidyverse.org/) packages (e.g. [`dplyr`](https://CRAN.R-project.org/package=dplyr), [`tidyr`](https://CRAN.R-project.org/package=tidyr), [`ggplot2`](https://CRAN.R-project.org/package=ggplot2), [`purrr`](https://CRAN.R-project.org/package=purrr), [`plotly`](https://CRAN.R-project.org/package=plotly)). This is possible thanks to its approach of converting [`SummarizedExperiment`](https://bioconductor.org/packages/SummarizedExperiment) objects to tibbles, performing operations, and converting back to the original format. This conversion process introduces substantial overhead when working with large-scale datasets. Each operation requires multiple data transformations, with the conversion to tibble format creating memory copies of the entire dataset, followed by the reverse conversion back to [`SummarizedExperiment`](https://bioconductor.org/packages/SummarizedExperiment). For datasets containing hundreds of samples and tens of thousands of genes, these repeated conversions can consume gigabytes of additional memory and add significant computational time to even simple operations like filtering or grouping. + +With the new [`tidySummarizedExperiment`](https://bioconductor.org/packages/tidySummarizedExperiment) release ([v1.19.7](https://github.com/tidyomics/tidySummarizedExperiment/releases/tag/v1.19.7)), we have introduced a new optimization that addresses these performance limitations. This optimization is powered by: + +1) Check for the query domain (assay, colData, rowData), and execute specialised operation. +2) Use of [`plyxp`](https://bioconductor.org/packages/plyxp) for complex domain-specific queries. + + +# What is plyxp? + +[`plyxp`](https://bioconductor.org/packages/plyxp) provides `rlang` data masks for the `SummarizedExperiment` class. This enables evaluation of unquoted expressions directly in the relevant contexts of a `SummarizedExperiment` (e.g. assays, `colData`, `rowData`), with optional access to other contexts when needed. The goal is for evaluation to feel like operating on a regular data.frame—without ever unwinding to a rectangular tibble/data.frame. This avoids costly round‑trips through intermediate tabular representations while preserving the semantics and integrity of the original Bioconductor object. + +### Motivation and design principles + +This benchmark supports ongoing work to improve mutate() execution paths, including exploring plyxp-backed execution for mixed-slot operations (see issue: [Attempt using plyxp for some cases in tidySummarizedExperiment](https://github.com/tidyomics/genomics-todos/issues/19)). The current proposal is grounded in three principles: + +- Decompose operation series: break `mutate(a=..., b=..., c=...)` into single operations for simpler handling and clearer routing. Reference implementation in `R/mutate.R` (decomposition step) at [L146](https://github.com/tidyomics/tidySummarizedExperiment/blob/92072d71f9d3b9a82cfc5fdced8e52477c44d80f/R/mutate.R#L146). +- Analyze scope: infer whether each expression targets `colData`, `rowData`, `assays`, or a mix (noting that the current analyser is likely over-engineered and could be simplified). See [L149](https://github.com/tidyomics/tidySummarizedExperiment/blob/92072d71f9d3b9a82cfc5fdced8e52477c44d80f/R/mutate.R#L149). +- Route mixed operations via plyxp: when an expression touches multiple slots, prefer the plyxp path for correctness and performance. See [L155](https://github.com/tidyomics/tidySummarizedExperiment/blob/92072d71f9d3b9a82cfc5fdced8e52477c44d80f/R/mutate.R#L155). + +These design choices aim to preserve dimnames, avoid unnecessary tibble round-trips, and provide predictable performance across simple and mixed-slot scenarios. + +# Benchmarking Overview + +This vignette benchmarks a set of [`mutate()`](https://tidyomics.github.io/tidySummarizedExperiment/reference/mutate.html), [`filter()`](https://tidyomics.github.io/tidySummarizedExperiment/reference/filter.html), [`select()`](https://tidyomics.github.io/tidySummarizedExperiment/reference/select.html), and [`distinct()`](https://tidyomics.github.io/tidySummarizedExperiment/reference/distinct.html) scenarios (see [documentation](https://bioconductor.org/packages/tidySummarizedExperiment)) comparing performance before and after optimization, by explicitly checking out specific commits via git worktree, loading each commit's code with `devtools::load_all()`, running the same scenarios multiple times, and comparing the runtimes with ggplot boxplots. + +- Before optimization: [commit 87445757d2d0332e7d335d22cd28f73568b7db66](https://github.com/tidyomics/tidySummarizedExperiment/commit/87445757d2d0332e7d335d22cd28f73568b7db66) +- After optimization: [commit 9f7c26e0519c92f9682b270d566da127367bcbc0](https://github.com/tidyomics/tidySummarizedExperiment/commit/9f7c26e0519c92f9682b270d566da127367bcbc0) + + +### Setup helper functions + +```{r helpers} +#| message: false +#| warning: false + +suppressPackageStartupMessages({ + library(ggplot2) + library(dplyr) + library(SummarizedExperiment) + library(rlang) + library(devtools) + library(airway) + library(microbenchmark) +}) + +load_branch_code <- function(worktree_dir) { + if (!requireNamespace("devtools", quietly = TRUE)) stop("Please install devtools to run this vignette.") + # Debug: print the directory we're looking for + cat("Looking for worktree directory:", worktree_dir, "\n") + cat("Directory exists:", dir.exists(worktree_dir), "\n") + cat("Current working directory:", getwd(), "\n") + # Check if directory exists + if (!dir.exists(worktree_dir)) { + stop(paste("Worktree directory does not exist:", worktree_dir)) + } + suppressMessages(devtools::load_all(worktree_dir, quiet = TRUE)) +} + +create_airway_test_se <- function() { + suppressPackageStartupMessages(library(airway)) + data(airway) + se <- airway + se[1:200, ] +} + +benchmark_scenarios <- function() { + list( + coldata_simple_assignment = quo({ se |> mutate(new_dex = dex) }), + coldata_arithmetic = quo({ se |> mutate(avgLength_plus_5 = avgLength + 5) }), + coldata_concat = quo({ se |> mutate(sample_info = paste(cell, dex, SampleName, sep = "_")) }), + coldata_grouped_mean = quo({ se |> group_by(dex) |> mutate(avgLength_group_mean = mean(avgLength)) |> ungroup() }), + assay_simple_assignment = quo({ se |> mutate(counts_copy = counts) }), + assay_plus_one = quo({ se |> mutate(counts_plus_1 = counts + 1) }), + assay_log = quo({ se |> mutate(log_counts_manual = log2(counts + 1)) }), + complex_conditional_coldata = quo({ se |> mutate(length_group = ifelse(avgLength > mean(avgLength), "longer", "shorter")) }), + complex_nested = quo({ se |> mutate(complex_category = ifelse(dex == "trt" & avgLength > mean(avgLength), "treated_long", ifelse(dex == "untrt", "untreated", "other"))) }), + mixed_assay_coldata = quo({ se |> mutate(new_counts = counts * avgLength) }), + multiple_simple_assay = quo({ se |> mutate(normalized_counts = counts / 1000, sqrt_counts = sqrt(counts)) }), + chained_mutates = quo({ se |> mutate(tmp = avgLength * 2) |> mutate(flag = ifelse(tmp > mean(tmp), 1, 0)) }), + + # Filter benchmarks (scoped and non-rectangular) + filter_coldata_simple = quo({ se |> filter(dex == "trt") }), + filter_coldata_numeric = quo({ se |> filter(avgLength > median(avgLength)) }), + filter_assay_nonrect = quo({ se |> filter(counts > 0) }), + + # Select benchmarks (covering colData-only, rowData-only, assays-only, mixed) + select_coldata_simple = quo({ se |> select(.sample, dex) }), + select_rowdata_simple = quo({ se |> select(.feature) }), + select_assay_only = quo({ se |> select(counts) }), + select_mixed_keys_counts = quo({ se |> select(.sample, .feature, counts) }), + select_coldata_wide = quo({ se |> select(.sample, dex, avgLength, SampleName) }), + + # Distinct benchmarks (covering colData-only, rowData-only, assays-only, mixed) + distinct_coldata_simple = quo({ se |> distinct(dex) }), + distinct_coldata_multiple = quo({ se |> distinct(dex, avgLength) }), + distinct_rowdata_simple = quo({ se |> distinct(.feature) }), + distinct_assay_only = quo({ se |> distinct(counts) }), + distinct_mixed_keys_counts = quo({ se |> distinct(.sample, .feature, counts) }), + distinct_coldata_wide = quo({ se |> distinct(.sample, dex, avgLength, SampleName) }), + distinct_with_keep_all = quo({ se |> distinct(dex, .keep_all = TRUE) }), + distinct_complex_expression = quo({ se |> distinct(dex, avgLength) }) + ) +} + +run_one <- function(expr_quo, reps = 5L) { + se_base <- create_airway_test_se() + mb <- microbenchmark::microbenchmark( + eval_tidy(expr_quo), + times = reps, + setup = { se <- se_base }, # reuse the same input, avoid recreating inside the timed expr + control = list(warmup = 2L) + ) + # microbenchmark returns nanoseconds; convert to milliseconds + as.numeric(mb$time) / 1e6 +} + +run_all_scenarios <- function(branch_label, reps = 7L) { + scenarios <- benchmark_scenarios() + out <- list() + for (nm in names(scenarios)) { + tms <- run_one(scenarios[[nm]], reps = reps) + out[[length(out) + 1L]] <- data.frame( + branch = branch_label, + scenario = nm, + replicate = seq_along(tms), + elapsed_ms = tms, + stringsAsFactors = FALSE + ) + } + bind_rows(out) +} + +# Parallel version: run each scenario on a separate worker +run_all_scenarios_parallel <- function(branch_label, reps = 20L, workers = 1L, initializer = NULL) { + scenarios <- benchmark_scenarios() + nms <- names(scenarios) + old_plan <- future::plan() + on.exit(future::plan(old_plan), add = TRUE) + future::plan(future::multisession, workers = workers) + res <- future.apply::future_lapply(nms, function(nm) { + if (!is.null(initializer)) initializer() + tms <- run_one(scenarios[[nm]], reps = reps) + data.frame( + branch = branch_label, + scenario = nm, + replicate = seq_along(tms), + elapsed_ms = tms, + stringsAsFactors = FALSE + ) + }, future.seed = TRUE) + dplyr::bind_rows(res) +} +``` + +### Run benchmarks on both branches + +```{r worktrees} +#| cache: false +#| warning: false +#| message: false +# Worktree directories (already exist in the post directory) +wt_before <- ".__bench_before__" +wt_after <- ".__bench_after__" + +# Verify worktrees exist +if (!dir.exists(wt_before)) { + stop("Worktree directory does not exist: ", wt_before) +} +if (!dir.exists(wt_after)) { + stop("Worktree directory does not exist: ", wt_after) +} + +# Before optimization (commit 87445757) +load_branch_code(wt_before) +res_before <- run_all_scenarios(branch_label = "before_optimization", reps = 10L) + +# After optimization (commit 9f7c26e) +load_branch_code(wt_after) +res_after <- run_all_scenarios(branch_label = "after_optimization", reps = 10L) + +results <- dplyr::bind_rows(res_before, res_after) |> + dplyr::mutate(operation = dplyr::case_when( + grepl("^filter", scenario) ~ "filter", + grepl("^select", scenario) ~ "select", + grepl("^distinct", scenario) ~ "distinct", + TRUE ~ "mutate" + )) + +summary_table <- results |> + group_by(branch, scenario) |> + summarise(median_ms = median(elapsed_ms), .groups = "drop") |> + tidyr::pivot_wider(names_from = branch, values_from = median_ms) |> + dplyr::mutate(speedup = round(before_optimization / after_optimization, 2)) + +``` + +```{r summary_table} +#| echo: false +#| message: false +#| warning: false + +reactable::reactable( + summary_table, + searchable = TRUE, + filterable = TRUE, + sortable = TRUE, + pagination = TRUE, + defaultPageSize = 10, + highlight = TRUE, + striped = TRUE, + bordered = TRUE, + compact = TRUE, + defaultColDef = reactable::colDef(align = "left", minWidth = 120), + columns = list( + scenario = reactable::colDef(name = "Scenario", minWidth = 220), + after_optimization = reactable::colDef(name = "After (ms)", format = reactable::colFormat(digits = 1)), + before_optimization = reactable::colDef(name = "Before (ms)", format = reactable::colFormat(digits = 1)), + speedup = reactable::colDef(name = "Speedup (x)", format = reactable::colFormat(digits = 2)) + ) +) +``` + +# Visualise with ggplot boxplots + +```{r plot} +#| fig-width: 10 +#| fig-height: 8 +dodge_w <- 0.7 + +ggplot(results, aes(x = scenario, y = elapsed_ms, fill = branch)) + + geom_boxplot(position = position_dodge(width = dodge_w), width = 0.7, outlier.shape = NA) + + + # Add jittered points aligned with the dodged boxplots + geom_point( + position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0, dodge.width = dodge_w), + alpha = 0.6, + size = 0.5 + ) + + scale_y_log10() + + coord_flip() + + facet_grid(operation ~ ., scales = "free_y", space = "free_y") + + annotation_logticks(sides = "b") + + labs(title = "Performance comparison: Before vs After optimization", + x = "Scenario", + y = "Elapsed (ms)") + + theme_bw() + + + # Angle x labels + theme(legend.position = "top", axis.text.x = element_text(angle = 45, hjust = 1)) + + # Save the plot + ggsave("benchmark_plot.png", width = 10, height = 8) + + +``` + +# Session Info +```{r sessionInfo} +#| echo: false +sessionInfo() +``` + + diff --git a/posts/2025-10-25-tidySummarizedExperiment-optimization/logo.png b/posts/2025-10-25-tidySummarizedExperiment-optimization/logo.png new file mode 100644 index 0000000..5de8a74 Binary files /dev/null and b/posts/2025-10-25-tidySummarizedExperiment-optimization/logo.png differ