CropXR Datapilot Visualization

Data visualization for the Datapilot project.

Graphs can be filtered by clicking on the categories in legends.

Nonclemature:

Job: a computation that takes inputs, creates intermediates, and produces outputs.
Pipeline: orchestrates jobs such that some job outputs are fed to job inputs, creating a pipelined computation. A pipeline is configurable, has defined inputs, and defined outputs.
Run: a full run of a pipeline with some configuration and input files. A run consists of executed jobs, intermediate files, and output files.

Main Comparison

We first compare metrics of compute/storage combinations, for each pipeline and sample input count. For each subsction, we have two sets of graphs:

Metrics aggregated as a mean and standard error over multiple runs with the same compute provider, and in some graphs also the storage provider.
Metrics for each run as a separate data point over time, for consistency checking. These graphs can be panned and zoomed.

Each set of graphs is faceted by the input sample count in columns, and pipeline in rows where neccessary.

Run Time Spent

Time spent by runs. The time spent by a run is the sum of time spent by the jobs of that run. For clustered runs, this total time spent is much higher than the “actual time” it takes to complete the pipeline, due to high job parallelism.

Variance is expected due to the availability of cluster nodes, resource sharing on a cluster node, network throughput, filesystem throughput, etc.

Metrics:

Work: Time spent on the actual work of a job, not including job setup (see below).
Job: Time spent by a job, including the actual work and job setup (see below).
Job Setup: Time spent setting up jobs. Job setup includes: pulling container images, starting containers, file staging/unstaging, etc.
Job Pending: Time spent waiting for jobs to be scheduled on a cluster node.
Full: Full time spent by the run, including job execution, setup, and pending times.

Run IO

Amount of input/output (IO) data that a run reads/writes.

Data is calculated by summing data from /proc/$pid/io for each job of a run, with $pid being the PID of each job.

Metrics:

All reads: rchar; bytes returned by successful read and similar system calls.
All writes: wchar; bytes returned by successful write and similar system calls.
Disk reads: read_bytes; bytes really fetched from the storage layer. Accurate for block-backed filesystems.
Disk writes: write_bytes; bytes really sent to the storage layer.
Non-disk reads: rchar - read_bytes
Non-disk writes: wchar - write_bytes

Compute Notes

Snellius: Disk reads and writes are always an extremely low number or 0, possibly due to their use of GPFS?
Research Cloud: Disk reads and writes are always an extremely low number or 0, also due to using network storage?

Run Rclone IO

Amount of input/output (IO) data that a run reads/writes through Rclone. Metrics are identical to the Run IO section.

Rclone is used to mount remote storages such that pipelines can interact with those remote storages as if they were using the local filesystem. With clustered runs, each cluster node needs to mount the remote storage via Rclone. Therefore, in a clustered run where jobs can run on different cluster nodes, jobs can mount remote storages, and thus there can be multiple Rclone daemon processes.

When a job starts an Rclone daemon process, we store its PID along with data from /proc/$pid/io. We then calculate the final data by taking the maximum values for each Rclone daemon process (identified by PID), and summing those together.

We run Rclone mount with --vfs-cache-mode writes, so files opened as write-only and read/write are first buffered to disk. This cache is removed after a successful run, and before a fresh run, such that caches from separate runs do not interfere.

The horizontal line is the size of the input data for read metrics, and size of the output data for write metrics. Rclone should transfer roughly this much data if it downloads/uploads the input/output once. This data is taken from the Pipeline Input/Output File Sizes section

Notes

Some combinations report much higher non-disk read numbers than the Rclone log file. For example, Azure Files non-disk reads is around 400GB, while the log file shows ~85GB of file transfers. I don’t know what is causing this. Combinations affected:
- DAIC with Azure Files and Research Drive
- Research Cloud with Azure Blob (and maybe others in the future)
Failed/retried jobs can incur additional Rclone transfers, increasing IO of the Rclone mount daemon, which can cause variance.

Pipeline Input/Output File Sizes

Size of input/output files of runs, aggregated by pipeline.

Notes

Intermediates started being tracked at 2026-01-22
Small variance is caused by:
- Rclone still uploading output files to the remote, at the moment of measurement.
- MultiQC reports not being fully generated and/or uploaded.

Pipeline Input/Output File Counts

Number of input/output files of runs, aggregated by pipeline.

Notes

Intermediate files are not counted, we only track their size.
Small variance is caused by:
- Rclone still uploading output files to the remote, at the moment of measurement.
- MultiQC reports not being fully generated and/or uploaded.

Additional Run Data

Additional data about runs which may be of some use.

STAR Align Jobs Time Spent

Time spent by the STAR align jobs of runs, which is the longest job in the RNA pipelines. runs include multiple STAR align jobs, one for each input sample.

Metrics are identical to the Run Time Spent section.

Run IO Operations

Amount of input/output (IO) operations of a run. Data calculation identical to the Run IO section.

Metrics:

Read system calls: syscr; number of read and similar system calls.
Write system calls: syscw; number of write and similar system calls.

Run Rclone IO Operations

Amount of input/output (IO) operations that a run makes through Rclone. Data calculation and metrics are identical to the Run IO Operations section.

Run Memory Usage

Maximum memory usage of runs. The maximum memory usage of a run is the maximum memory usage among the jobs of that run.

Metrics:

Peak real memory: VmHWM from /proc/$pid/status with $pid being the PID for each job.
Peak virtual memory: VmPeak from /proc/$pid/status with $pid being the PID for each job.
Real memory: ps -o rss at end of job.
Virtual memory: ps -o vsz at end of job.

Run Context Switches

Amount of context switches of a run. Data is calculated by summing data from /proc/$pid/status for each job of a run, with $pid being the PID of each job.

Metrics:

Voluntary context switches: voluntary_ctxt_switches
Involuntary context switches: nonvoluntary_ctxt_switches

Consistency Checking

Some data for consistency checking.

Job Status

Total number of jobs with a certain status, categorized by compute/storage combination.

Job Attempts

Total number of jobs with a certain attempt count. Only jobs which were attempted more than once (retried) are shown.