Skip to content

Conversation

@a-frantz
Copy link
Member

@a-frantz a-frantz commented Aug 28, 2025

@a-frantz a-frantz force-pushed the rfcs/sprocket-test branch from c50ef7f to 439168c Compare August 28, 2025 19:38
@a-frantz a-frantz force-pushed the rfcs/sprocket-test branch from 182296a to 5371c9d Compare August 28, 2025 20:36
@a-frantz a-frantz force-pushed the rfcs/sprocket-test branch from 36d0af6 to 66dcd35 Compare August 28, 2025 20:57
@sckott
Copy link

sckott commented Aug 29, 2025

This looks great, a few thoughts.

  1. Presumably test code in sprocket will be implemented in rust?
  2. At a high level this approach of testing by config (TOML) files limits what you can test, right? That is, you can't parse outputs and run arbitrary code on them? Or can you? Perhaps it makes sense to have another set of tests using pytest and python or similar if more complicated output parsing is needed? I guess this section in the PR is relevant here
  3. Will it be possible to have good/useful feedback on test failures and associated .wdl file line numbers?
  4. What about mocks? That is, presumably there will be a need for mocks when e.g., users want to avoid either time it takes to connect to a database or remote service or just can't do it given where the CI runs.

@a-frantz
Copy link
Member Author

Thanks for the review and feedback!

  1. Presumably test code in sprocket will be implemented in rust?

Are you referring to the builtin tests? Because yes, those will be written in Rust, although I consider that an implementation detail users shouldn't have to be concerned with.

  1. At a high level this approach of testing by config (TOML) files limits what you can test, right? That is, you can't parse outputs and run arbitrary code on them?

As the proposal currently stands, that is correct. No arbitrary code, although:

I guess this section in the PR is relevant here

I think this is probably going to be a "need to have" feature for many users. I have not fully landed on any implementation specifics right now, which is why it's off at the end of the document. I'd welcome some brainstorming on what this might look like!

  1. Will it be possible to have good/useful feedback on test failures and associated .wdl file line numbers?

Yes, absolutely!

  1. What about mocks? That is, presumably there will be a need for mocks when e.g., users want to avoid either time it takes to connect to a database or remote service or just can't do it given where the CI runs.

I'm not entirely sure. Could you elaborate?

Perhaps a bit off-target from the question, but I think another must have feature (which I neglected to include in this first draft) will be some test annotation and filtering capabilities. At the most basic level, users need to be able to note "this test is slow" and only run the "slow" tests conditionally (not on every commit). A more advanced use case would be something like our workflows CI, which parallelizes the tests by tag - https://github.com/stjudecloud/workflows/blob/main/.github/workflows/pytest.yaml
I intend to more formally add this to the document when I get a chance.

@sckott
Copy link

sckott commented Aug 29, 2025

  1. yes, the builtin tests. Right, an implementation detail. Just curious.

  2. I'd welcome some brainstorming on what this might look like!

I'll comment if I have any ideas.

  1. Say a user wants tests to run faster than it would take to run the entire workflow's under test. To do that they could use mocks to supply what would be returned by the long running operation they want to avoid waiting for.

Agree filtering would be really nice to have. For example, I use pytest -k ... quite a lot.

I don't know the Rust space very well, so these questions my be completely stupid but here goes:

  • Does this PR re-invent existing tooling that could be used?
  • Relatedly, for the builtin test conditions are they adopted from patterns used in other tooling or invented from whole cloth?

@a-frantz
Copy link
Member Author

a-frantz commented Aug 29, 2025

  1. Say a user wants tests to run faster than it would take to run the entire workflow's under test. To do that they could use mocks to supply what would be returned by the long running operation they want to avoid waiting for.

I welcome other people's thoughts on this, but my gut instinct is that this would be out of scope for this PR.

I am most immediately focused on enabling unit testing, where the core target is really the individual tasks which together compose a workflow, as opposed to testing workflows in their entirety. This is mentioned in my RFC but maybe not elaborated on very far, but I view testing workflows in their entirety as being a sufficiently distinct use case. Most tasks can be configured to run fast and light, the same is often not true for workflows.

However the API differences between tasks and workflows are practically non-existent (they have the same specifications for inputs and outputs, the rest is reduced to implementation details), which means it would be odd to block workflows from being run in this framework.

So my initial answer to this question (though I could certainly have my mind changed) is that we are punting on this problem to deal with further down the line. I do want the problem of workflow validation to be better addressed, but I'm also trying not to bite off more than I can chew 😅

@a-frantz
Copy link
Member Author

  • Does this PR re-invent existing tooling that could be used?

This is partially addressed in the prior art portion of the RFC. There are some existing frameworks and tools, but IMO they aren't a great fit for what I'm setting out to achieve. I think we can build something better suited for WDL users than currently exists. If there is existing tooling I have not mentioned that sufficiently addresses this case, I am not aware of it and would appreciate being corrected before I dedicate more of my time and energy on this 🤣

I will say that I have not investigated what Rust crates are currently out there for enabling something like I'm setting out to build, but I would like to avoid re-inventing any wheels and will happily outsource any work I think could be better handled by someone else!

  • Relatedly, for the builtin test conditions are they adopted from patterns used in other tooling or invented from whole cloth?

The builtin conditions I wrote up are very similar to what pytest-workflow offers (relevant docs here) and I won't claim credit for inventing them.

The builtins I included "for initial release" are intended to be:

  1. able to cover a large swath of common test cases
  2. easy to implement, so that we can ship something out the door that will get used

As stated in Future Possibilities, the tests detailed are meant as a starting point, and they can and should be added to! But again, trying not to take too large a bite here 😂

]
prefix = "test.merged"
[merge_sam_files.tests]
custom = "quickcheck.sh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the below, it looks like anything found in custom is always passed a single argument (the outputs.json file), but that should be clarified here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, and just generally feel that you should explicitly include the command line that is going to be run (e.g., <custom_executable> inputs.json outputs.json)


out_bam=$(jq -r .bam "$out_json")

samtools quickcheck "$out_bam"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work in a CI environment? Does the CI need to maintain and install a list of tools? Should the users be encouraged to run these in containers or should the test framework allow specification of a container?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. In short, Does the CI need to maintain and install a list of tools? yes.

Running sprocket test within a container would cause "docker in docker" headaches, but maybe the test framework could support spinning up a container just for the custom test execution? Although that might be more headache than it's worth.

I don't think expecting CI maintainers to get their dependencies in order before running sprocket test is a terrible dealbreaker. We're already doing that on workflows for pytest-workflow - https://github.com/stjudecloud/workflows/blob/main/.github/workflows/pytest.yaml#L37-L44

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future version, I wonder if it would be a good idea to allow custom bash invocations within a container. For example, you can elide the creation of a bash script altogether if you did something like

[[merge_sam_files.assertions.custom]]
container = "ubuntu:latest"
command = "samtools quickcheck $1"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with this is that it's adding complexity (for both user and implementer). Not opposed to the idea in itself, but I question if it's worth the complexity? I can add it to the future possibilities section, as I think it's worthwhile to track more formally than just this comment thread.

{ include_if_all = "0x0", exclude_if_any = "0x900", include_if_any = "0x0", exclude_if_all = "0x0" },
{ include_if_all = "00", exclude_if_any = "0x904", include_if_any = "3", exclude_if_all = "0" },
]
[[bam_to_fastq.matrix]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason ot separate out some inputs into separate matrix tables? It seems like it would be much clearer to have a single matrix table like:

[[bam_to_fastq.matrix]]
bam = [
    "$FIXTURES/test1.bam",
    "$FIXTURES/test2.bam",
    "$FIXTURES/test3.bam",
]
bam_index = [
    "$FIXTURES/test1.bam.bai",
    "$FIXTURES/test2.bam.bai",
    "$FIXTURES/test3.bam.bai",
]
bitwise_filter = [
    { include_if_all = "0x0", exclude_if_any = "0x900", include_if_any = "0x0", exclude_if_all = "0x0" },
    { include_if_all = "00", exclude_if_any = "0x904", include_if_any = "3", exclude_if_all = "0" },
]
paired_end = [true, false]
retain_collated_bam = [true, false]
append_read_number = [true, false]
output_singletons = [true, false]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, is this saying running every combination found under a matrix against every matrix? So your example would yield 3 * 2 * 2 * 2 * 2 * 2 (96) tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have kept reading. I see now that you were demonstrating the second case with 96 tests.


Users will be able to annotate each test with arbitrary tags which will allow them to run subsets of the entire test suite. They will also be able to run the tests in a specific file, as opposed to the default `sprocket test` behavior which will be to recurse the `test` directory and run all found tests. This will facilitate a variety of applications, most notably restricting the run to only what the developer knows has changed and parallelizing CI runs.

We may also want to give some tags special meaning: it is common to annotate "slow" tests and to exclude them from runs by default and we may want to make reduce friction in configuring that case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sprocket test should end up with arguments like --include-tag and --exclude-tag that would control which tags get included or excluded from the default set.

@sckott
Copy link

sckott commented Sep 3, 2025

@a-frantz

I welcome other people's thoughts on this, but my gut instinct is that this would be out of scope for this PR.

I think that makes sense.

Yeah I'm not able to help on prior art - just thought I'd ask the question - i'm happy with your answer.

The part on custom tests https://stjude-rust-labs.github.io/rfcs/branches/rfcs/sprocket-test/0001-sprocket-test.html#custom-tests looks good to me. We're currently using Python for WDL tests, so that's perfect if Python scripts are supported.

@sckott
Copy link

sckott commented Sep 3, 2025

If you don't have one already, will there be a repo for demonstration purposes where folks can see how you structure these tests for one or more WDLs?

@sckott
Copy link

sckott commented Sep 3, 2025

(disregard if not a good fit for this PR) Maybe this is just an implementation detail, but curious to see what the top of the output will look like. e.g., I like pytest's output that has useful info about versions/etc. for sprocket test it would be nice to have maybe: sprocket version, rust version?, test wide config options, what else?

@a-frantz
Copy link
Member Author

a-frantz commented Sep 3, 2025

If you don't have one already, will there be a repo for demonstration purposes where folks can see how you structure these tests for one or more WDLs?

just copy and pasted the existing examples into this branch - stjudecloud/workflows#263
Is something like this what you had in mind?

@sckott
Copy link

sckott commented Sep 3, 2025

@a-frantz

Is something like this what you had in mind?

Yes, perfect

Copy link
Contributor

@acfoltzer acfoltzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing this up, Ari!

My comments fall into two categories: first, there are a few specific comments regarding alternatives I'd like to explore that don't put us on the hook for authoring and supporting an entirely new tool for an entirely novel (semantically, at least) test definition language. The remainder and majority of the comments are premised on us exploring those possibilities but deciding to stick with the TOML-based approach that you've outlined so far in the draft.

Please take a look at the higher-level approach comments first, down in the Drawbacks and Alternatives sections. I don't want to get you too bogged down in the details if you end up finding one of the alternatives compelling enough to change course.


The Sprocket test framework is primarily specified in TOML, which is expected to be within a `tests/` directory at the root of the WDL workspace. `sprocket test` does not require any special syntax or modification of actual WDL files, and any spec-compliant WDL workspace can create a `tests` directory respected by Sprocket.

The `tests/` directory is expected to mirror the WDL workspace, where it should have the same path structure, but with `.wdl` file extensions replaced by `.toml` extensions. The TOML files contain tests for tasks and workflows defined at their respective WDL counterpart in the main workspace. e.g. a WDL file located at `data_structures/flag_filter.wdl` would have accompanying tests defined in `tests/data_structures/flag_filter.toml`. Following this structure frees the TOML from having to contain any information about _where_ to find the entrypoints of each test. All the test entrypoints (tasks or workflows) in `tests/data_structures/flag_filter.toml` are expected to be defined in the WDL located at `data_structures/flag_filter.wdl`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should be separating test definitions from the code they're testing like this. Having parallel file system hierarchies has the advantage for writing a set of tests external to workflows that the test author does not control, but other than that it introduces headaches.

Just from an editor experience point of view, if I'm looking at /some/path/to/foo.wdl and want to create a corresponding test, I might have to manually create up to four directories (counting /tests/) instead of making a new file in the same directory, which is a whole lot simpler in Emacs at least.

Is "spec-compliant WDL workspace" a defined concept? We'd need to nail that down in order to resolve questions like "what happens if there are multiple tests/ directories in the hierarchy?"

More importantly, the further a test definition is from the code it is testing, the more likely it will be neglected when changes are made. If we're lucky, any deficiencies in the change will trigger a test failure that will make us come back around and improve the test as we fix the bug, but otherwise it's way too easy for something to stay out-of-sight and out-of-mind if it's shunted off in this way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should be separating test definitions from the code they're testing like this.

I agree with you after having this all pointed out. I can rework this to instead be based off a sibling file of the same basename, but with the differing file extension.

Is "spec-compliant WDL workspace" a defined concept? We'd need to nail that down in order to resolve questions like "what happens if there are multiple tests/ directories in the hierarchy?"

No, it is currently not defined. Although doc is based off the same concept. I think the definition would just be a directory which is recursively searched for .wdl files. My instinct here is that none of the sprocket commands should "go up" and search parent directories, but instead "look down" from the CWD (as many of the current commands operate).


## E2E testing

As stated in the "motivation" section, this proposal is ignoring end-to-end (or E2E) tests and is really just focused on enabling unit testing for CI purposes. Perhaps some of this could be re-used for an E2E API, but I have largely ignored that aspect. (Also I have lots of thoughts about what that might look like, but for brevity will not elaborate further.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can use this tool to run workflows, providing their inputs and making assertions about their outputs, what distinguishes this proposal from E2E testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a user could use this for E2E testing if they really wanted to, it just wouldn't be very ergonomic.

I think that the proposed TOML tests are going to be too limited for proper E2E testing, especially given that many bioinformatics tools/pipelines are non-deterministic. E2E testing using this framework would probably end up making hefty use of the custom feature, which is not the best UX.

To wax poetically, this framework can be used as a hammer if we treat all forms of testing as nails, but I'd argue that not all forms of testing are nails and different tools should be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, do you have a reference for what constitutes "proper E2E testing"? It sounds like there's much more interesting properties involved than I was thinking about

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I don't really have any citations here 😅 It's something that we've been discussing as a desire to implement for a long time, but we haven't quite nailed down specifics. I think an example would be most illustrative:

We want to provide some form of evidence that version X of a pipeline (more concretely, a WDL workflow) is functionally equivalent to version Y. This problem of functional equivalence is a bit nebulous, and the current state of affairs is that fear of non-equivalence locks people into not updating their software. E2E testing (by my definition at least) is about proving functional equivalence by some metric(s). This would probably have to be defined per-workflow and be very domain-specific (i.e. not something easily generalized).

For the rnaseq-standard pipeline, an E2E test we've discussed is comparing some "truth" run of the pipeline's feature_counts output to the current commit's output, and considering an R^2 value above ~99% as being functionally equivalent.

We can then show this evidence to biologists and say "have no fear! Please update your software 🙏 "

CC to see if @adthrasher is in agreement with the above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes a lot of sense! I think these properties fall under the classic V&V banner, though there's a lot of grey area and overlap between these terms.

The approach you describe with rnaseq-standard seems like something we should absolutely target. I'll have to look into literature about fuzzing probabilistic systems, but it could also be interesting to set up a test bench that continuously runs and compares the output for randomly-generated or randomly-perturbed inputs between tools.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some bio context you may not be aware of yet:

Fuzzing biological data gets tricky fast. There's been some vague discussion of building out that functionality into this Rust Labs tool - https://github.com/stjude-rust-labs/fq?tab=readme-ov-file#generate

My understanding of the general consensus in bioinformatics is that "fake" data has very limited value. Grain of salt, not super well read in this area, but as I understand the distrust of synthetic data goes further than the current methods not being robust enough, but is a matter of principle that synthetic data will never capture the "unknown unknowns" inherent to biological data.

A more common approach (again, from my limited POV) is to subset real biological data until it's useful for test purposes, but ultimately everything needs to run the gamut on real cohorts.

Fuzzed data would be helpful to have for this RFC in that it's useful for integration testing and just ensuring nothing is broken. But for V&V or E2E or whatever we call it, I think convincing biologists that the methods are sound using synthetic data may not be suitable.

Not to say that what you're describing is a dead end, but more so that the scope of what it could be used for is maybe more limited than in other fields of computation.

# Rationale and alternatives
[rationale-and-alternatives]: #rationale-and-alternatives

REVIEWERS: I've thought through quite a wide variety of implementations that have not made it into writing, and I'm not sure how valuable my musings on alternatives I _didn't like_ are. I can expand on this section if it would be informative.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative to explore is to make it possible for a test author to drive Sprocket from a pytest, Rust #[test], or other general purpose programming environment. I believe this is the direction that pytest-wdl takes , and I think that is worth a closer look even though that particular project seems abandoned. Taking advantage of a widely-used and widely-supported testing framework would save us from having to support yet another CLI front-end, and would probably offer smoother integration into CI and reporting systems.

This would probably work best by creating a programmatic interface from Python to Sprocket, which would be a whole lot of work. That effort could pay off beyond testing, though, given Python's popularity in bioinformatics. Alternately, a wrapper around the Sprocket CLI probably wouldn't be that bad to write, even if doing an FFI sounds more fun 😉


## Custom tests

While the builtin test conditions should try and address many common use cases, users need a way to test for things outside the scope of the builtins (especially at launch, when the builtins will be minimal). There needs to be a way for users to execute arbitrary code on the outputs of a task or workflow for validation. This will be exposed via the `tests.custom` test, which will accept a name or array of names of user-written supplied executables (most commonly shell or Python scripts) which are expected to be found in a `tests/custom/` directory. These executables will be invoked with a positional argument which is a path to the task or workflow's `outputs.json`. Users will be responsible for parsing that JSON and performing any validation they desire. So long as the invoked executable exits with a code of zero, the test will be considered as passed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the inputs.json available as well would be useful for writing custom assertions where the output is dependent on some property of the input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I started considering add either other positional args or ENV variables that could be useful to have access to. I think there's some additional info we may want to expose beyond just an outputs.json, but not sure how important nailing all that down is for the RFC. This seems like something we can iterate on during development

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how important nailing all that down is for the RFC. This seems like something we can iterate on during development

The thing I'd watch out for here is the potential for churn if positional arguments are changed around. Since we're targeting an experimental opt-in mode for now, that probably isn't a big threat.


## Custom tests

While the builtin test conditions should try and address many common use cases, users need a way to test for things outside the scope of the builtins (especially at launch, when the builtins will be minimal). There needs to be a way for users to execute arbitrary code on the outputs of a task or workflow for validation. This will be exposed via the `tests.custom` test, which will accept a name or array of names of user-written supplied executables (most commonly shell or Python scripts) which are expected to be found in a `tests/custom/` directory. These executables will be invoked with a positional argument which is a path to the task or workflow's `outputs.json`. Users will be responsible for parsing that JSON and performing any validation they desire. So long as the invoked executable exits with a code of zero, the test will be considered as passed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also imagine providing a hook for a custom executable which could provide the inputs rather than asserting on the outputs. This would be a way for a user to write their own version of matrix testing by e.g. emitting an array of objects that each would be suitable to use as an input.json.

Copy link
Member Author

@a-frantz a-frantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the big picture Qs raised are being discussed in Slack. This response addresses some of the medium picture Qs 😅


The Sprocket test framework is primarily specified in TOML, which is expected to be within a `tests/` directory at the root of the WDL workspace. `sprocket test` does not require any special syntax or modification of actual WDL files, and any spec-compliant WDL workspace can create a `tests` directory respected by Sprocket.

The `tests/` directory is expected to mirror the WDL workspace, where it should have the same path structure, but with `.wdl` file extensions replaced by `.toml` extensions. The TOML files contain tests for tasks and workflows defined at their respective WDL counterpart in the main workspace. e.g. a WDL file located at `data_structures/flag_filter.wdl` would have accompanying tests defined in `tests/data_structures/flag_filter.toml`. Following this structure frees the TOML from having to contain any information about _where_ to find the entrypoints of each test. All the test entrypoints (tasks or workflows) in `tests/data_structures/flag_filter.toml` are expected to be defined in the WDL located at `data_structures/flag_filter.wdl`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should be separating test definitions from the code they're testing like this.

I agree with you after having this all pointed out. I can rework this to instead be based off a sibling file of the same basename, but with the differing file extension.

Is "spec-compliant WDL workspace" a defined concept? We'd need to nail that down in order to resolve questions like "what happens if there are multiple tests/ directories in the hierarchy?"

No, it is currently not defined. Although doc is based off the same concept. I think the definition would just be a directory which is recursively searched for .wdl files. My instinct here is that none of the sprocket commands should "go up" and search parent directories, but instead "look down" from the CWD (as many of the current commands operate).


## E2E testing

As stated in the "motivation" section, this proposal is ignoring end-to-end (or E2E) tests and is really just focused on enabling unit testing for CI purposes. Perhaps some of this could be re-used for an E2E API, but I have largely ignored that aspect. (Also I have lots of thoughts about what that might look like, but for brevity will not elaborate further.)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a user could use this for E2E testing if they really wanted to, it just wouldn't be very ergonomic.

I think that the proposed TOML tests are going to be too limited for proper E2E testing, especially given that many bioinformatics tools/pipelines are non-deterministic. E2E testing using this framework would probably end up making hefty use of the custom feature, which is not the best UX.

To wax poetically, this framework can be used as a hammer if we treat all forms of testing as nails, but I'd argue that not all forms of testing are nails and different tools should be used.


## Test Data

Most WDL tasks and workflows have `File` type inputs and outputs, so there should be an easy way to incorporate test files into the framework. This can be accomplished with a `tests/fixtures/` directory in the root of the workspace which can be referred to from any TOML test. If the string `$FIXTURES` is found within a TOML string value within the `inputs` table, the correct path to the `fixtures` directory will be dynamically inserted at test run time. This avoids having to track relative paths from TOML that may be arbitrarily nested in relation to test data. For example, let's assume there are `test.bam`, `test.bam.bai`, and `reference.fa.gz` files located within the `tests/fixtures/` directory; the following TOML `inputs` table could be used regardless of where that actual `.toml` file resides within the WDL workspace:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting the fixtures into a tests/ directory while keeping the tests themselves alongside the WDL documents in place seems a bit strange to me. Just wanted to flag it here, maybe we should choose one or the other to prevent the splitting of context for users.

elaborating on pytest-workflow and mentioning adoption by WDL spec
@a-frantz
Copy link
Member Author

a-frantz commented Oct 8, 2025

I've realized that the TOML format is incompatible with WDL due to it lacking an equivalent to WDL's None type. This is unfortunate, as I'm a fan of TOML and prefer writing TOML to any of the other similar formats.

The obvious alternatives are YAML and JSON, both of which I find painful to write by hand. YAML is easier for humans to read and write than JSON, so I plan to have this test framework specified in YAML. It should be a pretty straightforward drop in replacement for TOML in this RFC.

I do not intend to update this document at this point in time, as I think the RFC has more or less run its course, but I'm leaving this comment for posterity and also to solicit ideas for alternatives I may not have considered.

Alternatives I considered but decided against:

  • hacking TOML to add a null type we can convert to WDL None
  • rolling our own bespoke format for test specification

@a-frantz
Copy link
Member Author

a-frantz commented Dec 1, 2025

Please see PR - stjude-rust-labs/sprocket#468 for the proposed YAML test syntax. The linked PR is meant to solidify the matrix computation from user-defined inputs. It is possible to see all the executions which will be run (in a future version of the test command) provided input YAML test definitions by running sprocket dev test <path to WDL with accompanying test definitions>

@a-frantz
Copy link
Member Author

a-frantz commented Dec 1, 2025

Open questions not yet settled which will need to be addressed soon:

  1. Fixtures (referring to test files/directories to use as input to WDL executions)
    a. Where should fixtures reside?
    b. How should users refer to fixture paths from within the YAML they write?
  2. custom tests (providing a way for users to execute arbitrary binaries/scripts on the outputs of a given test)
    a. syntax for this is tentatively proposed to be an arbitrary shell command with the environment inherited from parent process and special substitution for the variables $INPUTS (<path to the test execution's inputs.json>) and $OUTPUTS (<path to the test execution's outputs.json>)
    b. where do the custom scripts/binaries for the custom tests reside? Whatever directory they are in will need to be added to the $PATH of the sub-process
  3. what "built-in" tests need to be present at first launch?
  4. how to annotate tests with tags that can be filtered/selected at the CL so only a subset run?

If anyone has any thoughts on these open questions, I'd love to hear them! They will be addressed in future PRs which will link back to here.

@a-frantz
Copy link
Member Author

a-frantz commented Dec 1, 2025

given the statement of intentions merged in #4, this RFC will be merged tomorrow as all the above conversations seem to have run their course.

Future Sprocket PRs (like stjude-rust-labs/sprocket#468 ) will continue to link to this PR for context, although the details of sprocket test have already deviated by some degree from this document (most notably switching TOML for YAML).

New conversations may continue to use this PR as an anchor of sorts, particularly if this document has made any glaring or fundamental mistake. The remaining details (as outlined in this comment) would be best discussed on upcoming PRs which deal with their implementation rather than this PR, which will now function primarily as an archive of design goals and intentions.

@a-frantz a-frantz merged commit 4793746 into main Dec 2, 2025
1 of 2 checks passed
github-actions bot pushed a commit that referenced this pull request Dec 2, 2025
* Create 0001-sprocket-test.md

* Update ci-build.sh

* Update 0000-placeholder.md

* docs: TODO -> REVIEWERS

* docs: link to template

* revise: elaborate on pytest-workflow

* feat: more prior art

* feat: elaborate on some other future possibilities

* feat: test filtering

* feat: future possibility: caching

* feat: custom test details

* feat: more discussion of the custom test design

* chore: typos, rephrasing awkward clauses, etc

* chore: review feedback

* revise: "test" -> "assertion"

* chore: review feedback

elaborating on pytest-workflow and mentioning adoption by WDL spec

* ci: remove bad key 4793746
@a-frantz a-frantz deleted the rfcs/sprocket-test branch December 3, 2025 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants