Skip to content

Finalize workflow output definition #5103

@bentsherman

Description

@bentsherman

Continuation of #4670

Enumerating the proposed changes that we've collected so far:

  • Support additional index file formats (json, yaml)

  • Generate output schema. Essentially a list of index file schemas. Should be generated on each run or via some separate command. Should eventually be used with parameter schema for chaining pipelines. See DSL2+ nf-core/fetchngs#312 for a concrete example (schema_outputs.yml). -> Generate output schema from output definition #5213

  • Dynamic path mapping. Allow the path option in target definition to be a closure:

    workflow {
      main:
      ch_foo = Channel.of( [ [id: 1], file('foo.fastq') ] )
    
      publish:
      ch_foo >> 'foo'
    }
    
    output {
      'foo' {
        path { meta, fastq -> "foo/${meta.id}" }
      }
    }

    Note that the dynamic path need only define the directory, not the full filename. Since a channel value may contain multiple files, an alternative syntax could be to provide the file and the channel value to the closure, so that it's clear which file is being published:

        path { file, value -> "foo/${value[0].id}" }
  • Move publish options to config. Publish options like directory and mode typically need to be configurable by the user, which currently would require you to define a special param for each option. Therefore it makes more sense for them to be config settings rather than pipeline code:

    // nextflow.config
    workflow {
      output {
        directory = 'results'
        mode = 'copy'
    
        withTarget:'foo' {
          mode = 'link'
        }
      }
    }
    
    // main.nf
    output {
      'foo' {
        index {
          path 'index.csv'
          // ...
        }
      }
    }

    The output block should be used only to define index files (i.e. the output schema). In other words, the pipeline code should define what is published and the config should define how it is published.

    For the output directory, it has also been proposed to provide a CLI option for it e.g. -output-dir and shorter config option e.g. outputDir. The output directory would be available in the pipeline code as part of workflow metadata i.e. workflow.outputDir.

  • Remove publish section from process definition. Still under discussion. The rationale for the process publish section was to provide some sensible defaults which can be overridden, however I've come to think that it only makes it harder to determine what is being published. Instead of enumerating the publish targets in one place, they are scattered throughout the pipeline code. Also, process definitions are abstracted from any particular pipeline, so it doesn't make much sense to put pipeline-specific details like params and publishing in the process definition.

    A better way to give some sensible defaults for a process would be to write an entry workflow in the process definition that gives an example usage:

    process foo {
      // ...
    }
    
    workflow {
      main:
      params.input = '...'
      foo( params.input )
    
      publish:
      foo.out >> 'foo'
    }

    This workflow will be ignored when importing the process as a module, but it provides a concrete example and can even be used to run the process directly. In fact it could even replace the custom nf-test DSL eventually.

  • Allow publish section only in entry workflow. I am less certain about this one but wanted to lay out the case for it. Building on the previous item, having publish sections potentially spread across several workflows in a pipeline makes it hard to see what all is being published. Instead, maybe named workflows should only be allowed to emit outputs, and only the entry workflow be able to publish outputs. As with the previous point, you could write an entry workflow for each named workflow which gives some example publishing (and allow you to run the workflow as a standalone pipeline).

    This idea is inspired by a principle in software engineering that side effects (a.k.a. I/O, publishing) should be pushed to the "boundaries" of the code, to make it more readable and testable, and to make it easier to swap out different I/O strategies (file copy, database insert, API call, etc).

    At the same time, I appreciate that publishing from a named workflow is a convenient shorthand, especially when you considering having to propagate outputs back up through potentially several nested workflows. But I wonder if being more strict here would be better in the long run. The example entry workflow is something that will be written anyway, both for testing and to run workflows as standalone pipelines.

  • Runtime enhancements

    • report target names in the output block that weren't defined in a process or workflow
    • prevent leading or trailing slashes
    • detect published file collisions at runtime
    • detect outputs that aren't published or used by downstream processes
  • Include output targets in inspect command. Similar to how inspect lists all processes with some resolved directives, etc, it could also show the resolved list of output targets. Not as essential if we implement some of the above points, but still useful for things like resolving params.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions