Finalize workflow output definition

Continuation of #4670 

Enumerating the proposed changes that we've collected so far:

- **Support additional index file formats** (json, yaml)

- ~~**Generate output schema.** Essentially a list of index file schemas. Should be generated on each run or via some separate command. Should eventually be used with parameter schema for chaining pipelines. See https://github.com/nf-core/fetchngs/pull/312 for a concrete example (`schema_outputs.yml`).~~ -> #5213 

- **Dynamic path mapping.** Allow the `path` option in target definition to be a closure:
  ```groovy
  workflow {
    main:
    ch_foo = Channel.of( [ [id: 1], file('foo.fastq') ] )

    publish:
    ch_foo >> 'foo'
  }

  output {
    'foo' {
      path { meta, fastq -> "foo/${meta.id}" }
    }
  }
  ```

  Note that the dynamic path need only define the directory, not the full filename. Since a channel value may contain multiple files, an alternative syntax could be to provide the file and the channel value to the closure, so that it's clear which file is being published:
  ```groovy
      path { file, value -> "foo/${value[0].id}" }
  ```

- **Move publish options to config.** Publish options like `directory` and `mode` typically need to be configurable by the user, which currently would require you to define a special param for each option. Therefore it makes more sense for them to be config settings rather than pipeline code:
  ```groovy
  // nextflow.config
  workflow {
    output {
      directory = 'results'
      mode = 'copy'

      withTarget:'foo' {
        mode = 'link'
      }
    }
  }

  // main.nf
  output {
    'foo' {
      index {
        path 'index.csv'
        // ...
      }
    }
  }
  ```

  The output block should be used only to define index files (i.e. the output schema). In other words, the pipeline code should define *what* is published and the config should define *how* it is published.

  For the output directory, it has also been proposed to provide a CLI option for it e.g. `-output-dir` and shorter config option e.g. `outputDir`. The output directory would be available in the pipeline code as part of workflow metadata i.e. `workflow.outputDir`.

- **Remove publish section from process definition**. Still under discussion. The rationale for the process publish section was to provide some sensible defaults which can be overridden, however I've come to think that it only makes it harder to determine what is being published. Instead of enumerating the publish targets in one place, they are scattered throughout the pipeline code. Also, process definitions are abstracted from any particular pipeline, so it doesn't make much sense to put pipeline-specific details like params and publishing in the process definition.

  A better way to give some sensible defaults for a process would be to write an entry workflow in the process definition that gives an example usage:
  ```groovy
  process foo {
    // ...
  }

  workflow {
    main:
    params.input = '...'
    foo( params.input )

    publish:
    foo.out >> 'foo'
  }
  ```

  This workflow will be ignored when importing the process as a module, but it provides a concrete example and can even be used to run the process directly. In fact it could even replace the custom nf-test DSL eventually.

- **Allow publish section only in entry workflow**. I am less certain about this one but wanted to lay out the case for it. Building on the previous item, having publish sections potentially spread across several workflows in a pipeline makes it hard to see what all is being published. Instead, maybe named workflows should only be allowed to emit outputs, and only the entry workflow be able to publish outputs. As with the previous point, you could write an entry workflow for each named workflow which gives some example publishing (and allow you to run the workflow as a standalone pipeline).

  This idea is inspired by a principle in software engineering that side effects (a.k.a. I/O, publishing) should be pushed to the "boundaries" of the code, to make it more readable and testable, and to make it easier to swap out different I/O strategies (file copy, database insert, API call, etc).

  At the same time, I appreciate that publishing from a named workflow is a convenient shorthand, especially when you considering having to propagate outputs back up through potentially several nested workflows. But I wonder if being more strict here would be better in the long run. The example entry workflow is something that will be written anyway, both for testing and to run workflows as standalone pipelines.

- **Runtime enhancements**
  - report target names in the output block that weren't defined in a process or workflow
  - prevent leading or trailing slashes
  - detect published file collisions at runtime
  - detect outputs that aren't published or used by downstream processes

- **Include output targets in inspect command**. Similar to how inspect lists all processes with some resolved directives, etc, it could also show the resolved list of output targets. Not as essential if we implement some of the above points, but still useful for things like resolving params.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Finalize workflow output definition #5103

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Finalize workflow output definition #5103

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions