-
Notifications
You must be signed in to change notification settings - Fork 134
feat: coffea.compute (WIP) #1470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
is it faster than generators? no idea
|
This is great to see @nsmith-, I've worked with the current executor interface quite a bit now, and you've added many high-quality improvements to this new executor implementation.
These are some comments about ideas for improvement:
class RunningBackend(Protocol):
"""A RunningBackend represents an active backend context.
This is the type returned by Backend.__enter__.
It may be distinct from Backend to separate resource management
from computation submission.
"""
def compute(self, item: Computable[InputT, ResultT], /) -> Task[InputT, ResultT]:
return self.reduce(*self.map(item))
def map(self, item: Computable[InputT, ResultT], /) -> Iterable[Task[InputT, ResultT]]:
... # e.g. `return (f() for f in item)`
def reduce(self, tasks: Iterable[Task[InputT, ResultT]], /) -> Task[InputT, ResultT]:
... # e.g. `return sum(tasks, EmptyResult())`Then we can implement custom backends and just swap out the
out, unfinished, status = task.result() # non-blocking (could e.g. filter the tasks by status without waiting for them)
if status == TaskStatus.COMPLETE:
assert len(unfinished) == 0
print("Success")Some minor comments:
I would like to contribute to this PR (and with that deprecate the developed of #1468 to add this here instead). I think I can add the futures and dask backend, and implement 2 strategies for both:
|
|
Looks nice! How do you picture the backends interacting with the computable objects? For example, if I want a backend that wants to dynamically modify the chunksize, would it be ok to have an option where the backend gets a computable object rather than the iterator? In that way, the backend can modify attributes of the computable object that may change the |
|
@pfackeldey thanks for the quick turnaround!
So, the idea is that
I had originally thought we would handle skimming / writing in a function wrapper and simply return a list of what was written as the To your general suggestion on splitting map/reduce, currently the from typing import Callable, Generic, Iterable, Iterator, Protocol, Self, TypeVar
from coffea.compute.protocol import (
Addable,
Computable,
InputT,
ResultT,
WorkElement,
)
class Future(Generic[ResultT]):
def __init__(self, item: WorkElement[InputT, ResultT]) -> None:
"This should schedule the work element for execution"
...
def result(self) -> ResultT: ...
def add_done_callback(self, fn: Callable[[Self], None]) -> None: ...
R = TypeVar("R", bound=Addable)
class ResultFuture(Generic[R]):
def add(self, item: Future[R]) -> None: ...
def result(self) -> R: ...
class RunningBackend(Protocol):
def compute(self, item: Computable[InputT, ResultT], /) -> ResultFuture[ResultT]:
return self.reduce(self.map(item))
def map(self, item: Computable[InputT, ResultT], /) -> Iterator[Future[ResultT]]:
"Example implementation"
return (Future(f) for f in item)
def reduce(self, futures: Iterable[Future[ResultT]], /) -> ResultFuture[ResultT]:
"Example implementation"
out = ResultFuture[ResultT]()
for f in futures:
f.add_done_callback(out.add)
return outThe tricky bit is for distributed backends: how will the data of the Future's result get shipped to the ResultFuture?
I would implement pre-loading by subclassing the
For now,
The happy path should be simple:
Agreed. That's the idea behind
Yes! Feel free to make a PR to this PR. Though, maybe some more discussion on map/reduce is warranted before committing too much effort. |
|
@pfackeldey The json serializable metadata is handled via the pydantic filespecs, currently there's an InputFiles component to a dataset which gets replaced by differently typed PreprocessedFiles (still some differentiation and lfn/pfn to add) #1398. Also @ncsmith @lgray @ikrommyd would we all agree on Fileset -> DataGroup? |
|
@btovar thanks for bringing up the dynamic chunking, I now remember you had implemented that at some point in the TaskVine Executor. So then we re-conceptualize from typing import Generator, Protocol, TypeAlias
from coffea.compute.protocol import (
Computable,
EmptyResult,
InputT,
ResultT,
WorkElement,
)
class SizedWorkElement(WorkElement[InputT, ResultT], Protocol):
def __len__(self) -> int:
"Return the size of this work element in some unit (e.g., number of events)"
...
NewSizeRequest: TypeAlias = int
class ResizableComputable(Computable[InputT, ResultT], Protocol):
def generate(
self,
) -> Generator[SizedWorkElement[InputT, ResultT], NewSizeRequest, None]:
"Generate work elements, possibly adapting their size based on external factors"
...
def compute_now(items: ResizableComputable[InputT, ResultT]) -> ResultT | EmptyResult:
out = EmptyResult()
work_gen = items.generate()
# Let it tell us the initial size
work_element = next(work_gen, None)
if work_element is None:
return out
while True:
result = work_element()
out += result
# Here we could adapt the size of future work elements based on performance metrics
# For simplicity, we just request the same size
try:
work_element = work_gen.send(len(work_element))
except StopIteration:
break
return out(a real implementation would have to wrap this into a Task) This would require a bit of re-imagining how the |
|
On preloading columns, one thing I have been thinking about is making the form required for preprocessed files. If you guarantee that, then at the file or taskelement level (so long as you have the datasetspec which stores the union form), you can store indices into the form for preloading or just which columns/branches are in the file (because the union form has the superset). It could be useful to have the latter just to eg filter on files where trigger branches in cms have evolved, I think that's a potential energy activation barrier for debugging we could smooth with such a mechanism. We'd have to pass down to the subelement the datasetspec, but it's metadata, or maybe in line with resizable compute task elements, you can retrieve updated metadata via messages |
|
Hi @nsmith-,
I missed that you define this using
My main motivation would be to reuse logic in
I don't think it is necessary to concretely define different Future types. We just need a single Future protocol and every backend implements that based on its own concept of futures, e.g.
Isn't every backend we have currently using futures (except for IterativeExecutor)? I was just imagining something like this, which would separate a bit the Task/Future concept (that every backend already provides afaik) from additional info, e.g. failures/continuation, using some sort of import typing as tp
class Future(tp.Protocol[ResultT]):
"""this protocol supports the minimal subset of features we need from https://docs.python.org/3/library/concurrent.futures.html#future-objects (and other backend future types)"""
def result(self) -> ResultT: ...
def add_done_callback(self, fn: Callable[[Self], None]) -> None: ...
@dataclass(frozen=True, slots=True)
class State(Generic[InputT, ResultT]):
out: Future[ResultT]
failures: Computable[InputT, ResultT]
class SimpleDaskBackend(RunningBackend):
def compute(self, item: Computable[InputT, ResultT], /) -> State[InputT, ResultT]:
return self.reduce(self.map(item))
def map(self, item: Computable[InputT, ResultT], /) -> Iterator[Future[ResultT]]:
return self.client.map(lambda f: f(), item)
def reduce(self, futures: Iterable[Future[ResultT]], /) -> State[InputT, ResultT]:
out_f = self.client.submit(lambda: EmptyResult())
failures = []
for f in as_completed(futures):
# handle failures
if issubclass(f.type, BaseException) and f.key.startswith("Processor-"):
failures.append(f) # <- translate them back to work elements, eg like `Continuation` does.
else:
out_f = self.client.submit(operator.add, out_f, f)
return State(out_f, failures)
class FancyDaskBackend(SimpleDaskBackend):
def reduce(self, futures: Iterable[Future[ResultT]], /) -> State[InputT, ResultT]:
"""A fancier reduce implementation""" |
|
Hello, I have skimmed through the discussion (not looked at actual code) and this is all looking very good! I was wondering about two things for starters. When I originally heard the term Regarding filesets, I haven't really been following the fileset spec/pydantic stuff but I saw this in your example above? dataset = Dataset(
files=[
File(path="file1.root", steps=[(0, 100), (100, 200), (200, 300)]),
File(path="file2.root", steps=[(0, 100), (100, 200), (200, 300)]),
],
metadata=ContextDataset(dataset_name="singlemuon", cross_section=None),
)Will the ability to pass good old fashioned json filesets remain? People tend to have such filesets already stored somewhere. |
|
The original seed idea was indeed a function akin to dask.compute, For jsons, no, we should not allow that. Part of this was all motivated by not having to handle 3-4 different levels of dictionary nesting that may or may not (accidentally) get passed into a function (which is the current level of inception-ness in coffea's FIlesets + additional layer in column-joining paradigm). And we shouldn't handle it because it's a one-liner to convert to the pydantic input types: coffea/tests/test_dataset_tools.py Line 327 in 95666fd
Literally just wrap your python dict in FilesetSpec(<dict>) or DatasetSpec(<dict>) etc. If anything fails, people should update, then save the pydantic version (i.e. that should be a one-time conversion).
|
Oh yeah, that was a mistake on my part. I did not mean to literally allow jsons, but an easy way (one-liner) for people to get a proper fileset structure from an existing json that they have. But you're saying that this will be supported. |
I thought about that originally, but context managers are very nice things to have. I see bugs in clearing the progress bar, for example, that are probably as a result of the lack of these currently that would be nice to get past once and for all. |
My preference would be to capture any backend-agnostic reduce logic in What you have looks like most of a initial dask backend, at least once the edit: note, the coffea/src/coffea/compute/protocol.py Lines 84 to 93 in d70db53
|
Assuming #1444 will be merged before this one
sounds good, I can give it a try 👍 |
|
Lots of material here, this is great to see! There are a few API layers and I suspect the |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Yeah, maybe to riff on that example, it might look something like dataset = InputDataset(
files=["file1.root", "file2.root"],
metadata=ContextDataset(dataset_name="my_dataset", cross_section=1.0),
)
def get_histo_metadata(file_element: OpenROOTFile) -> MyMetadata:
# Extract some metadata from the file
histo = file_element.root_dir["histo"].values()
return MyMetadata(histo_data=list(histo))
def do_analysis(input: ContextInput[EventsArray, MyMetadata]) -> int:
# Perform some analysis using the events and metadata
return len(input.data) + sum(input.context.histo_data)
with Backend() as backend:
preparer = PrepareFile(metadata_extractor=get_histo_metadata)
prepared = backend.compute(dataset.map_files(preparer)).result()
assert isinstance(prepared, PreparedDataset)
prepared.to_json("processable_dataset.json")
prepared = PreparedDataset.from_json("processable_dataset.json")
task = backend.compute(prepared.map_steps(do_analysis))
out = task.result()
coffea.util.save(out, "output.coffea")where the definitions of all the extra classes are as below: Click to expandfrom dataclasses import dataclass
from typing import Any, Callable
import fsspec
from pydantic import BaseModel
import coffea.util
from coffea.compute.context import Context, ContextDataElement, ContextInput
from coffea.compute.data.inputdata import InputDataset
from coffea.compute.data.processable_data import (
ContextDataset,
StepElement,
StepIterable,
)
from coffea.compute.data.rootfile import OpenROOTFile
from coffea.compute.func import EventsArray
from coffea.compute.protocol import Backend
@dataclass(frozen=True)
class MyMetadata(Context):
histo_data: list[int]
class FileInfo(BaseModel):
file_path: str
metadata: MyMetadata
# lineage/provenance?
def iter_steps(self):
# For simplicity, assume a single step covering the whole file
yield ContextDataElement(StepElement((0, -1), self.file_path), self.metadata)
class PreparedDataset(BaseModel, StepIterable[MyMetadata]):
files: list[FileInfo]
def iter_steps(self):
for fileinfo in self.files:
yield from fileinfo.iter_steps()
def __add__(self, other: "PreparedDataset") -> "PreparedDataset":
return PreparedDataset(files=self.files + other.files)
def to_json(self, path: str) -> None:
with fsspec.open(path, mode="w", compression="infer") as f:
f.write(self.model_dump_json()) # type: ignore
@classmethod
def from_json(cls, path: str) -> "PreparedDataset":
with fsspec.open(path, mode="r", compression="infer") as f:
return cls.model_validate_json(f.read()) # type: ignore
@dataclass
class PrepareFile:
metadata_extractor: Callable[[OpenROOTFile], MyMetadata]
def __call__(self, item: ContextInput[OpenROOTFile, Any]) -> PreparedDataset:
metadata = self.metadata_extractor(item.data)
return PreparedDataset(
files=[FileInfo(file_path=item.data.file_path, metadata=metadata)],
)I'm not sure how much of that would be user-driven, it's a lot of classes to define, but may be useful for framework developers that live on top of coffea. |
🗒️ Status of this PR is tracked at #1482 since otherwise the discussion will get quite long.
This pull request introduces the initial implementation of the new
coffea.computemodule, envisioned as the successor of the Processor/Executor/Runner interface. Currently it is all new additions in thesrc/coffea/compute/directory.To get started, the best place to look is at
protocol.py, which is sketched below:This protocol allows us to keep things very modular.
Backends
Backends are essentially the replacement for Executors. The
Computabledefines the map part, and theResultT's add function defines reduction. This is actually very similar to what we have in BaseExecutor:coffea/src/coffea/processor/executor.py
Lines 351 to 359 in f7e1249
So I'm hoping it is enough for our use. Already with the
partial_resultwe are extending what we could do before. The other big change is thatRunningBackend.computeis not blocking, but rather returns aTaskhandle (really, a future). A trivial implementation of a backend would be:To help prototype, a complete single-threaded backend is implemented in
backends/threaded.py, and a generic backend test is implemented intest_compute_backends.py.Error handling
Although not yet part of the protocol, the SingleThreadedBackend already handles errors and enables re-tries, using a wrapper around the WorkElement to keep track of exceptions. See
errors.pyfor details. A TODO for the protocol is to define exactly what the user interface should be for fetching the exceptions from a failed task. A quick example of re-running a failed job is as follows:Data objects
To build a
Computable, we need a user interface that binds some function to some data. Hence thefunc.pyanddata.pymodules. So far,func.pyis not very complex, but this is eventually the place where the details of preprocessing (preparing?) would go, as well as whatever NanoEvents adapters are needed. For the data, the structure is heavily inspired and far from feature-parity with #1453. To apply a function, the user interface looks something like:All this Context business is a way to get a typed metadata into the user function. Details are in
context.py. Of course, we wrap this so the classic interface still works:Abstractions
Power users may define new Dataset and File classes. Two ABCs exist to define the protocol for iterating over steps (chunks) or files:
Preprocessing
The goal, much like in the executor case, is to re-use the same protocol also for preprocessing. A sketch of what this might look like is in
test_compute_data.py::test_prepare. Some more work needs to be done to understand the UI for joining the prepared files with the metadata of the input specification. For now, to get the ball rolling, there is a genericGroupedResult(seegroup.py) that uses information in theContextto derive a key for the addable output dictionary.This is another departure from Processor: we try to handle the hierarchy of
DataGroup -> Dataset -> Resultfor the user, letting them define the process function without the need to build a Result object that splits out the different datasets into dictionaries. The process function can still respond to the dataset name via the context, with the nice bonus that it is indicated in the type signature: a process function likewill not fill any dataset-dependent axes, whereas
could.
Typing
Clearly there is a lot of fun to be had here! For example, in my IDE I get hints like:

i.e. the return type of the user function makes it all the way through the backend
What's next
Status of this PR tracked at #1482
There are about 20 TODOs and several design questions left. Please feel free to comment on this draft PR.
@btovar @pfackeldey @ikrommyd @lgray @alexander-held @NJManganelli in particular I'd be interested in your thoughts.