Skip to content

Commit 8667cda

Browse files
fix: Enable incremental syncing for workflow_runs stream (#456)
## Problem The `workflow_runs` stream currently fetches all historical workflow runs instead of respecting the `start_date` configuration. This causes significant performance issues for repositories with thousands of workflow runs, as the tap processes all pages of historical data on every run. ## Root Cause The `WorkflowRunsStream` was configured with: - `replication_key = None` (inheriting `updated_at` from parent `RepositoryStream`) - `ignore_parent_replication_key = False` - No custom URL parameter handling, it goes into **File: `tap_github/client.py`** ```python class GitHubRestStream(RESTStream): ... if self.replication_key == "updated_at": params["sort"] = "updated" params["direction"] = "desc" if self.use_fake_since_parameter else #"asc" ``` and tries to use the since logic. However, the [GitHub Actions API endpoint](https://docs.github.com/en/rest/actions/workflow-runs?apiVersion=2022-11-28#list-workflow-runs-for-a-repository) (`/repos/{owner}/{repo}/actions/runs`) only supports: - `created` parameter for date filtering (not `since`) ## Solution This PR modifies the `WorkflowRunsStream` to: 1. **Set `replication_key = "created_at"`** - Use the field that GitHub API supports 2. **Set `ignore_parent_replication_key = True`** - Use independent replication logic 3. **Add custom `get_url_params` method** - Use `created` parameter instead of `since` ## Changes Made **File: `tap_github/repository_streams.py`** ```python class WorkflowRunsStream(GitHubRestStream): replication_key = "created_at" # Changed from None ignore_parent_replication_key = True # Changed from False def get_url_params(self, context, next_page_token): params = super().get_url_params(context, next_page_token) # GitHub Actions API uses 'created' parameter instead of 'since' since = self.get_starting_timestamp(context) if self.replication_key and since: params["created"] = f"{since.isoformat(sep='T')}..*" return params ``` --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent dcbe056 commit 8667cda

File tree

1 file changed

+17
-2
lines changed

1 file changed

+17
-2
lines changed

tap_github/repository_streams.py

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2961,9 +2961,9 @@ class WorkflowRunsStream(GitHubRestStream):
29612961
name = "workflow_runs"
29622962
path = "/repos/{org}/{repo}/actions/runs"
29632963
primary_keys: ClassVar[list[str]] = ["id"]
2964-
replication_key = None
2964+
replication_key = "created_at"
29652965
parent_stream_type = RepositoryStream
2966-
ignore_parent_replication_key = False
2966+
ignore_parent_replication_key = True
29672967
state_partitioning_keys: ClassVar[list[str]] = ["repo", "org"]
29682968
records_jsonpath = "$.workflow_runs[*]"
29692969

@@ -3006,6 +3006,21 @@ class WorkflowRunsStream(GitHubRestStream):
30063006
th.Property("workflow_url", th.StringType),
30073007
).to_dict()
30083008

3009+
def get_url_params(
3010+
self,
3011+
context: Context | None,
3012+
next_page_token: Any | None, # noqa: ANN401
3013+
) -> dict[str, Any]:
3014+
"""Return a dictionary of values to be used in URL parameterization."""
3015+
params = super().get_url_params(context, next_page_token)
3016+
3017+
# GitHub Actions API uses 'created' parameter instead of 'since'
3018+
since = self.get_starting_timestamp(context)
3019+
if self.replication_key and since:
3020+
params["created"] = f"{since.isoformat(sep='T')}..*"
3021+
3022+
return params
3023+
30093024
def parse_response(self, response: requests.Response) -> Iterable[dict]:
30103025
"""Parse the response and return an iterator of result rows."""
30113026
yield from extract_jsonpath(self.records_jsonpath, input=response.json())

0 commit comments

Comments
 (0)