You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: Enable incremental syncing for workflow_runs stream (#456)
## Problem
The `workflow_runs` stream currently fetches all historical workflow
runs instead of respecting the `start_date` configuration. This causes
significant performance issues for repositories with thousands of
workflow runs, as the tap processes all pages of historical data on
every run.
## Root Cause
The `WorkflowRunsStream` was configured with:
- `replication_key = None` (inheriting `updated_at` from parent
`RepositoryStream`)
- `ignore_parent_replication_key = False`
- No custom URL parameter handling, it goes into
**File: `tap_github/client.py`**
```python
class GitHubRestStream(RESTStream):
...
if self.replication_key == "updated_at":
params["sort"] = "updated"
params["direction"] = "desc" if self.use_fake_since_parameter else #"asc"
```
and tries to use the since logic.
However, the [GitHub Actions API
endpoint](https://docs.github.com/en/rest/actions/workflow-runs?apiVersion=2022-11-28#list-workflow-runs-for-a-repository)
(`/repos/{owner}/{repo}/actions/runs`) only supports:
- `created` parameter for date filtering (not `since`)
## Solution
This PR modifies the `WorkflowRunsStream` to:
1. **Set `replication_key = "created_at"`** - Use the field that GitHub
API supports
2. **Set `ignore_parent_replication_key = True`** - Use independent
replication logic
3. **Add custom `get_url_params` method** - Use `created` parameter
instead of `since`
## Changes Made
**File: `tap_github/repository_streams.py`**
```python
class WorkflowRunsStream(GitHubRestStream):
replication_key = "created_at" # Changed from None
ignore_parent_replication_key = True # Changed from False
def get_url_params(self, context, next_page_token):
params = super().get_url_params(context, next_page_token)
# GitHub Actions API uses 'created' parameter instead of 'since'
since = self.get_starting_timestamp(context)
if self.replication_key and since:
params["created"] = f"{since.isoformat(sep='T')}..*"
return params
```
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
0 commit comments