Skip to content

Conversation

cbrammer
Copy link
Contributor

This tap uses a custom use_fake_since_parameter for items API's in GitHub that don't have a since parameter. This effects all items that use use_fake_since_parameter. This filters out any returned items that are before the start_date.

An example is PullRequestStream. It would get an entire page of 100 items, and 99 of them could be past the since date. But all 100 were then spinning up child streams to get comments/commits etc. That was causing a huge extra usage of the API request limit. This filters those out.

Comment on lines 148 to 149
# save the context from the requests so it can be available to the parse_response method
self.context = context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK this is not necessary. The stream class already has a context attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't think it is available 😞 on the core RESTStream class. That is why all other method signatures include it to be passed in. For some reason it was excluded from this method.

@edgarrmondragon edgarrmondragon changed the title Filter replication key items for use_fake_since_parameter fix: Filter replication key items for use_fake_since_parameter Mar 25, 2025
@edgarrmondragon edgarrmondragon self-assigned this Mar 25, 2025
@edgarrmondragon edgarrmondragon self-requested a review March 25, 2025 23:18
cbrammer and others added 2 commits March 27, 2025 15:41
Co-authored-by: Edgar Ramírez Mondragón <[email protected]>
Co-authored-by: Edgar Ramírez Mondragón <[email protected]>
Sadly, the GitHub API returns NUL values in some fields.
This function recursively replaces them with empty strings.
Otherwise postgres will raise an error when inserting the data.
Copy link

sonarqubecloud bot commented Apr 4, 2025

@cbrammer
Copy link
Contributor Author

cbrammer commented Apr 4, 2025

I added on to this PR since it was pending and the changes were in the same spot.

It looks like GiHub is returning NUL (\x00) values in some responses! I originally had it only on the specific stream that was causing issues, but I figured if it happens once... so I pushed it to the client layer. This does add overhead to see if there is a nul value and replace it, but it is better than dealing with bad data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants