-
Notifications
You must be signed in to change notification settings - Fork 46
fix: Filter replication key items for use_fake_since_parameter
#379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
tap_github/client.py
Outdated
# save the context from the requests so it can be available to the parse_response method | ||
self.context = context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK this is not necessary. The stream class already has a context
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I don't think it is available 😞 on the core RESTStream
class. That is why all other method signatures include it to be passed in. For some reason it was excluded from this method.
use_fake_since_parameter
use_fake_since_parameter
Co-authored-by: Edgar Ramírez Mondragón <[email protected]>
Co-authored-by: Edgar Ramírez Mondragón <[email protected]>
Sadly, the GitHub API returns NUL values in some fields. This function recursively replaces them with empty strings. Otherwise postgres will raise an error when inserting the data.
|
I added on to this PR since it was pending and the changes were in the same spot. It looks like GiHub is returning NUL (\x00) values in some responses! I originally had it only on the specific stream that was causing issues, but I figured if it happens once... so I pushed it to the client layer. This does add overhead to see if there is a nul value and replace it, but it is better than dealing with bad data |
This tap uses a custom
use_fake_since_parameter
for items API's in GitHub that don't have asince
parameter. This effects all items that useuse_fake_since_parameter
. This filters out any returned items that are before thestart_date
.An example is PullRequestStream. It would get an entire page of 100 items, and 99 of them could be past the
since
date. But all 100 were then spinning up child streams to get comments/commits etc. That was causing a huge extra usage of the API request limit. This filters those out.