-
Notifications
You must be signed in to change notification settings - Fork 81
Search feature #370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Search feature #370
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
from datetime import datetime | ||
|
||
import click | ||
from dateparser import parse | ||
from scrapinghub import ScrapinghubClient | ||
|
||
|
||
HELP = """ | ||
Given a project key and part of an url, fetch job ids from Scrapy Cloud. | ||
This is useful when you want to find a job in an efficient way starting from | ||
an url. | ||
The project key and an url (or part of it). The matching is case sensitive! | ||
shub search 123456 "B07F3NG1234" | ||
You can provide other parameters to narrow down the search significantly such | ||
as the spider name and the date interval to search for. Or both! The default | ||
is to search only the last 6 months. | ||
shub search 123456 "B07F3NG1234" --spider="amazon" | ||
shub search 123456 "B07F3NG1234" --start_date="last week" --end_date="2 days ago" | ||
""" | ||
|
||
SHORT_HELP = "Fetch job ids from Scrapy Cloud based on urls" | ||
|
||
|
||
@click.command(help=HELP, short_help=SHORT_HELP) | ||
@click.argument('project_key') | ||
@click.argument('url_content') | ||
@click.option( | ||
'--start_date', | ||
default='6 months ago', | ||
help='date to start searching from, defaults to 6 months ago' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I considered 6 months to be a reasonable default. My main concern is that people will use this without setting spider and start date and it will be slow for them and a lot of requests for scrapinghub cloud but making any of this mandatory might be too restrictive. What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd say it's too much: we have 120 days of data retention for the professional plan, so it doesn't make sense to have it larger than 4 months anyway. And from the usage perspective, how often do you search for a job that you ran a few months ago? I would set the option's default even lower, say a week or two. |
||
) | ||
@click.option('--end_date', default='now', help='date to end the search') | ||
@click.option('-s', '--spider', help='the spider to search') | ||
def cli(project_key, url_content, start_date, end_date, spider): | ||
def date_string_to_seconds(date): | ||
return int((parse(date) - datetime(1970, 1, 1)).total_seconds() * 1000) | ||
|
||
start_time = date_string_to_seconds(start_date) | ||
end_time = date_string_to_seconds(end_date) | ||
|
||
project = ScrapinghubClient().get_project(project_key) | ||
|
||
jobs = project.jobs.iter(startts=start_time, endts=end_time, spider=spider) | ||
for job_dict in jobs: | ||
job = project.jobs.get(job_dict['key']) | ||
for req in job.requests.iter(filter=[('url', 'contains', [url_content])]): | ||
click.echo(job_dict['key']) | ||
break | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -51,6 +51,7 @@ def cli(): | |
"migrate_eggs", | ||
"image", | ||
"cancel", | ||
"search" | ||
] | ||
|
||
for command in commands: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be avoided by using a specific format and parsing that, but ofc, this offers more flexibility. Should I remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dateparser is a nice and very convenient library, I don't mind to add it, we could use it in the other commands later if needed 👍