Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
'six>=1.7.0',
'tqdm',
'toml',
'dateparser'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be avoided by using a specific format and parsing that, but ofc, this offers more flexibility. Should I remove this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dateparser is a nice and very convenient library, I don't mind to add it, we could use it in the other commands later if needed 👍

],
classifiers=[
'Development Status :: 5 - Production/Stable',
Expand Down
55 changes: 55 additions & 0 deletions shub/search.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
from datetime import datetime

import click
from dateparser import parse
from scrapinghub import ScrapinghubClient


HELP = """
Given a project key and part of an url, fetch job ids from Scrapy Cloud.
This is useful when you want to find a job in an efficient way starting from
an url.
The project key and an url (or part of it). The matching is case sensitive!
shub search 123456 "B07F3NG1234"
You can provide other parameters to narrow down the search significantly such
as the spider name and the date interval to search for. Or both! The default
is to search only the last 6 months.
shub search 123456 "B07F3NG1234" --spider="amazon"
shub search 123456 "B07F3NG1234" --start_date="last week" --end_date="2 days ago"
"""

SHORT_HELP = "Fetch job ids from Scrapy Cloud based on urls"


@click.command(help=HELP, short_help=SHORT_HELP)
@click.argument('project_key')
@click.argument('url_content')
@click.option(
'--start_date',
default='6 months ago',
help='date to start searching from, defaults to 6 months ago'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered 6 months to be a reasonable default. My main concern is that people will use this without setting spider and start date and it will be slow for them and a lot of requests for scrapinghub cloud but making any of this mandatory might be too restrictive. What do you think?

Copy link
Contributor

@vshlapakov vshlapakov Sep 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it's too much: we have 120 days of data retention for the professional plan, so it doesn't make sense to have it larger than 4 months anyway. And from the usage perspective, how often do you search for a job that you ran a few months ago? I would set the option's default even lower, say a week or two.

)
@click.option('--end_date', default='now', help='date to end the search')
@click.option('-s', '--spider', help='the spider to search')
def cli(project_key, url_content, start_date, end_date, spider):
def date_string_to_seconds(date):
return int((parse(date) - datetime(1970, 1, 1)).total_seconds() * 1000)

start_time = date_string_to_seconds(start_date)
end_time = date_string_to_seconds(end_date)

project = ScrapinghubClient().get_project(project_key)

jobs = project.jobs.iter(startts=start_time, endts=end_time, spider=spider)
for job_dict in jobs:
job = project.jobs.get(job_dict['key'])
for req in job.requests.iter(filter=[('url', 'contains', [url_content])]):
click.echo(job_dict['key'])
break

1 change: 1 addition & 0 deletions shub/tool.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ def cli():
"migrate_eggs",
"image",
"cancel",
"search"
]

for command in commands:
Expand Down