-
Notifications
You must be signed in to change notification settings - Fork 107
Add SE publisher (Aftonbladet) #803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
addie9800
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding our now second Swedish publisher. I only have a couple of remarks before we can go ahead.
| ) | ||
|
|
||
|
|
||
| class AftonbladetParser(ParserProxy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're a bit in a special situation here. It seems as if the publisher did not implement the isAccessibleForFree attribute here, which is the default handling of the free_access attribute. You would need to find a custom implementation for this publisher. Example: https://www.aftonbladet.se/bil/a/bmRM95/privatleasade-audi-som-gick-sonder-far-betala-manadsavgift, this article is marked as free to access.
|
|
||
| @attribute | ||
| def images(self) -> List[Image]: | ||
| return image_extraction(doc=self.precomputed.doc, paragraph_selector=self._paragraph_selector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parsing of the image authors does not work in this article: https://www.aftonbladet.se/senastenytt/ttsport/sport/a/lw7vKA/ryss-kan-straffas-efter-schackmastarens-dod
| _summary_selector = XPath("//p[contains(@data-test-tag,'lead-text')]") | ||
| _paragraph_selector = XPath( | ||
| "//p[starts-with(@class,'hyperion-css-') and not(contains(@data-test-tag,'lead-text'))]" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This article also has subheadlines: https://www.aftonbladet.se/nyheter/a/yEkM0e/hemliga-kallaren-under-ica-baner-ags-av-leif-jonsson
| class AftonbladetParser(ParserProxy): | ||
| class V1(BaseParser): | ||
| _summary_selector = XPath("//p[contains(@data-test-tag,'lead-text')]") | ||
| _paragraph_selector = XPath( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some bloat at the end of the article is extracted here: https://www.aftonbladet.se/debatt/a/Av4LX3/jagarnas-riksforbund-stoppa-vansinnesdad-inte-helt-vanliga-jagare
Please review if you have time. Needs to be merged after SE Expressen (#800)