Skip to content

Conversation

@addie9800
Copy link
Collaborator

No description provided.

@addie9800 addie9800 requested a review from MaxDall November 10, 2025 17:55
Copy link
Collaborator

@MaxDall MaxDall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@addie9800 How come we fix this publisher rather than creating a new version?

@addie9800
Copy link
Collaborator Author

addie9800 commented Nov 16, 2025

@addie9800 How come we fix this publisher rather than creating a new version?

My train of thought usually is that if the layout changes are minimal, i.e. you don't have to create new selectors and it is enough to replace @class='...') with contains(@class, '...'), it's not that necessary to create a new version. Additionally, if the functionality that is added extends the functionality of the current version (i.e. adding a caption selector), the previous version might profit off the change as well, in case we didn't spot the missing selector earlier

@MaxDall
Copy link
Collaborator

MaxDall commented Nov 17, 2025

My train of thought usually is that if the layout changes are minimal, i.e. you don't have to create new selectors and it is enough to replace @Class='...') with contains(@Class, '...'), it's not that necessary to create a new version. Additionally, if the functionality that is added extends the functionality of the current version (i.e. adding a caption selector), the previous version might profit off the change as well, in case we didn't spot the missing selector earlier

@addie9800 In general, that seems like a solid rule of thumb. I was actually about to ask if you could document this somewhere in our docs, but then another thought occurred to me: does the benefit of having a timestamp for the change (for instance, if we later discover significant changes we couldn’t foresee at the time due to unimplemented features) outweigh the extra work of creating a new version? I’d be really interested in your opinion on that.

@addie9800
Copy link
Collaborator Author

the benefit of having a timestamp for the change (for instance, if we later discover significant changes we couldn’t foresee at the time due to unimplemented features)

Can you maybe explain your idea a bit more? I can't really imagine what you have in mind at the moment. I think my main concern is that we would probably not notice any issues in backwardcompatibility, no matter if we have a timestamp or not. If we implement a new feature at a later point which is unavailable for the older articles, we wouldn't notice it during testing, since we usually test through forward crawling. Unless of course it triggers the tests for the articles saved at creation of the parser version.

the extra work of creating a new version?

I don't think creating a new version is a significant amount of extra work. My main concern with creating (too many) extra versions is that we might end up spamming extra versions, if we create them too lightly.

@MaxDall
Copy link
Collaborator

MaxDall commented Nov 21, 2025

Can you maybe explain your idea a bit more? I can't really imagine what you have in mind at the moment.

There's no idea on my mind right now. I just wondered if it is worth the extra step of a new version so we have a timestamp for the layout change. My concern was, that it, for now, looks like a minor change to the layout, but we later find out that having a new version and therefore a timestamp would have been beneficial.

@addie9800
Copy link
Collaborator Author

My concern was, that it, for now, looks like a minor change to the layout, but we later find out that having a new version and therefore a timestamp would have been beneficial.

Ah ok, I see. I thought you wanted to introduce some kind of logging system to log the time of the change. While I do see your concern, I think in this specific case, I would assume that the risk is rather low of us having issues at a later point. But maybe, we can discuss it in todays meeting.

Copy link
Collaborator

@MaxDall MaxDall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these two articles [1], [2] the author is extracted as article text. Further, it seems like the summary isn't correctly labeled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants