Skip to content

Commit 266f5e1

Browse files
committed
feat: add contents from html2rss gem readme
Signed-off-by: Gil Desmarais <[email protected]>
1 parent 56d9b89 commit 266f5e1

31 files changed

+1297
-127
lines changed

about.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
layout: default
3+
title: About html2rss
4+
# nav_order: 2
5+
---
6+
7+
# About html2rss
8+
9+
`html2rss` is an open-source project dedicated to empowering users to take control of their web content consumption. In an age where many websites no longer offer traditional RSS feeds, `html2rss` bridges this gap by providing a robust and flexible solution for converting any HTML content into a structured RSS format.
10+
11+
The project was started in 2018 and has since grown into a suite of tools that help users create and consume RSS feeds.
12+
13+
---
14+
15+
### Our Mission
16+
17+
Our mission is to provide a simple, powerful, and accessible tool that enables individuals and developers to create custom RSS feeds from any web page. We believe in the power of open standards and the freedom to access information on your own terms.
18+
19+
---
20+
21+
### The html2rss Ecosystem
22+
23+
The `html2rss` project is more than just a single tool. It's a collection of tools that work together to provide a complete RSS solution:
24+
25+
- **[`html2rss`](https://github.com/html2rss/html2rss):** The core Ruby gem that provides the main functionality for converting HTML to RSS.
26+
- **[`html2rss-web`](https://github.com/html2rss/html2rss-web):** A web application that allows you to create and manage your RSS feeds through a user-friendly interface.
27+
- **[`html2rss-configs`](https://github.com/html2rss/html2rss-configs):** A collection of pre-built feed configs for popular websites, so you can get started quickly.
28+
29+
---
30+
31+
### Project Philosophy
32+
33+
- **User Empowerment:** Give users the tools to customize their web experience.
34+
- **Simplicity & Power:** Offer an easy-to-use interface with powerful underlying capabilities.
35+
- **Open Source:** Foster a collaborative environment where the community can contribute and improve the project.
36+
- **Reliability:** Strive for a stable and dependable tool that consistently delivers.
37+
38+
---
39+
40+
### The Team
41+
42+
`html2rss` is maintained by a dedicated group of volunteers and contributors from around the world. We are passionate about open source and committed to continuously improving the project.
43+
44+
Want to join us? Check out our [Contributing Guide]({{ '/contributing/' | relative_url }})!

api-reference.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
---
2+
layout: default
3+
title: API Reference
4+
nav_order: 8
5+
---
6+
7+
# API Reference
8+
9+
This section provides a reference for the `html2rss` command-line interface (CLI).
10+
11+
For detailed documentation on the Ruby API, please refer to the official YARD documentation.
12+
13+
[**📚 View the Ruby API Docs on rubydoc.info**](https://www.rubydoc.info/gems/html2rss)
14+
15+
---
16+
17+
### Command-Line Interface (CLI)
18+
19+
The `html2rss` executable provides the primary way to interact with the tool from your terminal.
20+
21+
#### `html2rss auto <URL>`
22+
23+
Automatically generates an RSS feed from the provided URL.
24+
25+
- `<URL>` (Required): The URL of the website to generate a feed from.
26+
27+
**Example:**
28+
29+
```bash
30+
html2rss auto https://unmatchedstyle.com/
31+
```
32+
33+
#### `html2rss feed <CONFIG_FILE>`
34+
35+
Generates an RSS feed based on the provided YAML configuration file.
36+
37+
- `<CONFIG_FILE>` (Required): Path to your YAML configuration file.
38+
39+
**Examples:**
40+
41+
```bash
42+
# Generate and print to console
43+
html2rss feed my_feed.yml
44+
45+
# Generate and save to an XML file
46+
html2rss feed my_feed.yml > my_feed.xml
47+
```
48+
49+
#### `html2rss help`
50+
51+
Displays the help message with available commands and options.
52+
53+
#### `html2rss --version`
54+
55+
Displays the currently installed version of `html2rss`.

components/html2rss-configs.md

Lines changed: 0 additions & 23 deletions
This file was deleted.

components/html2rss-web.md

Lines changed: 0 additions & 28 deletions
This file was deleted.

components/html2rss.md

Lines changed: 0 additions & 32 deletions
This file was deleted.

components/index.md

Lines changed: 0 additions & 7 deletions
This file was deleted.

configs/index.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
---
22
layout: default
3-
title: All feeds
3+
title: Ready-to-use configs
44
noindex: true
5-
nav_order: 1
5+
# nav_order: 1
66
---
77

88
<noscript>

configuration/auto_source.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
---
2+
layout: default
3+
title: Auto Source
4+
nav_order: 4
5+
parent: Configuration
6+
---
7+
8+
# `auto_source`
9+
10+
The `auto_source` scraper is the easiest way to create a feed. It intelligently finds items on a page without requiring you to specify CSS selectors.
11+
12+
You can enable it in your YAML config like this:
13+
14+
```yaml
15+
channel:
16+
url: https://example.com
17+
auto_source: {}
18+
```
19+
20+
---
21+
22+
## How it Works
23+
24+
The `auto_source` scraper uses a series of strategies to find content:
25+
26+
1. **`schema`:** It looks for structured data in the form of `<script type="json/ld">` tags. Many websites use this to provide machine-readable information about their content, often following the [Schema.org](https://schema.org/) standard.
27+
2. **`semantic_html`:** It searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`. These tags are often used to define the main content of a page.
28+
3. **`html`:** As a last resort, it analyzes the entire HTML structure to find frequently occurring selectors that are likely to contain the main content.
29+
30+
---
31+
32+
## Fine-Tuning `auto_source`
33+
34+
You can customize the behavior of the `auto_source` scraper to improve its accuracy.
35+
36+
### Scraper Options
37+
38+
You can enable or disable specific scrapers and adjust their settings.
39+
40+
```yaml
41+
auto_source:
42+
scraper:
43+
schema:
44+
enabled: false # default: true
45+
semantic_html:
46+
enabled: false # default: true
47+
html:
48+
enabled: true
49+
minimum_selector_frequency: 3 # default: 2
50+
use_top_selectors: 3 # default: 5
51+
```
52+
53+
- `minimum_selector_frequency`: The minimum number of times a selector must appear to be considered a candidate for the main content.
54+
- `use_top_selectors`: The number of top candidate selectors to consider.
55+
56+
### Cleanup Options
57+
58+
You can also clean up the results to remove unwanted items.
59+
60+
```yaml
61+
auto_source:
62+
cleanup:
63+
keep_different_domain: false # default: true
64+
min_words_title: 4 # default: 3
65+
```
66+
67+
- `keep_different_domain`: Whether to keep items that link to a different domain.
68+
- `min_words_title`: The minimum number of words a title must have to be included.

configuration/channel.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
---
2+
layout: default
3+
title: Channel
4+
nav_order: 1
5+
parent: Configuration
6+
---
7+
8+
# `channel`
9+
10+
The `channel` key contains information about the RSS feed itself, such as its title, URL, and description.
11+
12+
```yaml
13+
channel:
14+
url: https://example.com
15+
title: "My Custom Feed"
16+
description: "A feed of the latest news from Example.com"
17+
author: "[email protected] (Jane Doe)"
18+
ttl: 60
19+
language: "en-us"
20+
time_zone: "Europe/Berlin"
21+
```
22+
23+
---
24+
25+
## Channel Options
26+
27+
| Attribute | Required | Type | Default | Remark |
28+
| :------------ | :----------- | :------ | :------------- | :-------------------------------------------------------------------------------------------------------------------------------------- |
29+
| `url` | **Required** | String | | The URL of the website to scrape. |
30+
| `title` | Optional | String | Auto-generated | The title of the RSS feed. |
31+
| `description` | Optional | String | Auto-generated | Retrieved from meta description tags. |
32+
| `author` | Optional | String | Blank | Format: `email (Name)`. |
33+
| `ttl` | Optional | Integer | Auto-generated | Time to live in minutes. `html2rss` will use the `max-age` from the response headers if available, otherwise it will default to `360`. |
34+
| `language` | Optional | String | Auto-generated | Determined by the `lang` attribute of the `<html>` tag. |
35+
| `time_zone` | Optional | String | `'UTC'` | The time zone to use for parsing dates. See a [list of valid time zones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones). |

configuration/headers.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
---
2+
layout: default
3+
title: Headers
4+
nav_order: 1
5+
parent: Configuration
6+
---
7+
8+
# `headers`
9+
10+
The `headers` key allows you to set custom HTTP headers for your requests. This is useful for accessing protected content or interacting with APIs.
11+
12+
```yaml
13+
headers:
14+
User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
15+
Authorization: "Bearer YOUR_TOKEN"
16+
```
17+
18+
You can also set headers for APIs that require authorization or custom headers.
19+
20+
Dynamic parameters can be used in headers to pass values at runtime. See [Advanced Topics](/configuration/advanced-topics/) for more details.
21+
22+
## Example Configuration
23+
24+
This example demonstrates how to add custom HTTP headers to your feed request:
25+
26+
```yaml
27+
channel:
28+
url: https://example.com/protected-content
29+
headers:
30+
User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)"
31+
Authorization: "Bearer your_api_token_here"
32+
selectors:
33+
items:
34+
selector: ".article"
35+
title:
36+
selector: "h2.title"
37+
url:
38+
selector: "h2.title a"
39+
extractor: "href"
40+
description:
41+
selector: ".summary"
42+
```
43+
44+
### Explanation
45+
46+
- **`channel.headers`**: Defines custom HTTP headers to include in the request.
47+
- **`User-Agent`**: Some websites require a specific user agent string.
48+
- **`Authorization`**: Example of an API token for protected content.
49+
- The rest of the configuration extracts articles as usual.
50+
51+
Use this configuration to access content that requires authentication or specific headers.

0 commit comments

Comments
 (0)