Skip to content

A fast, modern, strict HTML meta tag parser for Python with snippet support — making Open Graph and Twitter Cards easy to access

License

Notifications You must be signed in to change notification settings

xfenix/meta-tags-parser

Repository files navigation

Meta tags parser

Test, lint, publish PyPI version Downloads Coverage Code style: black Imports: isort

Fast, modern, pure Python meta tag parser and snippet creator with full support for type annotations. The base package ships with py.typed and provides structured output. No jelly dicts — only typed structures! If you want to see what social media snippets look like, check the example:

Requirements

Install

pip install meta-tags-parser

Usage

TL;DR

  1. Parse meta tags from a source:

    from meta_tags_parser import parse_meta_tags_from_source, structs
    
    
    desired_result: structs.TagsGroup = parse_meta_tags_from_source("""... html source ...""")
    # desired_result is what you want
  2. Parse meta tags from a URL:

    from meta_tags_parser import parse_tags_from_url, parse_tags_from_url_async, structs
    
    
    desired_result: structs.TagsGroup = parse_tags_from_url("https://xfenix.ru")
    # and async variant
    desired_result: structs.TagsGroup = await parse_tags_from_url_async("https://xfenix.ru")
    # desired_result is what you want in both cases
  3. Parse a social media snippet from a source:

    from meta_tags_parser import parse_snippets_from_source, structs
    
    
    snippet_obj: structs.SnippetGroup = parse_snippets_from_source("""... html source ...""")
    # snippet_obj is what you want
    # access like snippet_obj.open_graph.title, ...
  4. Parse a social media snippet from a URL:

    from meta_tags_parser import parse_snippets_from_url, parse_snippets_from_url_async, structs
    
    
    snippet_obj: structs.SnippetGroup = parse_snippets_from_url("https://xfenix.ru")
    # and async variant
    snippet_obj: structs.SnippetGroup = await parse_snippets_from_url_async("https://xfenix.ru")
    # snippet_obj is what you want
    # access like snippet_obj.open_graph.title, ...

Huge note: the *_from_url functions are provided only for convenience and are very error-prone, so any reconnection or error handling is entirely up to you. I also avoid adding heavy dependencies to ensure robust connections, since most users don't expect that from this library. If you really need that, contact me.

Basic snippet parsing

Let's say you want to extract a snippet for Twitter from an HTML page:

from meta_tags_parser import parse_snippets_from_source, structs


my_result: structs.SnippetGroup = parse_snippets_from_source("""
    <meta property="og:card" content="summary_large_image">
    <meta property="og:url" content="https://github.com/">
    <meta property="og:title" content="Hello, my friend">
    <meta property="og:description" content="Content here, yehehe">
    <meta property="twitter:card" content="summary_large_image">
    <meta property="twitter:url" content="https://github.com/">
    <meta property="twitter:title" content="Hello, my friend">
    <meta property="twitter:description" content="Content here, yehehe">
""")

print(my_result)
# What will be printed:
"""
SnippetGroup(
    open_graph=SocialMediaSnippet(
        title='Hello, my friend',
        description='Content here, yehehe',
        image='',
        url='https://github.com/'
    ),
    twitter=SocialMediaSnippet(
        title='Hello, my friend',
        description='Content here, yehehe',
        image='',
        url='https://github.com/'
    )
)
"""
# You can access attributes like this
my_result.open_graph.title
my_result.twitter.image
# All fields are required and will always be available, even if they contain no data
# So you don't need to worry about attribute existence (though you may need to check their values)

Basic meta tag parsing

The main function is parse_meta_tags_from_source. Use it like this:

from meta_tags_parser import parse_meta_tags_from_source, structs


my_result: structs.TagsGroup = parse_meta_tags_from_source("""... html source ...""")
print(my_result)

# What will be printed:
"""
structs.TagsGroup(
    title="...",
    twitter=[
        structs.OneMetaTag(
            name="title", value="Hello",
            ...
        )
    ],
    open_graph=[
        structs.OneMetaTag(
            name="title", value="Hello",
            ...
        )
    ],
    basic=[
        structs.OneMetaTag(
            name="title", value="Hello",
            ...
        )
    ],
    other=[
        structs.OneMetaTag(
            name="article:name", value="Hello",
            ...
        )
    ]
)
"""

As you can see from this example, we don't use any jelly dicts—only structured dataclasses. Let's see another example:

from meta_tags_parser import parse_meta_tags_from_source, structs


my_result: structs.TagsGroup = parse_meta_tags_from_source("""
    <meta property="twitter:card" content="summary_large_image">
    <meta property="twitter:url" content="https://github.com/">
    <meta property="twitter:title" content="Hello, my friend">
    <meta property="twitter:description" content="Content here, yehehe">
""")

print(my_result)
# What will be printed:
"""
TagsGroup(
    title='',
    basic=[],
    open_graph=[],
    twitter=[
        OneMetaTag(name='card', value='summary_large_image'),
        OneMetaTag(name='url', value='https://github.com/'),
        OneMetaTag(name='title', value='Hello, my friend'),
        OneMetaTag(name='description', value='Content here, yehehe')
    ],
    other=[]
)
"""

for one_tag in my_result.twitter:
    if one_tag.name == "title":
        print(one_tag.value)
# What will be printed:
"""
Hello, my friend
"""

Improving speed

You can specify exactly what to parse:

from meta_tags_parser import parse_meta_tags_from_source, structs


result: structs.TagsGroup = parse_meta_tags_from_source("""... source ...""",
    what_to_parse=(WhatToParse.TITLE, WhatToParse.BASIC, WhatToParse.OPEN_GRAPH, WhatToParse.TWITTER, WhatToParse.OTHER)
)

Reducing this tuple of parsing requirements may increase overall parsing speed.

Important notes

  • Any name in a meta tag (name or property attribute) is lowercased
  • og: and twitter: prefixes are stripped from the original attributes, and the dataclass structures carry this information.
  • HTML is parsed with selectolax's LexborHTMLParser. It is fast and tolerant but does not emulate a browser, so extremely malformed markup or tags generated by JavaScript may not be handled. If the parser encounters a meta tag with property og:name, it will appear in the my_result.open_graph list
  • The page title (e.g., <title>Something</title>) is available as the string my_result.title (you'll receive Something)
  • "Standard" tags like title and description (see the full list in ./meta_tags_parser/structs.py in the BASIC_META_TAGS constant) are available as a list in my_result.basic
  • Other tags are available as a list in my_result.other, and their names are preserved, unlike the og:/twitter: behavior
  • For structured snippets, use the parse_snippets_from_source function

Changelog

See the release page at https://github.com/xfenix/meta-tags-parser/releases/.

About

A fast, modern, strict HTML meta tag parser for Python with snippet support — making Open Graph and Twitter Cards easy to access

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages