Extracting Structured Data from HTML

Hext is a domain-specific language to extract structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.

In this blog post I aim to present an overall introduction to my little project called Hext. Visit Hext’s project page for full documentation and a live demo.

Examples

The following Hext snippet collects all hyperlinks and extracts the href and the clickable text.

<a href:link @text:title />

Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.

Html Input
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Example</title>
  </head>
  <body>
    <a href="one.html">  Page 1</a>
    <a href="two.html">  Page 2</a>
    <a href="three.html">Page 3</a>
  </body>
</html>
Extracted Data
{
  "link": "one.html",
  "title": "Page 1"
},
{
  "link": "two.html",
  "title": "Page 2"
},
{
  "link": "three.html",
  "title": "Page 3"
}

You can use this example in Hext’s live code editor.

Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.

Using Hext on the Command Line

Hext ships with a command line utility called htmlext, which applies Hext snippets to HTML documents and outputs JSON.

Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

Use-cases like these were my primary drive for building Hext: Inspecting a website, figuring out its structure and using that structure to quickly extract data.
[More about htmlext]

Hacker News Submissions

The following Hext snippet collects submissions from Hacker News.

<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>

Each submission will get its own dictionary of key-value pairs.

{
  "comment_count": "32",
  "href": "https://example.com/",
  "rank": "1.",
  "score": "208 points",
  "title": "Example Title #1",
  "user": "user1"
},
{
  "comment_count": "10",
  "href": "https://example.com/",
  "rank": "2.",
  "score": "53 points",
  "title": "Example Title #2",
  "user": "user2"
}
// ...

Using Hext

Hext is written in C++, but comes with language bindings for Python, Node.js, Ruby and PHP.

As an example, this is a Python 3 script showcasing Hext.

import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext snippet.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

There’s a simplified htmlext clone for all available language bindings.

Installing Hext

See here for instructions on how to compile and install Hext from source.

Binary Packages

If you are using Ubuntu 18.04 you can get Hext by installing the debian package.

cd /tmp
wget https://github.com/thomastrapp/hext/releases/download/v0.7.0/hext-amd64-ubuntu18.04-v0.7.0.deb
wget https://github.com/thomastrapp/hext/releases/download/v0.7.0/hext-python3.6-amd64-ubuntu18.04-v0.7.0.deb
# install libhext and the htmlext command line utility
sudo apt install ./hext-amd64-ubuntu18.04-v0.7.0.deb
# install the hext python module
sudo apt install ./hext-python3.6-amd64-ubuntu18.04-v0.7.0.deb

See Hext’s release page on Github for more. If you are interested in Hext and in need of a binary package for your system or language, please raise an issue on Github.

Roundup

Hopefully this post gives an overview of what Hext is and how to use it. Visit Hext’s Project Page for more.

I have built Hext to scratch a personal itch of mine. I am unsure whether Hext might be useful to others. If it is, please let me know!

Acknowledgements

Hext wouldn’t have happened if it weren’t for Gumbo, which is used as the HTML parser that powers Hext. It is a wonderful piece of work, incredibly easy to integrate and blazingly fast. What more to wish for?

It would be a mistake not to mention Ragel, a state machine compiler. It is a blessing to work with, especially if you know the pain that is Bison/Yacc. If you are building a domain-specific language that can be parsed by a state machine, give Ragel a serious try.

The language bindings were built with SWIG. SWIG is an incredible achievement: I wouldn’t have thought it possible to bring all those wacky scripting languages together under one roof.

And of course I am thankful for the work of other countless authors, like the folks behind GCC, Boost and CMake. Without free software (as in freedom), I would not be a developer today.

Thank you for reading!

If you have feedback of any kind, please don’t hesitate to open an issue at Github or drop me an

Updated: