Extracting Structured Data from unstructured HTML documents
Hext is a domain-specific language to extract structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.
Visit Hext’s project page for up-to-date documentation and a live demo.
Examples
The following Hext template collects all hyperlinks and extracts the href and the clickable text.
<a href:link @text:title />
Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a
and an attribute called href
. If the element matches, its attribute href
and its textual representation are stored as link
and title
, respectively.
Html Input
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Example</title>
</head>
<body>
<a href="one.html"> Page 1</a>
<a href="two.html"> Page 2</a>
<a href="three.html">Page 3</a>
</body>
</html>
Extracted Data
{
"link": "one.html",
"title": "Page 1"
},
{
"link": "two.html",
"title": "Page 2"
},
{
"link": "three.html",
"title": "Page 3"
}
You can use this example in Hext’s live code editor.
Visit Hext’s documentation and its section “How Hext Matches Elements” for a more thorough explanation.
Using Hext on the Command Line
Hext ships with a command line utility called htmlext
, which applies Hext templates to HTML documents and outputs JSON.
Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:
htmlext \
-i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
-s '<a class="title" href:x />' \
-f x \
| xargs vlc
Use-cases like these were my primary drive for building Hext: Inspecting a website, figuring out its structure and using that structure to quickly extract data. Visit htmlext’s documentation for more.
Hacker News Submissions
The following Hext template collects submissions from Hacker News.
<tr>
<td><span @text:rank /></td>
<td><a href:href @text:title /></td>
</tr>
<?tr>
<td>
<span @text:score />
<a @text:user />
<a:last-child @text:filter(/\d+/):comment_count />
</td>
</tr>
Each submission will get its own dictionary of key-value pairs.
$ htmlext hacker-news.hext <(curl https://news.ycombinator.com/)
{
"comment_count": "32",
"href": "https://example.com/",
"rank": "1.",
"score": "208 points",
"title": "Example Title #1",
"user": "user1"
},
{
"comment_count": "10",
"href": "https://example.com/",
"rank": "2.",
"score": "53 points",
"title": "Example Title #2",
"user": "user2"
},
# ...
Using Hext
Hext is written in C++, but comes with language bindings for Python, Node.js, Ruby and PHP.
As an example, this is a Python 3 script showcasing Hext.
import hext
import requests
import json
res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()
# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)
# hext.Rule's constructor expects a single argument
# containing a Hext template.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
<td><span @text:rank /></td>
<td><a href:href @text:title /></td>
</tr>
<?tr>
<td>
<span @text:score />
<a @text:user />
<a:last-child @text:filter(/\d+/):comment_count />
</td>
</tr>""")
# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)
# Print each dictionary as JSON
for map in result:
print(json.dumps(map, ensure_ascii=False,
separators=(',',':')))
Installing Hext
The easiest way to install the htmlext command-line utility and the Hext python module is through pip:
pip install hext
Hext is available for different languages and environments:
C++, Python, Node, Ruby, PHP, JavaScript (WebAssembly), Linux and macOS.
See Hext’s download page for more. If you are interested in Hext and in need of a binary package for your system or language, please raise an issue on Github.
Roundup
Hopefully this post gives an overview of what Hext is and how to use it. Visit Hext’s Project Page for more.
I have built Hext to scratch a personal itch of mine. I am unsure whether Hext might be useful to others. If it is, please let me know!
Acknowledgements
Hext wouldn’t have happened if it weren’t for Gumbo, which is used as the HTML parser that powers Hext. It is a wonderful piece of work, incredibly easy to integrate and blazingly fast. What more to wish for?
It would be a mistake not to mention Ragel, a state machine compiler. It is a blessing to work with, especially if you know the pain that is Bison/Yacc. If you are building a domain-specific language that can be parsed by a state machine, give Ragel a serious try.
The language bindings were built with SWIG. SWIG is an incredible achievement: I wouldn’t have thought it possible to bring all those wacky scripting languages together under one roof.
Last updated on