Hext is a domain-specific language to extract structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.
In this blog post I aim to present an overall introduction to my little project called Hext. Visit Hext’s project page for full documentation and a live demo.
The following Hext snippet collects all hyperlinks and extracts the href and the clickable text.
Hext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag
a and an attribute called
href. If the element matches, its attribute
href and its textual representation are stored as
You can use this example in Hext’s live code editor.
Using Hext on the Command Line
Hext ships with a command line utility called
htmlext, which applies Hext snippets to HTML documents and outputs JSON.
Ever wanted to watch the submissions on /r/videos in vlc? Well, take a look at this little guy right here:
Use-cases like these were my primary drive for building Hext: Inspecting a website, figuring out its structure and using that structure to quickly extract data.
[More about htmlext]
Hacker News Submissions
The following Hext snippet collects submissions from Hacker News.
Each submission will get its own dictionary of key-value pairs.
As an example, this is a Python 3 script showcasing Hext.
There’s a simplified
htmlext clone for all available language bindings.
The easiest way to install the htmlext command-line utility and the Hext python module is through pip:
See here for instructions on how to compile and install Hext from source.
If you are using Ubuntu 18.04 you can get Hext by installing the debian package.
Hopefully this post gives an overview of what Hext is and how to use it. Visit Hext’s Project Page for more.
I have built Hext to scratch a personal itch of mine. I am unsure whether Hext might be useful to others. If it is, please let me know!
Hext wouldn’t have happened if it weren’t for Gumbo, which is used as the HTML parser that powers Hext. It is a wonderful piece of work, incredibly easy to integrate and blazingly fast. What more to wish for?
It would be a mistake not to mention Ragel, a state machine compiler. It is a blessing to work with, especially if you know the pain that is Bison/Yacc. If you are building a domain-specific language that can be parsed by a state machine, give Ragel a serious try.
The language bindings were built with SWIG. SWIG is an incredible achievement: I wouldn’t have thought it possible to bring all those wacky scripting languages together under one roof.
And of course I am thankful for the work of other countless authors, like the folks behind GCC, Boost and CMake. Without free software (as in freedom), I would not be a developer today.
Thank you for reading!
If you have feedback of any kind, please don’t hesitate to open an issue at Github or drop me an