Building a PyPI Package for a Modern C++ Project

These are my notes on publishing a Python module on PyPI in 2018 using C++17, Boost and Swig.

# install hext for python
$ pip install hext
# ... you're good to go!

Note This is indeed my first Python rodeo! The primary target audience is ‘Me in 6 Months’, so YMMV. But hopefully, it may be of use to anyone going down a similar route of using C++17, Boost and Swig to build and package a Python module.

Important Links:

Hext is my little library that has (simple) language bindings for Python, i.e. you are able to use Hext within a Python project.

Hext’s Dependencies

If a user were to install the Debian-packages, Hext’s dependencies would automatically be installed through Apt.

All the required libraries would automatically be copied into memory by the dynamic linker and be available to the application at runtime.

import hext instructs the Python interpreter to load hext.py, which loads the hext python module _hext.so, the "glue" between Python and Hext. This dynamically shared object depends on other libraries, which are satisfied through the dynamic linker.

A YOLO approach to dependency management

Unfortunately, we cannot expect the target system to have Gumbo and Boost installed, nor a standard library for C++17.

That leaves us with only one option: Producing a Python module that includes all (most) dependencies by linking statically. Linking statically brings its own bag of problems, especially since security updates require recompilation and redistribution.
This is also known as the YOLO method, because ‘You Only Link Once’ :)

A Build Environment for Python Modules

Binary modules and system compatibility

To be binary compatible with most systems, it is recommended to use CentOS 5 as a build target (PEP-0513). CentOS 5 was first released in 2007, and is End-Of-Life since March 2017.

The manylinux project provides a docker image (quay.io/pypa/manylinux1_x86_64) that comes pre-installed with Cent OS 5 and all that is required to build modules for 6 Python ABI versions (2.7m, 2.7mu, 3.4m, 3.5m, 3.6m, 3.7m).

Additionally, there’s a GCC 4.8.2, which is quite old, dare I say :) This means we need to either compile our own GCC, or use somebody else’s precompiled GCC for Cent OS 5.

Fortunately, compiling the most recent GCC (8.2.0 in my case) on Cent OS 5 is straight-forward. With the new GCC toolchain in place, building all other dependencies is a non-issue.

Building GCC

$ ./contrib/download_prerequisites
$ ./configure --enable-languages=c,c++ --disable-multilib
$ make -j4
$ make install
$ export PATH="/usr/local/bin:$PATH"
$ export CC=/usr/local/bin/gcc CXX=/usr/local/bin/g++

Building Boost

# install unicode support for boost regex
$ yum install libicu libicu-devel
# select your libraries
$ ./bootstrap.sh --with-libraries=program_options,regex
# build and install the static version of boost
# cxxflags="-fPIC":
#   The statically built boost libraries will end up in the Python module,
#   which is a shared library, and therefore needs position independent code.
$ ./b2 -j4 cxxflags="-fPIC" runtime-link=static variant=release link=static install

Building CMake

CMake provides binary releases for linux-x86_64, but those require Glibc 2.6, which is not available on Cent OS 5 (which is stuck with 2.4).

$ ./bootstrap --parallel=4
$ make -j4
$ make install

An alternative is to install a precompiled version of CMake through pip.

Building Swig

$ yum install pcre pcre-devel
$ ./configure --disable-perl --disable-ruby --disable-csharp --disable-r --disable-java
$ make -j4
$ make install

Building Gumbo

$ ./autogen.sh
$ CFLAGS="-fPIC" ./configure --enable-shared=no
$ make -j4
$ make install

Building A Python Module

We now have everything setup to actually build the Python Module.

Make sure to add the include path of the Python version you want to build against:

$ ls -d /opt/python/*/include/*/
/opt/python/cp27-cp27m/include/python2.7/
/opt/python/cp27-cp27mu/include/python2.7/
/opt/python/cp34-cp34m/include/python3.4m/
/opt/python/cp35-cp35m/include/python3.5m/
/opt/python/cp36-cp36m/include/python3.6m/
/opt/python/cp37-cp37m/include/python3.7m/
$ mkdir build ; cd build
$ MY_PYTHON_PATH=/opt/python/cp27-cp27m/include/python2.7/
$ cmake -DPYTHON_INCLUDE_PATH=$MY_PYTHON_PATH your-project-dir/

In CMakeLists.txt:

INCLUDE_DIRECTORIES(SYSTEM ${PYTHON_INCLUDE_PATH})

Stripping your Module

Remove all the excess leftovers from static linking to considerably reduce the filesize of your module.

$ strip --strip-unneeded _mymodule.so

Minimal Dependencies

Use ldd to list your module’s dependencies on dynamically shared objects:

$ ldd ./_mymodule.so
    linux-vdso.so.1 =>  (0x00007ffdaeb8b000)
    libm.so.6 => /lib64/libm.so.6 (0x00007fca51c08000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fca518a8000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fca52308000)

In the above example, _mymodule.so only depends on very old versions of libm and libc, and is therefore compatible with most linux-based systems. You can ignore linux-vdso.so (vDSO) and ld-linux-x86-64.so (ld.so).

A list of libraries which you can depend on when building Python modules for the manylinux1 platform tag is outlined here.

You can get more verbose output by invoking the dynamic linker directly and setting some environment variables:

$ LD_TRACE_LOADED_OBJECTS=1 LD_VERBOSE=1 /lib64/ld-linux-x86-64.so.2 ./_mymodule.so
  linux-vdso.so.1 =>  (0x00007ffe4c2cb000)
  libm.so.6 => /lib64/libm.so.6 (0x00007f7702388000)
  libc.so.6 => /lib64/libc.so.6 (0x00007f7702028000)
  /lib64/ld-linux-x86-64.so.2 (0x00007f7702a88000)

  Version information:
  ./_mymodule.so:
    ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
    libc.so.6 (GLIBC_2.3) => /lib64/libc.so.6
    libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6
  /lib64/libm.so.6:
    libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6
  /lib64/libc.so.6:
    ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
    ld-linux-x86-64.so.2 (GLIBC_PRIVATE) => /lib64/ld-linux-x86-64.so.2

Static libstdc++ and libgcc

When building your Python Module it is important to tell GCC to statically link libstdc++ and libgcc, i.e. -static-libgcc -static-libstdc++, as to not accidentally introduce a dependency on your toolchain.

CMake & Static Libraries

If CMake has trouble picking up the static version of libraries, experiment with the following CMake flags: -DCMAKE_FIND_LIBRARY_SUFFIXES=.a and -DBoost_USE_STATIC_LIBS=On.

Swig and Python 2 Unicode

If you are using Swig and building a module for Python 2 which accepts strings passed from Python to C++, make sure to add the following to your interface file:

%begin %{
#define SWIG_PYTHON_2_UNICODE
%}

See Swig’s documentation on Python 2 Unicode.

An easy way to test your Python interface on whether it accepts Unicode strings:

import foo

# pass a raw byte string
from_rawstr = foo.Bar("a raw byte string");

# pass a unicode string
# notice the u-prefix on the string literal
from_unistr = foo.Bar(u"a unicode string");

If there’s a TypeError thrown, like the following, the Python Unicode string is not accepted as an argument for a parameter of type std::string:

TypeError: in method 'new_Bar', argument 1 of type 'std::string'

Linking libpython is neither neccessary nor recommended.

Modules and __init__.py

When the python interpreter encounters the following line:

import mymodule

It will try to find a directory called mymodule that contains a file with the special name __init__.py, which is then executed:

$ ls site-packages/mymodule/
__init__.py
_mymodule.so

In other words, __init__.py is responsible for loading the shared library and setting up the module.

Swig and __init__.py

Swig generates a mymodule.py to be used as a loader for the compiled module _mymodule.so. Unfortunately, you cannot just rename this file to __init__.py and be done with it. I am not sure, but it seems that the generated Python script isn’t supposed to be used as an __init__.py:

# This file was automatically generated by SWIG (http://www.swig.org).
# ...
if _swig_python_version_info >= (2, 7, 0):
  # ...
elif _swig_python_version_info >= (2, 6, 0):
  # ...
else:
  import _mymodule
del _swig_python_version_info
# ...

The easiest way to load a shared library residing in the same directory of __init__.py, which works with Python ≥ 2.7:

from . import _mymodule

I am using the following bash script to automatically replace Swig’s loader with the above line:

cat mymodule.py \
 | sed '/^# This file was automatically generated by SWIG/,/^del _swig_python_version_info$/d' \
 | cat <(echo "from . import _mymodule") - \
 > __init__.py

Packaging a precompiled module for PyPI

Precompiled modules and executables are uploaded to PyPI in the form of wheels.

Wheels are basically zip-files that have a certain filename and contain a certain directory layout. See PEP 427 – The Wheel Binary Package Format 1.0 for details.

For example, the filename of my wheel hext-0.2.0-cp37-cp37m-manylinux1_x86_64.whl tells us:

  • hext-0.2.0: The package provided is hext, in version 0.2.0
  • cp37: For Python version 3.7
  • cp37m: Linked against the Python 3.7 Application Binary Interface (for example, Python 2.7 has ABIs for different Unicode string types)
  • manylinux1: It is compatible with systems that fulfill the manylinux1 platform tag
  • x86_64: Expected system architecture

You can upload as many wheels as you want. The user’s package manager (pip) will choose the appropriate wheel for the user’s environment.

Building Wheels

Packaging Python modules is done through setuptools and a setup file traditionally named setup.py.

Example project layout:

# mymodule is the python module to package
mymodule/
    # module initialization, loads _mymodule.so
    __init__.py
    # shared library
    _mymodule.so
README.md
# packaging instructions for setuptools
setup.py

setup.py might look like this:

from setuptools import setup, dist
from setuptools.command.install import install
import os

# force setuptools to recognize that this is
# actually a binary distribution
class BinaryDistribution(dist.Distribution):
    def has_ext_modules(foo):
        return True

# optional, use README.md as long_description
this_directory = os.path.abspath(os.path.dirname(__file__))
with open(os.path.join(this_directory, 'README.md')) as f:
    long_description = f.read()

setup(
    # this package is called mymodule
    name='mymodule',

    # this package contains one module,
    # which resides in the subdirectory mymodule
    packages=['mymodule'],

    # make sure the shared library is included
    package_data={'mymodule': ['_mymodule.so']},
    include_package_data=True,

    description="This is a short description",
    # optional, the contents of README.md that were read earlier
    long_description=long_description,
    long_description_content_type="text/markdown",

    # See class BinaryDistribution that was defined earlier
    distclass=BinaryDistribution,

    version='0.0.1',
    url='http://example.com/',
    author='...',
    author_email='...@example.com',
    # ...
)

If all the pieces are in the right place, you can now create a wheel:

# build a binary wheel for python 2.7
$ /opt/python/cp27-cp27m/bin/python setup.py bdist_wheel 
# check if all the files are there
$ unzip -l ./dist/mymodule-0.0.1-cp27-cp27m-linux_x86_64.whl
Archive:  ./dist/mymodule-0.0.1-cp27-cp27m-linux_x86_64.whl
----
mymodule-0.0.1.data/purelib/mymodule/__init__.py
mymodule-0.0.1.data/purelib/mymodule/_mymodule.so
mymodule-0.0.1.dist-info/top_level.txt
mymodule-0.0.1.dist-info/WHEEL
mymodule-0.0.1.dist-info/METADATA
mymodule-0.0.1.dist-info/RECORD
-------
6 files

So this is the wheel: mymodule-0.0.1-cp27-cp27m-linux_x86_64.whl. Notice how it says -linux instead of -manylinux1. This is because setuptools cannot tell which subset of linux-systems this wheel might be compatible with. Fortunately, renaming the wheel is enough, i.e. replace linux with manylinux1:

for wheel in $(find . -iname "*.whl") ; do 
  mv $wheel $(echo $wheel | sed 's/-linux_/-manylinux1_/')
done

Publishing Wheels on pypi.org

Now the only task left is to create an account on pypi.org and to finally publish your wheel. The recommended way to upload wheels is via twine:

twine upload dist/*.whl

The complete setup.py for hext

As an example, this is the setup.py I used for packaging hext v0.2.0. Note that Hext also includes a command-line utility called htmlext.

from setuptools import setup, dist
from setuptools.command.install import install
import os

class BinaryDistribution(dist.Distribution):
    def has_ext_modules(foo):
        return True

class PostInstallCommand(install):
    def run(self):
        install.run(self)
        if not os.path.isdir(self.install_scripts):
            os.makedirs(self.install_scripts)
        package_dir = os.path.dirname(os.path.abspath(__file__))
        binary_dir = os.path.join(package_dir, "bin");
        binary = "htmlext"
        source = os.path.join(binary_dir, binary)
        target = os.path.join(self.install_scripts, binary)
        if os.path.isfile(target):
            os.remove(target)
        self.copy_file(source, target)

this_directory = os.path.abspath(os.path.dirname(__file__))
with open(os.path.join(this_directory, 'README.md')) as f:
    long_description = f.read()

setup(
    name='hext',
    package_data={'hext': ['_hext.so', 'gumbo.license', 'rapidjson.license']},
    version='0.2.0',
    description="A module and command-line utility to extract structured data from HTML",
    long_description=long_description,
    long_description_content_type="text/markdown",
    url='http://hext.thomastrapp.com/',
    author='Thomas Trapp',
    author_email='[redacted]',
    include_package_data=True,
    distclass=BinaryDistribution,
    cmdclass={'install': PostInstallCommand},
    packages=['hext'],
    classifiers=[
        'Development Status :: 5 - Production/Stable',
        'Intended Audience :: Developers',
        'License :: OSI Approved :: Apache Software License',
        'Operating System :: POSIX :: Linux',
        'Programming Language :: C++',
        'Topic :: Internet :: WWW/HTTP',
        'Topic :: Software Development :: Libraries :: Python Modules',
    ],
    keywords='html-extraction scraping html data-extraction',
    project_urls={
        'Github': 'https://github.com/thomastrapp/hext/',
        'Bug Reports': 'https://github.com/thomastrapp/hext/issues',
        'Author': 'https://thomastrapp.com/'
    },
)

And this is the directory layout:

bin/
    htmlext
hext/
    __init__.py
    _hext.so
    gumbo.license
    rapidjson.license
# MANIFEST.in content: "include bin/htmlext"
MANIFEST.in
README.md
setup.py

Updated: