Apache Superset from Scratch: Day 1 (Python Setup)
December 23, 2021
I'm on a quest, to understand and map out as much of the Apache Superset code base as I can. In my day job, I have the opportunity to use Superset on a daily basis but I'm not intimately familiar with the code paths themselves. This series will revolve around the process on a M1 Macbook Air, but should generalize to most *nix systems.
My goal is to make noticeable progress on a daily basis. With the preamble out of the way, let's start!
The Superset codebase is large; where does one even begin? For new code bases, I generally like alternating between:
- breadth: starting with an overview of the development / contributor's guide
- depth: recursively going through each component & sub-component
For breadth, I'll start with the Setup Local Environment for Development section from CONTRIBUTING.MD.
Python 3.7.x or 3.8.x are recommended for running the Superset backend. I'm on a Mac, and prefer to leave the default
python that ships with the operating system to 2.7.x. Instead, I'll use Homebrew to install Python 3.8:
brew install email@example.com
Now, both the
pip3 commands work as expected (independent of the
pip 21.2.4 from /opt/homebrew/lib/python3.8/site-packages/pip (python 3.8)
Now time to create a Python virtual environment. Virtual environment is really a sandbox for your Python libraries that lives within a specific folder / project. This workflow gives you a few benefits:
- Virtual environment lives completely independent of the global Python sandbox
- It's super quick and easy to delete all of the project specific Python libraries and re-install, as an escape hatch
- Less time wasted (not zero sadly) dealing with version / dependency conflicts
Are there any downsides?
- The main one is increased storage requirements, because every Python project on your computer has its own copies of similar libraries
First, let me install
pip3 install virtualenv
Next, let's give our virtual environment a name. The
virtualenv creates a folder within your project folder and stuffs all of the Python libraries you install there. So we're really trying to decide on the name of this folder.
The CONTRIBUTING.MD file in the Superset repo suggests naming it
python3 -m venv venv
- The first
venvis short-hand for
- The second
venvrefers to the name of the folder we're creating (
Why should we name it
venv/? One hint is in the
.gitignore file, which specifies files & folder paths to ignore in version control. This means that each user can have their own local state and those details won't get checked into version control.
.gitignore file itself is version controlled though. So this file provides a "universal" agreemenet between all of the contributors to Superset that these files should not be checked into version control. Let's search for any string values containing "env" in the
cat .gitignore | grep 'env'
.env .envrc env venv* env_py3 envpy3 env36 venv
While some open source projects use the
.venv/ convention for virtualenv, the Superset one uses
venv it seems. So this means:
- we can party in our local
venv/and none of those changes will make it into any code PR's we may want to make
- if we want to use
.venv/instead, the git version control system will detect a change
Let's stick to the community convention, and run the suggested command:
python3 -m venv venv
If we run
ls while within the
superset/ folder, we'll see
venv listed as a folder. Success!
Usually, the Python requirements are specified in a
requirements.txt file. In the case of Superset, we're blessed with a folder of
.txt files. There's a lot we could explore and unpack here, but I'm going to focus on getting everything setup first.
If we look to CONTRIBUTING.MD, we see:
pip install -r requirements/testing.txt
If we open that file, we see something that resembles a standard
requirements.txt file, but with this header:
# This file is autogenerated by pip-compile-multi
I've made a mental note to investigate & explore
pip-compile-multi later, a library for compiling multiple requirement files. For now, let's run the following command to install the dependencies:
pip3 install -r requirements/testing.txt
Error 1: MySQL
I ran into this issue with red scary error text while on my M1 Macbook computer:
Collecting mysqlclient==2.1.0 Using cached mysqlclient-2.1.0.tar.gz (87 kB) ERROR: Command errored out with exit status 1: command: /firstname.lastname@example.org/bin/python3.8 -c 'import io, os, sys, setuptools, tokenize; sys.argv = '"'"'/private/var/folders/6d/f0fzvlyn6sd58q5rmx6s6df00000gn/T/pip-install-6c548wua/mysqlclient_a8c054d3233d4d00acb42d6a6bf2a562/setup.py'"'"'; __file__='"'"'/private/var/folders/6d/f0fzvlyn6sd58q5rmx6s6df00000gn/T/pip-install-6c548wua/mysqlclient_a8c054d3233d4d00acb42d6a6bf2a562/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/6d/f0fzvlyn6sd58q5rmx6s6df00000gn/T/pip-pip-egg-info-0735tk4h WARNING: Discarding https://files.pythonhosted.org/packages/de/79/d02be3cb942afda6c99ca207858847572e38146eb73a7c4bfe3bdf154626/mysqlclient-2.1.0.tar.gz#sha256=973235686f1b720536d417bf0a0d39b4ab3d5086b2b6ad5e6752393428c02b12 (from https://pypi.org/simple/mysqlclient/) (requires-python:>=3.5). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. ERROR: Could not find a version that satisfies the requirement mysqlclient==2.1.0 (from versions: 1.3.0, 1.3.1, 1.3.2, 1.3.3, 1.3.4, 1.3.5, 1.3.6, 1.3.7, 1.3.8, 1.3.9, 1.3.10, 1.3.11rc1, 1.3.11, 1.3.12, 1.3.13, 1.3.14, 1.4.0rc1, 1.4.0rc2, 1.4.0rc3, 1.4.0, 1.4.1, 1.4.2, 1.4.2.post1, 1.4.3, 1.4.4, 1.4.5, 1.4.6, 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.1.0rc1, 2.1.0) ERROR: No matching distribution found for mysqlclient==2.1.0
Some StackOverflow sleuthing suggested that I needed to install MySQL server via homebrew so the installation process for the Python client library would work. So this may not be an M1 related issue after all:
brew install mysql
Error 2: Postgres
mysql-client succeeded, pip now got stuck on postgres:
Error: pg_config executable not found. pg_config is required to build psycopg2 from source. Please add the directory containing pg_config to the $PATH or specify the full executable path with the option: python setup.py build_ext --pg-config /path/to/pg_config build ... or with the pg_config option in 'setup.cfg'. If you prefer to avoid building psycopg2 from source, please install the PyPI 'psycopg2-binary' package instead.
I'm going to move forward with finding the path to the
pg_config file and add it to my PATH. I'll first crack open the Postgres.app folder:
After jumping through folders, I found the
pg_config executable. As suggested in StackOverflow, I'm going to add that executable's folder to my PATH:
Now when I
pip3 install -r requirements/testing.txt again, everything works beautifully!
Now, we're ready to install Superset in "editable" mode. Editable mode lets us modify and test code changes in Superset quickly, which is ideal when developing features or fixing bugs.
pip3 install -e .
To test the installation, run the
superset command and the Superset CLI should appear:
That's it for Day 1. In Day 2, I'll play with setting up the metadata database, creating roles & permissions, loading example data, and starting the backend server.
If you want to follow along, use the RSS feed. Stay tuned! 📺