Srini Kadamati

Apache Superset from Scratch: Day 6 (Database Class)

December 29, 2021

I ended Day 5 with the knowledge of the Superset shell and a hunch that it might be a better tool for understanding the different code paths for how an example is loaded.

Now I'm going to try running some commands to begin emulating what the app is doing when loading an example. First things first, let's run the utils.get_example_database() function call:

>>> from superset.utils import core as utils
>>> database = utils.get_example_database()
>>> database
examples

Interesting. Superset returns the string value "examples". This is likely just the string representation of the returned "Database" object. We know that the examples database in our Superset installation lives in my home directory, as a SQLite file. So running the next command within the load_world_bank_health_n_pop() function should give us that information:

>>> engine = database.get_sqla_engine()
>>> engine
Engine(sqlite:////Users/srinik/.superset/superset.db)

Success!

The Superset Database Class

Next, I want to better understand the returned database object. The class for Database is defined in superset/superset/models/core.py:

class Database(
    Model, AuditMixinNullable, ImportExportMixin
):  # pylint: disable=too-many-public-methods

    """An ORM object that stores Database related information"""

    __tablename__ = "dbs"
    type = "table"
    __table_args__ = (UniqueConstraint("database_name"),)

    id = Column(Integer, primary_key=True)
    verbose_name = Column(String(250), unique=True)
    # short unique name, used in permissions
    database_name = Column(String(250), unique=True, nullable=False)
    ...

This is the same file that has the model definition for the CSSTemplate class, as I stumbled into earlier in this series! At the top of core.py is the following text:

A collection of ORM sqlalchemy models for Superset

This file contains the class definitions for the following models:

Let's dive deeper into the Database class!

Columns / Fields

The Database class defined in core.py maps to the "dbs" table in the metadata database, as suggested by this line of code:

__tablename__ = "dbs"

What other columns are defined?

String Representation

We know from our earlier exploration that running print() on a Superset Database object returns the database name. This aligns with the __repr__() definition for this model!

def __repr__(self) -> str:
        return self.name

Name Attribute

If I call the .name attribute on a Database object, the following is evaluated:

def name(self) -> str:
    return self.verbose_name if self.verbose_name else self.database_name

Interesting -- now we know how verbose_name is used! It's the preference for showing to humans, and database_name is the backup value displayed.

Data Attribute

What's next? The .data attribute looks interesting:

@property
    def data(self) -> Dict[str, Any]:

I want to run this for my Examples SQLite database and see what's returned:

>>> database.data
    {'id': 1, 
    'name': 'examples', 
    'backend': 'sqlite', 
    'configuration_method': 'sqlalchemy_form', 
    'allow_multi_schema_metadata_fetch': False, 
    'allows_subquery': True, 
    'allows_cost_estimate': False, 
    'allows_virtual_table_explore': True, 
    'explore_database_id': 1, 
    'parameters': {}, 
    'parameters_schema': {}}

Reserved Words

Neat! I can also retrieve the reserved words for the database:

>>> database.get_reserved_words()
{'right', 'select', 'check', 'having', 'virtual', 'before', 'fail', 'conflict', 'current_timestamp', 'escape', 'full', 'case', 'references', 'drop', 'begin', 'cast', 'view', 'of', 'insert', 'on', 'outer', 'cascade', 'in', 'attach', 'inner', 'vacuum', 'deferred', 'add', 'for', 'temporary', 'union', 'update', 'offset', 'as', 'where', 'transaction', 'explain', 'indexed', 'group', 'limit', 'to', 'pragma', 'unique', 'raise', 'initially', 'distinct', 'column', 'asc', 'notnull', 'null', 'between', 'rollback', 'end', 'when', 'deferrable', 'detach', 'match', 'all', 'temp', 'isnull', 'join', 'trigger', 'query', 'from', 'autoincrement', 'ignore', 'after', 'table', 'order', 'alter', 'reindex', 'is', 'intersect', 'primary', 'then', 'and', 'set', 'like', 'index', 'by', 'default', 'else', 'rename', 'plan', 'except', 'row', 'instead', 'natural', 'analyze', 'foreign', 'database', 'if', 'current_time', 'glob', 'current_date', 'cross', 'key', 'values', 'into', 'constraint', 'exists', 'left', 'delete', 'each', 'or', 'false', 'commit', 'exclusive', 'immediate', 'restrict', 'not', 'create', 'desc', 'true', 'using', 'replace', 'collate'}

Previewing Raw Data

I want to peek at the data in my SQLite database. The DB Browser for SQLite app on Mac is a good option for this. Better yet, once installed, I can use my terminal to pass the app the file location of my sqlite DB!

open -a "DB Browser for SQLite" /Users/srinik/.superset/superset.db

And voila!

DB Browser

Let's preview the dbs table (which corresponds to the Database model).

DB Browser DBs

It's nice to see all of the columns reflected here from the Database model.

World Health Dashboard: Examples Database

Let's revisit the load_world_bank_health_n_pop() function that loads the World Health Dashboard.

def load_world_bank_health_n_pop(  # pylint: disable=too-many-locals, too-many-statements
    only_metadata: bool = False, force: bool = False, sample: bool = False,
) -> None:
    """Loads the world bank health dataset, slices and a dashboard"""

    tbl_name = "wb_health_population"
    database = utils.get_example_database()
    engine = database.get_sqla_engine()
    schema = inspect(engine).default_schema_name
    table_exists = database.has_table_by_name(tbl_name)

This code does the following:

World Health Dashboard: Pandas Transformation

As someone who's spent years writing pandas code, the next part of of the load_world_bank_health_n_pop() function looks very familiar:

if not only_metadata and (not table_exists or force):
        data = get_example_data("countries.json.gz")
        pdf = pd.read_json(data)
        pdf.columns = [col.replace(".", "_") for col in pdf.columns]
        if database.backend == "presto":
            pdf.year = pd.to_datetime(pdf.year)
            pdf.year = pdf.year.dt.strftime("%Y-%m-%d %H:%M%:%S")
        else:
            pdf.year = pd.to_datetime(pdf.year)
        pdf = pdf.head(100) if sample else pdf

        pdf.to_sql(
            tbl_name,
            engine,
            schema=schema,
            if_exists="replace",
            chunksize=50,
            dtype={
                # TODO(bkyryliuk): use TIMESTAMP type for presto
                "year": DateTime if database.backend != "presto" else String(255),
                "country_code": String(3),
                "country_name": String(255),
                "region": String(255),
            },
            method="multi",
            index=False,
        )

Here's my breakdown of the code:

Phew! That's it for today. Tomorrow, I want to finish understanding how the Superset-specific metadata is loaded.