How to write a test for processing that uses BigQuery

This article is the 6th day article of VOYAGE GROUP Engineer Blog Advent Calendar 2014.

Hello, is engaged in spare time data scientist business in VOYAGE GROUP [@ hagino3000] is (https://twitter.com/hagino3000).

I think many people have begun to put data for analysis into BigQuery by riding the BigQuery movement these days. However, if you start using BigQuery, test code such as aggregate batches will not be completed in the local environment, and you will want to refer to BigQuery itself. This article introduces several approaches.

The sample code uses Python + nose + BigQuery-Python.

What's wrong

The reason why I worry about test code is that BigQuery has the following two features.

I can't create a local environment
It takes about 5 seconds to query even a small amount of data

Especially since the query takes a long time, I want to shorten this in the test.

Make everything Mock

BigQuery-Python test code, for example, doesn't access BigQuery at all.

https://github.com/tylertreat/BigQuery-Python/blob/master/bigquery/tests/test_client.py

Anyway, it has the merit of operating at high speed, but it is not possible to confirm whether the INSERT process and SELECT statement actually operate. In addition, the test code is full of Mock.

Create a dataset for each test module

To put it simply, it would be nice to have a dataset for unit tests, but it would interfere when multiple people run the test at the same time, so a dataset is required for each test run. Do the same thing that Django does with Create Database every time you run a test.

First, the process for creating a disposable data set (+ table).

`tests/helper.py`


# coding=utf-8
from datetime import datetime
import glob
import json
import os
import random
import re


def setup_dataset(client, test_name):
    """
Prepare a dataset for testing

    Parameters
    ----------
    client : bigquery.client
        See https://github.com/tylertreat/BigQuery-Python

    Returns
    -------
    dataset_id : string
ID of the created dataset(ex. ut_hoge_test_359103)

    schemas : dict (key: string, value: list)
The key is the table name and the value is the schema definition list used to create the table
    """
    #Creating a dataset
    dataset_id = 'ut_%s_%d' % (test_name ,int(random.random() * 1000000))
    client.create_dataset(
        dataset_id,
        friendly_name='For unit test started at %s' % datetime.now())

    #Create a table from a schema definition file
    schemas = {}
    BASE_DIR = os.path.normpath(os.path.join(os.path.dirname(__file__), '../'))
    for schema_file in glob.glob(os.path.join(BASE_DIR, 'schema/*.json')):
        table_name = re.search(r'([^/.]+).json$', schema_file).group(1)
        schema = json.loads(open(schema_file).read())
        schemas[table_name] = schema
        client.create_table(dataset_id, table_name, schema)

    return dataset_id, schemas

Create a data set with the test body and setup. Furthermore, in order to use this data set for the process to be tested, the process of acquiring the data set ID is made into a Mock.

`test_hoge.py`


# coding=utf-8
import time

import mock
from nose.tools import eq_
from nose.plugins.attrib import attr

import myapp.bq
import myapp.calc_daily_state
from . import helper

dataset_id = None
bq_client = None

#Run in parallel
_multiprocess_can_split_ = True


def setup():
    global dataset_id
    global bq_client
    # BigQuery-Get a Python client instance
    bq_client = myapp.bq.get_client(readonly=False)
    #Creating a dataset
    dataset_id, schemas = helper.setup_dataset(bq_client, 'test_hoge')
    #Mock the process to get the dataset ID
    myapp.bq.get_dataset_id = mock.Mock(return_value=dataset_id)
    #INSERT of test data
    bq_client.push_rows(dataset_id, 'events', [....Abbreviation....])
    #You may not be able to query immediately after INSERT, so sleep
    time.sleep(10)


@attr('slow')
def test_calc_dau():
    #Tests that reference BigQuery
    ret = myapp.calc_daily_state.calc_dau('2014/08/01')
    eq_(ret, "....Abbreviation....")


@attr('slow')
def test_calc_new_user():
    #Tests that reference BigQuery
    ret = myapp.calc_daily_state.calc_new_user('2014/08/01')
    eq_(ret, "....Abbreviation....")


def teadown():
    #It seems better to delete the dataset and leave it when there is a failed test
    bq_client.delete_dataset(dataset_id)

In this example, the process to be tested is assumed to be ReadOnly, so the data set can be created only once. It takes 5 seconds for each test case, so I want to adhere to 1 test and 1 assert.

It takes about 1 second to create the dataset of setup and load the data. Since it takes time for each case, it is possible to shorten the time to some extent by parallelizing.

#5 Run tests in parallel
nosetests --processes=5 --process-timeout=30

Multiprocess: parallel testing — nose 1.3.4 documentation http://nose.readthedocs.org/en/latest/plugins/multiprocess.html

Create a dataset for each test function

In the above example, the dataset was created by setup of module, but when it comes to testing a process with INSERT, it is necessary to eliminate the influence between tests. This will take even longer. Because, if you try to check the result immediately after INSERT and execute the query, you will not get the result. After INSERT, you need to put a few seconds to sleep and then run the query (which takes about 5 seconds) to see the result.

`test_fuga.py`


#Run in parallel
_multiprocess_can_split_ = True

@attr('slow')
class TestFugaMethodsWhichHasInsert(object):
    def setup(self):
        #Make a dataset
        (Abbreviation)
        self.dataset_id = dataset_id
        self.bq_client = bq_client
    
    def test_insert_foo(self):
        #Testing processing with INSERT

    def test_insert_bar(self):
        #Testing processing with INSERT

    def teardown(self):
        self.bq_client.delete_dataset(self.dataset_id)

You'll have to fall asleep by the end of the test, so let the CI tools do the work. In this case, BigQuery data can be executed in parallel because the influence between tests can be removed.

A balanced compromise

Only the method of issuing a query and inserting data uses the actual BigQuery, and separate the directories as a slow test. For other processing, Mock the part that returns the result of the query.

Extremists

Don't write tests in the analysis task code

Futurism

Wait for something like DynamoDB Local to appear. Or make it.

Summary

There is currently no option that this is the best, so if you want to reduce Mock and keep your test code simple, refer to BigQuery directly, and vice versa, use Mock. Remove the dependencies between the tests so that the tests can be run in parallel. It doesn't take long to create a dataset, so it's okay to create it for each test.

I would appreciate it if you could tell me if there is a better pattern.

Tomorrow's charge is @brtriver, please look forward to it.

Note: What is the test code for google-bigquery-tools?

Let's see what happens to the test code of the bq command made by Python. You can issue a query with `` `bq query xxx```, so you should have such a test.

https://code.google.com/p/google-bigquery-tools/source/browse/bq/bigquery_client_test.py

There is no test to execute the query. (´ ･ ω ･ `)