This article is the 6th day article of VOYAGE GROUP Engineer Blog Advent Calendar 2014.
Hello, is engaged in spare time data scientist business in VOYAGE GROUP [@ hagino3000] is (https://twitter.com/hagino3000).
I think many people have begun to put data for analysis into BigQuery by riding the BigQuery movement these days. However, if you start using BigQuery, test code such as aggregate batches will not be completed in the local environment, and you will want to refer to BigQuery itself. This article introduces several approaches.
The sample code uses Python + nose + BigQuery-Python.
The reason why I worry about test code is that BigQuery has the following two features.
Especially since the query takes a long time, I want to shorten this in the test.
BigQuery-Python test code, for example, doesn't access BigQuery at all.
https://github.com/tylertreat/BigQuery-Python/blob/master/bigquery/tests/test_client.py
Anyway, it has the merit of operating at high speed, but it is not possible to confirm whether the INSERT process and SELECT statement actually operate. In addition, the test code is full of Mock.
To put it simply, it would be nice to have a dataset for unit tests, but it would interfere when multiple people run the test at the same time, so a dataset is required for each test run. Do the same thing that Django does with Create Database every time you run a test.
First, the process for creating a disposable data set (+ table).
tests/helper.py
# coding=utf-8
from datetime import datetime
import glob
import json
import os
import random
import re
def setup_dataset(client, test_name):
"""
Prepare a dataset for testing
Parameters
----------
client : bigquery.client
See https://github.com/tylertreat/BigQuery-Python
Returns
-------
dataset_id : string
ID of the created dataset(ex. ut_hoge_test_359103)
schemas : dict (key: string, value: list)
The key is the table name and the value is the schema definition list used to create the table
"""
#Creating a dataset
dataset_id = 'ut_%s_%d' % (test_name ,int(random.random() * 1000000))
client.create_dataset(
dataset_id,
friendly_name='For unit test started at %s' % datetime.now())
#Create a table from a schema definition file
schemas = {}
BASE_DIR = os.path.normpath(os.path.join(os.path.dirname(__file__), '../'))
for schema_file in glob.glob(os.path.join(BASE_DIR, 'schema/*.json')):
table_name = re.search(r'([^/.]+).json$', schema_file).group(1)
schema = json.loads(open(schema_file).read())
schemas[table_name] = schema
client.create_table(dataset_id, table_name, schema)
return dataset_id, schemas
Create a data set with the test body and setup. Furthermore, in order to use this data set for the process to be tested, the process of acquiring the data set ID is made into a Mock.
test_hoge.py
# coding=utf-8
import time
import mock
from nose.tools import eq_
from nose.plugins.attrib import attr
import myapp.bq
import myapp.calc_daily_state
from . import helper
dataset_id = None
bq_client = None
#Run in parallel
_multiprocess_can_split_ = True
def setup():
global dataset_id
global bq_client
# BigQuery-Get a Python client instance
bq_client = myapp.bq.get_client(readonly=False)
#Creating a dataset
dataset_id, schemas = helper.setup_dataset(bq_client, 'test_hoge')
#Mock the process to get the dataset ID
myapp.bq.get_dataset_id = mock.Mock(return_value=dataset_id)
#INSERT of test data
bq_client.push_rows(dataset_id, 'events', [....Abbreviation....])
#You may not be able to query immediately after INSERT, so sleep
time.sleep(10)
@attr('slow')
def test_calc_dau():
#Tests that reference BigQuery
ret = myapp.calc_daily_state.calc_dau('2014/08/01')
eq_(ret, "....Abbreviation....")
@attr('slow')
def test_calc_new_user():
#Tests that reference BigQuery
ret = myapp.calc_daily_state.calc_new_user('2014/08/01')
eq_(ret, "....Abbreviation....")
def teadown():
#It seems better to delete the dataset and leave it when there is a failed test
bq_client.delete_dataset(dataset_id)
In this example, the process to be tested is assumed to be ReadOnly, so the data set can be created only once. It takes 5 seconds for each test case, so I want to adhere to 1 test and 1 assert.
It takes about 1 second to create the dataset of setup and load the data. Since it takes time for each case, it is possible to shorten the time to some extent by parallelizing.
#5 Run tests in parallel
nosetests --processes=5 --process-timeout=30
Multiprocess: parallel testing — nose 1.3.4 documentation http://nose.readthedocs.org/en/latest/plugins/multiprocess.html
In the above example, the dataset was created by setup of module, but when it comes to testing a process with INSERT, it is necessary to eliminate the influence between tests. This will take even longer. Because, if you try to check the result immediately after INSERT and execute the query, you will not get the result. After INSERT, you need to put a few seconds to sleep and then run the query (which takes about 5 seconds) to see the result.
test_fuga.py
#Run in parallel
_multiprocess_can_split_ = True
@attr('slow')
class TestFugaMethodsWhichHasInsert(object):
def setup(self):
#Make a dataset
(Abbreviation)
self.dataset_id = dataset_id
self.bq_client = bq_client
def test_insert_foo(self):
#Testing processing with INSERT
def test_insert_bar(self):
#Testing processing with INSERT
def teardown(self):
self.bq_client.delete_dataset(self.dataset_id)
You'll have to fall asleep by the end of the test, so let the CI tools do the work. In this case, BigQuery data can be executed in parallel because the influence between tests can be removed.
Only the method of issuing a query and inserting data uses the actual BigQuery, and separate the directories as a slow test. For other processing, Mock the part that returns the result of the query.
Don't write tests in the analysis task code
Wait for something like DynamoDB Local to appear. Or make it.
There is currently no option that this is the best, so if you want to reduce Mock and keep your test code simple, refer to BigQuery directly, and vice versa, use Mock. Remove the dependencies between the tests so that the tests can be run in parallel. It doesn't take long to create a dataset, so it's okay to create it for each test.
I would appreciate it if you could tell me if there is a better pattern.
Tomorrow's charge is @brtriver, please look forward to it.
Let's see what happens to the test code of the bq command made by Python. You can issue a query with `` `bq query xxx```, so you should have such a test.
https://code.google.com/p/google-bigquery-tools/source/browse/bq/bigquery_client_test.py
There is no test to execute the query. (´ ・ ω ・ `)
Recommended Posts