Ingenuity for automated testing of programs that use Redshift

I tried to automate the testing of Python programs that read and write Amazon Redshift on AWS Record of ingenuity at that time

*** Environment used ***

Amazon Redshift
Python3.7
pytest>=5.3.2
pandas>=1.0.0
psycopg2>=2.8.4

I'm using Redshift as the database, but I think other database products can be devised in the same way.

What I wanted to achieve

I wanted to verify the result data that the program wrote to the database (Redshift) In other words, I wanted to do something like DBUnit of Java/Junit while connected to the database without using DAO.

Automatic test method

Automatic test flow

There are the following phases

Data save
Test data input
Test execution
Result verification
Data recovery

To achieve this, I mainly created the fixture function of pytest this time.

Phase details

1. Data save/5. Data recovery

Before executing pytest of the program under test (Python) that reads and writes Redshift, save all data to S3 so that the data stored in Redshift is not corrupted. After completing the pytest execution, restore all the data and put Redshift in the state before the pytest execution.

Implementation example

`Python`


import pytest

# 'session'Allows processing before and after a series of pytest executions by specifying
@pytest.fixture(scope='session')
def fixture_session():
    #Process to UNLOAD all data from Redshift to S3

    yield

    #Processing to copy all data from S3 to Redshift

Even if you cancel pytest in the middle, data will be restored safely because it will be processed after yield. If you already have a large amount of data in Redshift, it will take a long time to save and restore, and pytest execution will not be smooth.

** Spectrum case **

--Move the actual data object in S3 to another S3 path with the same prefix/partition structure and save it. --Get the S3 path of the Spectrum table from the "location" column of the "SVV_EXTERNAL_TABLES" table --Note that when moving an S3 object, the ACL authority of the S3 object will be the default setting.

2. Test data input/3. Test execution

Every time you execute a test function (program to be tested), delete all Redshift data before executing it, and clean the test data remaining in Redshift. Before executing the test function, submit Redshift to the test data to be used

Implementation example

`Python`


import pytest

# 'function'Allows processing before executing the test function by specifying
@pytest.fixture(scope='function')
def fixture_function():
    #The process of deleting all data from Redshift

@pytest.fixture
def load_redshift_1_1():
    #Processing to COPY csv format test data to Redshift

#Test data is input before executing the test function
def test_1_1(load_redshift_1_1):
    #Process to execute the program under test

** Spectrum case **

--Upload test data to the corresponding location path in S3 with the same prefix/partition structure

4. Result verification

Verify the execution result of the program under test If the program under test SELECTs Redshift, you don't need to validate the data contained in Redshift. When the program under test changes Redshift (INSERT, UPDATE, DELETE), compare and verify the result (real) data of Redshift and the expected data prepared in advance.

Implementation example to verify the result data of Redshift

`Python`


import pandas as pd

def test_1_1(load_redshift_1_1):
    #Process to execute the program under test

    # ex_df =Processing to get expected data in csv format with pandas DataFrame

    # ac_df =Processing to get real data from Redshift with pandas DataFrame

    #Compare and validate two DataFrames
    assert pd.testing.assert_frame_equal(
        #Primary key(id)To index
        ex_df.set_index('id'),
        ac_df.set_index('id'),
        # index,I don't care about the order of columns
        check_like=True
    ) is None

** Spectrum case **

--You can't change the Spectrum table, so you won't validate the data object in S3

Task

-(Repost) If there is already a large amount of data in Redshift, it will take time to save and restore, and pytest execution will not be smooth. --Since one database is used, pytest cannot be executed concurrently, and concurrency control is required.

Automatic test method alternative

Another plan of the automatic test method that was forgotten

Database copy method

image

--Use a database that duplicates the definitions in that database instead of using an existing database --The program under test connects to the duplicated database when pytest is executed and processes it. --No need to save/restore existing data so that it will not be corrupted --Pytest can be executed at the same time --Send off because the following costs seemed to be high --How to duplicate the database with each definition in the database --How to replace the database connection information connected by the program under test with the duplicated database connection information

New database creation method for testing

image

--Create a new database with the same definition as the existing database without using the existing database --The program to be tested connects to the newly created database when executing pytest and processes it. --Get DDL from managed Git repository etc. --No need to save/restore existing data so that it will not be corrupted --Pytest can be executed at the same time --Send off because the following costs seemed to be high --How to get managed DDL and create a new database --How to replace the database connection information that the program under test connects with the newly created database connection information