[JAVA] Try HiveRunner

motivation

I want to test Hive queries.

With mysql, docker-compose is relatively easy to test, but I want to test hive queries in the same way, but what should I do? (It seems a little painful to mess with docker ..)

That's why I chose a plug-in that looked good and tried it.

I chose the git star among the some officially introduced tools. There were many HiveRunner. (Version is 4.1.0)

Preparation

HiveRunner basically tests your query as a test of Junit.

It doesn't need any external dependencies, it looks like an image of HiveServer on JVM and Junit running hive sql on that HiveServer.

This time we will create a project with maven. The following is the pom when executing this time. There is no particular reason, but I have specified the BEELINE emulator.

<properties>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
</properties>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-surefire-plugin</artifactId>
            <version>2.21.0</version>
            <configuration>
                <forkMode>always</forkMode>
                <systemProperties>
                    <!-- Defaults to HIVE_CLI, other options include BEELINE and HIVE_CLI_PRE_V200 -->
                    <commandShellEmulator>BEELINE</commandShellEmulator>
                </systemProperties>
            </configuration>
        </plugin>
    </plugins>
</build>

<dependencies>
    <dependency>
        <groupId>com.klarna</groupId>
        <artifactId>hiverunner</artifactId>
        <version>4.1.0</version>
        <scope>test</scope>
    </dependency>
</dependencies>

I will try it for the time being


@RunWith(StandaloneHiveRunner.class)
public class TestHiveRunner {
    @HiveSQL(files = {})
    private HiveShell shell;

    /*
     *Create DB and table
     */
    @Before
    public void setupSourceDatabase() {
        shell.execute(
                "CREATE DATABASE source_db; " +
                "CREATE TABLE source_db.test_table (year STRING, value INT);");

        shell.execute(
                "CREATE DATABASE my_schema; " +
                "CREATE EXTERNAL TABLE my_schema.result " +
                "(year STRING, value INT) " +
                "STORED AS PARQUET " +
                "TBLPROPERTIES (\"parquet.compress\"=\"snappy\")");
    }

    @Test
    public void testMaxValueByYear() {
        /*
         *Store test data in the table to be aggregated
         */
        shell.insertInto("source_db", "test_table")
                .withColumns("year", "value")
                .addRow("2014", 3)
                .addRow("2014", 4)
                .addRow("2015", 2)
                .addRow("2015", 5)
                .commit();

        /*
         *Execute aggregate query(INSERT query)
         */
        shell.executeStatement("INSERT INTO my_schema.result " +
                "SELECT " +
                "   year, " +
                "   MAX(value) " +
                "FROM " +
                "   source_db.test_table " +
                "GROUP BY year");

        /*
         *Get the result from the table where the aggregated result is INSERTed
         */
        List<Object[]> result = shell.executeStatement("SELECT * FROM my_schema.result");

        assertEquals(2, result.size());
        assertArrayEquals(new Object[]{"2014",4}, result.get(0));
        assertArrayEquals(new Object[]{"2015",5}, result.get(1));
    }
}

This is a slightly rewritten version of the code shown in most of the above examples.

It looks like it's very easy to test.

It is also possible to read the extracted query, or when the number of test data inserts increases and it is separately cut out to tsv etc.

Let's take a look at each

@HiveSQL


//Src by default/test/You are referencing the resources directory.
@HiveSQL(files = {"create_test_table.sql", "create_max.sql"})
private HiveShell shell;

By specifying the SQL file in files = {}, it will be executed automatically after the instance is created.

By setting @HiveSQL (files = {...}, autoStart = false), you can start it after performing ** arbitrary setup **. (Call the start () method)

By the way, the items that can be set in ** Arbitrary setup ** are as follows. (There are many overloaded methods.)


//Set HiveConf
void setProperty(String key, String value);
void setHiveConfValue(String key, String value);

//Copy test data to HDFS
void addResource(String targetFile, File sourceFile);

//Registration of script to be executed when HiveShell is started
// @The same is true for HiveSetupScript, but the script execution order is guaranteed for the following.
void addSetupScript(String script);
void addSetupScripts(Charset charset, File... scripts);

//It seems that it will open the stream and write the test data to HDFS
OutputStream getResourceOutputStream(String targetFile);

execute

//Direct writing
shell.execute("CREATE DATABASE source_db; " +
        "CREATE TABLE source_db.test_table (year STRING, value INT);");


//It is also possible to read the cut out SQL
shell.execute(Paths.get("src/test/resources/calculate_max.sql"));

Script (query) execution with no return value.

Multiple queries can be executed by separating them with ;.

Does that mean there is no return value because you can execute multiple queries?

executeQuery & executeStatement


//You can also execute it with executeQuery below.
shell.executeStatement("INSERT INTO my_schema.result " +
        "SELECT " +
        "   year, " +
        "   MAX(value) " +
        "FROM " +
        "   source_db.test_table " +
        "GROUP BY year");

//Only executeQuery can read the cut out SQL
shell.execute(Paths.get("src/test/resources/calculate_max.sql"));

Unlike ʻexecute, multiple queries cannot be executed at once, and an error will occur if; `is included at the end of the sentence.

Instead, List <String> will return the results of the query.

insertInto


shell.insertInto("source_db", "test_table")
        .withColumns("year", "value")
        .addRow("2014", 3)
        .addRow("2014", 4)
        .addRow("2015", 2)
        .addRow("2015", 5)
        .commit();

//You can also insert data from tsv etc.
shell.insertInto("source_db", "test_table")
        .withColumns("year", "value")
        .addRowsFromTsv(new File("src/test/resources/insert_data_of_test_table.tsv"))
        .commit();

This method will not be executed without commit ().

Preparing the test data is troublesome, but it is very convenient to insert it with tsv.

Summary

It seems to be useful (although it feels a little heavy) when you want to test a query or try a little query.

On the other hand, in a table with a large number of columns, a test when the amount of test data is large, a test of a very complicated query, etc., it seems to be difficult in terms of memory and execution speed, but what happens (README)? There are instructions on how to deal with it ..)

That said, it's tempting to be able to test sql for hive simply by putting hiverunner in pom.

Recommended Posts

Try HiveRunner
Try Mockito
Try Selenium
Try DbUnit
try docker-compose
Try Lombok
Try App Clips
Try using libGDX
First try ~ catch
Try using Maven
Try using powermock-mockito2-2.0.2
Try using GraalVM
Try Java 8 Stream
Try using jmockit 1.48
Try using sql-migrate
Try using SwiftLint
Try using Log4j 2.0
Try Ruby Minitest
Roughly try Java 9