[JAVA] For aggregation processing that cannot be done with the Hive predefined function, first try the Reflect function.

Overview

When performing aggregation with Hadoop / Hive, aggregation processing that cannot be realized by the defined function will be handled by creating a custom UDF, outputting the result, and then processing it with another program. However, it is easy to handle it in Hive query because it is troublesome to write another code and management is troublesome.

If the processing you want to perform can be handled by an existing Java class, ** Hive's Reflect function may be able to do it without writing your own wrapper **.

environment

Hive 0.9 or higher (verification environment is 2.1.0)

Reflect UDF https://cwiki.apache.org/confluence/display/Hive/ReflectUDF Instantiate and call a method of an object using Java reflection. You can also call a static function.

This method should return a primitive type or a type that Hive knows how to serialize.

Basic


reflect(class, method[, arg1[, arg2..]])

Example 1. Character string / numerical processing

The example is a function that can be realized with a predefined function, Since you can freely call the processing implemented in the java class, you can reach the itchy place.


SELECT reflect("java.lang.String", "valueOf", 1),
       reflect("java.lang.String", "isEmpty"),
       reflect("java.lang.Math", "max", 2, 3),
       reflect("java.lang.Math", "min", 2, 3),
       reflect("java.lang.Math", "round", 2.5),
       reflect("java.lang.Math", "exp", 1.0),
       reflect("java.lang.Math", "floor", 1.9),
       reflect("java.lang.Math", "abs", -10)
FROM src LIMIT 1;
 
 
1       true    3       2       3       2.718281828459045       1.0     10

https://docs.oracle.com/javase/jp/8/docs/api/java/lang/String.html https://docs.oracle.com/javase/jp/8/docs/api/java/lang/Math.html

Case 2. URL encoding / decoding

The URL of the search result screen includes the user's free input character string (search character string), but since the search character string is usually recorded in the log in the URL-encoded state, it is decoded into a character string that can be read by the analyst. The processing to do is convenient. By combining this process, ** generation of search string ranking etc. from access log can be completed with one query **

It's a simple case, so it's a great case to use the reflect function.


SELECT 
  encoded_url,
  reflect("org.apache.commons.codec.net.URLCodec", "decode", encoded_url, "UTF-8"), 
  reflect("java.net.URLDecoder", "decode", encoded_url, "UTF-8") 
FROM src LIMIT 1;

%E3%83%86%E3%82%B9%E3%83%88%E3%83%A1%E3%83%83%E3%82%BB%E3%83%BC%E3%82%B8 test message test message

https://docs.oracle.com/javase/jp/8/docs/api/java/net/URLDecoder.html https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/net/URLCodec.html

Case 3. Hash value generation

Many hash functions are already available in built-in functions since Hive 1.3, Even in earlier versions, you can freely call and generate formats that are not in the built-in functions.

SELECT
  md5('hoge'), 
  sha2('hoge',256), 
  sha2('hoge',384), 
  sha2('hoge',512),
  reflect('org.apache.commons.codec.digest.DigestUtils', 'md5Hex', 'hoge'), 
  reflect('org.apache.commons.codec.digest.DigestUtils', 'sha256Hex', 'hoge'), 
  reflect('org.apache.commons.codec.digest.DigestUtils', 'sha384Hex', 'hoge'), 
  reflect('org.apache.commons.codec.digest.DigestUtils', 'sha512Hex', 'hoge')
from src LIMIT 1;

ea703e7aa1efda0064eaa507d9e8ab7e 
ecb666d778725ec97307044d642bf4d160aabb76f56c0069c71ea25b1e926825  cc2aa04a6cb251b8d9dfbacc60b806587456d3fc356dc832116b9ba188713e6adf5f995b750d86b0883b24d07a37c720 dbb50237ad3fa5b818b8eeca9ca25a047e0f29517db2b25f4a8db5f717ff90bf0b7e94ef4f5c4e313dfb06e48fbd9a2e40795906a75c470cdb619cf9c2d4f6d9 

ea703e7aa1efda0064eaa507d9e8ab7e 
ecb666d778725ec97307044d642bf4d160aabb76f56c0069c71ea25b1e926825 cc2aa04a6cb251b8d9dfbacc60b806587456d3fc356dc832116b9ba188713e6adf5f995b750d86b0883b24d07a37c720 dbb50237ad3fa5b818b8eeca9ca25a047e0f29517db2b25f4a8db5f717ff90bf0b7e94ef4f5c4e313dfb06e48fbd9a2e40795906a75c470cdb619cf9c2d4f6d9 

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/digest/DigestUtils.html

Summary

The latest version has a lot of built-in functions, so I don't feel any shortage, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Parts that are out of reach of itchy areas such as URL encoding / decoding will continue to appear in the future. General-purpose / primitive processing that does not require UDF is something that you want to process easily.

If you use the Reflect function, you can enjoy its convenience and convenience.

reference

Recommended Posts

For aggregation processing that cannot be done with the Hive predefined function, first try the Reflect function.
[First Java] Make something that works with Intellij for the time being