==背 景==
Hive拥有强壮的内置函数来实现数据处理和数据剖析,可是有时特定的需求逻辑,或许是复用率高的处理逻辑,单纯的内置函数处理起来回十分麻烦,甚至无法处理,这个时分,就需求用到Hive自定义的函数了,Hive自定义函数整体分为三种;
-
udf(user defined function 用户定义函数)
:传入一个值,逻辑运算后回来一个值,如内置函数的floor,round等一样的计算办法; -
udaf(user aggregation function 用户自定义聚合函数)
:传入多行数据,根据选定的值group by后回来一行成果,跟sum,count相似; -
udtf(user table-generator function 用户自定义表生成函数)
:基于特定的一行值输入,回来展开多行输出,跟内置函数的explode相似;
构建UDF函数的过程;
- 继承UDF或许UDAF或许UDTF相关的基类或接口,实现需求需求的逻辑的办法。
- 将写好的类打包为jar,并最好上传到hdfs。
- 进入到Hive客户端,运用
CREATE FUNCTION db_name.your_function_name AS 'your_jar_package_name.your_jar_class_name' USING JAR 'hdfs:///your_jar_hdfs_path/your_jar.jar
创立你的UDF函数:
db_name:你的hive某个数据库的姓名; your_function_name:你给函数取个姓名; your_jar_package_name:你Java代码实现UDF的package名; your_jar_class_name:你Java代码实现UDF的class名; hdfs:///your_jar_hdfs_path/your_jar.jar:hdfs上存的Jar路径; 其它:Hive创立函数的关键字;
- 在hive客户端中运用
select your_function_name (cols)
测验自定义函数功用;
该篇博客先讲解下用的最多的UDF的实现,下面进入实战环节;
==Hive UDF实战==
==需求背景==
自己有一批设备数据,有个字段叫画面份额如16:9,4:3等,可是数据在前端收集的时分做的不够好,会出现16:9,9:16这种情况,数据运用十分不方便,玉所以数仓决定清洗下数据,定下逻辑悉数设备的画面份额为大:小
,即只会出现16:9和4:3;
==Java编程==
个人比较细化运用IDEA编写Java,如图1,便是构建一个基于Maven结构的Java项目来实现UDF;
图1 Java代码实现需求逻辑的UDF其中实现的逻辑的Java代码如下,相对简略,不必多解说吧?
package org.example;
import org.apache.hadoop.hive.ql.exec.UDF;
public class PicRatio_UDF extends UDF {
public String evaluate(String picratio)
{
try
{
String[] myArray=picratio.split(":");
int myArray0=Integer.parseInt(myArray[0]);
int myArray1=Integer.parseInt(myArray[1]);
return myArray0<myArray1?myArray1+":"+myArray0:picratio;
}
catch (Exception e)
{
System.out.println(e);
return picratio;
}
}
}
谨慎起见,还是写一段主函数测验下咱们的Java办法是否是对的,用例测验Java代码如下;
package org.example;
public class Test_UDF {
public static void main(String[] args)
{
String source_PicRatio="3:4";
PicRatio_UDF myPicRatio_UDF=new PicRatio_UDF();
String myPicRatio=myPicRatio_UDF.evaluate(source_PicRatio);
System.out.println(myPicRatio);
}
}
项目是基于Maven结构的,用到的Maven依赖和插件等在pom.xml文件内,内容如下;
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>udf</artifactId>
<version>1.0-SNAPSHOT</version>
<name>udf</name>
<!-- FIXME change it to the project's website -->
<url>http://www.example.com</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.8.5</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-common</artifactId>
<version>2.3.5</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.3.5</version>
</dependency>
</dependencies>
<!--一定要加上,不然容易报错Failure to find org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde -->
<repositories>
<repository>
<id>spring-plugin</id>
<url>https://repo.spring.io/plugins-release/</url>
</repository>
</repositories>
<build>
<pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
<plugins>
<!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
<plugin>
<artifactId>maven-clean-plugin</artifactId>
<version>3.1.0</version>
</plugin>
<!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
<plugin>
<artifactId>maven-resources-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.0</version>
</plugin>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.22.1</version>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-install-plugin</artifactId>
<version>2.5.2</version>
</plugin>
<plugin>
<artifactId>maven-deploy-plugin</artifactId>
<version>2.8.2</version>
</plugin>
<!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
<plugin>
<artifactId>maven-site-plugin</artifactId>
<version>3.7.1</version>
</plugin>
<plugin>
<artifactId>maven-project-info-reports-plugin</artifactId>
<version>3.0.0</version>
</plugin>
</plugins>
</pluginManagement>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass></mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
一切都预备Ok后运用Maven打成Jar包,打包可参考博客IntelliJ IDEA将代码打成Jar包的办法,注意,一定要保证打包Jar的时分日志输出是准确的,没有报错才行,而不是单纯看到有Jar包生成了就觉得包打好了,如果你不看打包报错日志,单纯去copy走生成的过错的包,后边hive引用该包还是会报错,你还的改然后重新打包,然后将该包上传到hdfs上,如我把它上传到hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar
;
万事具备,现在开始翻开hive的客户端,创立永久的UDF函数,Hive Cli环境下操作如下;
hive> create function rowyet.picratio as 'org.example.PicRatio_UDF' using jar 'hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar';
Added [/opt/hive/log/c0017eab-0d8b-49bd-8157-09237071085c_resources/udf-1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar]
OK
Time taken: 6.03 seconds
--测验
hive> select rowyet.picratio("4:3")
> ;
OK
4:3
Time taken: 0.902 seconds, Fetched: 1 row(s)
hive> select rowyet.picratio("3:4");
OK
4:3
Time taken: 0.064 seconds, Fetched: 1 row(s)
也能够临时性增加函数,详细HiveQL如下,一般不常用,都是建立永久的居多;
create temporary function rowyet.picratio as 'org.example.PicRatio_UDF' using jar 'hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar';
==反常剖析==
- 如果连加载Jar都报错,未出现以下加载Jar的日志的句子就报错了; 如报错
FAILED: SemanticException java.lang.IllegalArgumentException: java.net.UnknownHostException: hive
,阐明hive客户端内没有加载到Jar包,请检查你的Jar路径对不对,是否多了少了/
;
Added [/opt/hive/log/c0017eab-0d8b-49bd-8157-09237071085c_resources/udf-1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar]
OK
- 如果加载Jar显现有日志,而是报错
Failed to register rowyet.picratio using class com.ruoyin.hiveudf.PicRatio_UDF
这种相似的,阐明是Jar打包的有问题,Jar内存在问题,请回去看打包日志,是否存在过错,修复掉即可。
==删除自定义UDF函数==
也支撑删除自定义得UDF,详细HiveQL如下;
hive> drop function rowyet.picratio;
Added [/opt/hive/log/ce6a5159-2063-4796-b4cb-5a6762b0ad15_resources/udf-1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar]
OK
Time taken: 0.752 seconds
hive> select rowyet.picratio("4:3")
> ;
FAILED: SemanticException [Error 10011]: Invalid function rowyet.picratio