==背 景==

  Hive拥有强壮的内置函数来实现数据处理和数据剖析,可是有时特定的需求逻辑,或许是复用率高的处理逻辑,单纯的内置函数处理起来回十分麻烦,甚至无法处理,这个时分,就需求用到Hive自定义的函数了,Hive自定义函数整体分为三种;

  • udf(user defined function 用户定义函数):传入一个值,逻辑运算后回来一个值,如内置函数的floor,round等一样的计算办法;
  • udaf(user aggregation function 用户自定义聚合函数):传入多行数据,根据选定的值group by后回来一行成果,跟sum,count相似;
  • udtf(user table-generator function 用户自定义表生成函数):基于特定的一行值输入,回来展开多行输出,跟内置函数的explode相似;

  构建UDF函数的过程;

  1. 继承UDF或许UDAF或许UDTF相关的基类或接口,实现需求需求的逻辑的办法。
  2. 将写好的类打包为jar,并最好上传到hdfs。
  3. 进入到Hive客户端,运用 CREATE FUNCTION db_name.your_function_name AS 'your_jar_package_name.your_jar_class_name' USING JAR 'hdfs:///your_jar_hdfs_path/your_jar.jar 创立你的UDF函数:

db_name:你的hive某个数据库的姓名; your_function_name:你给函数取个姓名; your_jar_package_name:你Java代码实现UDF的package名; your_jar_class_name:你Java代码实现UDF的class名; hdfs:///your_jar_hdfs_path/your_jar.jar:hdfs上存的Jar路径; 其它:Hive创立函数的关键字;

  1. 在hive客户端中运用select your_function_name (cols)测验自定义函数功用;

  该篇博客先讲解下用的最多的UDF的实现,下面进入实战环节;

==Hive UDF实战==

==需求背景==

  自己有一批设备数据,有个字段叫画面份额如16:9,4:3等,可是数据在前端收集的时分做的不够好,会出现16:9,9:16这种情况,数据运用十分不方便,玉所以数仓决定清洗下数据,定下逻辑悉数设备的画面份额为大:小,即只会出现16:9和4:3;

==Java编程==

  个人比较细化运用IDEA编写Java,如图1,便是构建一个基于Maven结构的Java项目来实现UDF;

Hive从入门到放弃——Hive 用户自定义函数之UDF简介(十二)

图1 Java代码实现需求逻辑的UDF

  其中实现的逻辑的Java代码如下,相对简略,不必多解说吧?

package org.example;
import org.apache.hadoop.hive.ql.exec.UDF;
public class PicRatio_UDF extends UDF {
    public String evaluate(String picratio)
    {
        try
        {
            String[] myArray=picratio.split(":");
            int myArray0=Integer.parseInt(myArray[0]);
            int myArray1=Integer.parseInt(myArray[1]);
            return myArray0<myArray1?myArray1+":"+myArray0:picratio;
        }
        catch (Exception e)
        {
            System.out.println(e);
            return picratio;
        }
    }
}

  谨慎起见,还是写一段主函数测验下咱们的Java办法是否是对的,用例测验Java代码如下;

package org.example;
public class Test_UDF {
    public static void main(String[] args)
    {
        String source_PicRatio="3:4";
        PicRatio_UDF myPicRatio_UDF=new PicRatio_UDF();
        String myPicRatio=myPicRatio_UDF.evaluate(source_PicRatio);
        System.out.println(myPicRatio);
    }
}

  项目是基于Maven结构的,用到的Maven依赖和插件等在pom.xml文件内,内容如下;

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>org.example</groupId>
  <artifactId>udf</artifactId>
  <version>1.0-SNAPSHOT</version>
  <name>udf</name>
  <!-- FIXME change it to the project's website -->
  <url>http://www.example.com</url>
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.7</maven.compiler.source>
    <maven.compiler.target>1.7</maven.compiler.target>
  </properties>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.8.5</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-common</artifactId>
      <version>2.3.5</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>2.3.5</version>
    </dependency>
  </dependencies>
<!--一定要加上,不然容易报错Failure to find org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde -->
  <repositories>
    <repository>
      <id>spring-plugin</id>
      <url>https://repo.spring.io/plugins-release/</url>
    </repository>
  </repositories>
  <build>
    <pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
      <plugins>
        <!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
        <plugin>
          <artifactId>maven-clean-plugin</artifactId>
          <version>3.1.0</version>
        </plugin>
        <!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
        <plugin>
          <artifactId>maven-resources-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>3.8.0</version>
        </plugin>
        <plugin>
          <artifactId>maven-surefire-plugin</artifactId>
          <version>2.22.1</version>
        </plugin>
        <plugin>
          <artifactId>maven-jar-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-install-plugin</artifactId>
          <version>2.5.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-deploy-plugin</artifactId>
          <version>2.8.2</version>
        </plugin>
        <!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
        <plugin>
          <artifactId>maven-site-plugin</artifactId>
          <version>3.7.1</version>
        </plugin>
        <plugin>
          <artifactId>maven-project-info-reports-plugin</artifactId>
          <version>3.0.0</version>
        </plugin>
      </plugins>
    </pluginManagement>
    <plugins>
      <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <configuration>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
          <archive>
            <manifest>
              <mainClass></mainClass>
            </manifest>
          </archive>
        </configuration>
        <executions>
          <execution>
            <id>make-assembly</id>
            <phase>package</phase>
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

  一切都预备Ok后运用Maven打成Jar包,打包可参考博客IntelliJ IDEA将代码打成Jar包的办法,注意,一定要保证打包Jar的时分日志输出是准确的,没有报错才行,而不是单纯看到有Jar包生成了就觉得包打好了,如果你不看打包报错日志,单纯去copy走生成的过错的包,后边hive引用该包还是会报错,你还的改然后重新打包,然后将该包上传到hdfs上,如我把它上传到hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar;   万事具备,现在开始翻开hive的客户端,创立永久的UDF函数,Hive Cli环境下操作如下;

hive> create function  rowyet.picratio as 'org.example.PicRatio_UDF' using jar 'hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar';
Added [/opt/hive/log/c0017eab-0d8b-49bd-8157-09237071085c_resources/udf-1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar]
OK
Time taken: 6.03 seconds
--测验
hive> select rowyet.picratio("4:3")
    > ;
OK
4:3
Time taken: 0.902 seconds, Fetched: 1 row(s)
hive> select rowyet.picratio("3:4");
OK
4:3
Time taken: 0.064 seconds, Fetched: 1 row(s)

  也能够临时性增加函数,详细HiveQL如下,一般不常用,都是建立永久的居多;

create temporary function  rowyet.picratio as 'org.example.PicRatio_UDF' using jar 'hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar';

==反常剖析==

  1. 如果连加载Jar都报错,未出现以下加载Jar的日志的句子就报错了; 如报错FAILED: SemanticException java.lang.IllegalArgumentException: java.net.UnknownHostException: hive,阐明hive客户端内没有加载到Jar包,请检查你的Jar路径对不对,是否多了少了/
Added [/opt/hive/log/c0017eab-0d8b-49bd-8157-09237071085c_resources/udf-1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar]
OK
  1. 如果加载Jar显现有日志,而是报错Failed to register rowyet.picratio using class com.ruoyin.hiveudf.PicRatio_UDF这种相似的,阐明是Jar打包的有问题,Jar内存在问题,请回去看打包日志,是否存在过错,修复掉即可。

==删除自定义UDF函数==

  也支撑删除自定义得UDF,详细HiveQL如下;

hive> drop function rowyet.picratio;
Added [/opt/hive/log/ce6a5159-2063-4796-b4cb-5a6762b0ad15_resources/udf-1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar]
OK
Time taken: 0.752 seconds
hive> select rowyet.picratio("4:3")
    > ;
FAILED: SemanticException [Error 10011]: Invalid function rowyet.picratio