赞
踩
在Hive中,所有的表生成函数,包括用户自定义的和内置的,都统称为用户自定义表生成函数(user defined table generating functions),简称udtf。本文只介绍Hive自带的内置表生成函数。
功能:返回n行,每行对应数组中的一个元素。
- spark-sql> select explode(array(1,2,3,4));
-
- col
- 1
- 2
- 3
- 4
- Time taken: 0.048 seconds, Fetched 4 row(s)
功能:返回n行两列,每行对应每个map键-值,第一列是map的键,第二列是map的值。(不常用)
- spark-sql> select explode(map(1,2,3,4));
- key value
- 1 2
- 3 4
- Time taken: 0.039 seconds, Fetched 2 row(s)
功能:与explode类似,但除了数组元素本身(第二列),还返回各元素在数组中的位置(从0开始,第一列)。(不常用)
- spark-sql> select posexplode(array(2,4,6,8));
- pos col
- 0 2
- 1 4
- 2 6
- 3 8
- Time taken: 0.034 seconds, Fetched 4 row(s)
功能:把k列数据转换成n行,k/n列,其中n必须是正整数,后面的v_1到v_k必须是元素,不能是列名。(不常用)
select stack(3,1,2,3,4,5,6);
- spark-sql> select stack(3,1,2,3,4,5,6);
- col0 col1
- 1 2
- 3 4
- 5 6
- Time taken: 0.032 seconds, Fetched 3 row(s)
select stack(2,1,2,3,4,5,6);
- spark-sql> select stack(2,1,2,3,4,5,6);col0 col1 col2
- 1 2 3
- 4 5 6
- Time taken: 0.039 seconds, Fetched 2 row(s)
select stack(3,1,2,3,4,5,6,7);
- spark-sql> select stack(3,1,2,3,4,5,6,7);
- col0 col1 col2
- 1 2 3
- 4 5 6
- 7 NULL NULL
- Time taken: 0.035 seconds, Fetched 3 row(s)
select stack(6,1,2,3,4,5,6);
- spark-sql> select stack(6,1,2,3,4,5,6);
- col0
- 1
- 2
- 3
- 4
- 5
- 6
- Time taken: 0.03 seconds, Fetched 6 row(s)
udtf有一个很大的限制,在使用udtf时,select后面只能跟udtf,不能跟其他任何字段,否则会报错,如下:
hive 报错,spark-sql不报错,如下所示:
- hive> select 1 as flag,explode(array(1,2,3,4));
- FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions
-
- spark-sql> select 1 as flag,explode(array(1,2,3,4));
- flag col
- 1 1
- 1 2
- 1 3
- 1 4
- Time taken: 0.044 seconds, Fetched 4 row(s)
- 2022-11-07 21:15:01,476 INFO thriftserver.SparkSQLCLIDriver: Time taken: 0.044 seconds, Fetched 4 row(s)
1)语法:
select
列别名1[ ,列别名2,列别名3……]
from 表名 lateral view udtf(expression) 虚拟表别名 as 列别名1[ ,列别名2,列别名3……]
lateral view跟在from后面,然后跟要使用的udtf,为生成的虚拟表起一个表别名,不写会报错。然后跟as 列别名,有些udtf会产生多个列,所以有时要跟多个列别名。
2)应用1:行转列
原始数据:
- select *
- from
- (
- select "a" as shop,"1,2,4" as uid_array
- union
- select "b" as shop,"4,5,6" as uid_array
- )tb_final
-
- a 1,2,4
- b 4,5,6
- Time taken: 5.345 seconds, Fetched 2 row(s)
首先使用split函数对uid_array进行切割,返回一个数组,然后使用lateral view explode进行行转列
- select shop
- ,uid --这里是下面生成的列别名
- from
- ( --准备原始数据
- select *
- from
- (
- select "a" as shop,"1,2,4" as uid_array
- union
- select "b" as shop,"4,5,6" as uid_array
- )tb_final
- )temp_test6
- lateral view explode(split(uid_array, ',')) tb_a as uid;
-
- a 1
- a 2
- a 4
- b 4
- b 5
- b 6
- Time taken: 1.479 seconds, Fetched 6 row(s)

3)应用2:求总聚合结果
举例:
此时想一句hql求出每个商店的来客数,以及两个商店去重后的来客数。由于uid为4的用户同时出现在两家商店,所以统计total维度时用户数为5。最终要得到如下结果:
a 3
b 3
总 5
explode实现,关键点在于构造一个array数组,将原本的聚合维度字段放入,然后任意自定义一个词例如‘total’作为总聚合的维度名称,再自定义一个别名(这里使用total_shop),代表包含total的字段名,聚合时使用新自定义的别名进行聚合,如下:
- select total_shop
- ,count(distinct uid) as uid_num
- from
- ( --准备原始数据
- select shop
- ,uid --这里是下面生成的列别名
- from
- ( --准备原始数据
- select *
- from
- (
- select "a" as shop,"1,2,4" as uid_array
- union
- select "b" as shop,"4,5,6" as uid_array
- )tb_final
- )temp_test6
- lateral view explode(split(uid_array, ',')) tb_a as uid
- )temp_test7
- lateral view explode(array(shop, '总')) tb_a as total_shop
- group by total_shop
- order by total_shop;
-
-
-
- a 3
- b 3
- 总 5
- Time taken: 2.14 seconds, Fetched 3 row(s)

这样的写法等价于:
- select shop
- ,count(distinct uid) as uid_num
- from
- ( --准备原始数据
- select shop
- ,uid --这里是下面生成的列别名
- from
- ( --准备原始数据
- select *
- from
- (
- select "a" as shop,"1,2,4" as uid_array
- union
- select "b" as shop,"4,5,6" as uid_array
- )tb_final
- )temp_test6
- lateral view explode(split(uid_array, ',')) tb_a as uid
- )temp_test7
-
- group by shop
-
- union all
-
- select '总' as shop
- ,count(distinct uid) as uid_num
- from
- ( --准备原始数据
- select shop
- ,uid --这里是下面生成的列别名
- from
- ( --准备原始数据
- select *
- from
- (
- select "a" as shop,"1,2,4" as uid_array
- union
- select "b" as shop,"4,5,6" as uid_array
- )tb_final
- )temp_test6
- lateral view explode(split(uid_array, ',')) tb_a as uid
- )temp_test7 ;
-
- a 3
- b 3
- 总 5
- Time taken: 2.715 seconds, Fetched 3 row(s)

功能:从一个json字符串中获取多个key对应的value并作为一个元组返回。
举例:具体案例见
功能:返回从url中抽取指定n部分的内容并作为一个元组返回,参数url是url字符串,而参数p1,p2,....是要抽取的部分。
举例:具体案例见
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。