当前位置:   article > 正文

hive表生成函数explode、stack、lateral view和json_tuple、parse_url_tuple示例_hive stack

hive stack

在Hive中,所有的表生成函数,包括用户自定义的和内置的,都统称为用户自定义表生成函数(user defined table generating functions),简称udtf。本文只介绍Hive自带的内置表生成函数。

1、explode(array)

功能:返回n行,每行对应数组中的一个元素。

  1. spark-sql> select explode(array(1,2,3,4));
  2. col
  3. 1
  4. 2
  5. 3
  6. 4
  7. Time taken: 0.048 seconds, Fetched 4 row(s)

2、explode(map)

功能:返回n行两列,每行对应每个map键-值,第一列是map的键,第二列是map的值。(不常用)

  1. spark-sql> select explode(map(1,2,3,4));
  2. key value
  3. 1 2
  4. 3 4
  5. Time taken: 0.039 seconds, Fetched 2 row(s)

3、posexplode(array)

功能:与explode类似,但除了数组元素本身(第二列),还返回各元素在数组中的位置(从0开始,第一列)。(不常用)

  1. spark-sql> select posexplode(array(2,4,6,8));
  2. pos col
  3. 0 2
  4. 1 4
  5. 2 6
  6. 3 8
  7. Time taken: 0.034 seconds, Fetched 4 row(s)

4、stack(int n, v_1, v_2, ..., v_k)

功能:把k列数据转换成n行,k/n列,其中n必须是正整数,后面的v_1到v_k必须是元素,不能是列名。(不常用)

4.1、n设为3,将后面6个元素按顺序分为3行2列

select stack(3,1,2,3,4,5,6);

  1. spark-sql> select stack(3,1,2,3,4,5,6);
  2. col0 col1
  3. 1 2
  4. 3 4
  5. 5 6
  6. Time taken: 0.032 seconds, Fetched 3 row(s)

4.2、n设为2,将后面6个元素按顺序分为2行3列

select stack(2,1,2,3,4,5,6);

  1. spark-sql> select stack(2,1,2,3,4,5,6);col0 col1 col2
  2. 1 2 3
  3. 4 5 6
  4. Time taken: 0.039 seconds, Fetched 2 row(s)

4.3、n设为3,将后面7个元素按顺序分为3行3列

select stack(3,1,2,3,4,5,6,7);

  1. spark-sql> select stack(3,1,2,3,4,5,6,7);
  2. col0 col1 col2
  3. 1 2 3
  4. 4 5 6
  5. 7 NULL NULL
  6. Time taken: 0.035 seconds, Fetched 3 row(s)

4.4、n设为6,将后面6个元素转为为6行1列

select stack(6,1,2,3,4,5,6);

  1. spark-sql> select stack(6,1,2,3,4,5,6);
  2. col0
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5
  8. 6
  9. Time taken: 0.03 seconds, Fetched 6 row(s)

5、表生成函数必备:lateral view

udtf有一个很大的限制,在使用udtf时,select后面只能跟udtf,不能跟其他任何字段,否则会报错,如下:

hive 报错,spark-sql不报错,如下所示:

  1. hive> select 1 as flag,explode(array(1,2,3,4));
  2. FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions
  3. spark-sql> select 1 as flag,explode(array(1,2,3,4));
  4. flag col
  5. 1 1
  6. 1 2
  7. 1 3
  8. 1 4
  9. Time taken: 0.044 seconds, Fetched 4 row(s)
  10. 2022-11-07 21:15:01,476 INFO thriftserver.SparkSQLCLIDriver: Time taken: 0.044 seconds, Fetched 4 row(s)

1)语法:

select 
列别名1[ ,列别名2,列别名3……] 
from 表名 lateral view udtf(expression) 虚拟表别名 as 列别名1[ ,列别名2,列别名3……]

lateral view跟在from后面,然后跟要使用的udtf,为生成的虚拟表起一个表别名,不写会报错。然后跟as 列别名,有些udtf会产生多个列,所以有时要跟多个列别名。


2)应用1:行转列

原始数据:

  1. select *
  2. from
  3. (
  4. select "a" as shop,"1,2,4" as uid_array
  5. union
  6. select "b" as shop,"4,5,6" as uid_array
  7. )tb_final
  8. a 1,2,4
  9. b 4,5,6
  10. Time taken: 5.345 seconds, Fetched 2 row(s)

首先使用split函数对uid_array进行切割,返回一个数组,然后使用lateral view explode进行行转列

  1. select shop
  2. ,uid --这里是下面生成的列别名
  3. from
  4. ( --准备原始数据
  5. select *
  6. from
  7. (
  8. select "a" as shop,"1,2,4" as uid_array
  9. union
  10. select "b" as shop,"4,5,6" as uid_array
  11. )tb_final
  12. )temp_test6
  13. lateral view explode(split(uid_array, ',')) tb_a as uid;
  14. a 1
  15. a 2
  16. a 4
  17. b 4
  18. b 5
  19. b 6
  20. Time taken: 1.479 seconds, Fetched 6 row(s)

3)应用2:求总聚合结果

举例:

此时想一句hql求出每个商店的来客数,以及两个商店去重后的来客数。由于uid为4的用户同时出现在两家商店,所以统计total维度时用户数为5。最终要得到如下结果:
a    3
b    3
总    5

explode实现,关键点在于构造一个array数组,将原本的聚合维度字段放入,然后任意自定义一个词例如‘total’作为总聚合的维度名称,再自定义一个别名(这里使用total_shop),代表包含total的字段名,聚合时使用新自定义的别名进行聚合,如下:

  1. select total_shop
  2. ,count(distinct uid) as uid_num
  3. from
  4. ( --准备原始数据
  5. select shop
  6. ,uid --这里是下面生成的列别名
  7. from
  8. ( --准备原始数据
  9. select *
  10. from
  11. (
  12. select "a" as shop,"1,2,4" as uid_array
  13. union
  14. select "b" as shop,"4,5,6" as uid_array
  15. )tb_final
  16. )temp_test6
  17. lateral view explode(split(uid_array, ',')) tb_a as uid
  18. )temp_test7
  19. lateral view explode(array(shop, '总')) tb_a as total_shop
  20. group by total_shop
  21. order by total_shop;
  22. a 3
  23. b 3
  24. 5
  25. Time taken: 2.14 seconds, Fetched 3 row(s)

这样的写法等价于:

  1. select shop
  2. ,count(distinct uid) as uid_num
  3. from
  4. ( --准备原始数据
  5. select shop
  6. ,uid --这里是下面生成的列别名
  7. from
  8. ( --准备原始数据
  9. select *
  10. from
  11. (
  12. select "a" as shop,"1,2,4" as uid_array
  13. union
  14. select "b" as shop,"4,5,6" as uid_array
  15. )tb_final
  16. )temp_test6
  17. lateral view explode(split(uid_array, ',')) tb_a as uid
  18. )temp_test7
  19. group by shop
  20. union all
  21. select '总' as shop
  22. ,count(distinct uid) as uid_num
  23. from
  24. ( --准备原始数据
  25. select shop
  26. ,uid --这里是下面生成的列别名
  27. from
  28. ( --准备原始数据
  29. select *
  30. from
  31. (
  32. select "a" as shop,"1,2,4" as uid_array
  33. union
  34. select "b" as shop,"4,5,6" as uid_array
  35. )tb_final
  36. )temp_test6
  37. lateral view explode(split(uid_array, ',')) tb_a as uid
  38. )temp_test7 ;
  39. a 3
  40. b 3
  41. 5
  42. Time taken: 2.715 seconds, Fetched 3 row(s)

6、json_tuple(jsonstr, k1, k2, ...)

功能:从一个json字符串中获取多个key对应的value并作为一个元组返回。

举例:具体案例见

7、parse_url_tuple(url, p1, p2, ...)

功能:返回从url中抽取指定n部分的内容并作为一个元组返回,参数url是url字符串,而参数p1,p2,....是要抽取的部分。
举例:具体案例见

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/IT小白/article/detail/761876
推荐阅读
相关标签
  

闽ICP备14008679号