赞
踩
参考链接:https://blog.csdn.net/lfq1532632051/article/details/63262519
直接上例子,下面是我要处理的 json 数据格式
- {
- "IP": "192.168.1.1",
- "appName": "sichuan_yunyingyong",
- "customEvent": [
- {
- "eventName": "xx1",
- "du": "xx",
- "timestamp": "1480521763049",
- "eventParams": {
- "ContentID": "yixiuge",
- "account": "13856976635",
- "networkType": "WIFI",
- "result": "0",
- "type": "11"
- }
- },
- {
- "eventName": "xx2",
- "du": "xx",
- "timestamp": "1480521763049",
- "eventParams": {
- "ContentID": "yixiuge",
- "account": "13856976636",
- "networkType": "WIFI",
- "result": "0",
- "type": "11"
- }
- }
- ]
- }

这里面有json对象,还有json array,json对象好解析,json array不好解析,接下来两者都讲解下,需要使用 lateral view
- select
-
- j1.j1_ip,
-
- j1.j1_appName,
-
- j2.j2_customEvent_json
-
- FROM tab_json s
-
- lateral view json_tuple(s.json, 'IP', 'appName', 'customEvent') j1 as j1_ip, j1_appName, j1_customEvent
-
- lateral view posexplode(split(regexp_replace(regexp_replace(j1.j1_customEvent,'\\}\\,\\{','\\}\\|\\|\\{'),'\\[|\\]',''), '\\|\\|')) j2 as j2_customEvents_pos, j2_customEvent_json
这个表 tab_json是包含json数据的表,json是json数据的字段,结果为
- 192.168.1.1 sichuan_yunyingyong {"eventName":"xx1","du":"xx","timestamp":"1480521763049","eventParams":{"ContentID":"yixiuge","account":"13856976635","networkType":"WIFI","result":"0","type":"11"}}
-
- 192.168.1.1 sichuan_yunyingyong {"eventName":"xx2","du":"xx","timestamp":"1480521763049","eventParams":{"ContentID":"yixiuge","account":"13856976636","networkType":"WIFI","result":"0","type":"11"}}
json数据就是直接把key当成字段,直接把key作为json_tuple方法的参数即可,这种解析json对象比较简单,如json_tuple(s.json, 'IP') 就是取json中字段IP的值,
但是,为什么原来是一条记录怎么解析成2条记录了呢,问题如下
posexplode(split(regexp_replace(regexp_replace(j1.j1_customEvent,'\\}\\,\\{','\\}\\|\\|\\{'),'\\[|\\]',''), '\\|\\|')) j2 as j2_customEvents_pos, j2_customEvent_json
我把json array的格式通过替换变成了 {json1} || {json2} , 这种格式再根据 || 来拆开,形成了一个有两个元素的数组
注意:hive转义符需要写两个 \
接着 posexplode 在把数组变成(pos, json) 的键值对,pos记录了元素的位置,json就是实际的json数据,这样一条数据就变成了两条了,这点要注意,数据量因为这种操作,成倍的增加
如果想获取array中的eventName和timestamp字段怎么办呢,在上一部的基础上愉快的使用get_json_object就好了
- select
-
- j1.j1_ip,
-
- j1.j1_appName,
-
- j2.j2_customEvent_json,
-
- get_json_object(j2.j2_customEvent_json, '$.eventName') as eventName,
- get_json_object(j2.j2_customEvent_json, '$.timestamp') as timestamp,
-
- FROM tab_json s
-
- lateral view json_tuple(s.json, 'IP', 'appName', 'customEvent') j1 as j1_ip, j1_appName, j1_customEvent
-
- lateral view posexplode(split(regexp_replace(regexp_replace(j1.j1_customEvent,'\\}\\,\\{','\\}\\|\\|\\{'),'\\[|\\]',''), '\\|\\|')) j2 as j2_customEvents_pos, j2_customEvent_json

结果如下:
- 192.168.1.1 sichuan_yunyingyong xx1 1480521763049 {"eventName":"xx1","du":"xx","timestamp":"1480521763049","eventParams":{"ContentID":"yixiuge","account":"13856976635","networkType":"WIFI","result":"0","type":"11"}}
-
- 192.168.1.1 sichuan_yunyingyong xx2 1480521763049 {"eventName":"xx2","du":"xx","timestamp":"1480521763049","eventParams":{"ContentID":"yixiuge","account":"13856976636","networkType":"WIFI","result":"0","type":"11"}}
解释了这点,那么我想获取IP , appName , account 字段怎么办呢,我直接给出sql语句了
- select
-
- j1.j1_ip,
-
- j1.j1_appName,
-
- j4.j4_account
-
- FROM tab_json s
-
- lateral view json_tuple(s.json, 'IP', 'appName', 'customEvent') j1 as j1_ip, j1_appName, j1_customEvent
-
- lateral view posexplode(split(regexp_replace(regexp_replace(j1.j1_customEvent,'\\}\\,\\{','\\}\\|\\|\\{'),'\\[|\\]',''), '\\|\\|')) j2 as j2_customEvents_pos, j2_customEvent_json
-
- lateral view json_tuple(j2.j2_customEvent_json, 'eventParams') j3 as j3_eventParams
-
- lateral view json_tuple(j3.j3_eventParams, 'account') j4 as j4_account

结果如下:
- 192.168.1.1 sichuan_yunyingyong 13856976635
-
- 192.168.1.1 sichuan_yunyingyong 13856976636
上面的例子json array有两个元素,如果你只关注其中一个元素,那么可以如下操作
lateral view posexplode(array(split(regexp_replace(regexp_replace(j1.j1_customEvent,'\\}\\,\\{','\\}\\|\\|\\{'),'\\[|\\]',''), '\\|\\|')[1])) j2 as j2_customEvents_pos, j2_customEvent_json
split 跟上数组下标,就能取出某个元素,由于posexplode只接受 array类型的参数,可以使用array函数转换成对应的数组,这样就只有一条数据了,结果如下
192.168.1.1 sichuan_yunyingyong 13856976636
总结下:
在现在的hive版本中,hive 2还没有试用,不知道是不是已经引入了json array的解析函数了,目前的版本是不能通过方法解析的,
思路是,通过给json array替换字符,由原来的 [ {} , {} ] 变成 {} || {} 这样,在转换成数组用 posexplode 函数,这样就可以了
当然实际使用时数据放大也要注意,如果json array只有一个元素就不会放大
在只有一个元素的情况下,直接把 [ ] 去掉,用array 转成数组即可
以后可以自己写一个UDF来解析json array,欲知详情,请听下回分解
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。