首页 > 美文阅读

hive复杂数据类型在数仓中应用（array、map、struct、和其组合应用）

更新时间:2023-07-02 18:45:12 阅读：评论：0

hive复杂数据类型在数仓中应⽤（array、map、struct、和其组合应⽤）环境：⼀般宽表建表可能考虑存储更多信息选择复杂模型建设

复杂数据类型：array、map、struct

1.数组array，⾥边不能装不同类型的数据

zhangsan beijing,shanghai,tianjin,hangzhou

lisi changchun,chengdu,wuhan,beijing

创建表

create table hive_array(name string, work_locations array)

row format delimited fields terminated by ‘\t’

collection items terminated by ‘,’;

hive> desc formatted hive_array;

#col_name data_type comment

name string

work_locations array

#加载本地⽂件

load data local inpath ‘/home/hadoop/data/’ overwrite into table hive_array;

#查询数据

hive> lect * from hive_array;

ruoze [“beijing”,“shanghai”,“tianjin”,“hangzhou”]

jepson [“changchun”,“chengdu”,“wuhan”,“beijing”]

hive> lect name, size(work_locations) from hive_array;

ruoze 4

jepson 4

hive> lect name, work_locations[0] from hive_array;

ruoze beijing

jepson changchun

hive> lect * from hive_array where array_contains(work_locations, “tianjin”);

ruoze [“beijing”,“shanghai”,“tianjin”,“hangzhou”]

2.map Map(‘a’#1,‘b’#2)

1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28

2,lisi,father:mayun#mother:huangyi#brother:guanyu,22

3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29

4,mayun,father:mayongzhen#mother:angelababy,26

#创建表

create table hive_map(id int,name string, family map<string,string>,age int)

row format delimited fields terminated by ‘,’

collection items terminated by ‘#’

map keys terminated by ‘:’;

#查看表

hive> desc formatted hive_map;

#col_name data_type comment

id int

name string

family map<string,string>

age int

#加载本地数据

load data local inpath ‘/home/hadoop/data/’

overwrite into table hive_map;

查询⽅式

map的使⽤： map是键值对，即(key,value)形式。在⼀个键值对中，要求其key唯⼀，否则将覆盖掉其value。

map结构：fieldName(k1:v1,k2:v2,…)

取值语法： fieldName[‘key’]，通过⽅括号（[]）来取

hive> lect * from hive_map;

1 zhangsan {“father”:“xiaoming”,“mother”:“xiaohuang”,“brother”:“xiaoxu”} 28

2 lisi {“father”:“mayun”,“mother”:“huangyi”,“brother”:“guanyu”} 22

3 wangwu {“father”:“wangjianlin”,“mother”:“ruhua”,“sister”:“jingtian”} 29

4 mayun {“father”:“mayongzhen”,“mother”:“angelababy”} 26

hive> lect id,name,family[‘father’] as father, family[‘sister’] from hive_map;

1 zhangsan xiaoming NULL

2 lisi mayun NULL

广播稿怎么写3 wangwu wangjianlin jingtian

4 mayun mayongzhen NULL

hive> lect id,name,map_keys(family) from hive_map;

篮球逐风梦1 zhangsan [“father”,“mother”,“brother”]

2 lisi [“father”,“mother”,“brother”]

3 wangwu [“father”,“mother”,“sister”]

4 mayun [“father”,“mother”]

hive> lect id,name,map_values(family) from hive_map;

1 zhangsan [“xiaoming”,“xiaohuang”,“xiaoxu”]

2 lisi [“mayun”,“huangyi”,“guanyu”]

3 wangwu [“wangjianlin”,“ruhua”,“jingtian”]

4 mayun [“mayongzhen”,“angelababy”]

hive> lect id,name,size(family) from hive_map;

1 zhangsan 3

2 lisi 3

3 wangwu 3

4 mayun 2

hive> lect id,name,family[‘brother’] from hive_map where array_contains(map_keys(family),‘brother’); OK

1 zhangsan xiaoxu

2 lisi guanyu

3.struct结构体

//原始数据

cat

192.168.1.1#zhangsan:40

192.168.1.2#lisi:50

192.168.1.3#wangwu:60

192.168.1.4#zhaoliu:70

//建表并导⼊数据

create table hive_struct(ip string,urinfo structname:string,age:int)

row format delimited fields terminated by ‘#’

collection items terminated by ‘:’;

#加载数据

load data local inpath ‘/home/hadoop/data/’

计算机病毒的主要特征overwrite into table hive_struct;

#查询数据

hive> lect * from hive_struct;

192.168.1.1 {“name”:“zhangsan”,“age”:40}

192.168.1.2 {“name”:“lisi”,“age”:50}

192.168.1.3 {“name”:“wangwu”,“age”:60}

192.168.1.4 {“name”:“zhaoliu”,“age”:70}

//取值

struct的使⽤： struct是结构体，其定义为： filedName struct filed1:type1,field2:type2,… 表⽰该字段由多个字段组合⽽成。

取某个字段的语法为： fieldname.field1, 通过点（.）来取

描写思念的诗句hive> lect ip,urinfo.name,urinfo.age from hive_struct;

192.168.1.1 zhangsan 40

192.168.1.2 lisi 50

192.168.1.3 wangwu 60

192.168.1.4 zhaoliu 70

4、map和struct 结合

建表语句

以borrow_repay_record为例：其key为phaNumber,value为⼀个struct。默读剧情大概

map(string,struct<…>) 显然，value的类型可以是复杂数据类型，这就形成了复杂数据类型的嵌套。其语法仍然符合各个基本类型的语法规则如，取出其对应的map 的key 为load 的value中对应 duedate：的值

语法为

lect borrow_repay_recore[‘load’].duedate from dw_kuanbiao where dt=‘2019-02-12’

当不知道key（或者不关⼼key），如何来取出满⾜需求的value？这就⽤到了map的展开（将⼀⾏变为多⾏）

我们取出ur_id为100000的记录对应的 borrow_repay_record (注意ur_id取出的值存在多⾏情况 )

结果结构类似

{“19”:{“duedate”:“2015-04-23 14:51:42”,“repayoverduemgmtfee”:null，},

“18”:{“duedate”:“2015-03-23 14:51:42”,“repayoverduemgmtfee”:null，},

有板有眼的意思“15”:{“duedate”:“2014-12-23 14:51:42”,“repayoverduemgmtfee”:null，******},

“14”:{“duedate”:“2014-11-23 14:51:42”,“repayoverduemgmtfee”:null，*****}}

以看到，这⼀⾏当中，其实包含了相当多的信息。

为了能够获取任意⼀⾏中的任意⼀个字段，⽽不是通过key索引来寻找该字段，我们需要将上述⼀⾏，按照key ，value的形式打散，化为多⾏，并能够与表中的其他字段进⾏融合。⽽hive则提供了相关函数。

explode() 函数，能够将⼀⾏打散为多⾏，但该函数⽆法将打散出来的⾏与表的其他字段进⾏融合。

LATERAL VIEW 则能够弥补这⼀缺点，⼆者⼀般配合使⽤。

举例如下：

SELECT ur_id,phaNumber,value from dw_loan LATERAL VIEW explode(borrow_repay_record) adTable AS phaNumber,value where dt = ‘2019-01-29’ and ur_id = ‘100000’

通过LATERAL VIEW explode(borrow_repay_record) adTable AS phaNumber,value 可以将map中的数据按⾏切分，并与原来的⾏中连接，形成多⾏。时间介词

可以认为， from后⾯就是⼀个表，和平常⽤的表并⽆区别。

那么，如果要算某个字段的和的时候，则直接使⽤就ok：

如，要计算本⾦的和，map 中value的某个字段值的情况：

lect sum(value.principal) from dw_kubiao LATERAL VIEW explode(borrow_repay_recore) adTable As phaNumeber,value where ur_id =‘100000’

注：在宽表建设过程中，使⽤了hive的复杂数据类型，如map, struct, 以及复杂数据类型的嵌套，如map<string, struct> 等，

虽然hive复杂数据类型能够让单⾏记录容纳更多的信息，但也导致了加载过程的复杂。为了简化这些包含复杂数据类型的表的加载过程，采⽤了中间表。即先把数据按照最终表的数据结构导⼊到中间表，再利⽤MR清洗⼀遍中间表，使其满⾜复杂数据类型的要求。

（即先将数据导⼊到 tmep_kubiao中----》 dw_kubiao中）

其中map<string,struct> 由原来string 类型替换⽽来

temp_kubiao 表定义

…

COMMENT ‘标的信息表’

ROW FORMAT SERDE

‘org.apache.hadoop.hive.rde2.lazy.LazySimpleSerDe’

WITH SERDEPROPERTIES (

‘field.delim’=’,’,

‘rialization.format’=’,’)

…

dw_kubiao 表定义

紫砂壶的辨别…

ROW FORMAT SERDE

‘org.apache.hadoop.hive.rde2.lazy.LazySimpleSerDe’

WITH SERDEPROPERTIES (

‘colelction.delim’=’|’,

‘field.delim’=’,’,

‘mapkey.delim’=’:’,

‘rialization.format’=’,’

)

string 对应字段鸿以| 分隔成不同 key : struct （让后将 struct 中的分隔符由原来的@ 换成\004 --》由单独mr 实现）

注：map中多个元素的分隔符以及 struct多个元素的分隔符，⽬前hive提供的语法是⽆法都更改的，

只能够更改⼀个。剩下的分隔符则按照 ascii 码 1- 8的顺序进⾏使⽤。当指定 colection的分隔符为 ’ | ', 实际上是指定了 map 结构的元素分隔符，那么 struct元素的分隔符则默认为 ‘\004’, 因此，只需要把 struct的分隔符改为 ‘\004’ 即可。

注：：：：map 中分隔⽅式暂时没找到修改语句。采⽤修改元数据然后重新加载hive 分区的⽅式实现

本文发布于:2023-07-02 18:45:12，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/1074635.html

上一篇：中国建行存单英文翻译

下一篇：【保险应用体系架构】IAA是什么

标签：字段语法能够数据类型类型分隔符加载数据

留言与评论（共有 0 条评论）