Hive 基本语法操练（三）：分区操作和桶操作-白红宇的个人博客

发布日期：2021-10-09 07:57:17 浏览次数：7 分类：技术文章

本文共 6467 字，大约阅读时间需要 21 分钟。

（一）分区操作

Hive 的分区通过在创建表时启动 PARTITION BY 实现，用来分区的维度并不是实际数据的某一列，具体分区的标志是由插入内容时给定的。当要查询某一分区的内容时可以采用 WHERE 语句，例如使用 “WHERE tablename.partition_key>a” 创建含分区的表。创建分区语法如下。

CREATE TABLE table_name(...)PARTITION BY (dt STRING,country STRING)

1、创建分区

Hive 中创建分区表没有什么复杂的分区类型（范围分区、列表分区、hash 分区，混合分区等）。分区列也不是表中的一个实际的字段，而是一个或者多个伪列。意思是说，在表的数据文件中实际并不保存分区列的信息与数据。

创建一个简单的分区表。

hive> create table partition_test(member_id string,name string) partitioned by (stat_date string,province string) row format delimited fields terminated by ',';

这里写图片描述

这个例子中创建了 stat_date 和 province 两个字段作为分区列。通常情况下需要预先创建好分区，然后才能使用该分区。例如：

hive> alter table partition_test add partition (stat_date=’2015-01-18’,province=’beijing’);

这样就创建了一个分区。这时会看到 Hive 在HDFS 存储中创建了一个相应的文件夹。

$ hadoop fs -ls /user/hive/warehouse/partition_test/stat_date=2018-05-1818/05/18 18:18:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableFound 1 itemsdrwxr-xr-x   - hadoop supergroup          0 2018-05-18 18:10 /user/hive/warehouse/partition_test/stat_date=2018-05-18/province=beijing----显示刚刚创建的分区

每一个分区都会有一个独立的文件夹，在上面例子中，stat_date 是主层次，province 是副层次。

2、插入数据

使用一个辅助的非分区表 partition_test_input 准备向 partition_test 中插入数据，实现步骤如下。

1) 查看 partition_test_input 表的结构，命令如下。

hive> desc partition_test_input;OKmember_id               string                                      name                    string                                      stat_date               string                                      province                string                                      # Partition Information      # col_name              data_type               comment             stat_date               string                                      province                string                                      Time taken: 0.142 seconds, Fetched: 10 row(s)

2) 查看 partition_test_input 的数据，命令如下。

hive> select * from partition_test_input;

3) 向 partition_test 的分区中插入数据，命令如下。

insert overwrite table partition_test partition(stat_date='2015-01-18',province='jiangsu') select member_id,name from partition_test_input where stat_date='2015-01-18' and province='jiangsu';Query ID = hadoop_20180518182626_53ea7084-acb5-421f-ae66-4f3e2898cc2aTotal jobs = 3Launching Job 1 out of 3Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_1526636465246_0001, Tracking URL = http://master:8088/proxy/application_1526636465246_0001/Kill Command = /opt/modules/hadoop-2.6.0/bin/hadoop job  -kill job_1526636465246_0001Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02018-05-18 18:26:23,547 Stage-1 map = 0%,  reduce = 0%2018-05-18 18:26:33,030 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.98 secMapReduce Total cumulative CPU time: 1 seconds 980 msecEnded Job = job_1526636465246_0001Stage-4 is selected by condition resolver.Stage-3 is filtered out by condition resolver.Stage-5 is filtered out by condition resolver.Moving data to: hdfs://ns/tmp/hive/hadoop/ed837b62-1bd7-4569-96ea-637844deb0cb/hive_2018-05-18_18-26-08_605_2975853227273346012-1/-ext-10000Loading data to table default.partition_test partition (stat_date=2018-05-21, province=sichuan)Partition default.partition_test{stat_date=2018-05-21, province=sichuan} stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]MapReduce Jobs Launched: Stage-Stage-1: Map: 1   Cumulative CPU: 1.98 sec   HDFS Read: 290 HDFS Write: 86 SUCCESSTotal MapReduce CPU Time Spent: 1 seconds 980 msecOKTime taken: 26.543 seconds

向多个分区插入数据，命令如下。

hive> from partition_test_inputinsert overwrite table partition_test partition(stat_date='2015-01-18',province='jiangsu') select member_id,name from partition_test_input where stat_date='2015-01-18' and province='jiangsu'insert overwrite table partition_test partition(stat_date='2015-01-28',province='sichuan') select member_id,name from partition_test_input where stat_date='2015-01-28' and province='sichuan'insert overwrite table partition_test partition(stat_date='2015-01-28',province='beijing') select member_id,name from partition_test_input where stat_date='2015-01-28' and province='beijing';

3、动态分区

按照上面的方法向分区表中插入数据，如果数据源很大，针对一个分区就要写一个 insert ，非常麻烦。使用动态分区可以很好地解决上述问题。动态分区可以根据查询得到的数据自动匹配到相应的分区中去。

动态分区可以通过下面的设置来打开：

set hive.exec.dynamic.partition=true;  set hive.exec.dynamic.partition.mode=nonstrict;

动态分区的使用方法很简单，假设向 stat_date=’2015-01-18’ 这个分区下插入数据，至于 province 插到哪个子分区下让数据库自己来判断。stat_date 叫做静态分区列，province 叫做动态分区列。

hive> insert overwrite table partition_test partition(stat_date='2015-01-18',province)select member_id,name province from partition_test_input where stat_date='2015-01-18';

注意，动态分区不允许主分区采用动态列而副分区采用静态列，这样将导致所有的主分区都要创建副分区静态列所定义的分区。

hive.exec.max.dynamic.partitions.pernode：每一个 MapReduce Job 允许创建的分区的最大数量，如果超过这个数量就会报错（默认值100）。

hive.exec.max.dynamic.partitions：一个 dml 语句允许创建的所有分区的最大数量（默认值100）。

hive.exec.max.created.files：所有 MapReduce Job 允许创建的文件的最大数量（默认值10000）。

尽量让分区列的值相同的数据在同一个 MapReduce 中，这样每一个 MapReduce 可以尽量少地产生新的文件夹，可以通过 DISTRIBUTE BY 将分区列值相同的数据放到一起，命令如下。

hive> insert overwrite table partition_test partition(stat_date,province)select memeber_id,name,stat_date,province from partition_test_input distribute by stat_date,province;

（二）桶操作

Hive 中 table 可以拆分成 Partition table 和桶（BUCKET），桶操作是通过 Partition 的 CLUSTERED BY 实现的，BUCKET 中的数据可以通过 SORT BY 排序。

BUCKET 主要作用如下。

1)数据 sampling；

2)提升某些查询操作效率，例如 Map-Side Join。

需要特别主要的是，CLUSTERED BY 和 SORT BY 不会影响数据的导入，这意味着，用户必须自己负责数据的导入，包括数据额分桶和排序。 ‘set hive.enforce.bucketing=true’ 可以自动控制上一轮 Reduce 的数量从而适配 BUCKET 的个数，当然，用户也可以自主设置 mapred.reduce.tasks 去适配 BUCKET 个数，推荐使用：

hive> set hive.enforce.bucketing=true;

操作示例如下。

1) 创建临时表 student_tmp，并导入数据。

hive> desc student_tmp;hive> select * from student_tmp;

2) 创建 student 表。

hive> create table student(id int,age int,name string)partitioned by (stat_date string)clustered by (id) sorted by(age) into 2 bucketrow format delimited fields terminated by ',';

3) 设置环境变量。

hive> set hive.enforce.bucketing=true;

4) 插入数据。

hive> from student_tmpinsert overwrite table student partition(stat_date='2015-01-19')select id,age,name where stat_date='2015-01-18' sort by age;

5) 查看文件目录。

$ hadoop fs -ls /usr/hive/warehouse/student/stat_date=2015-01-19/

6) 查看 sampling 数据。

hive> select * from student tablesample(bucket 1 out of 2 on id);

tablesample 是抽样语句，语法如下。

tablesample(bucket x out of y)

y 必须是 table 中 BUCKET 总数的倍数或者因子。

以上就是博主为大家介绍的这一板块的主要内容，这都是博主自己的学习过程，希望能给大家带来一定的指导作用，有用的还望大家点个支持，如果对你没用也望包涵，有错误烦请指出。如有期待可关注博主以第一时间获取更新哦，谢谢！

转载地址：https://blog.csdn.net/py_123456/article/details/80411509 如侵犯您的版权，请留言回复原文章的地址，我们会给您删除此文章，给您带来不便请您谅解！

上一篇：Hive 基本语法操练（四）：Hive 复合类型

下一篇：Hive 基本语法操练（二）：视图和索引操作

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

（一）分区操作

（二）桶操作

发表评论

最新留言

关于作者

推荐文章

（一） 分区操作

（二）桶操作

发表评论

最新留言

关于作者

推荐文章

（一）分区操作