Hadoop 2.6 + Hive 1.2.1 + spark-1.4.1(3)

jopen 10年前

1. 新建表

1) 新建表结构

create table user_table(

id int,

userid bigint,

name string,

describe string comment 'desc表示用户的描述'

)

comment '这是用户信息表'

partitioned by(country string, city string) -- 建立分区，所谓的分区就是文件夹

clustered by (id) sorted by (userid) into 32 buckets

//通过id进行hash取值来分桶，桶类通过userid来排序排序

分桶便于有用数据加载到有限的内存中（性能上的优化----还有join,group by,distinct）

row format delimited -- 指定分隔符解析数据

fields terminated by '\001' -- 字段之间的分隔符

collection items terminated by '\002' -- array字段内部的分隔符

map keys terminated by '\003' -- map字段内部分隔符

//用来分隔符解析数据（load进去的原始数据，hive是不会对它进行任何处理）

stored as textfile; -- 存储格式( rcfile/ textfile / sequencefile )

//存储格式(原始数据，就是textfile格式就行)

总结：

相比textfile和SequenceFile，rcfile由于列式存储方式，数据加载时性能消耗较大，但是具有较好的压缩比和查询响应。数据仓库的特点是一次写入、多次读取，因此，整体来看，rcfile相比其余两种格式具有较明显的优势。

a) Table 内部表（大小写无所谓）

创建:

create table t1(id string);

create table t2(id string, name string) row format delimited fields terminated by '\t';

加载:

load data local inpath '/root/Downloads/seq100w.txt' into table t1;

load data inpath '/seq100w.txt' into table t1; (hdfs中数据移动到/hive/t1文件夹中)

（因此我们直接把hdfs中数据移动到我们表对应的文件夹中也能读取到数据）

load data local inpath '/root/Downloads/seq100w.txt' overwrite into table t1;

b) Partition 分区表

创建:

create table t3(id string) partitioned by (province string);

加载:

load data local inpath '/root/Downloads/seq100w.txt' into table t3 partition(province ='beijing');

查看某个表中所有的分区

Hive>show partitions 表名;

c) Bucket Table 桶表

创建: create table t4(id string) clustered by (id) into 4 buckets; //通过id来分桶

create table t4(id string) clustered by (id) sorted by (id asc) into 4 buckets; //对桶中数据进行升序排序，使每个桶的连接变成了高效的合并排序（merge-sort）,因此可以进一步提升map端连接的效率

设置均匀插入：set hive.enforce.bucketing = true;

加载: insert into table t4 select id from t3 where province='beijing';

覆盖： insert overwrite table bucket_table select name from stu;

抽样查询：select * from bucket_table tablesample(bucket 1 out of 4 on id); //表示在表中随机选择1个桶的数据

select * from bucket_table tablesample(bucket 1 out of 2 on id); //表示随机选择半个桶的数据

select * from bucket_table tablesample(bucket 1 out of 4 on rand()); //表示随机选择1个桶的数据的部分数据（从某个桶中取样，它会扫描整个表的数据集）

l 数据加载到桶表时，会对字段取hash值，然后与桶的数量取模。把数据放到对应的文件中。任何一桶里都会有一个随机的用户集合

d) External Table 外部表

（t5可以不放在仓库中，可以自定义存储位置,以wlan为仓库）

创建: create external table t5(id string) location '/wlan'; wlan 表示文件夹

EXTERNAL关键字表示创建外部表；数据有外部仓库控制，不是由hive控制，只有元数据（也就是表结构）由hive控制；因此不会把数据移到hive的仓库目录下，而是移动到外部仓库中去，当你drop table 表名，元数据(表结构)会删除，但是数据在外部仓库中，因此不会被hive删除。

hive>create external table t1(id ) row format delimited fields terminated by '\t' location ‘/wlan’；加上便于读取数据，查询的时候不会为Null（\t就是数据的分隔符） ;wlan 表示文件夹，wlan最好与你要创建的表名一致，这样方便查看和管理

create external table hadoop_1(id int,name string) row format delimited fields terminated by '\t' location '/wenjianjia';

load data inpath '/wenjianjia/hello' into table hadoop_1 ;

2) 复制现有表结构

// 新建new_table 表结构和 user_table 一样

create table new_table like user_table;

3) 表重命名

hive> alter table new_table rename to new_table_1;

4) 创建表分区

创建:

create table t3(id string) partitioned by (province string);

加载:

load data local inpath '/root/Downloads/seq100w.txt' into table t3 partition(province ='beijing');

2. 删除表

1) 清空表中数据

hadoop fs –rmr /… 直接删除表在hdfs中存放的数据就行

如果不小心把表也在hdfs中删除了

2) 删除表

drop table test1

3) 删除表分区（删除分区和分区中的数据）

hive> alter table dm_newuser_active_month drop partition (batch_date="201404");

删除表分区，一定要batch_date一定要加：冒号

3. 修改表信息

1) 表添加一个字段

hive> alter table test1 add columns(name string);

2) 修改表的某个字段

注意：change 取代现有表的要修改的列，它修改表模式而不是数据。

alter table 表名 change 要修改的列名修改后的列名修改后的类型 comment ‘备注信息’;

3) 修改表的所有字段

注意：replace 取代现有表的所有列，它修改表模式而不是数据。

alter table 表名replace columns(age int comment 'only keep the first column');

4) 添加表分区

hive> alter table ods_smail_mx_201404 add partition (day=20140401); 单独添加分区

create table user_table_2(

id int,

name string

)

comment '这是用户信息表'

partitioned by(dt string)

stored as textfile;

insert overwrite table user_table_2

partition(dt='2015-11-01')

select id, col2 name

from table_4;

4. 查看表

1) 查看建表语句

show create table tmp_jzl_20150310_diff;

2) 查看表结构

desc tmp_jzl_20150310_diff;

3) 查看表分区

show partitions tmp_jzl_20150310_diff;

4) 查看库中表名

hive> use tmp;

查看tmp库中所有的表

hive> show tables;

查看tmp库中 tmp_jzl_20150504开头的表

hive> show tables 'tmp_jzl_20150504*';

tmp_jzl_20150504_1

tmp_jzl_20150504_2

tmp_jzl_20150504_3

tmp_jzl_20150504_4

来自： http://my.oschina.net/repine/blog/552428