赞
踩
一、下载TPCH生成工具
官方网址:http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp
本文中安装的是:
TPC-H 2.17.1 pdf Download TPCH_Tools.zip
可通过wget、curl来获取安装包。
【注】获取包地址,需要填写Email信息进行申请,无伤大雅,照做。请确保,您的邮箱可以收到自动发送的包含连接地址的邮件,
这个地址才是真正的下载地址。
二、安装TPCH生成工具
下载完成后,您可能看到这样的文件:
<span style="color:#CC0000;">c81a6b51-9f9c-4a5b-8979-9bf85688767b-tpc-h-tool.zip</span>
unzip c81a6b51-9f9c-4a5b-8979-9bf85688767b-tpc-h-tool.zip
第一个,无用,忽略之。进入第二个。
- <span style="font-size:12px;">[root@node20 tpch_2_17_0]# cd dbgen/
- [root@node20 dbgen]# ls
- answers bm_utils.c build.o dbgen driver.o dsstypes.h makefile permute.o qgen queries rnd.c rng64.h speed_seed.o tpcd.h update_release.sh
- bcd2.c bm_utils.o check_answers dbgen.dsp dss.ddl HISTORY makefile.suite PORTING.NOTES qgen.c README rnd.h rng64.o tests tpch.dsw variants
- bcd2.h BUGS column_split.sh dists.dss dss.h load_stub.c permute.c print.c qgen.o reference rnd.o shared.h text.c tpch.sln varsub.c
- bcd2.o build.c config.h driver.c dss.ri load_stub.o permute.h print.o qgen.vcproj release.h rng64.c speed_seed.c text.o tpch.vcproj varsub.o
- </span>
[root@node20 dbgen]# ./dbgen -help
你会看到相应的参数信息。详情,自己看。
三、生成指定的数据库文件
1. 编辑Makefile文件,具体的参数配置可参考如下示例。
<span style="font-size:12px;">[root@node20 dbgen]# cp makefile.suite makefile</span>
- <span style="font-size:12px;">[root@node20 dbgen]# vim makefile
- </span>CC = gcc
- DATABASE= SQLSERVER
- MACHINE = LINUX
- WORKLOAD = TPCH
- CFLAGS = -g -DDBNAME=\"dss\" -D$(MACHINE) -D$(DATABASE) -D$(WORKLOAD) -DRNG_TEST -D_FILE_OFFSET_BITS=64
- LDFLAGS = -O
执行make。
2.执行dbgen,生成orders数据表,大小为5G。
- <span style="font-size:12px;">[root@node20 dbgen]# ./dbgen -vf -s 5 -T O
- TPC-H Population Generator (Version 2.17.0)
- Copyright Transaction Processing Performance Council 1994 - 2010
- Generating data for order table/
- Preloading text ... 100%
- done.
- </span>
- <span style="font-size:12px;">[root@node20 dbgen]# ls -lh *.tbl
- -rw-r--r-- 1 root root 830M May 11 14:01 orders.tbl</span><span style="font-size:14px;">
- </span>
实际上就是,将tpch-tool生成的
*.tbl
文件加载到hadoop系统中,然后,通过impala-shell来创建相应的表。
1.clone脚本工程,用于自动完成加载工作。
参考网址:https://github.com/kj-ki/tpc-h-impala
2.将利用TPCH工具生成的数据表文件,移动到该工程目录的data/文件夹下。
- <span style="font-size:12px;">[root@node20 data]# mv ../../../tpch_2_17_0/dbgen/*.tbl ./
- [root@node20 data]# ls
- lineitem.tbl orders.tbl tpch_prepare_data.sh</span>
3.加载数据到Hadoop系统中。
- <span style="font-size:12px;">[root@node20 data]# ./tpch_prepare_data.sh
- mkdir: `/tpch': File exists
- mkdir: `/tpch/customer': File exists
- mkdir: `/tpch/lineitem': File exists
- mkdir: `/tpch/nation': File exists
- mkdir: `/tpch/orders': File exists
- mkdir: `/tpch/part': File exists
- mkdir: `/tpch/partsupp': File exists
- mkdir: `/tpch/region': File exists
- mkdir: `/tpch/supplier': File exists
- copyFromLocal: `customer.tbl': No such file or directory
- copyFromLocal: `/tpch/lineitem/lineitem.tbl': File exists
- copyFromLocal: `nation.tbl': No such file or directory
- copyFromLocal: `part.tbl': No such file or directory
- copyFromLocal: `partsupp.tbl': No such file or directory
- copyFromLocal: `region.tbl': No such file or directory
- copyFromLocal: `supplier.tbl': No such file or directory
- </span>

查看导入情况:
- <span style="font-size:12px;">[root@node20 data]# hadoop fs -ls /tpch
- Found 8 items
- drwxr-xr-x - root supergroup 0 2016-05-06 12:10 /tpch/customer
- drwxr-xr-x - root supergroup 0 2016-05-06 15:42 /tpch/lineitem
- drwxr-xr-x - root supergroup 0 2016-05-06 12:10 /tpch/nation
- drwxr-xr-x - root supergroup 0 2016-05-11 14:14 /tpch/orders
- drwxr-xr-x - root supergroup 0 2016-05-06 12:10 /tpch/part
- drwxr-xr-x - root supergroup 0 2016-05-06 12:11 /tpch/partsupp
- drwxr-xr-x - root supergroup 0 2016-05-06 12:11 /tpch/region
- drwxr-xr-x - root supergroup 0 2016-05-06 12:11 /tpch/supplier
- [root@node20 data]# hadoop fs -ls /tpch/orders
- Found 1 items
- -rw-r--r-- 3 root supergroup 870187306 2016-05-11 14:14 /tpch/orders/orders.tbl
- [root@node20 data]# hadoop fs -ls /tpch/lineitem
- Found 2 items
- drwxr-xr-x - root supergroup 0 2016-05-06 15:42 /tpch/lineitem/lineitem
- -rw-r--r-- 3 root supergroup 7775727688 2016-05-06 12:12 /tpch/lineitem/lineitem.tbl
- </span>

【注】如果你遇到如下异常情况,请确保impala所需的环境变量已经导入。执行source ${IMPALA_HOME}/bin/impala-config.sh
- <span style="font-size:12px;">[root@node20 data]# ./tpch_prepare_data.sh
- -mkdir: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -mkdir [-p] <path> ...
- -mkdir: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -mkdir [-p] <path> ...
- -mkdir: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -mkdir [-p] <path> ...
- -mkdir: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -mkdir [-p] <path> ...
- -mkdir: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -mkdir [-p] <path> ...
- -mkdir: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -mkdir [-p] <path> ...
- -mkdir: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -mkdir [-p] <path> ...
- -mkdir: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -mkdir [-p] <path> ...
- -mkdir: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -mkdir [-p] <path> ...
- ^C-copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>
- -copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>
- -copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>
- -copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>
- -copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>
- -copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst>
- -copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
- Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p] <localsrc> ... <dst></span>

各类表的参考描述如下:
The official TPC-H specification: http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
直接执行如下语句,会一直卡住。
- <span style="font-size:12px;">[root@node20 tpc-h-impala]# ./tpch_benchmark.sh
-
- ***********************************************
- * TPC-H benchmark on Impala *
- ***********************************************
-
- See benchmark.log for more details of query errors.
-
- Executing Trial #1 of 1 trial(s)...
- Running query: q1_pricing_summary_report</span>
SO,直接查看该文件,可知自动导入表到impala是通过tpch_prepare/文件夹中的.hive文件来完成的。
分析可知,直接打开impala-shell,手动创建外部表。
- <span style="font-size:12px;">[node20:21000] > create external table orders (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE STRING, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/orders';
- Query: create external table orders (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE STRING, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/orders'
-
- Fetched 0 row(s) in 0.04s
- [node20:21000] > select count(*) from orders;
- Query: select count(*) from orders
- +----------+
- | count(*) |
- +----------+
- | 7500000 |
- +----------+
- Fetched 1 row(s) in 5.03s
- </span>
5.生成parquet格式的数据表。
- <span style="font-size:12px;">[node20:21000] > create database tpch_parquet;
- Query: create database tpch_parquet
-
- Fetched 0 row(s) in 0.10s
- [node20:21000] > show databases;
- Query: show databases
- +------------------+----------------------------------------------+
- | name | comment |
- +------------------+----------------------------------------------+
- | _impala_builtins | System database for Impala builtin functions |
- | default | Default Hive database |
- | tpch_parquet | |
- +------------------+----------------------------------------------+
- Fetched 3 row(s) in 0.01s
- [node20:21000] > create table tpch_parquet.orders_pq stored as parquet as select * from orders;
- Query: create table tpch_parquet.orders_pq stored as parquet as select * from orders
- +-------------------------+
- | summary |
- +-------------------------+
- | Inserted 7500000 row(s) |
- +-------------------------+
- Fetched 1 row(s) in 10.62s
- [node20:21000] > desc tpch_parquet.orders_pq;
- Query: describe tpch_parquet.orders_pq
- +-----------------+--------+---------+
- | name | type | comment |
- +-----------------+--------+---------+
- | o_orderkey | int | |
- | o_custkey | int | |
- | o_orderstatus | string | |
- | o_totalprice | double | |
- | o_orderdate | string | |
- | o_orderpriority | string | |
- | o_clerk | string | |
- | o_shippriority | int | |
- | o_comment | string | |
- +-----------------+--------+---------+
- Fetched 9 row(s) in 0.01s
- [node20:21000] > show table stats tpch_parquet.orders_pq;
- Query: show table stats tpch_parquet.orders_pq
- +-------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
- | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
- +-------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
- | -1 | 3 | 269.18MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://localhost:20500/test-warehouse/tpch_parquet.db/orders_pq |
- +-------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
- Fetched 1 row(s) in 0.01s</span>

Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。