利用SQOOP將數(shù)據(jù)從數(shù)據(jù)庫導(dǎo)入到HDFS

作者：王森豐 2016-12-21 14:14:51

本文程序?qū)氲紿DFS中的數(shù)據(jù)是文本格式，所以在創(chuàng)建Hive外部表的時候，不需要指定文件的格式為RCFile,而使用默認的TextFile即可。數(shù)據(jù)間的分隔符為'\001'.如果多次導(dǎo)入同一個表中的數(shù)據(jù)，數(shù)據(jù)以append的形式插入到HDFS目錄中。

基本使用

如下面這個shell腳本：

#Oracle的連接字符串，其中包含了Oracle的地址，SID,和端口號 
CONNECTURL=jdbc:oracle:thin:@20.135.60.21:1521:DWRAC2 
#使用的用戶名 
ORACLENAME=kkaa 
#使用的密碼 
ORACLEPASSWORD=kkaa123 
#需要從Oracle中導(dǎo)入的表名 
oralceTableName=tt 
#需要從Oracle中導(dǎo)入的表中的字段名 
columns=AREA_ID,TEAM_NAME 
#將Oracle中的數(shù)據(jù)導(dǎo)入到HDFS后的存放路徑 
hdfsPath=apps/as/hive/$oralceTableName 
#執(zhí)行導(dǎo)入邏輯。將Oracle中的數(shù)據(jù)導(dǎo)入到HDFS中 
sqoop import --append --connect $CONNECTURL --username $ORACLENAME --password $ORACLEPASSWORD --target-dir $hdfsPath  --num-mappers 1 --table $oralceTableName --columns $columns --fields-terminated-by '\001'

執(zhí)行這個腳本之后，導(dǎo)入程序就完成了。

接下來，用戶可以自己創(chuàng)建外部表，將外部表的路徑和HDFS中存放Oracle數(shù)據(jù)的路徑對應(yīng)上即可。

注意：這個程序?qū)氲紿DFS中的數(shù)據(jù)是文本格式，所以在創(chuàng)建Hive外部表的時候，不需要指定文件的格式為RCFile,而使用默認的TextFile即可。數(shù)據(jù)間的分隔符為'\001'.如果多次導(dǎo)入同一個表中的數(shù)據(jù)，數(shù)據(jù)以append的形式插入到HDFS目錄中。

并行導(dǎo)入

假設(shè)有這樣這個sqoop命令，需要將Oracle中的數(shù)據(jù)導(dǎo)入到HDFS中：

sqoop import --append --connect $CONNECTURL --username $ORACLENAME --password $ORACLEPASSWORD --target-dir $hdfsPath  --m 1 --table $oralceTableName --columns $columns --fields-terminated-by '\001'  --where "data_desc='2011-02-26'"

請注意，在這個命令中，有一個參數(shù)"-m",代表的含義是使用多少個并行，這個參數(shù)的值是1,說明沒有開啟并行功能。

現(xiàn)在，我們可以將"-m"參數(shù)的值調(diào)大，使用并行導(dǎo)入的功能，如下面這個命令：

sqoop import --append --connect $CONNECTURL --username $ORACLENAME --password $ORACLEPASSWORD --target-dir $hdfsPath  --m 4 --table $oralceTableName --columns $columns --fields-terminated-by '\001'  --where "data_desc='2011-02-26'"

一般來說，Sqoop就會開啟4個進程，同時進行數(shù)據(jù)的導(dǎo)入操作。

但是，如果從Oracle中導(dǎo)入的表沒有主鍵，那么會出現(xiàn)如下的錯誤提示：

ERROR tool.ImportTool: Error during import: No primary key could be found for table creater_user.popt_cas_redirect_his. Please specify one with --split-by or perform a sequential import with '-m 1'.

在這種情況下，為了更好的使用Sqoop的并行導(dǎo)入功能，我們就需要從原理上理解Sqoop并行導(dǎo)入的實現(xiàn)機制。

如果需要并行導(dǎo)入的Oracle表的主鍵是id,并行的數(shù)量是4,那么Sqoop首先會執(zhí)行如下一個查詢：

select max（id） as max, select min（id） as min from table [where 如果指定了where子句];

通過這個查詢，獲取到需要拆分字段(id)的***值和最小值，假設(shè)分別是1和1000.

然后，Sqoop會根據(jù)需要并行導(dǎo)入的數(shù)量，進行拆分查詢，比如上面的這個例子，并行導(dǎo)入將拆分為如下4條SQL同時執(zhí)行：

select * from table where 0 <= id < 250; 
select * from table where 250 <= id < 500; 
select * from table where 500 <= id < 750; 
select * from table where 750 <= id < 1000;

注意，這個拆分的字段需要是整數(shù)。

從上面的例子可以看出，如果需要導(dǎo)入的表沒有主鍵，我們應(yīng)該如何手動選取一個合適的拆分字段，以及選擇合適的并行數(shù)。

再舉一個實際的例子來說明：

我們要從Oracle中導(dǎo)入creater_user.popt_cas_redirect_his.

這個表沒有主鍵，所以我們需要手動選取一個合適的拆分字段。

首先看看這個表都有哪些字段：

然后，我假設(shè)ds_name字段是一個可以選取的拆分字段，然后執(zhí)行下面的sql去驗證我的想法：

select min（ds_name）， max（ds_name） from creater_user.popt_cas_redirect_his where data_desc='2011-02-26'

發(fā)現(xiàn)結(jié)果不理想，min和max的值都是相等的。所以這個字段不合適作為拆分字段。

再測試一下另一個字段：CLIENTIP

select min（CLIENTIP）， max（CLIENTIP） from creater_user.popt_cas_redirect_his where data_desc='2011-02-26'

這個結(jié)果還是不錯的。所以我們使用CLIENTIP字段作為拆分字段。

所以，我們使用如下命令并行導(dǎo)入：

sqoop import --append --connect $CONNECTURL --username $ORACLENAME --password $ORACLEPASSWORD --target-dir $hdfsPath  --m 12 --split-by CLIENTIP --table $oralceTableName --columns $columns --fields-terminated-by '\001'  --where "data_desc='2011-02-26'"

這次執(zhí)行這個命令，可以看到，消耗的時間為：20mins, 35sec,導(dǎo)入了33,222,896條數(shù)據(jù)。

另外，如果覺得這種拆分不能很好滿足我們的需求，可以同時執(zhí)行多個Sqoop命令，然后在where的參數(shù)后面指定拆分的規(guī)則。如：

sqoop import --append --connect $CONNECTURL --username $ORACLENAME --password $ORACLEPASSWORD --target-dir $hdfsPath  --m 1 --table $oralceTableName --columns $columns --fields-terminated-by '\001'  --where "data_desc='2011-02-26' logtime<10:00:00" 
sqoop import --append --connect $CONNECTURL --username $ORACLENAME --password $ORACLEPASSWORD --target-dir $hdfsPath  --m 1 --table $oralceTableName --columns $columns --fields-terminated-by '\001'  --where "data_desc='2011-02-26' logtime>=10:00:00"

從而達到并行導(dǎo)入的目的。

【本文為51CTO專欄作者“王森豐”的原創(chuàng)稿件，轉(zhuǎn)載請注明出處】

責任編輯：龐桂玉來源：神算子