[toc]
hive增量分析
背景:每天上传到服务器的日志,经过 每日增量分析得到当天的结果,然后合并更新到总结果集。并且只把更新的数据导入到mongodb结果数据库。
sh portal_use file_month day 2015-09-07
1、初始化创建一个结果集表res_portal_use,以用户ID和维度分区。LAST_UPDATE表示该行数据 最后更新时间。
CREATE TABLE IF NOT EXISTS RES_PORTAL_USE(
FDID STRING,
COUNT BIGINT,
LAST_UPDATE DATE
)
PARTITIONED BY (ID STRING,SCOP STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't';
2、当天的结果表:tmp_portal_use_20151104,其中LAST_UPDAYET为今天的日期。对flume上传到hdfs上的今天的日志进行分析,得出今天的结果集。
3、两个表中的字段是一样的,union all之后进行sum(count),max(LAST_UPDATE)就是增量之和以及LAST_UPDATE为今天的是要导入到mongodb的结果集。该结果集放到tmp_portal_use表中
INSERT OVERWRITE TABLE TMP_RES_portal_use PARTITION(ID,SCOP) SELECT sub.FDID AS FDID,SUM(sub.COUNT) AS COUNT,MAX(sub.LAST_UPDATE) AS LAST_UPDATE,sub.ID as ID,sub.SCOP as SCOP FROM (SELECT fdid,count,to_date('2015-11-05') as LAST_UPDATE,id,scop FROM tmp_portal_use_20151105 UNION ALL SELECT * FROM RES_portal_use)sub GROUP BY sub.ID,sub.SCOP,sub.FDID
注意:该处需要用到动态分区。
4、导入之后,先删除旧的res_portal_use表,再把tmp_portal_use表重命名为res_portal_use表。那么res_portal_use就变成了最新的结果集表。
28 #移除old结果集
29 hive -e "DROP TABLE tmp$0${date}"
30 hive -e "DROP TABLE RES$0"
31 #重命名新结果集
32 hive -e "ALTER TABLE TMP_RES$0 RENAME TO RES$0"
33 #导入更新的结果
5、把res_portal_use表中LAST_UPDATE为今天的的数据更新导入到mongodb中。
hive性能分析
构建数据脚本
# !/bin/bash
if [ $# -ne 1 ]
then
echo "请输入一个参数"
return
fi
FILE_NAME="$1.txt"
if [ ! -f $FILE_NAME ]
then
echo "创建$FILE_NAME数据"
</span><span class="n">sh</span> <span class="n">build</span><span class="o">.</span><span class="na">sh</span> <span class="err">$</span><span class="mi">1</span> <span class="n">$FILE_NAME</span><span class="err">
echo "创建完成"
fi
echo "$FILE_NAME已经存在!直接运行计算......"
echo "第一步,清理原始数据......"
</span><span class="n">hadoop</span> <span class="n">fs</span> <span class="o">-</span><span class="n">rmr</span> <span class="o">/</span><span class="n">output</span><span class="o">/</span><span class="n">logs</span><span class="o">/</span><span class="n">request_transform</span><span class="o">/</span><span class="mi">2015</span><span class="o">-</span><span class="mi">11</span><span class="o">/</span><span class="mi">2015</span><span class="o">-</span><span class="mi">11</span><span class="o">-</span><span class="mi">10</span><span class="o">/*</span><span class="err">
echo "第一步,清理完成"
echo "第二步,清理上次结果表,并创建新表和分区。。。。。。"
hive -e 'drop table res_portal_use'
hive -f ../create_table.sql
</span><span class="n">ALTER</span> <span class="n">TABLE</span> <span class="n">request_log</span> <span class="n">ADD</span> <span class="n">PARTITION</span><span class="o">(</span><span class="n">month</span><span class="o">=</span><span class="err">'</span><span class="mi">2015</span><span class="o">-</span><span class="mi">11</span><span class="err">'</span><span class="o">,</span><span class="n">day</span><span class="o">=</span><span class="err">'</span><span class="mi">2015</span><span class="o">-</span><span class="mi">11</span><span class="o">-</span><span class="mi">10</span><span class="err">'</span><span class="o">)</span> <span class="n">LOCATION</span> <span class="err">'</span><span class="o">/</span><span class="n">output</span><span class="o">/</span><span class="n">logs</span><span class="o">/</span><span class="n">request_transform</span><span class="o">/</span><span class="mi">2015</span><span class="o">-</span><span class="mi">11</span><span class="o">/</span><span class="mi">2015</span><span class="o">-</span><span class="mi">11</span><span class="o">-</span><span class="mi">10</span><span class="err">'
echo "第二步,重建表完成!"
echo "第三步,上传文件到hive......"
</span><span class="n">hadoop</span> <span class="n">fs</span> <span class="o">-</span><span class="n">copyFromLocal</span> <span class="o">./</span><span class="n">$FILE_NAME</span> <span class="o">/</span><span class="n">output</span><span class="o">/</span><span class="n">logs</span><span class="o">/</span><span class="n">request_transform</span><span class="o">/</span><span class="mi">2015</span><span class="o">-</span><span class="mi">11</span><span class="o">/</span><span class="mi">2015</span><span class="o">-</span><span class="mi">11</span><span class="o">-</span><span class="mi">10</span><span class="err">
echo "第三步,上传完成"
echo "第四步,执行分析......" sh ../request/portal_use file_month day 2015-11-10
遇到的问题
问题一:测试数据大的时候后,job一致处于appending状态
利用jps查看,发现nodemanager都被干掉了,于是先stop-yan.sh然后start-yarn.sh重启。
问题二:内存不够的问题
因为hadoop默认每个map任务和reduce任务默认的内存分配是1024,所以当分配的总内存算机的内存的时候则会出现问题。
解决办法:修改mappred-site.xml,yarn-site.xml相关配置
mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
set mapreduce.input.fileinputformat.split.maxsize=1024000;
set yarn.timeline-service.handler-thread-count=2;
问题三:如何控制map个数
==设置分片大小==
mapreduce.input.fileinputformat.split.minsize #默认为1
mapreduce.input.fileinputformat.split.maxsize #默认为256M
dfs.block.size #块大小
# 计算分片大小的公式
protected long computeSplitSize(long blockSize, long minSize,
long maxSize) {
return Math.max(minSize, Math.min(maxSize, blockSize));
}
在hadoop中blocksize:128M,为什么hive中map数目以256M为一个split进行分割呢? 因为:参数:set mapred.max.split.size = 256000000 ; //最大分割 该参数是hive进行map端分割的参数
mongodb性能分析
执行性能分析脚本
#!/bin/bash
if [ $# -ne 1 ] then echo "请输入要测试的数据量W为单位!" return fi
FILE_NAME="$1.txt"
if [ ! -f FILE_NAME ]
then
echo "创建$FILE_NAME数据!"
count=$1
while [ $count -gt 0 ]
do
count=$(($count-1))
</span>cat 1W.log >> <span class="nv">$FILE_NAME</span><span class="sb">
done
echo "创建完成!"
else
echo "文件已经存在不需要创建"
fi
echo "step1清理结果数据......"
#rm -fr /home/nemo/mongodb/data/request*123456
</span>~/mongodb/bin/mongo master/behavior --eval <span class="s2">"db.file_process.drop();"</span><span class="sb">
</span>~/mongodb/bin/mongo master/request_log_123456 --eval <span class="s2">"db.dropDatabase();"</span><span class="sb">
</span>~/mongodb/bin/mongo master/request_res_123456 --eval <span class="s2">"db.dropDatabase();"</span><span class="sb">
echo "step1清理完成!"
echo "step2清理上次日志处理结果本分......"
</span>rm ./logs/123456/<span class="k">*</span><span class="sb">
</span>mv <span class="nv">$FILE_NAME</span> ./logs/123456/request.log.2015-05-21<span class="sb">
</span>rm -fr ./logs_bal/<span class="k">*</span><span class="sb">
echo "step2清理完成!"
mongodb脚本
需要用到mongodb执行js脚本的命令,见mongo --help上面有介绍到。
./mongo master/behavior --eval "jscode” #执行一段js脚本
./mongo master/behavior ./*.js #执行一个js文件
分析结果
条数(万条) | 大小 | M-R:耗时(S) | 总耗时(S) | im:mongo mr(S) |
---|---|---|---|---|
50 | 242 | 2-1:16 | 120 | 459S:83ms |
100 | 484 | 2-1:38 | 168 | 973S:478ms |
100 | 484 | 4-1:34 | 156 | |
500 | 2418 | 4-1:65 | 176 | |
1000 | 4836 | 4-1:86 | 206 |
分析结果:hive支持大文件分析,mongodb在数据导入到mongodb中非常耗时,如果在高并发情况下针对多个小文件进行分析,则mongodb性能比较好。如果一个客户每天生成一个上G的文件,那么mongodb会非常慢。
附录l
用到的linux命令
# head -n 10000 log.txt >> 1W.tx #取W条记录
# more 1W.txt | wc -l #计数
# du -h #查看文件大小
# du -m --max-depth=1 /etc | sort -nr #查看文件夹大小
# du --max-depth=1 -h #当前目录
# ls -l --block-size=M #查看文件大小
用到的hadoop相关命令
$ hadoop fs -ls /output/logs/request_transform #查询
$ hadoop fs -mkdir /output/logs/request_transform/2015-09/2015-09-08 #创建一个目录
$ hadoop fs -copyFromLocal ./100W.txt /output/logs/request_transform/2015-09/2015-09-08 #上传到hdfs
$ hadoop fs -copyToLocal #下载到本地
$ hadoop fs -rmr /output/logs/request_transform/2015-09/2015-09-08
$ mapred job -list #显示当前的job
$ mapred job -kill jobid #killjiob