赞
踩
Get to the key point firstly, the article comes from LawsonAbs!
MapReduce
的设计目标就是为了更好的处理大量数据。在使用MapReduce
处理大量数据之前,先在linux中使用常见的文本处理方法对一些数据集进行操作。
代码如下:
[root@server4 hadoop]# cat findMax.sh
for year in {1901,1902}
do
echo -ne `basename $year .gz`"\t"
gunzip -c $year | \
awk '{ temp = substr($0,88,5) + 0;
q = substr($0,93,1);
if ( temp !=9999 && q ~ /[01459]/ && temp > max ) max=temp}
END {print max}'
done
在看完详细的代码之后,我逐一讲解一下这个脚本是什么意思:
for year in {1901,1902}
这是一个for循环,用于遍历{}
中的值do
与done
表示一个循环体echo -ne
表示的是不输出换行符;同时使用逃逸字符【所谓逃逸字符,就是指转义字符】basename $year .gz
表示的是,对$year
这个字符串进行格式化输出【去掉.gz
这个后缀】gunzip -c $year
表示的是查看$year
这个文件,即使是压缩文件,也不会产生任何影响。awk
中,在awk中的操作是找到如下字符串:temp = substr($0,88,5)
在如下的行记录中找到的就是 -0078
这条记录,对其进行+0操作,是为了去掉其前缀0;q=substr($0,93,1);
这个操作就是找出每行中,从第93个字符开始的一个字符。针对如下行的值就是1
0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999
awk
对temp值和q值进行判断,如果temp!=9999,并且q 是0,1,4,5,9
中的一个值,并且 temp> max
,那么执行值替换操作。END
中输出最大值。gunzip
命令的使用[root@server4 hadoop]# gunzip -c 1901.gz | head -1
0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999
awk
命令的使用【详见我的博客 Linux命令详解之awk 】[root@server4 hadoop]# echo 1 | awk '{q=$0; if(q !~ /[03459]/) print "a"}'
a
gzip
命令gzip -c target > destination
[root@server4 thumbs]# ll
total 12
-rw-r--r--. 1 root root 1394 Dec 24 13:38 20181224.tar.gz
-rw-r--r--. 1 root root 36 Dec 24 10:28 A.txt
-rw-r--r--. 1 root root 2531 Dec 21 21:02 baidu.txt
[root@server4 thumbs]# gzip -c A.txt baidu.txt > 20190103.gz
[root@server4 thumbs]# ll
total 16
-rw-r--r--. 1 root root 1394 Dec 24 13:38 20181224.tar.gz
-rw-r--r--. 1 root root 1277 Jan 3 17:24 20190103.gz
-rw-r--r--. 1 root root 36 Dec 24 10:28 A.txt
-rw-r--r--. 1 root root 2531 Dec 21 21:02 baidu.txt
但是需要注意的是:这个压缩命令是将压缩的文件合并成了一个,而不是单独的压缩! 所以一般需要通过先tar
一下,然后再使用gzip
命令进行单独压缩。
Hadoop
解决什么问题?随着数据的极速增加,对数据的存储以及分析提出了越来越高的要求,同时,遇到的问题也越来越多。主要的问题有:
speed of data acess
although the storage capacities of hard drives have increased massively over the years, access speeds — the rate at which data can be read from drives — have not kept up.
solutions: reduce the time is to read from multiple disks at once.
正是因为磁盘访问数据的速度不够快,才导致提出并行访问数据。
other problems
除了上面会遇到的访问数据速度低之外,还有硬件损坏等的问题。
There’s more to being able to read and write data in parallel to or from multiple disks, though.
problem lists:
- hardware failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high. => redundant copies of the data
- most analysis tasks need to be able to combine the data in some way, and data read from one disk may need to be combined with data from any of the other 99 disks.
Hadoop
是干嘛的?自然而然,我们可能都想问一个问题,都说大数据,那么这个Hadoop
到底是用于干什么的?
In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage and analysis.
Querying All Your Data
这么大的数据,那么该如何查询你所需要的数据呢?
The approach taken by MapReduce may seem like a brute-force approach. The premise is that the entire dataset — or at least a good portion of it — can be processed for each query. But this is its power. MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative.
ad hoc query
: 特别的查询
transformative
:变革的,有根本突破的
Beyond Batch
【不仅仅是批处理】For all its strengths, MapReduce is fundamentally a batch processing system, and is not suitable for interactive analysis.
不仅仅是批处理的原因是:Hadoop
是一个ecosystem
Indeed, the term “Hadoop” is sometimes used to refer to a larger ecosystem of projects, not just HDFS and MapReduce, that fall under the umbrella of infrastructure for distributed computing and large-scale data processing.
What is the Hadoop ecosystem ?
and following items:
Despite the emergence of different processing frameworks on Hadoop, MapReduce still has a place for batch processing, and it is useful to understand how it works since it introduces several concepts that apply more generally (like the idea of input formats, or how a dataset is split into pieces).
Comparison with Other Systems
Relational Database Management Systems
Why can’t we use databases with lots of disks to do large-scale analysis? Why is Hadoop needed?
MapReduce suits applications where the data is written once and read many times, whereas a relational database is good for datasets that are continually updated.
为什么hadoop 适合一次写多次读的应用?为什么关系型数据库适合不断更新的数据集?
MapReduce — and the other processing models in Hadoop — scales linearly with the size of the data.
Hadoop tries to co-locate the data with the compute nodes, so data access is fast because it is local. This feature, known as data locality, is at the heart of data processing in Hadoop and is the reason for its good performance.
MPI
与 Hadoop API
的区别MPI
Hadoop
Volunteer Computing
breaking a problem into independent pieces to be worked on in parallel.
differences:
MapReduce
任务不同,MapReduce
任务可以利用 data locality
的特性。MapReduce
任务可以失败?Coordinating the processes in a large-scale distributed computation is a challenge. The hardest aspect is gracefully handling partial failure — when you don’t know whether or not a remote process has failed — and still making progress with the overall computation
Distributed processing frameworks like MapReduce spare the programmer from having to think about failure, since the implementation detects failed tasks and reschedules replacements on machines that are healthy. MapReduce is able to do this because it is a shared-nothing architecture, meaning that tasks have no dependence on one other.MapReduce is able to do this because it is a shared-nothing architecture, meaning that tasks have no dependence on one other
像MapReduce 这样的分布式计算框架使得程序免于(spare sb from doing sth
)考虑任务失败的情形,因为Hadoop 会检测失败的任务并重新安排在正常机器上运行。 MapReduce 之所以可以这么做是因为它是一个非依赖的架构,这就意味着任务之间没有相互依赖。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。