小舞很执着

这个屌丝很懒，什么也没留下！

热门标签

《Hadoop权威指南》读书笔记之一 — Chapter 1_meet hadoop querying all your data beyong batch

作者：小舞很执着 | 2024-07-22 15:13:49

踩

meet hadoop querying all your data beyong batch

0.总结

Get to the key point firstly, the article comes from LawsonAbs!

1. 前言

MapReduce 的设计目标就是为了更好的处理大量数据。在使用MapReduce处理大量数据之前，先在linux中使用常见的文本处理方法对一些数据集进行操作。

1.1 代码

代码如下：

[root@server4 hadoop]# cat findMax.sh
for year in {1901,1902}
do
 echo -ne `basename $year .gz`"\t"
 gunzip -c $year | \
  awk '{ temp = substr($0,88,5) + 0;
        q = substr($0,93,1);
	if ( temp !=9999 && q ~ /[01459]/ && temp > max ) max=temp}
 	END {print max}'
done
1
2
3
4
5
6
7
8
9
10

1.2 shell脚本分析

在看完详细的代码之后，我逐一讲解一下这个脚本是什么意思：

for year in {1901,1902} 这是一个for循环，用于遍历{}中的值
do与done 表示一个循环体
echo -ne 表示的是不输出换行符；同时使用逃逸字符【所谓逃逸字符，就是指转义字符】
basename $year .gz表示的是，对$year这个字符串进行格式化输出【去掉.gz这个后缀】
gunzip -c $year 表示的是查看$year这个文件，即使是压缩文件，也不会产生任何影响。
管道输出到awk中，在awk中的操作是找到如下字符串：
temp = substr($0,88,5) 在如下的行记录中找到的就是 -0078 这条记录，对其进行+0操作，是为了去掉其前缀0；
q=substr($0,93,1);这个操作就是找出每行中，从第93个字符开始的一个字符。针对如下行的值就是1
0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999
然后awk对temp值和q值进行判断，如果temp!=9999，并且q 是0,1,4,5,9中的一个值，并且 temp> max，那么执行值替换操作。
对文件的每行操作之后，在END中输出最大值。

2 部分命令示例

2.1 `gunzip`命令的使用

[root@server4 hadoop]# gunzip -c 1901.gz | head -1
0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999
1
2

2.2 `awk` 命令的使用【详见我的博客 Linux命令详解之awk 】

[root@server4 hadoop]# echo 1 | awk '{q=$0;  if(q !~ /[03459]/) print "a"}'
a
1
2

2.3 `gzip`命令

压缩文件 gzip -c target > destination

[root@server4 thumbs]# ll
total 12
-rw-r--r--. 1 root root 1394 Dec 24 13:38 20181224.tar.gz
-rw-r--r--. 1 root root   36 Dec 24 10:28 A.txt
-rw-r--r--. 1 root root 2531 Dec 21 21:02 baidu.txt
[root@server4 thumbs]# gzip -c A.txt baidu.txt  > 20190103.gz
[root@server4 thumbs]# ll
total 16
-rw-r--r--. 1 root root 1394 Dec 24 13:38 20181224.tar.gz
-rw-r--r--. 1 root root 1277 Jan  3 17:24 20190103.gz
-rw-r--r--. 1 root root   36 Dec 24 10:28 A.txt
-rw-r--r--. 1 root root 2531 Dec 21 21:02 baidu.txt
1
2
3
4
5
6
7
8
9
10
11
12

但是需要注意的是：这个压缩命令是将压缩的文件合并成了一个，而不是单独的压缩！ 所以一般需要通过先tar 一下，然后再使用gzip命令进行单独压缩。

3. `Hadoop` 解决什么问题？

随着数据的极速增加，对数据的存储以及分析提出了越来越高的要求，同时，遇到的问题也越来越多。主要的问题有：

3.1 `speed of data acess`

although the storage capacities of hard drives have increased massively over the years, access speeds — the rate at which data can be read from drives — have not kept up.

solutions: reduce the time is to read from multiple disks at once.

正是因为磁盘访问数据的速度不够快，才导致提出并行访问数据。

3.2 `other problems`

除了上面会遇到的访问数据速度低之外，还有硬件损坏等的问题。

There’s more to being able to read and write data in parallel to or from multiple disks, though.
problem lists:

hardware failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high. => redundant copies of the data
most analysis tasks need to be able to combine the data in some way, and data read from one disk may need to be combined with data from any of the other 99 disks.

4 `Hadoop` 是干嘛的？

自然而然，我们可能都想问一个问题，都说大数据，那么这个Hadoop到底是用于干什么的？

In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage and analysis.

5.`Querying All Your Data`

这么大的数据，那么该如何查询你所需要的数据呢？

The approach taken by MapReduce may seem like a brute-force approach. The premise is that the entire dataset — or at least a good portion of it — can be processed for each query. But this is its power. MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative.

ad hoc query：特别的查询
transformative：变革的，有根本突破的

6. `Beyond Batch` 【不仅仅是批处理】

For all its strengths, MapReduce is fundamentally a batch processing system, and is not suitable for interactive analysis.

不仅仅是批处理的原因是：Hadoop 是一个ecosystem

Indeed, the term “Hadoop” is sometimes used to refer to a larger ecosystem of projects, not just HDFS and MapReduce, that fall under the umbrella of infrastructure for distributed computing and large-scale data processing.

6.2 `What is the Hadoop ecosystem ?`

01.The first component to provide online access was HBase, a key-value store that uses HDFS for its underlying storage.
02.The real enabler for new processing models in Hadoop was the introduction of YARN (which stands for Yet Another Resource Negotiator) in Hadoop 2.

and following items:

Interactive SQL
Iterative processing
Stream processing
Search

Despite the emergence of different processing frameworks on Hadoop, MapReduce still has a place for batch processing, and it is useful to understand how it works since it introduces several concepts that apply more generally (like the idea of input formats, or how a dataset is split into pieces).

7.`Comparison with Other Systems`

7.1 `Relational Database Management Systems`

Why can’t we use databases with lots of disks to do large-scale analysis? Why is Hadoop needed?

01.the trend in disk drives: seek time is improving more slowly than transfer rate.If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate.
02.For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database
03.Another difference between Hadoop and an RDBMS is the amount of structure in the datasets on which they operate.
RDBMS： structured data
Hadoop： unstructured or semi-structured data

MapReduce suits applications where the data is written once and read many times, whereas a relational database is good for datasets that are continually updated.
为什么hadoop 适合一次写多次读的应用？为什么关系型数据库适合不断更新的数据集？

MapReduce — and the other processing models in Hadoop — scales linearly with the size of the data.

Hadoop tries to co-locate the data with the compute nodes, so data access is fast because it is local. This feature, known as data locality, is at the heart of data processing in Hadoop and is the reason for its good performance.

7.2 `MPI` 与 `Hadoop API` 的区别

7.2.1 `MPI`

gives great control to programmers,
but it requires that they explicitly handle the mechanics of the data flow,exposed via low-level C routines and constructs such as sockets, as well as the higher-level algorithms for the analyses.
MPI programs have to explicitly manage their own checkpointing and recovery, which gives more control to the programmer but makes them more difficult to write.

7.2.2 `Hadoop`

Processing in Hadoop operates only at the higher level
the programmer thinks in terms of the data model (such as key-value pairs for MapReduce)
the data flow remains implicit.

7.3 `Volunteer Computing`

breaking a problem into independent pieces to be worked on in parallel.

differences:

01.VC is cpu-intensive but MapReduce is I/O-intensive
02.VC 的任务运行在不同的机器上，这些机器之间没有联系但是MapReduce任务运行的在Hadoop集群中
03.第二点的不同导致VC运行job的时候不仅需要发送程序，还需要发送数据。但是 MapReduce 任务不同，MapReduce 任务可以利用 data locality 的特性。

8. Question

8.1 为什么`MapReduce`任务可以失败？

Coordinating the processes in a large-scale distributed computation is a challenge. The hardest aspect is gracefully handling partial failure — when you don’t know whether or not a remote process has failed — and still making progress with the overall computation

Distributed processing frameworks like MapReduce spare the programmer from having to think about failure, since the implementation detects failed tasks and reschedules replacements on machines that are healthy. MapReduce is able to do this because it is a shared-nothing architecture, meaning that tasks have no dependence on one other.MapReduce is able to do this because it is a shared-nothing architecture, meaning that tasks have no dependence on one other

像MapReduce 这样的分布式计算框架使得程序免于（spare sb from doing sth）考虑任务失败的情形，因为Hadoop 会检测失败的任务并重新安排在正常机器上运行。 MapReduce 之所以可以这么做是因为它是一个非依赖的架构，这就意味着任务之间没有相互依赖。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/小舞很执着/article/detail/865756?site

《Hadoop权威指南》读书笔记之一 — Chapter 1_meet hadoop querying all your data beyong batch

0.总结

1. 前言

1.1 代码

1.2 shell脚本分析

2 部分命令示例

2.1 gunzip命令的使用

2.2 awk 命令的使用【详见我的博客 Linux命令详解之awk 】

2.3 gzip命令

3. Hadoop 解决什么问题？

3.1 speed of data acess

3.2 other problems

4 Hadoop 是干嘛的？

5.Querying All Your Data

6. Beyond Batch 【不仅仅是批处理】

6.2 What is the Hadoop ecosystem ?

7.Comparison with Other Systems

7.1 Relational Database Management Systems

7.2 MPI 与 Hadoop API 的区别

7.2.1 MPI

7.2.2 Hadoop

7.3 Volunteer Computing