Click here until yesterday
You will become an engineer in 100 days --Day 76 --Programming --About machine learning
You will become an engineer in 100 days-Day 70-Programming-About scraping
You will become an engineer in 100 days --Day 66 --Programming --About natural language processing
You will become an engineer in 100 days --Day 63 --Programming --Probability 1
You will become an engineer in 100 days-Day 59-Programming-Algorithms
You will become an engineer in 100 days --- Day 53 --Git --About Git
You will become an engineer in 100 days --Day 42 --Cloud --About cloud services
You will become an engineer in 100 days --Day 36 --Database --About the database
You will be an engineer in 100 days-Day 24-Python-Basics of Python language 1
You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1
You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1
You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1
This time, I will continue talking about databases and talk about Hadoop.
As for the database
You will become an engineer in 100 days --Day 36 --Database --About the database
But I'm doing it, but as a story about data
First, there is the word big data
.
The definition of big data
is rather vague, and if it is a little big, it's like big data
.
There seems to be some media that say
According to the definition of 2012, data in the range of tens of terabytes to several petabytes
It seems to be big data
.
Data of about 1TB is not called big data
.
One machine is enough for this.
In recent years, the data itself has become bloated, exceeding the amount that can be handled by one machine.
Big data
exists.
In that case, one database installed on the machine is not enough You will need a number of machines.
What came out to solve such a problem was to process large-scale big data. The platform.
Hadoop
stores data such as text, images and logs
An open source distributed processing platform
that processes at high speed.
Hadoop
is characterized by distributed processing performed on multiple machines
By distributing the data to multiple machines
You can easily scale out.
By processing large-scale data on multiple machines at once It also realizes high speed.
Hadoop
was originally published as a paper by Google, Inc.
It is an open source implementation of Google's internal technology.
How to handle the large amount of data that Google has
It is based on a paper on the technical system.
Google File System (GFS)
: Google's distributed file system
Google MapReduce
: Distributed processing technology at Google
Hadoop
itself is developed based on Java
It is still under development and its trademark is the elephant.
Hadoop is mainly composed of the following two elements.
HDFS
: Hadoop distributed file system
Hadoop MapReduce
: A distributed processing framework
** HDFS (Hadoop Distributed File System) **
HDFS consists of two nodes, a master and a slave.
Master node: NameNode
On a slave node: DataNode
NameNode
manages file system meta information
DataNode
stores the substance of the data.
Data (block) divided by a fixed size when storing a file
Save as DataNode
.
And so that some DataNode
fails and data is not lost
Save a replica of the block in multiple DataNode
s.
This is a mechanism to prevent data loss.
By default, the number of replicas is 3, so even if two servers fail The data will not be lost.
Reference: What is NTT DATA / distributed processing technology "Hadoop"? https://oss.nttdata.com/hadoop/hadoop.html
Regarding NameNode
, if it is a single node, it will not be usable if it fails.
Redundancy can be achieved by setting the Active-Standby configuration.
By increasing the number of servers, you can also increase the amount of data stored.
** Hadoop MapReduce (Distributed Processing Framework) **
Another component is MapReduce
.
This is a mechanism for dividing data and processing it in parallel to obtain results more quickly.
MapReduce consists of two nodes, a master and a slave.
Masternode: JobTracker
On slave nodes: TaskTracker
JobTracker
manages MapReduce jobs and goes to TaskTracker
Assign tasks and manage resources.
TaskTracker
executes the task.
Even if TaskTracker
fails, JobTracker
will
Assign the same task to other TaskTracker
s
You can continue to run without having to start over.
Reference: What is NTT DATA / distributed processing technology "Hadoop"? https://oss.nttdata.com/hadoop/hadoop.html
The MapReduce application
does the following:
Job definition
: Defines information for processing map and reduce
map processing
: Meaning data in key value format for input data
reduce processing
: Performs processing on the data aggregated for each key of map processing
When the MapReduce process is finished, the final result will be output. As the number of servers performing processing increases, processing will be completed at higher speed. It is advantageous to have a large number of computable units.
I think the following three points are the main features of Hadoop.
** Server scalability ** With HDFS, if you run out of resources, you can increase the number of servers You can secure data capacity and computational resources.
** Fault tolerance **
If you install the software on a commercially available server Since you can build a Hadoop environment, you don't need any dedicated hardware. In addition, by the mechanism such as HDFS and TaskTracker If there are only a few units, the system will operate even if it breaks down.
** Processing flexibility **
The difference from conventional databases is when storing data in HDFS. It means that no schema definition is required.
Because you only have to give meaning when retrieving data It can be said that the data is stored for the time being.
I first learned about Hadoop in 2012. At that time, there were almost no companies that adopted it at seminars, etc. We had introduced it, so I think it was quite a rare person.
As for the number of units, there are 3 master name nodes. I think there were 20 data nodes.
The amount of data is about 5 billion rows of log data with about 100 columns per row per month.
When it comes to data of this scale, it was several TB. It doesn't fit in one unit, and there is not enough memory at the time of aggregation.
The MAX of the work I have done on my PC Since it is about the total of files of about 200GB As a guide, if the data exceeds several TB Is it like thinking about distributed processing?
The place that handles this much data I think you are using the foundation of distributed processing.
However, it has a large amount of data and can only master Hadoop
.
It's worth using if you have human and financial resources
That may be only a small part.
Unless you are a company that has money and data It's difficult to introduce and it doesn't make sense.
When I went to the place where I had announced it long ago
Yahoo
CyberAgent
NTT DATA
Hitachi Solutions
Recruit Technologies
It ’s just big companies. If you want to acquire distributed processing technology such as Hadoop I have to work at a company around here You may not be able to learn it.
By the way, I'm not using even 1mm now www
Distributed processing is now commonplace in handling big data It is indispensable.
As a key player in spreading distributed processing to the general public I think the existence of Hadoop is quite large. Let's suppress it as knowledge.
14 days until you become an engineer
Otsu py's HP: http://www.otupy.net/
Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw
Twitter: https://twitter.com/otupython
Recommended Posts