As you all know about Bigdata and its framework hadoop
Limitations of hadoop :
hadoop can perform only batch processing and data will be access in sequential
manner That means one has to search the entire dataset even for the simplest of jobs.
Now lets talk about solution :
Hadoop random access database
Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the databases that store huge amounts of data and access the data in a random manner.
Hbase :
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable.
Column Oriented and Row Oriented
Column-oriented databases are those that store data tables as sections of columns of data, rather than as rows of data. Shortly, they will have column families.
Row-Oriented Database Column-Oriented Database
It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical Processing (OLAP).
Such databases are designed for small number of rows and columns. Column-oriented databases are designed for huge tables.
The following image shows column families in a column-oriented database:
Table
HBase and RDBMS
HBase RDBMS
HBase is schema-less, it doesn't have the concept of fixed columns schema; defines only column families. An RDBMS is governed by its schema, which describes the whole structure of tables.
It is built for wide tables. HBase is horizontally scalable. It is thin and built for small tables. Hard to scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as structured data. It is good for structured data.
Hbase Architecture :-
HBase has three major components: the client library, a master server, and region servers. Region servers can be added or removed as per requirement.
MasterServer:
The master server -
1. Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.
2. Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the regions to less occupied servers.
3. Maintains the state of the cluster by negotiating the load balancing.
4. Is responsible for schema changes and other metadata operations such as creation of tables and column families.
Regions
Regions are nothing but tables that are split up and spread across the region servers.
Region server
The region servers have regions that -
Communicate with the client and handle data-related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.
Zookeeper
Zookeeper is an open-source project that provides services like maintaining configuration information, naming, providing distributed synchronization, etc.
Zookeeper has ephemeral nodes representing different region servers. Master servers use these nodes to discover available servers.
In addition to availability, the nodes are also used to track server failures or network partitions.
Clients communicate with region servers via zookeeper.
In pseudo and standalone modes, HBase itself will take care of zookeeper.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Note: Hbase can be configured in three mode like hadoop
1.) Standalone Mode
2.) Pseudo Distributed Mode
3.) Fully Distributed Mode
If anyone want to do Bigdata Hadoop Training. Please visit on - http://www.bigdatahadoop.info/
Limitations of hadoop :
hadoop can perform only batch processing and data will be access in sequential
manner That means one has to search the entire dataset even for the simplest of jobs.
Now lets talk about solution :
Hadoop random access database
Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the databases that store huge amounts of data and access the data in a random manner.
Hbase :
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable.
Column Oriented and Row Oriented
Column-oriented databases are those that store data tables as sections of columns of data, rather than as rows of data. Shortly, they will have column families.
Row-Oriented Database Column-Oriented Database
It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical Processing (OLAP).
Such databases are designed for small number of rows and columns. Column-oriented databases are designed for huge tables.
The following image shows column families in a column-oriented database:
Table
HBase and RDBMS
HBase RDBMS
HBase is schema-less, it doesn't have the concept of fixed columns schema; defines only column families. An RDBMS is governed by its schema, which describes the whole structure of tables.
It is built for wide tables. HBase is horizontally scalable. It is thin and built for small tables. Hard to scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as structured data. It is good for structured data.
Hbase Architecture :-


MasterServer:
The master server -
1. Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.
2. Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the regions to less occupied servers.
3. Maintains the state of the cluster by negotiating the load balancing.
4. Is responsible for schema changes and other metadata operations such as creation of tables and column families.
Regions
Regions are nothing but tables that are split up and spread across the region servers.
Region server
The region servers have regions that -
Communicate with the client and handle data-related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.
Zookeeper
Zookeeper is an open-source project that provides services like maintaining configuration information, naming, providing distributed synchronization, etc.
Zookeeper has ephemeral nodes representing different region servers. Master servers use these nodes to discover available servers.
In addition to availability, the nodes are also used to track server failures or network partitions.
Clients communicate with region servers via zookeeper.
In pseudo and standalone modes, HBase itself will take care of zookeeper.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Note: Hbase can be configured in three mode like hadoop
1.) Standalone Mode
2.) Pseudo Distributed Mode
3.) Fully Distributed Mode
If anyone want to do Bigdata Hadoop Training. Please visit on - http://www.bigdatahadoop.info/
No comments:
Post a Comment