Monday, 26 December 2016

What is HDFS

Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several machines and replicated to ensure their durability to failure and high availability to parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and node name.

Where to use HDFS

  • Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
  • Streaming Data Access: The time to read whole data set is more important than latency in reading the first. HDFS is built on write-once and read-many-times pattern.
  • Commodity Hardware:It works on low cost hardware.
To know more about BigData Hadoop Training

Where not to use HDFS

  • Low Latency data access: Applications that require very less time to access the first data should not use HDFS as it is giving importance to whole data rather than time to fetch the first record.
  • Lots Of Small Files:The name node contains the metadata of files in memory and if the files are small in size it takes a lot of memory for name node's memory which is not feasible.
  • Multiple Writes:It should not be used when we have to write multiple times.

HDFS Concepts

  1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are 128 MB by default and this is configurable.Files n HDFS are broken into block-sized chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file stored in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is large just to minimize the cost of seek.
  2. Name Node: HDFS works in master-worker pattern where the name node acts as master.Name Node is controller and manager of HDFS as it knows the status and the metadata of all the files in HDFS; the metadata information being file permission, names and location of each block.The metadata are small, so it is stored in the memory of name node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple clients concurrently,so all this information is handled bya single machine. The file system operations like opening, closing, renaming etc. are executed by it.
  3. Data Node: They store and retrieve blocks when they are told to; by client or name node. They report back to name node periodically, with list of blocks that they are storing. The data node being a commodity hardware also does the work of block creation, deletion and replication as stated by the name node.
HDFS DataNode and NameNode Image:
HDFS DataNode NameNode
HDFS Read Image:
HDFS Read
HDFS Write Image:
HDFS Write
Since all the metadata is stored in name node, it is very important. If it fails the file system can not be used as there would be no way of knowing how to reconstruct the files from blocks present in data node. To overcome this, the concept of secondary name node arises.

Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It performs periodic check points.It communicates with the name node and take snapshot of meta data which helps minimize downtime and loss of data.

Friday, 11 November 2016

When to use Pig Latin versus Hive SQL?

Image credit: gerard79 at stock.xchng


Once your big data is loaded into Hadoop, what’s the best way to use that data?  You’ll need some way to filter and aggregate the data, and then apply the results for something useful.  Collecting terabytes and petabytes of web traffic data is not useful until you have a way to extract meaningful data insights out of it.

That’s where MapReduce comes in.  MapReduce permits you to filter and aggregate data from HDFS so that you can gain insights from the big data.  However, writing MapReduce code with basic Java may require you to write many lines of code laboriously, with additional time needed for code review and QA.

So instead of writing plain Java code to use MapReduce, you now have the options of using either the Pig Latin or Hive SQL languages to construct MapReduce programs.  (There’s also another option to use the Hadoop Streaming protocol with STDIN and STDOUT with any language such as Python or even BASH shell script, but we’ll explore that option more on another day.)  The benefit is that you only need to write much fewer lines of code, thus reducing overall development and testing time.  The rule of thumb is that writing Pig scripts takes 5% of the time compared to writing MapReduce programs in Java, while reducing runtime performance by only 50%.  Although Pig and Hive scripts generally don’t run as fast as native Java MapReduce programs, they are vastly superior in boosting productivity for data engineers and analysts.

When should you use Pig Latin and when should you use Hive?

Depending on where you work, you may need to simply use whatever standards your company has established.

For example, Hive is commonly used at Facebook for analytical purposes.  Facebook promotes the Hive language and their employees frequently speak about Hive at Big Data and Hadoop conferences.
However, Yahoo! is a big advocate for Pig Latin.  Yahoo! has one of the biggest Hadoop clusters in the world.  Their data engineers use Pig for data processing on their Hadoop clusters.

Alternatively, you may have a choice of Pig or Hive at your organization, especially if no standards have yet been established, or perhaps multiple standards have been set up.

If you know SQL, then Hive will be very familiar to you.  Since Hive uses SQL, you will feel at home with all the familiar select, where, group by, and order by clauses similar to SQL for relational databases.  You do, however, lose some ability to optimize the query, by relying on the Hive optimizer.  This seems to be the case for any implementation of SQL on any platform, Hadoop or traditional RDBMS, where hints are sometimes ironically needed to teach the automatic optimizer how to optimize properly.

However, compared to Hive, Pig needs some mental adjustment for SQL users to learn.  Pig Latin has many of the usual data processing concepts that SQL has, such as filtering, selecting, grouping, and ordering, but the syntax is a little different from SQL (particularly the group by and flatten statements!).  Pig requires more verbose coding, although it’s still a fraction of what straight Java MapReduce programs require.  Pig also gives you more control and optimization over the flow of the data than Hive does.

Personally, I use both Pig Latin and Hive, although for different purposes.  I learned Pig Latin first, and have used it to construct dataflows, where I typically have a scheduled job to periodically crunch the massive data from HDFS and to transfer the summarized data into a relational database for reporting, dashboarding, and ad-hoc analyses.  I also use Hive for some simple ad-hoc analytical queries into the data in HDFS, as Hive queries are a lot faster to write for those types of queries.  However, I don’t use Hive for the automated batch jobs that move data between HDFS and other systems.  I find that I can tune the dataflow process better using Pig than with Hive.  Additionally, some of the datasets that I need in Hadoop have not yet been structured with metadata schemas for use with Hive.  In those cases, Pig is much more flexible in reading those datasets than Hive is.

Hadoop expert Alan Gates has an excellent write-up comparing the differences between Pig Latin and Hive and when to use each of them.

If you are a data engineer, then you’ll likely feel like you’ll have better control over the dataflow (ETL) processes when you use Pig Latin, especially if you come from a procedural language background.  If you are a data analyst, however, you will likely find that you can ramp up on Hadoop faster by using Hive, especially if your previous experience was more with SQL than with a procedural programming language.  If you really want to become a Hadoop expert, then you should learn both Pig Latin and Hive for the ultimate flexibility.

Tuesday, 11 October 2016

10 Reasons Why Big Data Analytics is the Best Career Move

The Great White is considered to be the King of the Ocean. This is because the great White is on top of its game. Imagine if you could be on top of the game in the ocean of Big Data!

Big Data is everywhere and there is almost an urgent need to collect and preserve whatever data is being generated, for the fear of missing out on something important. There is a huge amount of data floating around. What we do with it is all that matters right now. This is why Big Data Analytics is in the frontiers of IT. Big Data Analytics has become crucial as it aids in improving business, decision makings and providing the biggest edge over the competitors. This applies for organizations as well as professionals in the Analytics domain. For professionals, who are skilled in Big Data Hadoop Training Analytics, there is an ocean of opportunities out there.

Why Big Data Analytics is the Best Career move

If you are still not convinced by the fact that Big Data Analytics is one of the hottest skills, here are 10 more reasons for you to see the big picture.

1. Soaring Demand for Analytics Professionals:

Jeanne Harris, senior executive at Accenture Institute for High Performance, has stressed the significance of analytics professionals by saying, “…data is useless without the skill to analyze it.” There are more job opportunities in Big Data management and Analytics than there were last year and many IT professionals are prepared to invest time and money for the training.
The job trend graph for Big Data Analytics, from Indeed.com, proves that there is a growing trend for it and as a result there is a steady increase in the number of job opportunities.
Big Data Analytics Job Trend
Source: Indeed.com
The current demand for qualified data professionals is just the beginning. Srikanth Velamakanni, the Bangalore-based cofounder and CEO of CA headquartered Fractal Analytics states: “In the next few years, the size of the analytics market will evolve to at least one-thirds of the global IT market from the current one-tenths”.

Technology professionals who are experienced in Analytics are in high demand as organizations are looking for ways to exploit the power of Big Data. The number of job postings related to Analytics in Indeed and Dice has increased substantially over the last 12 months. Other job sites are showing similar patterns as well. This apparent surge is due to the increased number of organizations implementing Analytics and thereby looking for Analytics professionals.

In a study by QuinStreet Inc., it was found that the trend of implementing Big Data Analytics is zooming and is considered to be a high priority among U.S. businesses. A majority of the organizations are in the process of implementing it or actively planning to add this feature within the next two years.

You can watch the sample class recording of Edureka’s Big Data and Hadoop course here:

2. Huge Job Opportunities & Meeting the Skill Gap:

The demand for Analytics skill is going up steadily but there is a huge deficit on the supply side. This is happening globally and is not restricted to any part of geography. In spite of Big Data Analytics being a ‘Hot’ job, there is still a large number of unfilled jobs across the globe due to shortage of required skill. A McKinsey Global Institute study states that the US will face a shortage of about 190,000 data scientists and 1.5 million managers and analysts who can understand and make decisions using Big Data by 2018.
India, currently has the highest concentration of analytics globally. In spite of this, the scarcity of data analytics talent is particularly acute and demand for talent is expected to be on the higher side  as more global organizations are outsourcing their work.

According to Srikanth Velamakanni, co-founder and CEO of Fractal Analytics, there are two types of talent deficits: Data Scientists, who can perform analytics and Analytics Consultant, who can understand and use data. The talent supply for these job title, especially Data Scientists is extremely scarce and the demand is huge.

3. Salary Aspects:

Strong demand for Data Analytics skills is boosting the wages for qualified professionals and making Big Data pay big bucks for the right skill. This phenomenon is being seen globally where countries like Australia and the U.K are witnessing this ‘Moolah Marathon’.

According to the 2015 Skills and Salary Survey Report published by the Institute of Analytics Professionals of Australia (IAPA), the annual median salary for data analysts is $130,000, up four per cent from last year. Continuing the trend set in 2013 and 2014, the median respondent earns 184% of the Australian full-time median salary. The rising demand for analytics professionals is also reflected in IAPA’s membership, which has grown to more than 5000 members in Australia since its formation in 2006.

Randstad states that the annual pay hikes for Analytics professionals in India is on an average 50% more than other IT professionals. According to The Indian Analytics Industry Salary Trend Report by Great Lakes Institute of Management, the average salaries for analytics professionals in India was up by 21% in 2015 as compared to 2014. The report also states that 14% of all analytics professionals get a salary of more than Rs. 15 lakh per annum.

A look at the salary trend for Big Data Analytics in the UK also indicates a positive and exponential growth. A quick search on Itjobswatch.co.uk shows a median salary of £62,500 in early 2016 for Big Data Analytics jobs, as compared to £55,000 in the same period in 2015. Also, a year-on-year median salary change of +13.63% is observed.

The table below looks at the statistics for Big Data Analytics skills in IT jobs advertised across the UK. Included is a guide to the salaries offered in IT jobs that have cited Big Data Analytics over the 3 months to 23 June 2016 with a comparison to the same period over the previous 2 years.

4. Big Data Analytics: A Top Priority in a lot of Organizations

According to the ‘Peer Research – Big Data Analytics’ survey, it was concluded that Big Data Analytics is one of the top priorities of the organizations participating in the survey as they believe that  it improves the performances of their organizations.
Big Data Analytics - A Top Priority in Organizations
Based on the responses, it was found that approximately 45% of the surveyed believe that Big Data analytics will enable much more precise business insights, 38% are looking to use Analytics to recognize sales and market opportunities. More than 60% of the respondents are depending on Big Data Analytics to  boost the organization’s social media marketing abilities. The QuinStreet research based on their survey also back the fact that Analytics is the need of the hour, where 77% of the respondents consider Big Data Analytics a top priority.

A survey by Deloitte, Technology in the Mid-Market; Perspectives and Priorities, reports that executives clearly see the value of analytics. Based on the survey, 65.2% of respondents are using some form of analytics that is helping their business needs. The image below clearly depicts their attitude and belief towards Big Data Analytics.
Big Data Analytics - A Top Priority in Organizations

5. Adoption of Big Data Analytics is Growing:

New technologies are now making it easier to perform increasingly sophisticated data analytics on a very large and diverse datasets. This is evident as the report from The Data Warehousing Institute (TDWI) shows. According to this report, more than a third of the respondents are currently using some form of advanced analytics on Big Data,  for Business Intelligence, Predictive Analytics and Data Mining tasks.

With Big Data Analytics providing an edge over the competition, the rate of implementation of the necessary Analytics tools has increased exponentially. In fact, most of the respondents of the  ‘Peer Research – Big Data Analytics’ survey reported that they already have a strategy setup for dealing with Big Data Analytics. And those who are yet to come up with a strategy are also in the process of planning for it.
Formal Strategy for Big Data AnalyticsWhen it comes to Big Data Analytics tools, the adoption of Apache Hadoop framework continues to be the popular choice. There are various commercial and open-source frameworks to choose from and organizations are making the appropriate choice based on their requirement. Over half of the respondents have already deployed or are currently implementing a Hadoop distribution. Out of them, a quarter of the respondents have deployed open-source framework, which is twice the number of organizations that have deployed a commercial distribution of the Hadoop framework.
Adoption of Big Data Analytics Tools

6. Analytics: A Key Factor in Decision Making

Analytics is a key competitive resource for many companies. There is no doubt about that. According to the ‘Analytics Advantage’ survey overseen by Tom Davenport, ninety six percent of respondents feel that analytics will become more important to their organizations in the next three years. This is because there is a huge amount of data that is not being used and at this point, only rudimentary analytics is being done. About forty nine percent of the respondents strongly believe that analytics is a key factor in better decision-making capabilities. Another sixteen percent like it for its superior key strategic initiatives.

Even though there is a fight for the title of ‘Greatest Benefit of Big Data Analytics’, one thing is undeniable and stands out the most: Analytics play an important role in driving business strategy and making effective business decisions.
Key Benefits of Big Data Analytics
Seventy Four percent of the respondents of the ‘Peer-Research Big Data Analytics Survey’ have agreed that Big Data Analytics is adding value to their organization and allows vital information for making timely and effective business decisions of great importance. This is a clear indicator than Big Data Analytics is here to stay and a career in it is the wisest decision one can make.

7. The Rise of Unstructured and Semistructured Data Analytics:

The ‘Peer Research – Big Data Analytics’ survey clearly reports that there is a huge growth when it comes to unstructured and semistructured data analytics. Eighty four percent of the respondents have mentioned that the organization they work for are currently processing and analyzing unstructured data sources, including weblogs, social media, e-mail, photos, and video. The remaining respondents have indicated that steps are being taken to implement them in the next 12 to 18 months.
Rise of Unstructured Data

8. Big Data Analytics is Used Everywhere!

It is a given that there is a huge demand for Big Data Analytics owing to its awesome features. The tremendous growth is also due to the varied domain across which Analytics is being utilized. The image below depicts the job opportunities across various domains.
Big Data Analytics Across Domains

9. Surpassing Market Forecast / Predictions for Big Data Analytics:

Big Data Analytics has  topped a survey carried out by Nimbus Ninety, as the most disruptive technologies that will have the biggest influence in three years’ time. Added to this, there are more market forecasts that support this:
  • According to IDC, the Big Data Analytics market will reach $125 billion worldwide in 2015.
  • IIA states that Big Data Analytics tools will be the first line of defense, combining machine learning, text mining and ontology modeling to provide holistic and integrated security threat prediction, detection, and deterrence and prevention programs.

  • According to the survey ‘The Future of Big Data Analytics – Global Market and Technologies Forecast – 2015-2020’, Big Data Analytics Global Market will grow by 14.4% CAGR over this period.

  • The Big Data Analytics Global Market for Apps and Analytics Technology will grow by 28.2% CAGR, for  Cloud Technology will grow by 16.1% CAGR, for Computing Technology will grow by 7.1% CAGR, for NoSQL Technology will grow by 18.9% CAGR  over the entire 2015-2020 period.

10. Numerous Choices in Job Titles and Type of Analytics :

From a career point of view, there are so many option available, in terms of domain as well as nature of job. Since Analytics is utilized in varied fields, there are numerous job titles for one to choose from.
  • Big Data Analytics Business Consultant
  • Big Data Analytics Architect
  • Big Data Engineer
  • Big Data Solution Architect
  • Big Data Analyst
  • Analytics Associate
  • Business Intelligence and Analytics Consultant
  • Metrics and Analytics Specialist
Big Data Analytics career is deep and one can choose from the 3 types of data analytics depending on the Big Data environment.
  • Prescriptive Analytics
  • Predictive Analytics
  • Descriptive Analytics.
A huge array of organizations like Ayata, IBM, Alteryx, Teradata, TIBCO, Microsoft, Platfora, ITrend, Karmasphere, Oracle, Opera, Datameer, Pentaho, Centrofuge, FICO, Domo, Quid, Saffron, Jaspersoft, GoodData, Bluefin Labs, Tracx, Panaroma Software, and countless more are utilizing Big Data Analytics for their business needs and a huge job opportunities are possible with them.

Saturday, 17 September 2016

Getting Started with Hbase

As  you all know  about Bigdata and its  framework  hadoop

Limitations of hadoop :

hadoop can perform  only batch processing and data will be access in sequential
manner  That means one has to search the entire dataset even for the simplest of jobs.

Now  lets talk about solution :


Hadoop random access  database

Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the databases that store huge amounts of data and access the data in a random manner. 


Hbase :

HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable.


Column Oriented and Row Oriented
Column-oriented databases are those that store data tables as sections of columns of data, rather than as rows of data. Shortly, they will have column families.

Row-Oriented Database     Column-Oriented Database
It is suitable for Online Transaction Process (OLTP).     It is suitable for Online Analytical Processing (OLAP).
Such databases are designed for small number of rows and columns.     Column-oriented databases are designed for huge tables.

The following image shows column families in a column-oriented database:
Table


HBase and RDBMS
HBase     RDBMS
HBase is schema-less, it doesn't have the concept of fixed columns schema; defines only column families.     An RDBMS is governed by its schema, which describes the whole structure of tables.
It is built for wide tables. HBase is horizontally scalable.     It is thin and built for small tables. Hard to scale.
No transactions are there in HBase.     RDBMS is transactional.
It has de-normalized data.     It will have normalized data.
It is good for semi-structured as well as structured data.     It is good for structured data.

Hbase  Architecture :-


HBase has three major components: the client library, a master server, and region servers. Region servers can be added or removed as per requirement.

MasterServer:

The master server - 

1. Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.

2. Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the regions to less occupied servers.

3.  Maintains the state of the cluster by negotiating the load balancing.

 4. Is responsible for schema changes and other metadata operations such as creation of tables and column families.

Regions

Regions are nothing but tables that are split up and spread across the region servers.

Region server

The region servers have regions that -

   Communicate with the client and handle data-related operations.
   Handle read and write requests for all the regions under it.
   Decide the size of the region by following the region size thresholds.

Zookeeper

    Zookeeper is an open-source project that provides services like maintaining    configuration information, naming, providing distributed synchronization, etc.

    Zookeeper has ephemeral nodes representing different region servers. Master servers use these nodes to discover available servers.

    In addition to availability, the nodes are also used to track server failures or network partitions.

    Clients communicate with region servers via zookeeper.

    In pseudo and standalone modes, HBase itself will take care of zookeeper.


@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
 
Note:  Hbase  can be configured  in three mode like  hadoop

1.)  Standalone  Mode 
2.)  Pseudo Distributed  Mode
3.)   Fully Distributed Mode

If anyone want to do Bigdata Hadoop Training. Please visit on - http://www.bigdatahadoop.info/