BIG DATA
Definition
It is the term used to describe the
exponential growth, availability and usage of structured and
unstructured data. Big Data is becoming global phenomena in all the
sectors including government, business and societies because of the
availability of high volume of data, efficient and affordable processing
technologies leading to more accurate analyses based on facts. The more
accurate the analyses are better would be the understanding of business
problems, opportunities and threats leading to better decision making
for operational excellence in any business.
The relevance and emergence of Big Data have increased
because of the 3Vs (Volume, Velocity and Variety) of data. Volumes refer
to the large amount of data being generated through a variety of
sources. These data could be structured such as financial transaction,
invoices, personal detail or unstructured such as twitter/facebook
messages, reviews, videos, images etc. Velocity indicates the speed at
which the new data is being generated. Social media messages, sensor
data are the examples of high volume of data being generated with high
velocity. Variety refers to different types of data available for
storage, retrieval and analysis.
Difference between Big Data and Traditional Data
Traditional data includes the data such as documents,
finances, stocks and personal files, which are more structured and
mostly human generated. Whereas Big Data mainly refers to the large
volume of data being generated through a variety of sources, not just
human but also machines, processes etc. These data could be social media
contents , sensor data , RFID data, scientific
research data with millions, billions in numbers and large volume of
size.
Why Big Data Matters
In last few years businesses have witnessed rapid growth of
data whether it is in retail industry, logistics, financial or health
industry. There are many reasons, which are contributing to the growth
of the data. Total cost of application ownership is being reduced;
application providers are offering cloud/subscription based solutions.
Internet is becoming accessible to more and more end customers with the
availability of smart phones at affordable cost and higher bandwidth
(4G). Social media is becoming mainstream of daily life, business, and
politics. More and more objects/things are getting connected to Internet
(popularly known as Internet of Things) generating data in each stage
of supply chain / value chain of businesses. Data is not only being
generated by human but also by machines in terms of RFID feeds, medical
records, sensor data, scientific data, service logs, web contents, maps,
GPS logs etc. Some data are being generated so fast that there is not
even time to store it before applying analytics to it.
This phenomenon of exponential growth of structured and
unstructured data with higher speed is forcing the business to explore
it rather than ignore it. To remain competitive, businesses have to
leverage the data they can get within and outside the organization and
use it in the decision-making process whether it is to understand the
customers, suppliers, employees or the internal/external processes.
How to get a handle on analyzing the big data
The process of getting handle on analyzing the big data
starts with the analysis of the company’s business environment,
strategies, understanding the sources of data and their relevance in the
business. It is also important to know how the data driven companies
and other analytical competitors are exploiting Big Data in order to
achieve the strategic advantage. The process of understanding big data
analytics does not only involve understanding the technical aspects of
big data infrastructure like Hadoop but also logical aspects of
analytics such as data modeling, mining and its application in business
in order to make better decisions. Literature reviews and researches
would not be sufficient for a company to get confidence whether the big
data analytics is the way forward for them. Companies can actually start
a pilot project focusing on one strategic aspect of business whereby
Big Data analytics could add value in the decision making process.
Hadoop and MapReduce
Hadoop is an open source software infrastructure that
enables the processing of large volume of data sets in distributed
clusters of commodity servers. The main advantage of Hadoop is its
scalability as the server can be scaled from one to thousands of servers
with the possibility of parallel processing (computing). Hadoop has
mainly two components a) HDFS and b) MapReduce. HDFS refers to the
Hadoop Distributed File System that spans across all the nodes of the
cluster within Hadoop server architecture. Unlike RDBMS, it is schema
less architecture, which can easily store different types of structured
and unstructured data. Hadoop is also fault tolerant that means if one
node fails then it uses other backup nodes and recovers easily.
MapReduce is at the core of Hadoop. It is highly scalable
cluster based data processing technology which consists of two main
tasks a) Map and b) Reduce, executed in sequential order. The “map” task
takes a set of data from source and converts it into key/value pairs.
The “reduce” job then takes these key/value pairs outputs from “map” job
and combines (reduces) those tuples (rows) into smaller number of
tuples. For example, there is a requirement to collect “twitter” data
for a newly released song for sentiment analysis. The chosen keywords
are “I love it”, “I like it”, “I hate it”. The map job finds these keys
and computes to come up with value (count) in each data set stored in a
node of HDFS and the “reduce” task combines these computations at all
nodes level to come up with final set of computation (e.g. final counts
of these key/values). This way it becomes possible to process vast
volume of data based on the scalability of the cluster.
Reference: A Blog from Javra Software Nepal.