Monday, June 15, 2015

BIG DATA

BIG DATA

 Traditional approach was to extract the data from distributed systems, store in centralized data warehouse and develop analytics on top of centralized data warehouse. The central data warehouses were the relational database management systems used mainly to processes structured data such as financial transaction, shipping data, employee information. However, with the technological advancement in various fronts, it has become possible to generate large volume of data sets; both structured and unstructured, in high frequency through varieties of data sources. This led to a tipping point when the traditional architecture was not scalable, efficient enough to store, process and retrieve those large volume of data sets. 

Definition
It is the term used to describe the exponential growth, availability and usage of structured and unstructured data. Big Data is becoming global phenomena in all the sectors including government, business and societies because of the availability of high volume of data, efficient and affordable processing technologies leading to more accurate analyses based on facts. The more accurate the analyses are better would be the understanding of business problems, opportunities and threats leading to better decision making for operational excellence in any business.
The relevance and emergence of Big Data have increased because of the 3Vs (Volume, Velocity and Variety) of data. Volumes refer to the large amount of data being generated through a variety of sources. These data could be structured such as financial transaction, invoices, personal detail or unstructured such as twitter/facebook messages, reviews, videos, images etc. Velocity indicates the speed at which the new data is being generated. Social media messages, sensor data are the examples of high volume of data being generated with high velocity. Variety refers to different types of data available for storage, retrieval and analysis. 

Difference between Big Data and Traditional Data
Traditional data includes the data such as documents, finances, stocks and personal files, which are more structured and mostly human generated. Whereas Big Data mainly refers to the large volume of data being generated through a variety of sources, not just human but also machines, processes etc. These data could be social media contents , sensor data , RFID data, scientific research data with millions, billions in numbers and large volume of size.

Why Big Data Matters
In last few years businesses have witnessed rapid growth of data whether it is in retail industry, logistics, financial or health industry.  There are many reasons, which are contributing to the growth of the data.  Total cost of application ownership is being reduced; application providers are offering cloud/subscription based solutions. Internet is becoming accessible to more and more end customers with the availability of smart phones at affordable cost and higher bandwidth (4G). Social media is becoming mainstream of daily life, business, and politics. More and more objects/things are getting connected to Internet (popularly known as Internet of Things) generating data in each stage of supply chain / value chain of businesses. Data is not only being generated by human but also by machines in terms of RFID feeds, medical records, sensor data, scientific data, service logs, web contents, maps, GPS logs etc. Some data are being generated so fast that there is not even time to store it before applying analytics to it.
This phenomenon of exponential growth of structured and unstructured data with higher speed is forcing the business to explore it rather than ignore it. To remain competitive, businesses have to leverage the data they can get within and outside the organization and use it in the decision-making process whether it is to understand the customers, suppliers, employees or the internal/external processes.

How to get a handle on analyzing the big data
The process of getting handle on analyzing the big data starts with the analysis of the company’s business environment, strategies, understanding the sources of data and their relevance in the business. It is also important to know how the data driven companies and other analytical competitors are exploiting Big Data in order to achieve the strategic advantage. The process of understanding big data analytics does not only involve understanding the technical aspects of big data infrastructure like Hadoop but also logical aspects of analytics such as data modeling, mining and its application in business in order to make better decisions. Literature reviews and researches would not be sufficient for a company to get confidence whether the big data analytics is the way forward for them. Companies can actually start a pilot project focusing on one strategic aspect of business whereby Big Data analytics could add value in the decision making process.

Hadoop and MapReduce
Hadoop is an open source software infrastructure that enables the processing of large volume of data sets in distributed clusters of commodity servers. The main advantage of Hadoop is its scalability as the server can be scaled from one to thousands of servers with the possibility of parallel processing (computing). Hadoop has mainly two components a) HDFS and b) MapReduce.  HDFS refers to the Hadoop Distributed File System that spans across all the nodes of the cluster within Hadoop server architecture. Unlike RDBMS, it is schema less architecture, which can easily store different types of structured and unstructured data. Hadoop is also fault tolerant that means if one node fails then it uses other backup nodes and recovers easily.
MapReduce is at the core of Hadoop. It is highly scalable cluster based data processing technology which consists of two main tasks a) Map and b) Reduce, executed in sequential order. The “map” task takes a set of data from source and converts it into key/value pairs. The “reduce” job then takes these key/value pairs outputs from “map” job and combines (reduces) those tuples (rows) into smaller number of tuples. For example, there is a requirement to collect “twitter” data for a newly released song for sentiment analysis. The chosen keywords are “I love it”, “I like it”, “I hate it”. The map job finds these keys and computes to come up with value (count) in each data set stored in a node of HDFS and the “reduce” task combines these computations at all nodes level to come up with final set of computation (e.g. final counts of these key/values). This way it becomes possible to process vast volume of data based on the scalability of the cluster.

Conclusion Data is becoming the energy of 21st century, a very important resource for every organization. The most important challenge would be to implement a sustainable, affordable, stable solution, which could bring insights and knowledge out of the mountains of data generated by human and machines every second. Big Data solution like Hadoop is being increasingly adopted by industries to exploit their structured and unstructured data thereby creating value through the cycle of data information insight knowledge intelligence and developing strategic capability as analytical competitor.  Despite the challenges, more and more organizations are expected to join this bandwagon in coming future.


Reference: A Blog from Javra Software Nepal.