Unit - 3
Introduction of Big data and Hadoop Ecosystem
Introduction:
- Now we are living in the Big Data Era.
- A few years ago, Systems or Organizations, or Applications were using all Structured Data only (Structured Data means in the form of Rows and Columns). It was very easy to use Relational Data Bases (RDBMS) and old Tools to store, manage process, and report this Data.
- However recently, the Nature of Data is changed. And Systems or Organizations or Applications are generating a huge amount of Data in a variety of formats at a very fast rate.
- That means Data is not simple Structured Data(Not in the form of simple Rows and Columns). It does not have any proper format, just Raw Data without any format. It is “very difficult or not possible” to use Old Technologies, Traditional Relational Databases, and Tools to store, manage, process and report this Data. Traditional Databases cannot Store, Process, and Analysis of this kind of Data.
- Then how to solve this problem? Here Big Data Solutions come into the picture.
- Big Data Solutions solve all these problems very easily.
- Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is data with so large size and complexity that none of the traditional data management tools can store it or process it efficiently. Big data is also data but with a huge size.
- Big data refers to the large, diverse sets of information that grow at ever-increasing rates. It encompasses the volume of information, the velocity or speed at which it is created and collected, and the variety or scope of the data points being covered (known as the "three v's" of big data). Big data often comes from data mining and arrives in multiple formats.
Key Takeaways:-
- Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is data with so large size and complexity that none of the traditional data management tools can store it or process it efficiently. Big data is also data but with a huge size.
3.2.1 Culture:
The ability to make data-driven decisions and implement enduring change across the organization is arguably the most important benefit of using big data and advanced analytics. But in many instances, brilliant analytical insights cannot be transformed into business improvement because the organization is unable to incorporate them into process and behavior changes.
3.2.2 Expectations:-
The organization must have consistent expectations for its use of analytics to explain performance and facilitate decision-making. There must be a shared vision of the insights that big data and analytics will yield and how they will impact decision-making.
3.2.3 Process:-
The organization must have processes for analysis and reporting, as well as an engagement model for helping transform business problems and managing analytics efforts that benefit the business.
3.2.4 Skills and tools:-
With the proliferation of systems and applications, organizations can easily acquire new technology before they are ready to use it. In addition to technology infrastructure, the organization must have employees with adequate skills to use the tools and interpret the results. In the world of big data, these skills are not likely to reside in the marketing organization; they may exist in pockets across the company or in a centralized analytics function.
3.2.5 Data:-
Big or small, structured or unstructured, data fuels analytics efforts, and it requires an underlying infrastructure to support it. Developing this infrastructure is often a significant challenge because advanced analysis requires internal data from repositories across marketing, sales, services, training, and finance, as well as external data from social media and other sources. IT and marketing must work together with the same set of priorities to ensure that the data infrastructure is in place.
The promise of big data and analytics – and the insights they bring to light – are compelling. But it takes more than technology to transform insights into true business performance improvement. For an organization to benefit from the much-hyped big data, it must have realistic expectations, obtain the resources to make it work, and be prepared to implement the changes that the insights indicate are needed.
Key Takeaways:-
The five elements of big data are:
- Culture
- Expectations
- Process
- Skills and tools
- Data
3.3.1 Definition:
Big data analytics is the often complex process of examining big data to uncover information -- such as hidden patterns, correlations, market trends, and customer preferences -- that can help organizations make informed business decisions.
On a broad scale, data analytics technologies and techniques provide a means to analyze data sets and take away new information—which can help organizations make informed business decisions. Business intelligence (BI) queries answer basic questions about business operations and performance.
Big data analytics is a form of advanced analytics, which involves complex applications with elements such as predictive models, statistical algorithms, and what-if analysis powered by analytics systems.
3.3.2 The importance of big data analytics:-
- Big data analytics through specialized systems and software can lead to positive business-related outcomes:
- New revenue opportunities
- More effective marketing
- Better customer service
- Improved operational efficiency
- Competitive advantages over rivals
- Big data analytics applications allow data analysts, data scientists, predictive modelers, statisticians, and other analytics professionals to analyze growing volumes of structured transaction data, plus other forms of data that are often left untapped by conventional BI and analytics programs.
- This includes a mix of semi-structured and unstructured data. For example, internet clickstream data, web server logs, social media content, text from customer emails and survey responses, mobile phone records, and machine data captured by sensors connected to the internet of things (IoT).
3.3.3 Working Big data analytics:
- In some cases, Hadoop clusters and NoSQL systems are used primarily as landing pads and staging areas for data. This is before it gets loaded into a data warehouse or analytical database for analysis -- usually in a summarized form that is more conducive to relational structures.
- More frequently, however, big data analytics users are adopting the concept of a Hadoop data lake that serves as the primary repository for incoming streams of raw data. In such architectures, data can be analyzed directly in a Hadoop cluster or run through a processing engine like Spark. As in data warehousing, sound data management is a crucial first step in the big data analytics process. Data is stored in the HDFS must be organized, configured, and partitioned properly to get good performance out of both extracts, transform, and load (ETL) integration jobs and analytical queries.
- Once the data is ready, it can be analyzed with the software commonly used for advanced analytics processes. That includes tools for:
-data mining, which sifts through data sets in search of patterns and relationships;
-predictive analytics, which builds models to forecast customer behavior and other future developments;
-machine learning, which taps algorithms to analyze large data sets; and
-deep learning, a more advanced offshoot of machine learning.
3.3.4 Big data analytics uses and challenges:-
- Big data analytics applications often include data from both internal systems and external sources, such as weather data or demographic data on consumers compiled by third-party information services providers.
- Besides, streaming analytics applications are becoming common in big data environments as users look to perform real-time analytics on data fed into Hadoop systems through stream processing engines, such as Spark, Flink, and Storm.
- Early big data systems were mostly deployed on-premises, particularly in large organizations that collected, organized, and analyzed massive amounts of data.
- But cloud platform vendors, such as Amazon Web Services (AWS) and Microsoft, have made it easier to set up and manage Hadoop clusters in the cloud.
- The same goes for Hadoop suppliers such as Cloudera-Hortonworks, which supports the distribution of the big data framework on the AWS and Microsoft Azure clouds.
Key Takeaways:-
- Big data analytics is the often complex process of examining big data to uncover information -- such as hidden patterns, correlations, market trends, and customer preferences -- that can help organizations make informed business decisions
- Big data analytics applications allow data analysts, data scientists, predictive modelers, statisticians, and other analytics professionals to analyze growing volumes of structured transaction data, plus other forms of data that are often left untapped by conventional BI and analytics programs
- Big data analytics applications often include data from both internal systems and external sources, such as weather data or demographic data on consumers compiled by third-party information services providers.
- The first step in the process is getting the data. We need to ingest big data and then store it in data stores (SQL or No SQL).
- Once data has been ingested, after noise reduction and cleansing, big data is stored for processing.
- There are two types of data processing, Map Reduce, and Real-Time.
- Scripting languages are needed to access data or to start the processing of data. After processing, the data can be used in various fields. It may be used for analysis, machine learning, and can be presented in graphs and charts.
- Earlier Approach – When this problem came into existence, Google™ tried to solve it by introducing GFS and Map Reduce process
- These two are based on distributed file systems and parallel processing. The framework was very successful. Hadoop is an open-source implementation of the MapReduce framework.
Key Takeaways:-
- The first step in the process is getting the data. We need to ingest big data and then store it in data stores (SQL or No SQL).
- Once data has been ingested, after noise reduction and cleansing, big data is stored for processing.
- There are two types of data processing, Map Reduce, and Real-Time.
Virtualization has three characteristics that support the scalability and operating efficiency required for big data environments:
- Partitioning: - In virtualization, many applications and operating systems are supported in a single physical system by partitioning the available resources.
- Isolation: - Each virtual machine is isolated from its host physical system and other virtualized machines. Because of this isolation, if one virtual instance crashes, the other virtual machines and the host system aren’t affected. Besides, data isn’t shared between one virtual instance and another.
- Encapsulation:-A virtual machine can be represented as a single file, so you can identify it easily based on the services it provides.
3.5.1 Big data server virtualization:-
- In server virtualization, one physical server is partitioned into multiple virtual servers.
- The hardware and resources of a machine — including the random access memory (RAM), CPU, hard drive, and network controller — can be virtualized into a series of virtual machines that each runs its applications and operating system.
- A virtual machine (VM) is a software representation of a physical machine that can execute or perform the same functions as the physical machine.
- A thin layer of software is inserted into the hardware that contains a virtual machine monitor, or hypervisor.
3.5.2 Big data application virtualization:-
- Application infrastructure virtualization provides an efficient way to manage applications in context with customer demand.
- The application is encapsulated in a way that removes its dependencies from the underlying physical computer system. This helps to improve the overall manageability and portability of the application.
- Besides, the application infrastructure virtualization software typically allows for codifying business and technical usage policies to make sure that each of your applications predictably leverages virtual and physical resources.
- Efficiencies are gained because you can more easily distribute IT resources according to the relative business value of your applications.
- Application infrastructure virtualization used in combination with server virtualization can help to ensure that business service-level agreements are met. Server virtualization monitors CPU and memory usage but does not account for variations in business priority when allocating resources.
3.5.3 Big data network virtualization:-
- Network virtualization provides an efficient way to use networking as a pool of connection resources. Instead of relying on the physical network for managing traffic, you can create multiple virtual networks all utilizing the same physical implementation.
- This can be useful if you need to define a network for data gathering with a certain set of performance characteristics and capacity and another network for applications with different performance and capacity.
- Virtualizing the network helps reduce these bottlenecks and improve the capability to manage the large distributed data required for big data analysis.
3.5.4 Big data processor and memory virtualization:-
- Processor virtualization helps to optimize the processor and maximize performance. Memory virtualization decouples memory from the servers.
- In big data analysis, you may have repeated queries of large data sets and the creation of advanced analytic algorithms, all designed to look for patterns and trends that are not yet understood.
- These advanced analytics can require lots of processing power (CPU) and memory (RAM). For some of these computations, it can take a long time without sufficient CPU and memory resources.
3.5.5 Big data and storage virtualization:-
- Data virtualization can be used to create a platform for dynamically linked data services. This allows data to be easily searched and linked through a unified reference source.
- As a result, data virtualization provides an abstract service that delivers data in a consistent form regardless of the underlying physical database. Besides, data virtualization exposes cached data to all applications to improve performance.
- Storage virtualization combines physical storage resources so that they are more effectively shared. This reduces the cost of storage and makes it easier to manage data stores required for big data analysis.
Key Takeaways:-
- Virtualization has three characteristics that support the scalability and operating efficiency required for big data environments:
In server virtualization, one physical server is partitioned into multiple virtual servers
- The hardware and resources of a machine — including the random access memory (RAM), CPU, hard drive, and network controller — can be virtualized into a series of virtual machines that each runs its applications and operating system.
- The application is encapsulated in a way that removes its dependencies from the underlying physical computer system. This helps to improve the overall manageability and portability of the application.
- Network virtualization provides an efficient way to use networking as a pool of connection resources. Instead of relying on the physical network for managing traffic, you can create multiple virtual networks all utilizing the same physical implementation.
- While virtualization has been a part of the IT landscape for decades, it is only recently (in 1998) that VMware delivered the benefits of virtualization to industry-standard x86-based platforms, which now form the majority of desktop, laptop, and server shipments.
- A key benefit of virtualization is the ability to run multiple operating systems on a single physical system and share the underlying hardware resources – known as partitioning.
- Today, virtualization can apply to a range of system layers, including hardware-level virtualization, operating system-level virtualization, and high-level language virtual machines.
- Hardware-level virtualization was pioneered on IBM mainframes in the 1970s, and then more recently Unix/RISC system vendors began with hardware-based partitioning capabilities before moving on to software-based partitioning.
- For Unix/RISC and industry-standard x86 systems, the two approaches typically used with software-based partitioning are hosted and hypervisor architectures).
- A hosted approach provides partitioning services on top of a standard operating system and supports the broadest range of hardware configurations.
- In contrast, a hypervisor architecture is the first layer of software installed on a clean x86-based system (hence it is often referred to as a “bare-metal” approach). Since it has direct access to the hardware resources, a hypervisor is more efficient than hosted architectures, enabling greater scalability.
Figure 1
Key Takeaways:-
- While virtualization has been a part of the IT landscape for decades, it is only recently (in 1998) that VMware delivered the benefits of virtualization to industry-standard x86-based platforms, which now form the majority of desktop, laptop, and server shipments.
- A key benefit of virtualization is the ability to run multiple operating systems on a single physical system and share the underlying hardware resources – known as partitioning.
- Hadoop Ecosystem is a platform or a suite that provides various services to solve big data problems. It includes Apache projects and various commercial tools and solutions.
- There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are used to supplement or support these major elements. All these tools work collectively to provide services such as absorption, analysis, storage, and maintenance of data, etc.
- Following are the components that collectively form a Hadoop ecosystem:
- HDFS: Hadoop Distributed File System
- YARN: Yet Another Resource Negotiator
- MapReduce: Programming based Data Processing
- Spark: In-Memory data processing
- PIG, HIVE: Query-based processing of data services
- HBase: NoSQL Database
- Mahout, Spark MLLib: Machine Learning algorithm libraries
- Solar, Lucene: Searching and Indexing
- Zookeeper: Managing cluster
- Oozie: Job Scheduling
Key Takeaways:-
- Hadoop Ecosystem is a platform or a suite that provides various services to solve big data problems. It includes Apache projects and various commercial tools and solutions.
- There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common.
- HDFS is a storage layer for Hadoop.
- HDFS is suitable for distributed storage and processing, that is, while the data is being stored, it first gets distributed, and then it is processed.
- HDFS provides Streaming access to file system data.
- HDFS provides file permission and authentication.
- HDFS uses a command-line interface to interact with Hadoop
Key Takeaways:-
- HDFS is a storage layer for Hadoop.
- HDFS is suitable for distributed storage and processing, that is, while the data is being stored, it first gets distributed, and then it is processed.
- Hadoop MapReduce is the other framework that processes data.
- It is the original Hadoop processing engine, which is primarily Java-based.
- It is based on the map and reduces the programming model.
- Many tools such as Hive and Pig are built on a map-reduce model.
- It has an extensive and mature fault tolerance built into the framework.
- It is still very commonly used but losing ground to Spark.
Key Takeaways:-
- Hadoop MapReduce is the other framework that processes data.
- It is based on the map and reduces the programming model.
- Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System.
- Consists of three major components i.e.
- Resource Manager
- Nodes Manager
- Application Manager
- Resource manager has the privilege of allocating resources for the applications in a system whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on acknowledges the resource manager.
- The application manager works as an interface between the resource manager and node manager and performs negotiations as per the requirement of the two.
Key Takeaways:-
- Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System.
- HBase is a NoSQL database or non-relational database.
- HBase is important and mainly used when you need random, real-time, read, or write access to your Big Data.
- It provides support to a high volume of data and high throughput.
- In an HBase, a table can have thousands of columns.
3.11.1 Storage mechanism in Hbase:
HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families, which are the key-value pairs. A table has multiple column families and each column family can have any number of columns. Subsequent column values are stored contiguously on the disk. Each cell value of the table has a timestamp. In short, in an HBase:
- The table is a collection of rows.
- The row is a collection of column families.
- The Column family is a collection of columns.
- The column is a collection of key-value pairs.
Figure 2
3.11.2 Features of Hbase:
- HBase is linearly scalable.
- It has automatic failure support.
- It provides consistent read and writes.
- It integrates with Hadoop, both as a source and a destination.
- It has an easy java API for clients.
- It provides data replication across clusters.
3.11.3 Application of Hbase:
- It is used whenever there is a need for write-heavy applications.
- HBase is used whenever we need to provide fast random access to available data.
- Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
Key Takeaways:-
- HBase is a NoSQL database or non-relational database.
- HBase is important and mainly used when you need random, real-time, read, or write access to your Big Data.
- HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families, which are the key-value pairs
- HIVE executes queries using MapReduce; however, a user need not write any code in low-level MapReduce.
- Hive is suitable for structured data. After the data is analyzed, it is ready for the users to access.
- Now that we know what HIVE does, we will discuss what supports the search of data. Data search is done using Cloudera Search.
3.12.1 Operators in Hive:
There are four types of operators in the Hive:
- Relational Operators
- Arithmetic Operators
- Logical Operators
- Complex Operators
Key Takeaways:-
- HIVE executes queries using MapReduce; however, a user need not write any code in low-level MapReduce.
- Hive is suitable for structured data. After the data is analyzed, it is ready for the users to access.
3.13 Pig and Pig Latin
- Pig converts its scripts to Map and Reduce code, thereby saving the user from writing complex MapReduce programs.
- Ad-hoc queries like Filter and Join, which are difficult to perform in MapReduce, can be easily done using Pig.
- You can also use Impala to analyze data.
- It is an open-source high-performance SQL engine, which runs on the Hadoop cluster.
- It is ideal for interactive analysis and has very low latency which can be measured in milliseconds.
3.13.1 Pig Execution Environment:
- The Pig execution environment has two modes:
- Local mode: All scripts are run on a single machine. Hadoop MapReduce and HDFS are not required.
- Hadoop: Also called MapReduce mode, all scripts are run on a given Hadoop cluster.
- Under the covers, the Pig creates a set of maps and reduces jobs. The user is absolved from the concerns of writing code, compiling, packaging, submitting, and retrieving the results. In many respects, Pig is analogous to SQL in the RDBMS world.
3.13.2 Pig execution:
- Pig programs can be run in three different ways, all of them compatible with local and Hadoop mode:
- Script: Simply a file containing Pig Latin commands, identified by the .pig suffix (for example, file.pig or myscript.pig). The commands are interpreted by Pig and executed in sequential order.
- Grunt: Grunt is a command interpreter. You can type Pig Latin on the grunt command line and Grunt will execute the command on your behalf. This is very useful for prototyping and “what if” scenarios.
- Embedded: Pig programs can be executed as part of a Java program.
3.13.3 Pig Latin supported operations:
Pig Latin has a very rich syntax. It supports operators for the following operations:
- Loading and storing of data
- Streaming data
- Filtering data
- Grouping and joining data
- Sorting data
- Combining and splitting data
Pig Latin also supports a wide variety of types, expressions, functions, diagnostic operators, macros, and file system commands.
Key Takeaways:-
- Pig converts its scripts to Map and Reduce code, thereby saving the user from writing complex MapReduce programs.
- Ad-hoc queries like Filter and Join, which are difficult to perform in MapReduce, can be easily done using Pig.
- The Pig execution environment has two modes
- Pig programs can be run in three different ways, all of them compatible with local and Hadoop mode:
- Sqoop is a tool designed to transfer data between Hadoop and relational database servers.
- It is used to import data from relational databases (such as Oracle and MySQL) to HDFS and export data from HDFS to relational databases.
- If you want to ingest event data such as streaming data, sensor data, or log files, then you can use Flume.
3.14.1 Key Features of Sqoop:
- Four key features are found in Sqoop:
- Bulk import: Sqoop can import individual tables or entire databases into HDFS. The data is stored in the native directories and files in the HDFS file system.
- Direct input: Sqoop can import and map SQL (relational) databases directly into Hive and HBase.
- Data interaction: Sqoop can generate Java classes so that you can interact with the data programmatically.
- Data export: Sqoop can export data directly from HDFS into a relational database using a target table definition based on the specifics of the target database.
Key Takeaways:-
- Sqoop is a tool designed to transfer data between Hadoop and relational database servers.
- It is used to import data from relational databases (such as Oracle and MySQL) to HDFS and export data from HDFS to relational databases.
- Four key features are found in Sqoop
- There was a huge issue of management of coordination and synchronization among the resources or the components of Hadoop which resulted in inconsistency, often.
- Zookeeper overcame all the problems by performing synchronization, inter-component based communication, grouping, and maintenance.
Figure 3
3.15.1 Benefits of Zookeeper:
Zookeeper provides a very simple interface and services. Zookeeper brings these key benefits:
- Fast. Zookeeper is especially fast with workloads where reads to the data are more common than writes. The ideal read/write ratio is about 10:1.
- Reliable. Zookeeper is replicated over a set of hosts (called an ensemble) and the servers are aware of each other. As long as a critical mass of servers is available, the Zookeeper service will also be available. There is no single point of failure.
- Simple. Zookeeper maintains a standard hierarchical namespace, similar to files and directories.
- Ordered. The service maintains a record of all transactions, which can be used for higher-level abstractions, like synchronization primitives.
Key Takeaways:-
- There was a huge issue of management of coordination and synchronization among the resources or the components of Hadoop which resulted in inconsistency, often.
- Zookeeper provides a very simple interface and services. Zookeeper brings these
- A company has millions of services that are running on multiple servers. Thus, produce lots of logs.
- To gain insights and understand customer behavior, they need to analyze these logs altogether. To process logs, a company requires an extensible, scalable, and reliable distributed data collection service.
- That service must be capable of performing the flow of unstructured data such as logs from source to the system where they will be processed (such as in the Hadoop Distributed File System).
- Flume is an open-source distributed data collection service used for transferring the data from source to destination.
3.16.1 Some Important features of Flume:
- Flume has a flexible design based on streaming data flows. It is fault-tolerant and robust with multiple failovers and recovery mechanisms. Flume Big data has different levels of reliability to offer which includes 'best-effort delivery' and an 'end-to-end delivery'.
- Best-effort delivery does not tolerate any Flume node failure whereas 'end-to-end delivery' mode guarantees delivery even in the event of multiple node failures.
- Flume carries data between sources and sinks. This gathering of data can either be scheduled or event-driven. Flume has its query processing engine which makes it easy to transform each new batch of data before it is moved to the intended sink.
- Possible Flume sinks include HDFS and HBase. Flume Hadoop can also be used to transport event data including but not limited to network traffic data, data generated by social media websites, and email messages.
Key Takeaways:-
- A company has millions of services that are running on multiple servers. Thus, produce lots of logs.
- Flume has a flexible design based upon streaming data flows
- Best-effort delivery does not tolerate any Flume node failure whereas 'end-to-end delivery' mode guarantees delivery even in the event of multiple node failures
- Flume carries data between sources and sinks
- Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a single unit.
- There are two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs.
- Oozie workflow is the jobs that need to be executed in a sequentially ordered manner whereas Oozie Coordinator jobs are those that are triggered when some data or external stimulus is given to it.
3.17.1 Working of Oozie:
- Oozie runs as a service in the cluster and clients submit workflow definitions for immediate or later processing.
- Oozie workflow consists of action nodes and control-flow nodes.
- An action node represents a workflow task, e.g., moving files into HDFS, running a MapReduce, Pig, or Hive jobs, importing data using Sqoop, or running a shell script of a program written in Java.
- A control-flow node controls the workflow execution between actions by allowing constructs like conditional logic wherein different branches may be followed depending on the result of an earlier action node.
- Start Node, End Node, and Error Node fall under this category of nodes.
- Start Node, designates the start of the workflow job.
- End Node, signals end of the job.
- Error Node designates the occurrence of an error and corresponding error message to be printed.
- At the end of the execution of a workflow, an HTTP callback is used by Oozie to update the client with the workflow status. Entry-to or exit from an action node may also trigger the callback.
Figure 4
Key Takeaways:-
- Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a single unit.
- There are two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs.
References:
- Big Data and Analysis – Seema Acharya and Subhashini Chellappan –Wiley Publication
- Data Mining: Concepts and Techniques Second Edition- Jaiwei Han and Micheline kamber-Morgan KaufMan publisher
- Data Mining and Analysis Fundamental Concepts and Algorithms –Mohammed J. Zaki and Wager Meira Jr. Cambridge University Press.
- Big Data (Black Book) – DT Editorial Services – Dreamtech Press