Unit - 3
Big Data Analytics Life Cycle
Q1) Introduce Big data?
A1) Big data refers to a large amount of data, as the name implies. Big Data refers to a data set that is both huge in volume and complicated. Traditional data processing software cannot handle Big Data because of its vast volume and increasing complexity. Big Data simply refers to datasets that contain a vast amount of unstructured and structured data.
Companies can use Big Data Analytics to address issues they are encountering in their business and efficiently fix these problems. Companies aim to find patterns in this sea of data and extract insights from it so that it may be used to solve the problem(s) at hand.
Despite the fact that businesses have been gathering massive amounts of data for decades, the term "Big Data" only became popular in the early to mid-2000s.
Corporations acknowledged the vast amount of data produced on a daily basis and the need of successfully utilising it.
Big Data is a massive collection of data that continues to grow dramatically over time. It is a data set that is so huge and complicated that no typical data management technologies can effectively store or process it. Big data is similar to regular data, but it is much larger.
Q2) What are the sources of big data?
A2) Big data is mostly derived from three sources: social data, machine data, and transactional data. Furthermore, businesses must distinguish between data generated internally, that is, data that resides behind a company's firewall, and data generated outside that must be imported into a system.
It's also vital to consider if the data is unstructured or structured. Because unstructured data lacks a pre-defined data model, it necessitates additional resources to comprehend.
The three primary sources of Big Data
Social data
Likes, Tweets & Retweets, Comments, Video Uploads, and general media are all sources of social data on the world's most popular social media platforms. This type of data may be quite useful in marketing analytics because it provides essential insights into consumer behaviour and sentiment. Another good source of social data is the public web, and tools like Google Trends can help enhance the volume of big data.
Machine data
Machine data includes information generated by industrial machinery, sensors put in machinery, and even web logs that track user behaviour. As the internet of things becomes more prevalent and develops over the world, this type of data is predicted to grow rapidly. In the not-too-distant future, sensors such as medical gadgets, smart metres, road cameras, satellites, games, and the quickly expanding Internet Of Things will give high velocity, value, volume, and variety of data.
Transactional data
Transactional data is derived from all online and offline transactions that occur on a daily basis. Invoices, payment orders, storage records, and delivery receipts are all considered transactional data, but data on its own is nearly worthless, and most businesses struggle to make sense of the data they generate and how to use it effectively.
Other sources
Media as a Big data source
The most common source of big data is the media, which offers significant insights into consumer preferences and evolving trends. It is the quickest way for businesses to acquire an in-depth overview of their target audience, establish patterns and conclusions, and improve their decision-making because it is self-broadcast and overcomes all physical and demographical borders. Social media and interactive platforms such as Google, Facebook, Twitter, YouTube, and Instagram, as well as generic media such as photographs, videos, audios, and podcasts, provide quantitative and qualitative insights on all aspects of user involvement.
Cloud as a big data source
Companies have migrated their data to the cloud to get ahead of traditional data sources. Cloud storage handles both organised and unstructured data and gives real-time data and on-demand insights to businesses. Cloud computing's key feature is its scalability and adaptability. Cloud allows for an efficient and cost-effective data source because huge data may be stored and accessed on public or private clouds via networks and computers.
Database as a big data sources
Businesses nowadays prefer to collect relevant big data by combining old and digital databases. This connection paves the door for a hybrid data model while requiring minimal capital and IT infrastructure. Additionally, these databases are used for a variety of business intelligence reasons. These databases can then be used to extract information that can be used to boost business earnings. MS Access, DB2, Oracle, SQL, and Amazon Simple are just a few examples of popular databases.
Extracting and interpreting data from a large number of big data sources is a time-consuming and difficult task. These issues can be avoided if companies take into account all of the important aspects of big data, assess relevant data sources, and deploy them in a way that is well aligned with their objectives.
Q3) Introduce data analytics life cycle?
A3) For Big Data challenges and data science projects, the data analytic lifecycle was created. To depict a genuine project, the cycle is iterative. Step-by-step approach is needed to organise the actions and processes involved with gathering, processing, analysing, and repurposing data to address the particular requirements for doing Big Data analysis.
The Data Analytics Lifecycle is a six-stage cyclic process that illustrates how data is created, collected, processed, implemented, and analysed for various goals.
Fig 1: Data analytics life cycle
Q4) What is phase1?
A4) Phase 1 - Discovery
This is the first step in defining your project's goals and determining how to complete the data analytics lifecycle. Begin by identifying your business area and ensuring that you have sufficient resources (time, technology, data, and people) to meet your objectives.
The most difficult aspect of this step is gathering enough data. You'll need to create an analysis plan, which will take some time and effort.
Accumulate resources
To begin, you must examine the models you wish to create. Then figure out how much domain knowledge you'll need to complete those models.
The next step is to determine whether you have the necessary skills and resources to complete your projects.
Frame the issue
Meeting your client's expectations is the most likely source of problems. As a result, you must identify the project's challenges and explain them to your clients. This is referred to as "framing." You must write a problem statement that explains the current situation as well as potential future issues. You must also identify the project's goal, as well as the project's success and failure criteria.
Formulate initial hypothesis
After you've gathered all of the client's needs, you'll need to construct early hypotheses based on the data you've gathered.
Q5) What do you mean by phase 2 - data preparations?
A5) Phase 2 - Data preparations
Before moving on to the model building process, the data preparation and processing phase involves gathering, processing, and conditioning data.
Identify data sources
You must identify numerous data sources and assess how much and what type of data you can get in a given amount of time. Evaluate the data structures, investigate their attributes, and gather all of the necessary tools.
Collection of data
There are three ways to collect data:
● Data collection: You can obtain data from a variety of other sources.
● Data Entry: You can use digital technology or manual entry to prepare data points.
● Signal reception: Data from digital devices, such as IoT devices and control systems, can be accumulated.
Q6) Write about phase 3?
A6) Phase 3 - model planning
This is the stage in which you must assess the quality of the data and select a model that is appropriate for your project.
Loading Data in Analytics Sandbox
A data lake design includes an analytics sandbox that allows you to store and handle enormous amounts of data. It can handle a wide range of data types, including big data, transactional data, social media data, web data, and so on. It's a setting that lets your analysts schedule and process data assets using the data tools of their choosing. The adaptability of the analytics sandbox is its best feature. Analysts can process data in real time and obtain critical information in a short amount of time.
There are three ways to load data into the sandbox:
● ETL - Before loading the data into the sandbox, the ETL Team professionals ensure that it complies with the business standards.
● ELT - Data is fed into the sandbox and then transformed according to business standards.
● ETLT - ETL and ELT are both part of ETLT, which consists of two levels of data transformation.
Unnecessary characteristics or null values may be present in the data you've collected. It could take a form that is too difficult to predict. This is where data exploration can assist you in uncovering hidden data trends.
Steps involved in data exploration:
● Data identification
● Univariate Analysis
● Multivariate Analysis
● Filling Null values
● Feature engineering
Regression approaches, decision trees, neural networks, and other techniques are frequently used by data analysts for model planning. Rand PL/R, WEKA, Octave, Statista, and MATLAB are some of the most commonly used tools for model preparation and execution.
Q7) Define model building?
A7) Phase 4 - model building
The process of deploying the proposed model in a real-time environment is known as model building. It enables analysts to get in-depth analytical knowledge to help them consolidate their decision-making process. This is a time-consuming procedure because you must constantly add new features as requested by your clients.
Here, your goal is to predict corporate decisions, customise market strategies, and develop custom-tailored customer interests. This is accomplished by incorporating the model into your current production domain.
In certain circumstances, a single model perfectly matches with the business objectives/data, while in others, multiple attempts are required. As you begin to explore the data, you'll need to run certain algorithms and compare the results to your goals. In some circumstances, you may need to run multiple variants of a model at the same time until you get the required results.
Q8) What do you mean by communication results?
A8) Phase 5 - communication result
This is the stage in which you must present the results of your data analysis to your clients. It necessitates a number of sophisticated processes in which you must deliver information to them in a clear and concise manner. Your clients don't have enough time to figure out which information is crucial. As a result, you must perform flawlessly in order to capture your clients' attention.
Check the data accuracy
Is the information provided by the data accurate? If not, you'll need to execute some more operations to fix the problem. You must make certain that the data you process is consistent. This will assist you in constructing a persuasive case while describing your findings.
Highlight important findings
Each piece of information, on the other hand, plays a critical function in the development of a successful project. Some data, on the other hand, has more potent information that can genuinely benefit your readers. Try to organise data into different essential points while describing your findings.
Determine the most appropriate communication format
The way you present your findings says a lot about who you are as a professional. We advise you to use visual presentations and animations because they help you express information much more quickly. However, there are instances when you need to go back to basics. Your clients, for example, may be required to carry the findings in a physical manner. They may also be required to collect and exchange certain information.
Q9) What is phase 6?
A9) Phase 6 - operation alize
Your data analytics life cycle is practically complete once you generate a full report that includes your major results, papers, and briefings. Before providing the final reports to your stakeholders, you must assess the success of your analysis.
You must migrate the sandbox data and execute it in a live environment during this procedure. Then you must keep a tight eye on the results to ensure they are in line with your expectations. You can finish the report if the findings are precisely aligned with your goal. Otherwise, you'll have to go back and make some modifications in your data analytics lifecycle.
Q10) What are the advantages of Big data?
A10) The following are some of the advantages or benefits of Big Data:
● Big data analysis generates novel solutions. Big data analysis aids in customer knowledge and targeting. It assists in the optimization of corporate processes.
● It contributes to the advancement of science and research.
● With the availability of patient records, it enhances healthcare and public health.
● It is used in financial trading, sports, polls, and security/law enforcement, among other things.
● Anyone can use surveys to gain access to a wealth of information and provide answers to any question.
● Additions are made every second.
● One platform can hold an infinite amount of data.
Q11) Write the disadvantages of Big data?
A11) The following are some of Big Data's problems or disadvantages:
● Traditional storage can be very expensive when it comes to storing large amounts of data.
● Unstructured data makes up a large portion of big data.
● Big data analysis is in violation of privacy norms.
● It can be used to manipulate client information.
● It has the potential to exacerbate social stratification.
● In the short term, big data analysis is useless. To reap the benefits, it must be studied for a longer period of time.
● The results of big data analysis might be deceiving at times.
● Rapid updates in huge data might cause real-time figures to be out of sync.