UNIT-2
Data Analysis and Exploration
In the first part, a mathematical model for decision making which takes into account the whole decision-making process is developed. The DMP itself is composed of six consecutive stages. These six stages have been selected according to the “Méthode de Resonnement Tactique” (MTS—Tactical Reasoning Method).
2.1.1. Structure of mathematical models:-
- Mathematical models have been developed and used in many application domains, ranging from physics to architecture, from engineering to economics.
- The models adopted in the various contexts differ substantially in terms of their mathematical structure. However, it is possible to identify a few fundamental features shared by most models.
- Generally speaking, a model is a selective abstraction of a real system. In other words, a model is designed to analyze and understand from an abstract point of view the operating behavior of a real system, regarding which it only includes those elements deemed relevant for the investigation carried out.
- In this respect, it is worth quoting the words of Einstein on the development of a model: ‘everything should be made as simple as possible, but not simpler.’ In graphical terms, the definition of Scientific and technological development has turned to mathematical models of various types for the abstract representation of real systems.
- As an example, consider the thought experiment (Gedanken experiment) popularized in physics at the beginning of the twentieth century, which involved building a mental model of a given phenomenon and verifying its validity by imagining the consequences caused by hypothetical modifications in the model itself.
- The analogy is well apparent between this conceptual paradigm and what-if analyses that can be easily performed using a simple spreadsheet to find an answer to questions such as: given a model for calculating the budget of a company, how are cash flows affected by a change in the payment terms, such as 90 days vs. 60 days, of invoices issued in favor of the main customers?
- According to their characteristics, models can be divided into iconic, analogical, and symbolic:-
- Iconic:-
An iconic model is a material representation of a real system, whose behavior is imitated for the
- Analysis:-
A miniaturized model of a new city neighborhood is an example of an iconic model.
- Analogical:-
An analogical model is also a material representation, although it imitates the real behavior by analogy rather than by replication. A wind tunnel built to investigate the aerodynamic properties of a motor vehicle is an example of an analogical model intended to represent the actual progression of a vehicle on the road.
- Symbolic:-
A symbolic model, such as a mathematical model, is an abstract representation of a real system. It is intended to describe the behavior of the system through a series of symbolic variables, numerical parameters, and mathematical relationships. Business intelligence systems, and consequently the models presented, are exclusively based on symbolic models. A further relevant distinction concerns the probabilistic nature of models, which can either stochastic or Deterministic.
- Stochastic:-
In a stochastic model, some input information represents random events and is therefore characterized by a probability distribution, which in turn can be assigned or unknown. Predictive models, which will be thoroughly described in the following chapters, as well as waiting line models, briefly mentioned below in this chapter, are examples of stochastic models.
- Deterministic:-
A model is called deterministic when all input data are supposed to be known a priori and with certainty. Since this assumption is rarely fulfilled in real systems, one resorts to deterministic models when the problem at hand is sufficiently complex and any stochastic elements are of limited relevance. Notice, however, that even for deterministic models the hypothesis of knowing the data with certainty may be relaxed. Sensitivity and scenario analyses, as well as what-if analysis, allow one to assess the robustness of optimal decisions to variations in the input parameters. A further distinction concerns the temporal dimension in a mathematical model, which can be either static or dynamic.
- Static:-
Static models consider a given system and the related decision-making process within one single temporal stage.
- Dynamic:-
Dynamic models consider a given system through several temporal stages, corresponding to a sequence of decisions. In many instances the temporal dimension is subdivided into discrete intervals of a previously fixed span: minutes, hours, days, weeks, months and years are examples of discrete subdivisions of the time axis. Discrete-time dynamic models, which largely prevail in business intelligence applications, observe the status of a system only at the beginning or at the end of discrete intervals. Continuous-time dynamic models consider a continuous sequence of periods on the time axis.
Key Takeaways:
- The DMP itself is composed of six consecutive stages. These six stages have been selected according to the “Méthode de Resonnement Tactique” (MTS—Tactical Reasoning Method).
- Mathematical models have been developed and used in many application domains, ranging from physics to architecture, from engineering to economics.
- A model is a selective abstraction of a real system. In other words, a model is designed to analyze and understand from an abstract point of view the operating behavior of a real system, regarding which it only includes those elements deemed relevant for the investigation carried out.
- According to their characteristics, models can be divided into iconic, analogical, and symbolic
- Data Mining is a process of finding potentially useful patterns from huge data sets. It is a multi-disciplinary skill that uses machine learning, statistics, and AI to extract information to evaluate future events probability.
- The insights derived from Data Mining are used for marketing, fraud detection, scientific discovery, etc.
- Data Mining is all about discovering hidden, unsuspected, and previously unknown yet valid relationships amongst the data.
- Data mining is also called Knowledge Discovery in Data (KDD), Knowledge extraction, data/pattern analysis, information harvesting, etc.
2.2.1. Types of Data:
Data mining can be performed on the following types of data
• Relational databases
• Data warehouses
• Advanced DB and information repositories
• Object-oriented and object-relational databases
• Transactional and Spatial databases
• Heterogeneous and legacy databases
• Multimedia and streaming database
• Text databases
• Text mining and Web mining
2.2.2. Data Mining Implementation Process:
Let's study the Data Mining implementation process in detail:
- Business understanding:
- In this phase, business and data-mining goals are established.
- First, you need to understand the business and client objectives. You need to define what your client wants (which many times even they do not know themselves)
- Take stock of the current data mining scenario. Factor in resources, assumptions, constraints, and other significant factors into your assessment.
- Using business objectives and current scenarios, define your data mining goals.
- A good data mining plan is very detailed and should be developed to accomplish both business and data mining goals.
2. Data understanding:
- In this phase, a sanity check on data is performed to check whether it's appropriate for the data mining goals.
- First, data is collected from multiple data sources available in the organization.
- These data sources may include multiple databases, flat filer, or data cubes. There are issues like object matching and schema integration that can arise during the Data Integration process. It is a quite complex and tricky process as data from various sources unlikely to match easily. For example, table A contains an entity named cust_no whereas table B contains an entity named cust-id.
- Therefore, it is quite difficult to ensure that both of these given objects refer to the same value or not. Here, Metadata should be used to reduce errors in the data integration process.
- Next, the step is to search for properties of acquired data. A good way to explore the data is to answer the data mining questions (decided in the business phase) using the query, reporting, and visualization tools.
- Based on the results of a query, the data quality should be ascertained. Missing data if any should be acquired.
3. Data preparation:
- In this phase, data is made production-ready.
- The data preparation process consumes about 90% of the time of the project.
- The data from different sources should be selected, cleaned, transformed, formatted, anonymized, and constructed (if required).
- Data cleaning is a process to "clean" the data by smoothing noisy data and filling in missing values.
- For example, for a customer demographics profile, age data is missing. The data is incomplete and should be filled in. In some cases, there could be data outliers. For instance, age has a value of 300. Data could be inconsistent. For instance, the name of the customer is different on different tables.
- Data transformation operations change the data to make it useful in data mining. Following transformation can be applied.
4. Data transformation:
Data transformation operations would contribute to the success of the mining process.
- Smoothing: -It helps to remove noise from the data.
- Aggregation: - Summary or aggregation operations are applied to the data. I.e., the weekly sales data is aggregated to calculate the monthly and yearly total.
- Generalization: - In this step, Low-level data is replaced by higher-level concepts with the help of concept hierarchies. For example, the city is replaced by the county.
- Normalization: - Normalization performed when the attribute data are scaled up o scaled down. Example: Data should fall in the range -2.0 to 2.0 post-normalization.
- Attribute construction:-These attributes are constructed and included the given set of attributes helpful for data mining.
The result of this process is the final data set that can be used in modeling.
5. Modeling:
- In this phase, mathematical models are used to determine data patterns.
- Based on the business objectives, suitable modeling techniques should be selected for the prepared dataset.
- Create a scenario to test check the quality and validity of the model.
- Run the model on the prepared dataset.
- Results should be assessed by all stakeholders to make sure that the model can meet data mining objectives.
6. Evaluation:
- In this phase, patterns identified are evaluated against the business objectives.
- Results generated by the data mining model should be evaluated against the business objectives.
- Gaining business understanding is an iterative process. While understanding, new business requirements may be raised because of data mining.
- A go or no-go decision is taken to move the model in the deployment phase.
7. Deployment:
- In the deployment phase, you ship your data mining discoveries to everyday business operations.
- The knowledge or information discovered during the data mining process should be made easy to understand for non-technical stakeholders.
- A detailed deployment plan, for shipping, maintenance, and monitoring of data mining discoveries is created.
- A final project report is created with lessons learned and key experiences during the project. This helps to improve the organization's business policy
2.2.3. Data Mining Techniques:
1. Classification:
This analysis is used to retrieve important and relevant information about data, and metadata. This data mining method helps to classify data in different classes.
2. Clustering:
Clustering analysis is a data mining technique to identify data that are like each other. This process helps to understand the differences and similarities between the data.
3. Regression:
Regression analysis is the data mining method of identifying and analyzing the relationship between variables. It is used to identify the likelihood of a specific variable, given the presence of other variables.
4. Association Rules:
This data mining technique helps to find the association between two or more Items. It discovers a hidden pattern in the data set.
5. Outer detection:
This type of data mining technique refers to the observation of data items in the dataset which do not match an expected pattern or expected behavior. This technique can be used in a variety of domains, such as intrusion, detection, fraud or fault detection, etc. Outer detection is also called Outlier Analysis or Outlier mining.
6. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends in transaction data for a certain period.
7. Prediction:
The prediction has used a combination of the other techniques of data mining like trends, sequential patterns, clustering, classification, etc. It analyzes past events or instances in the right sequence for predicting a future event.
2.2.4. Challenges of Implementation of Data mining:
- Skilled Experts are needed to formulate the data mining queries.
- Overfitting: Due to the small size training database, a model may not fit future states.
- Data mining needs large databases which sometimes are difficult to manage
- Business practices may need to be modified to determine to use the information uncovered.
- If the data set is not diverse, data mining results may not be accurate.
- Integration information needed from heterogeneous databases and global information systems could be complex
2.2.5. Data mining Examples:
Now in this Data Mining course, let's learn about Data mining with examples:
Example:
Consider a marketing head of telecom service provides who wants to increase revenues of long-distance services. For high ROI on his sales and marketing efforts, customer profiling is important. He has a vast data pool of customer information like age, gender, income, credit history, etc. But it's impossible to determine characteristics of people who prefer long-distance calls with manual analysis. Using data mining techniques, he may uncover patterns between high long-distance call users and their characteristics.
For example, he might learn that his best customers are married females between the age of 45 and 54 who make more than $80,000 per year. Marketing efforts can be targeted to such demographic.
2.2.6. Data Mining Tools:
Following are 2 popular Data Mining Tools widely used in Industry
- R-language:
R language is an open-source tool for statistical computing and graphics. R has a wide variety of statistical, classical statistical tests, time-series analysis, classification, and graphical techniques. It offers an effective data handling and storage facility.
2. Oracle Data Mining:-
Oracle Data mining popularly knowns as ODM is a module of the Oracle Advanced Analytics Database. This Data mining tool allows data analysts to generate detailed insights and makes predictions. It helps predict customer behavior, develops customer profiles, and identifies cross-selling opportunities.
2.2.7. Benefits of Data Mining:
- Data mining technique helps companies to get knowledge-based information.
- Data mining helps organizations to make profitable adjustments in operation and production.
- Data mining is a cost-effective and efficient solution compared to other statistical data applications.
- Data mining helps with the decision-making process.
- Facilitates automated prediction of trends and behaviors as well as the automated discovery of hidden patterns.
- It can be implemented in new systems as well as existing platforms
- It is a speedy process that makes it easy for users to analyze a huge amount of data in less time.
2.2.8. Disadvantages of Data Mining:
- There are chances of companies may sell useful information about their customers to other companies for money. For example, American Express has sold credit card purchases of their customers to other companies.
- Many data mining analytics software is difficult to operate and requires advanced training to work on.
- Different data mining tools work in different manners due to different algorithms employed in their design. Therefore, the selection of the correct data mining tool is a very difficult task.
- The data mining techniques are not accurate, and so it can cause serious consequences in certain conditions.
2.2.9. Data Mining Applications:
- Communications:-
Data mining techniques are used in the communication sector to predict customer behavior to offer highly targeted and relevant campaigns.
- Insurance:-
Data mining helps insurance companies to price their products profitable and promote new offers to their new or existing customers.
- Education:-
Data mining benefits educators to access student data, predict achievement levels, and find students or groups of students which need extra attention. For example, students who are weak in maths subject.
- Manufacturing:-
With the help of Data, Mining Manufacturers can predict the wear and tear of production assets. They can anticipate maintenance which helps them reduce them to minimize downtime
- Banking:-
Data mining helps the finance sector to get a view of market risks and manage regulatory compliance. It helps banks to identify probable defaulters to decide whether to issue credit cards, loans, etc.
- Retail:-
Data mining techniques help retail malls and grocery stores identify and arrange the most sellable items in the most attentive positions. It helps store owners to comes up with the offer which encourages customers to increase their spending.
- Service Providers:-
Service providers like mobile phone and utility industries use Data Mining to predict the reasons when a customer leaves their company. They analyze billing details, customer service interactions, complaints made to the company to assign each customer a probability score and offer incentives.
- E-Commerce:-
E-commerce websites use Data Mining to offer cross-sells and up-sells through their websites. One of the most famous names is Amazon, which uses Data mining techniques to get more customers into their e-commerce store.
- Super Markets:-
Data Mining allows supermarkets to develop rules to predict if their shoppers were likely to be expecting. By evaluating their buying pattern, they could find woman customers who are most likely pregnant. They can start targeting products like baby powder, baby shop, and diapers, and so on.
- Crime Investigation:-
Data Mining helps crime investigation agencies to deploy the police workforce (where is a crime most likely to happen and when?), who to search at a border crossing etc.
- Bioinformatics:-
Data Mining helps to mine biological data from massive datasets gathered in biology and medicine.
Key Takeaways:-
- Data mining definition: Data Mining is all about explaining the past and predicting the future via Data analysis.
- Data mining helps to extract information from huge sets of data. It is the procedure of mining knowledge from data.
- The data mining process includes business understanding, Data Understanding, Data Preparation, Modelling, Evolution, Deployment.
- Important Data mining techniques are classification, clustering, Regression, Association rules, Outer detection, Sequential Patterns, and prediction
- R-language and Oracle Data mining are prominent data mining tools and techniques.
- Data mining technique helps companies to get knowledge-based information.
- The main drawback of data mining is that many analytics software is difficult to operate and requires advanced training to work on.
- Data mining is used in diverse industries such as Communications, Insurance, Education, Manufacturing, Banking, Retail, Service providers, e-commerce, Supermarkets Bioinformatics.
- Data preparation is the cleaning and transforming raw data before processing and analysis. It is an important step before processing and often involves reformatting data, making corrections to data, and combining data sets to enrich data.
- Data preparation is often a lengthy undertaking for data professionals or business users, but it is essential as a prerequisite to put data in context to turn it into insights and eliminate bias resulting from poor data quality.
2.3.1. Benefit of Data preparation + the cloud:-
76% of data scientists say that data preparation is the worst part of their job, but efficient, accurate business decisions can only be made with clean data. Data preparation helps:
- Fix errors quickly:-Data preparation helps catch errors before processing. After data has been removed from its source, these errors become more difficult to understand and correct.
- Produce top-quality data: - Cleaning and reformatting datasets ensure that all data used in the analysis will be high quality.
- Make better business decisions: - Higher quality data that can be processed and analyzed more quickly and efficiently leads to more timely, efficient, and high-quality business decisions.
Additionally, as data and data processes move to the cloud, data preparation moves with it for even greater benefits, such as:
- Superior scalability: - Cloud data preparation can grow at the pace of the business. Enterprise doesn’t have to worry about the underlying infrastructure or try to anticipate their evolutions.
- Future proof: - Cloud data preparation upgrades automatically so that new capabilities or problem fixes can be turned on as soon as they are released. This allows organizations to stay ahead of the innovation curve without delays and added costs.
- Accelerated data usage and collaboration: - Doing data prep in the cloud means it is always on, doesn’t require any technical installation, and lets teams collaborate on the work for faster results.
Additionally, a good, cloud-native data preparation tool will offer other benefits (like an intuitive and simple to use GUI) for easier and more efficient preparation.
2.3.2. Data Preparation Steps:-
The specifics of the data preparation process vary by industry, organization, and need, but the framework remains largely the same.
1. Gather data:
The data preparation process begins with finding the right data. This can come from an existing data catalog or can be added ad-hoc.
2. Discover and assess data:
After collecting the data, it is important to discover each dataset. This step is about getting to know the data and understanding what has to be done before the data becomes useful in a particular context.
Discovery is a big task, but Talend’s data preparation platform offers visualization tools that help users profile and browse their data.
3. Cleanse and validate data:
Cleaning up the data is traditionally the most time-consuming part of the data preparation process, but it’s crucial for removing faulty data and filling in gaps. Important tasks here include:
- Removing extraneous data and outliers.
- Filling in missing values.
- Conforming data to a standardized pattern.
- Masking private or sensitive data entries.
Once data has been cleansed, it must be validated by testing for errors in the data preparation process up to this point. Often, an error in the system will become apparent during this step and will need to be resolved before moving forward.
4. Transform and enrich data:
Transforming data is the process of updating the format or value entries to reach a well-defined outcome, or to make the data more easily understood by a wider audience. Enriching data refers to adding and connecting data with other related information to provide deeper insights.
5. Store data:
Once prepared, the data can be stored or channeled into a third-party application—such as a business intelligence tool—clearing the way for processing and analysis to take place.
Key Takeaways:-
- Data preparation is the cleaning and transforming raw data before processing and analysis. It is an important step before processing and often involves reformatting data, making corrections to data, and combining data sets to enrich data.
- 76% of data scientists say that data preparation is the worst part of their job, but efficient, accurate business decisions can only be made with clean data.
- The specifics of the data preparation process vary by industry, organization, and need, but the framework remains largely the same.
Data exploration is the first step in data analysis and typically involves summarizing the main characteristics of a data set, including its size, accuracy, initial patterns in the data, and other attributes. It is commonly conducted by data analysts using visual analytics tools, but it can also be done in more advanced statistical software.
2.4.1. The role of data exploration
- Before it can analyze data collected by multiple data sources and stored in data warehouses, an organization must know how many cases are in a data set, what variables are included, how many missing values there are, and what general hypotheses the data is likely to support. An initial exploration of the data set can help answer these questions by familiarizing analysts with the data with which they are working.
- Once data exploration has uncovered the relationships between the different variables, organizations can continue the data mining process by creating and deploying data models to take action
- Companies can conduct data exploration via a combination of automated and manual methods.
- Analysts commonly use automated tools such as data visualization software for data exploration because these tools allow users to quickly and simply view most of the relevant features of a data set. From this step, users can identify variables that are likely to have interesting observations.
- By displaying data graphically -- for example, through scatter plots, density plots, or bar charts -- users can see if two or more variables correlate and determine if they are good candidates for further analysis, which may include:
- Univariate analysis: The analysis of one variable.
- Bivariate analysis: The analysis of two variables to determine their relationship.
- Multivariate analysis: The analysis of multiple outcome variables.
- Principal components analysis: The analysis and conversion of possibly correlated variables into a smaller number of uncorrelated variables.
- Manual data exploration methods may include filtering and drilling down into data in Excel spreadsheets or writing scripts to analyze raw data sets.
- After the data exploration is complete, analysts can move on to the data discovery phase to answer specific questions about a business issue. The data discovery process involves using business intelligence tools to examine trends, sequences, and events and creating visualizations to present to business leaders.
2.4.2. Data exploration tools and vendors:
- Analysts can explore data using features in business intelligence tools and data visualization software, such as MapR, Microsoft Power BI, Qlik, and Tableau.
- Data profiling and preparation software from vendors including Trifacta and Paxata can help organizations blend disparate data sources to enable faster data exploration by analysts.
- There are also free, open-source data exploration tools, such as MIT's DIVE, which include visualization features and regression capabilities.
Key Takeaways:-
- Data exploration is the first step in data analysis and typically involves summarizing the main characteristics of a data set, including its size, accuracy, initial patterns in the data, and other attributes
- Before it can analyze data collected by multiple data sources and stored in data warehouses, an organization must know how many cases are in a data set, what variables are included, how many missing values there are, and what general hypotheses the data is likely to support.
- Once data exploration has uncovered the relationships between the different variables, organizations can continue the data mining process by creating and deploying data models to take action
- By displaying data graphically -- for example, through scatter plots, density plots, or bar charts -- users can see if two or more variables correlate and determine if they are good candidates for further analysis
References :
- Business Intelligence – Data Mining and optimization for Decision Making – Curlo Vercellis-Wiley Publications.
- Big Data and Analysis – Seema Acharya and Subhashini Chellappan –Wiley Publication
- Data Mining: Concepts and Techniques Second Edition- Jaiwei Han and Micheline Kamber-Morgan KaufMan publisher
- Data Mining and Analysis Fundamental Concepts and Algorithms –Mohammed J. Zaki and Wager Meira Jr. Cambridge University Press.