Unit – 3
Data Preprocessing
Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted. This data is usually not necessary or helpful when it comes to analyzing data because it may hinder the process or provide inaccurate results. There are several methods for cleaning data depending on how it is stored along with the answers being sought.
Data cleaning is not simply about erasing information to make space for new data, but rather finding a way to maximize a data set’s accuracy without necessarily deleting information.
For one, data cleaning includes more actions than removing data, such as fixing spelling and syntax errors, standardizing data sets, and correcting mistakes such as empty fields, missing codes, and identifying duplicate data points. Data cleaning is considered a foundational element of the data science basics, as it plays an important role in the analytical process and uncovering reliable answers.
Data transformation is the process of converting data from one format or structure into another format or structure. Data transformation is critical to activities such as data integration and data management. Data transformation can include a range of activities: you might convert data types, cleanse data by removing nulls or duplicate data, enrich the data, or perform aggregations, depending on the needs of your project.
Typically, the process involves two stages.
In the first stage, you:
- Perform data discovery where you identify the sources and data types.
- Determine the structure and data transformations that need to occur.
- Perform data mapping to define how individual fields are mapped, modified, joined, filtered, and aggregated.
In the second stage, you:
- Extract data from the original source. The range of sources can vary, including structured sources, like databases, or streaming sources, such as telemetry from connected devices, or log files from customers using your web applications.
- Perform transformations. You transform the data, such as aggregating sales data or converting date formats, editing text strings or joining rows and columns.
- Send the data to the target store. The target might be a database or a data warehouse that handles structured and unstructured data.
Why transform data?
You might want to transform your data for a number of reasons. Generally, businesses want to transform data to make it compatible with other data, move it to another system, join it with other data, or aggregate information in the data.
For example, consider the following scenario: your company has purchased a smaller company, and you need to combine information for the Human Resources departments. The purchased company uses a different database than the parent company, so you’ll need to do some work to ensure that these records match. Each of the new employees has been issued an employee ID, so this can serve as a key. But, you’ll need to change the formatting for the dates, you’ll need to remove any duplicate rows, and you’ll have to ensure that there are no null values for the Employee ID field so that all employees are accounted for. All these critical functions are performed in a staging area before you load the data to the final target.
Other common reasons to transform data include:
- You are moving your data to a new data store; for example, you are moving to a cloud data warehouse and you need to change the data types.
- You want to join unstructured data or streaming data with structured data so you can analyze the data together.
- You want to add information to your data to enrich it, such as performing lookups, adding geolocation data, or adding timestamps.
- You want to perform aggregations, such as comparing sales data from different regions or totaling sales from different regions.
How is data transformed?
There are a few different ways to transform data:
- Scripting. Some companies perform data transformation via scripts using SQL or Python to write the code to extract and transform the data.
- On-premise ETL tools. ETL (Extract, Transform, Load) tools can take much of the pain out of scripting the transformations by automating the process. These tools are typically hosted on your company’s site, and may require extensive expertise and infrastructure costs.
- Cloud-based ETL tools. These ETL tools are hosted in the cloud, where you can leverage the expertise and infrastructure of the vendor.
- Data transformation challenges
Data transformation can be difficult for a number of reasons:
Time-consuming. You may need to extensively cleanse the data so you can transform or migrate it. This can be extremely time-consuming, and is a common complaint amongst data scientists working with unstructured data.
Costly. Depending on your infrastructure, transforming your data may require a team of experts and substantial infrastructure costs.
Slow. Because the process of extracting and transforming data can be a burden on your system, it is often done in batches, which means you may have to wait up to 24 hours for the next batch to be processed. This can cost you time in making business decisions.
The method of data reduction may achieve a condensed description of the original data which is much smaller in quantity but keeps the quality of the original data.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine that information you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your company every three months. They involve you in the annual sales, rather than the quarterly average, So we can summarize the data in such a way that the resulting data summarizes the total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required for our analysis. It reduces data size as it eliminates outdated or redundant features.
- Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of the original attributes on the set based on their relevance to other attributes. We know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
2. Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }
Step-1: {X1, X2, X3, X4, X5}
Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
3. Combination of forwarding and Backward Selection –
It allows us to remove the worst and select best attributes, saving time and making the process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types based on their compression techniques.
- Lossless Compression –
Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction. Lossless data compression uses algorithms to restore the precise original data from the compressed data. - Lossy Compression –
Methods such as Discrete Wavelet transform technique, PCA (principal component analysis) are examples of this compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning equivalent to the original the image. In lossy-data compression, the decompressed data may differ to the original data but are useful enough to retrieve information from them.
4. Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models or smaller representation of the data instead of actual data, it is important to only store the model parameter. Or non-parametric method such as clustering, histogram, sampling. For More Information on Numerosity Reduction Visit the link below:
Techniques of data discretization are used to divide the attributes of the continuous nature into data with intervals. We replace many constant values of the attributes by labels of small intervals. This means that mining results are shown in a concise, and easily understandable way.
- Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set of attributes and repeat of this method up to the end, then the process is known as top-down discretization also known as splitting.
- Bottom-up discretization –
If you first consider all the constant values as split-points, some are discarded through a combination of the neighbourhood values in the interval, that process is called bottom-up discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) to high-level concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
- Binning –
Binning is the process of changing numerical variables into categorical counterparts. The number of categorical counterparts depends on the number of bins specified by the user. - Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into disjoint ranges called brackets. There are several partitioning rules:- Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set.
- Equal Width Partioning: Partioning the values in a fixed gap based on the number of bins i.e. a set of values ranging from 0-20.
- Clustering: Grouping the similar data together.
Reference Books
1. Data Mining : Next Generation Challenges and Future Direction by Kargupta, et al, PHI.
2. Data Warehousing, Data Mining & OLAP by Alex Berson Stephen J.Smith.