Unit - 5
Big Data Analytics and Model Evaluation
Q1) What is a k-means algorithm?
A1) For numeric results, K-Means clustering is one of the most commonly used prototype-based clustering algorithms. The centroid or mean of all the data points in a cluster is the prototype of a cluster in k-means. As a consequence, the algorithm works best with continuous numeric data. When dealing with data that includes categorical variables or a mixture of quantitative and categorical variables.
K-Means Clustering is an unsupervised learning approach used in machine learning and data science to solve clustering problems. K specifies the number of predefined clusters that must be produced during the process; for example, if K=2, two clusters will be created, and if K=3, three clusters will be created, and so on.
It allows us to cluster data into different groups and provides a simple technique to determine the categories of groups in an unlabeled dataset without any training.
It's a centroid-based approach, which means that each cluster has its own centroid. The main goal of this technique is to reduce the sum of distances between data points and the clusters that they belong to.
For numeric results, K-Means clustering is one of the most commonly used prototype-based clustering algorithms. The centroid or mean of all the data points in a cluster is the prototype of a cluster in k-means. As a consequence, the algorithm works best with continuous numeric data. When dealing with data that includes categorical variables or a mixture of quantitative and categorical variables.
The technique takes an unlabeled dataset as input, separates it into a k-number of clusters, and continues the procedure until no better clusters are found. In this algorithm, the value of k should be predetermined.
The k-means clustering algorithm primarily accomplishes two goals:
● Iteratively determines the optimal value for K centre points or centroids.
● Each data point is assigned to the k-center that is closest to it. A cluster is formed by data points that are close to a specific k-center.
As a result, each cluster contains datapoints with certain commonality and is isolated from the others.
Fig 1: Working of the K-means Clustering Algorithm
Pseudo Algorithm
- Choose an appropriate value of K (number of clusters we want)
- Generate K random points as initial cluster centroids
- Until convergence (Algorithm converges when centroids remain the same between iterations):
● Assign each point to a cluster whose centroid is nearest to it ("Nearness" is measured as the Euclidean distance between two points)
● Calculate new values of centroids of each cluster as the mean of all points assigned to that cluster
K-Means as an optimization problem
Any learning algorithm has the aim of minimizing a cost function. We'll see how, by using centroid as a cluster prototype, we can minimize a cost function called "sum of squared error".
The sum of squared error is the square of all points' distances from their respective cluster prototypes.
We get by taking the partial derivative of the cost function with respect to and cluster prototype and equating it to 0.
Cluster centroids are thus the prototypes that minimize the cost function.
Q2) What is hierarchical clustering?
A2) Hierarchical clustering, also known as hierarchical cluster analysis or HCA, is another unsupervised machine learning approach for grouping unlabeled datasets into clusters.
The hierarchy of clusters is developed in the form of a tree in this technique, and this tree-shaped structure is known as the dendrogram.
Although the results of K-means clustering and hierarchical clustering may appear to be comparable at times, their methods differ. As opposed to the K-Means algorithm, there is no need to predetermine the number of clusters.
There are two ways to hierarchical clustering:
● Agglomerative - Agglomerative is a bottom-up strategy in which the algorithm begins by grouping all data points into single clusters and merging them until only one remains.
● Divisive - Because it is a top-down method, the divisive algorithm is the inverse of the agglomerative algorithm.
Why do we need hierarchical clustering?
Why do we need hierarchical clustering when we already have alternative clustering methods like K-Means Clustering? So, as we've seen with K-means clustering, this algorithm has some limitations, such as a set number of clusters and a constant attempt to construct clusters of the same size. We can utilise the hierarchical clustering algorithm to tackle these two problems because we don't need to know the specified number of clusters in this algorithm.
Q3) Write the difference between k-means and hierarchical clustering?
A3) Difference between k-means and hierarchical clustering are -
k-means Clustering | Hierarchical Clustering |
k-means, using a pre-specified number of clusters, the method assigns records to each cluster to find the mutually exclusive cluster of spherical shape based on distance. | Hierarchical methods can be either divisive or agglomerative. |
K Means clustering needed advance knowledge of K i.e. no. Of clusters one want to divide your data. | In hierarchical clustering one can stop at any number of clusters, one find appropriate by interpreting the dendrogram. |
One can use median or mean as a cluster centre to represent each cluster. | Agglomerative methods begin with ‘n’ clusters and sequentially combine similar clusters until only one cluster is obtained. |
Methods used are normally less computationally intensive and are suited with very large datasets. | Divisive methods work in the opposite direction, beginning with one cluster that includes all the records and Hierarchical methods are especially useful when the target is to arrange the clusters into a natural hierarchy. |
In K Means clustering, since one start with random choice of clusters, the results produced by running the algorithm many times may differ. | In Hierarchical Clustering, results are reproducible in Hierarchical clustering |
K- means clustering a simply a division of the set of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset). | A hierarchical clustering is a set of nested clusters that are arranged as a tree. |
K Means clustering is found to work well when the structure of the clusters is hyper spherical (like circle in 2D, sphere in 3D). | Hierarchical clustering don’t work as well as, k means when the shape of the clusters is hyper spherical. |
Advantages: 1. Convergence is guaranteed. 2. Specialized to clusters of different sizes and shapes. | Advantages: 1 .Ease of handling of any forms of similarity or distance. 2. Consequently, applicability to any attributes types. |
Disadvantages: 1. K-Value is difficult to predict 2. Didn’t work well with global cluster. | Disadvantage: 1. Hierarchical clustering requires the computation and storage of an n×n distance matrix. For very large datasets, this can be expensive and slow |
Q4) Explain time series analysis?
A4) A time series is a collection of observations of category or numerical variables that are linked together by a date or timestamp. The time series of a stock price is a good example of time series data. The basic structure of time series data is shown in the table below. The observations are recorded every hour in this scenario.
Timestamp | Stock - Price |
2015-10-11 09:00:00 | 100 |
2015-10-11 10:00:00 | 110 |
2015-10-11 11:00:00 | 105 |
2015-10-11 12:00:00 | 90 |
2015-10-11 13:00:00 | 120 |
Plotting the series is usually the initial stage in time series analysis, and this is usually done with a line chart.
The most typical use of time series analysis is to anticipate future values of a numeric value based on the data's temporal structure. This indicates that the current observations are utilised to forecast future values.
Traditional regression approaches are ineffective due to the data's temporal ordering. We need models that take into consideration the temporal ordering of the data in order to produce reliable forecasts.
The Autoregressive Moving Average model is the most extensively used Time Series Analysis model (ARMA). The model is divided into two sections: an autoregressive (AR) component and a moving average (MA) component. The model is then known as the ARMA(p, q) model, with p denoting the autoregressive part's order and q denoting the moving average part's order.
Autoregressive Model
AR(p) stands for autoregressive model of order p. It's written like this in math:
Where {φ1, …, φp} are the parameters to be estimated, c is a constant, and t is the white noise random variable. The values of the parameters must be constrained in order for the model to remain stationary
Moving Average
The moving average model of order q is denoted by the symbol MA(q).
Where the θ1, ..., θq are the parameters of the model, μ is the expectation of Xt, and the εt, εt − 1, ... Are, white noise error terms.
Autoregressive Moving Average
The ARMA(p, q) model is a combination of AR(p) and MA(q) models, as can be shown.
Consider the AR component of the equation, which aims to estimate parameters for Xt -i observations in order to predict the value of the variable in Xt. In the end, it's a weighted average of previous values. The MA part follows the same procedure, but with the addition of the error of prior observations, εt − i. As a result, the model's final outcome is a weighted average.
The following code snippet shows how to use R to create an ARMA(p, q).
# install.packages("forecast")
Library("forecast")
# Read the data
Data = scan('fancy.dat')
Ts_data <- ts(data, frequency = 12, start = c(1987,1))
Ts_data
Plot.ts(ts_data)
Plotting the data is usually the first step in determining whether the data has a temporal structure. The graphic shows that there are large increases at the end of each year.
The code below applies an ARMA model to the data. It runs multiple model combinations and chooses the one with the least amount of inaccuracy.
# Fit the ARMA model
Fit = auto.arima(ts_data)
Summary(fit)
# Series: ts_data
# ARIMA(1,1,1)(0,1,1)[12]
# Coefficients:
# ar1 ma1 sma1
# 0.2401 -0.9013 0.7499
# s.e. 0.1427 0.0709 0.1790
#
# sigma^2 estimated as 15464184: log likelihood = -693.69
# AIC = 1395.38 AICc = 1395.98 BIC = 1404.43
# Training set error measures:
# ME RMSE MAE MPE MAPE MASE ACF1
# Training set 328.301 3615.374 2171.002 -2.481166 15.97302 0.4905797 -0.02521172
Q5) Introduce text analysis?
A5) The data includes language that defines freelancer profiles as well as the hourly rate they charge in USD. The goal of the next section is to fit a model that can forecast a freelancer's hourly salary based on their expertise.
The code below demonstrates how to turn raw text, which in this case has user skills, into a bag of words matrix. We utilise the tm R library to accomplish this. This means that we establish a variable for each word in the corpus with the number of occurrences of each variable.
Library(tm)
Library(data.table)
Source('text_analytics/text_analytics_functions.R')
Data = fread('text_analytics/data/profiles.txt')
Rate = as.numeric(data$rate)
Keep = !is.na(rate)
Rate = rate[keep]
### Make bag of words of title and body
X_all = bag_words(data$user_skills[keep])
X_all = removeSparseTerms(X_all, 0.999)
X_all
# <<DocumentTermMatrix (documents: 389, terms: 1422)>>
# Non-/sparse entries: 4057/549101
# Sparsity : 99%
# Maximal term length: 80
# Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
### Make a sparse matrix with all the data
X_all <- as_sparseMatrix(X_all)
We can fit a model that will give us a sparse solution now that we have the text represented as a sparse matrix. The LASSO is a good alternative in this instance (least absolute shrinkage and selection operator). This is a regression model that may pick the most important features to predict the outcome.
Train_inx = 1:200
X_train = X_all[train_inx, ]
y_train = rate[train_inx]
X_test = X_all[-train_inx, ]
y_test = rate[-train_inx]
# Train a regression model
Library(glmnet)
Fit <- cv.glmnet(x = X_train, y = y_train,
Family = 'gaussian', alpha = 1,
Nfolds = 3, type.measure = 'mae')
Plot(fit)
# Make predictions
Predictions = predict(fit, newx = X_test)
Predictions = as.vector(predictions[,1])
Head(predictions)
# 36.23598 36.43046 51.69786 26.06811 35.13185 37.66367
# We can compute the mean absolute error for the test data
Mean(abs(y_test - predictions))
# 15.02175
We now have a model that can forecast a freelancer's hourly salary based on a set of abilities. The model's speed will improve as more data is collected, but the code to implement this pipeline will remain the same.
Q6) Define a bag of words?
A6) When using machine learning methods to model text, the bag-of-words model is a way of encoding text data.
The bag-of-words approach is easy to learn and use, and it's proven to be effective in tasks like language modelling and document classification.
The bag-of-words model for feature extraction in natural language processing.
A bag-of-words model, or BoW for short, is a method of extracting text attributes for use in modelling, such as machine learning techniques.
The method is straightforward and adaptable, and it may be used to extract information from documents in a variety of ways.
A bag-of-words is a text representation that describes the frequency with which words appear in a document. It entails two steps:
A list of terms that are well-known.
A metric for determining the existence of well-known terms.
Because any information about the sequence or structure of words in the document is deleted, it is referred to as a "bag" of words. The model simply cares about whether or not recognised terms appear in the document, not where they appear.
The assumption is that documents with comparable content are similar. Furthermore, we can deduce something about the document's significance solely from its content.
You can make the bag-of-words as simple or as complex as you want. The difficulty arises from deciding how to create a vocabulary of known words (or tokens) as well as how to rate the existence of known terms.
Limitations
The bag-of-words paradigm is simple to learn and use, and it provides a lot of customization options for your individual text data.
It's been used to solve challenges like language modelling and documentation classification with tremendous success.
Nonetheless, it has some flaws, including the following:
● The vocabulary must be carefully designed, especially in order to regulate the size, which has an impact on the document representations' sparsity.
● Sparsity: Sparse representations are more difficult to model, both in terms of computational complexity (space and time complexity) and in terms of information difficulty (models must harness so little information in such a huge representational space).
● Meaning: When you disregard word order, you're ignoring the context and, as a result, the meaning of the words in the document (semantics). Context and meaning can determine the difference between the same words organised differently ("this is fascinating" vs "is this interesting"), synonyms ("old bike" vs "used bike"), and much more if they are modelled.
Q7) What is TF-IDF?
A7) A difficulty with scoring word frequency is that it causes very common words to dominate the document (e.g., a higher score), yet they may not carry as much "informational content" for the model as rarer but maybe domain specialised words.
One method is to rescale the frequency of words based on how frequently they appear in all texts, penalising frequent words like "the" that appear frequently across all publications.
Term Frequency – Inverse Document Frequency, or TF-IDF for short, is a scoring method in which:
● Term Frequency is a metric that measures how frequently a term appears in the current document.
● Inverse Document Frequency (IDF) is a metric for determining how uncommon a word is across documents.
The results are based on a weighted average, which means that not all words are equally relevant or intriguing.
The ratings have the effect of highlighting words in a document that are unique (contain useful information).
Q8) Introduce social network analysis?
A8) The technique of investigating social structures using networks and graph theory is known as social network analysis (SNA). It classifies networked systems as nodes (individual actors, persons, or items in the network) and ties, edges, or links (relationships or interactions) that connect them. Social media networks, memes spread, information circulation, friendship and acquaintance networks, business networks, knowledge networks, difficult working relationships, social networks, collaboration graphs, kinship, disease transmission, examples of social structures commonly visualised through social network analysis.
Sociograms, in which nodes are represented as points and links are depicted as lines, are frequently used to display these networks. These visualisations allow you to examine networks qualitatively by changing the visual depiction of their nodes and edges to reflect different qualities.
In modern sociology, social network analysis has become a crucial technique. It is now widely available as a consumer tool in the fields of anthropology, biology, demography, communication studies, economics, geography, history, information science, organisational studies, political science, public health, social psychology, development studies, sociolinguistics, and computer science (see the list of SNA software).
SNA has two distinct advantages. To begin with, it can process a vast amount of relational data and define the overall topology of a relational network. Tem and parameter selection, such as in-degree and out-degree centrality, to confirm the network's influential nodes. SNA context and choose the parameters to use to identify the "centre" based on the network's characteristics. The communication structure and position of persons may be precisely characterised by examining nodes, clusters, and relations.
History
Early sociologists such as Georg Simmel and Émile Durkheim spoke on the necessity of analysing patterns of relationships that connect social actors, which laid the groundwork for social network analysis. Since the early twentieth century, social scientists have used the term "social networks" to describe complex sets of relationships between members of social systems at various dimensions, from interpersonal to worldwide.
Jacob Moreno and Helen Jennings created basic analytical procedures in the 1930s. In 1954, John Arundel Barnes began using the phrase systematically to define patterns of relationships, embracing both public and social science notions such as bounded groupings (e.g., tribes, families) and social categories (e.g., gender, ethnicity). Ronald Burt, Kathleen Carley, Mark Granovetter, David Krackhardt, Edward Laumann, Anatol Rapoport, Barry Wellman, Douglas R. White, and Harrison White were among the first to apply systematic social network analysis to their research.
SNA has been frequently employed in research on second language learning when studying abroad. Anheier, Gerhards and Romo, Wouter De Nooy, and Burgert Senekal have all used network analysis in the study of literature. Indeed, social network analysis has been used in a variety of academic fields as well as in practical applications such as money laundering and terrorist prevention.
Q9) Introduce business analysis?
A9) Business analysts are a term used to describe those who specialise in business analytics. Company analysts are in great demand because business decisions are more complex than ever before, and analysts assist in ensuring that those decisions are based on the most accurate, valid, and reliable data.
Analysts are experts in data analytics, particularly in the corporate world. Analysts are familiar with the primary business processes and how they are influenced by trends both outside and inside the company.
Analysts may concentrate their knowledge on stakeholder analysis, marketing, finances, risk, and information technology, depending on the business's current strategic priorities.
Business analytics is a data management solution and a subset of business intelligence that involves using methodologies like data mining, predictive analytics, and statistical analysis to analyse and transform data into useful information, identify and predict trends and outcomes, and make better, data-driven business decisions.
The following are the primary components of a typical business analytics dashboard:
● Data Aggregation: Data must first be obtained, sorted, and filtered, either through volunteered data or transactional records, before it can be analysed.
● Data Mining: To detect trends and establish links, data mining for business analytics filters through enormous datasets using databases, statistics, and machine learning.
● Association and Sequence Identification: the discovery of predictable activities that are carried out in conjunction with other acts or in a sequential order.
● Text Mining: For qualitative and quantitative analysis, it examines and organises big, unstructured text databases.
● analyses historical data from a given time period in order to create educated predictions about future occurrences or behaviours.
● Prediction Business Analytics: Predictive business analytics employs a number of statistical techniques to build predictive models that extract data from datasets, discover patterns, and provide a score for a variety of organisational outcomes.
● Businesses can use simulation tools to test out best-case scenarios once patterns have been discovered and predictions have been made.
● For easy and quick data interpretation, visual representations such as charts and graphs are provided.
The essentials of business analytics are typically classified as descriptive analytics, which analyses historical data to determine how a unit may respond to a set of variables; predictive analytics, which examines historical data to determine the likelihood of specific future outcomes; or prescriptive analytics, which combines descriptive analytics and predictive analytics to provide insight into what happened.
The operation and management of clinical information systems in the healthcare industry, the tracking of player spending and development of retention efforts in casinos, and the streamlining of fast food restaurants by monitoring peak customer hours and identifying when certain food items should be prepared based on assembly time are just a few examples of business analytics applications.
Modern, high-quality business analytics software solutions and platforms are designed to ingest and process the massive datasets that businesses confront and can use to run their operations more efficiently.
Q10) What are the Types of model selection?
A10) Types of model selection
Resampling methods
Resampling methods are basic strategies for rearranging data samples to see if the model works well on data samples that haven't been trained on. To put it another way, resampling allows us to see if the model will generalise effectively.
Random Split
Random Splits are used to sample a percentage of data at random and divide it into training, testing, and, ideally, validation sets. The advantage of this strategy is that the original population is likely to be well represented in all three groupings. Random splitting, to put it another way, prevents biassed data sampling.
It's crucial to remember that the validation set is used in model selection. The validation set is the second test set, and it's understandable to wonder why there are two test sets.
The test set is used to evaluate the model during the feature selection and tuning phase. This signifies that the model parameters and feature set have been chosen to produce the best results on the test set. As a result, the validation set is used for the final evaluation, which contains wholly unseen data points (not used in the tuning and feature selection modules).
Time-Based Split
There are several forms of data that cannot be split randomly. For example, if we need to train a weather forecasting model, we can't divide the data into training and testing sets at random. The seasonal pattern will be thrown off! Time Series is a phrase used to describe this type of data.
A time-wise split is employed in such instances. The training set can include data from the previous three years as well as the first ten months of this year. The testing or validation set can be saved for the last two months.
There's also the concept of window sets, in which the model is trained until a specific date and then tested on future dates iteratively, with the training window changing by one day each time (consequently, the test set also reduces by a day). The advantage of this method is that it stabilizes the model and prevents overfitting when the test set is very small (say, 3 to 7 days).
Time-series data, on the other hand, has the disadvantage that the occurrences or data points are not mutually independent. A single occurrence could have an impact on all subsequent data inputs.
A change in the ruling party, for example, could have a significant impact on population numbers in the years ahead. Alternatively, the infamous coronavirus pandemic will have a significant impact on economic data in the next years.
In this situation, no machine learning model can learn from previous data because the data points before and after the occurrence are vastly different.
K-Fold Cross-Validation
The cross-validation technique shuffles the dataset at random and then divides it into k groups. Following that, when iterating over each group, the group should be considered a test set, while the rest of the groups should be combined into a training set. The model is then tested on the test group, and the process is repeated for the remaining k groups.
As a result, at the end of the process, one will have k different test group findings. The best model can then be readily chosen by selecting the model with the highest score.
Stratified K-Fold
The technique for stratified K-Fold is similar to that of K-Fold cross-validation with one major difference: unlike k-fold cross-validation, stratified k-fold considers the values of the target variable.
If the target variable is a categorical variable with two classes, for example, stratified k-fold assures that each test fold has the same ratio of the two classes as the training set.
This makes the model evaluation more accurate and the model training less biassed.
Bootstrap
One of the most powerful methods for obtaining a stable model is to use Bootstrap. Because it is based on the concept of random sampling, it is similar to the random splitting technique.
The first step is to figure out how big your sample will be (which is usually equal to the size of the original dataset). After that, a random data point from the original dataset must be chosen and added to the bootstrap sample. Following the addition, the sample must be returned to the original sample. This procedure must be repeated N times, with N denoting the sample size.
As a result, the bootstrap sample is created by sampling data points from the original dataset with replacement, which is a resampling technique. This indicates that numerous instances of the same data point can be found in the bootstrap sample.
The model is trained on the bootstrapped sample and then tested on any data points that were not included in the bootstrapped sample. Out-of-bag samples are what they're called.
Q11) How to evaluate models?
A11) Multiple measures can be used to evaluate models. The correct evaluation metric, on the other hand, is critical and often depends on the problem being solved. A thorough understanding of a variety of metrics can aid the evaluator in finding a good match between the problem description and a metric.
Classification metrics
A matrix called the confusion matrix can be generated for each classification model prediction, displaying the number of test instances correctly and erroneously classified.
It looks like this (assuming that the goal classes are 1 - Positive and 0 - Negative):
| Actual 0 | Actual 1 |
Predicted 0 | True Negatives (TN) | False Negatives (FN) |
Predicted 1 | False Positives (FP) | True Positives (TP) |
TN: Number of accurately classified negative cases
TP: Total number of accurately classified positive cases
FN: The number of positive cases that were mistakenly labelled as negative.
FP: The number of positive cases that were incorrectly labelled as negative.
Q12) Define holdout method?
A12) The Holdout Method is the most basic way to evaluate a classifier. The data set (a collection of data items or examples) is divided into two sets using this method: the Training set and the Test set.
A classifier is a programme that assigns data objects in a collection to one of several target categories or classes.
Example
Spam and non-spam e-mails are separated in our inbox.
The accuracy, error rate, and error estimates of the classifier should all be determined. It can be done in a variety of ways. The 'Holdout Method' is one of the most basic approaches for classifier evaluation.
The data set is partitioned in the holdout technique so that the maximum data goes to the training set and the remaining data belongs to the test set.
The hold-out approach for training a machine learning model is to split the data into multiple splits and use one split for training and the other splits for validating and testing the models. Both model evaluation and model selection are done using the hold-out method.
When all of the data is used to train the model using various algorithms, the problem of evaluating the models and choosing the best one remains. The main goal is to figure out which model has the lowest generalisation error out of all the others. In other words, which model outperforms all others in predicting future or unknown datasets? This necessitates the use of a technique that allows the model to be trained on one dataset and tested on another. Here's where the hold-out strategy comes into play.
Hold-out method for Model Evaluation
The hold-out approach for model evaluation is a methodology for dividing a dataset into training and test datasets and then analysing model performance to find the best model. The hold-out method for model evaluation is depicted here.
Fig 2: Hold-out method for model evaluation
You'll notice that the data set is divided into two pieces in the diagram above. One split is set aside or reserved for the model's training. Another set is set aside or held back for model testing and evaluation. The split % is determined by the amount of data that is available for training. In most cases, a 70-30 percent split is utilised to split the dataset, with 70% of the dataset being used for training and 30% being used for testing the model.
If the goal is to compare models based on model accuracy on the test dataset and select the best model, this technique is ideal. However, there's a chance that attempting to employ this strategy will result in the model fitting well to the test dataset. In other words, the models are trained on the test dataset to enhance model accuracy, assuming the test dataset reflects the population. As a result, the test error becomes a generalisation error estimate that is optimistically biassed. That, however, is not desirable. Because it was trained to fit well (or overfit) the test data, the final model fails to generalise well to unknown or future datasets.
The hold-out approach is used to evaluate models in the following way:
● Divide the data into two sections (preferably based on 70-30 percent split; However, the percentage split will vary).
● Train the model on the training dataset; pick a fixed set of hyper parameters while training the model.
● On the held-out test dataset, test or evaluate the model.
● To get a model that can generalise better on the unknown or future dataset, train the final model on the full dataset.
Note that this method is used to evaluate models employing a fixed number of hyper parameters and partitioning the dataset into training and test datasets. Another method is to divide the data into three sets and use these three sets to select a model or tune hyperparameters.
Q13) What is Random Sub-sampling Method?
A13) Let's take a closer look at how random subsampling works:
● We produce 'k' replicas of provided data using Random Subsampling, which executes 'k' iterations of the full dataset.
● A fixed number of observations is picked without replacement for each iteration [for each replica] and set aside as a test set.
● Each iteration fits the model to the training set, and each test set yields an estimate of prediction error.
● Let Ei denote the estimated PE (prediction error) in the ith test set. The average of the individual estimations Ei yields the genuine error estimate.
Pros - For sparse datasets, it is a better strategy than the Hold out method.
Cons - There's a probability that the identical record in the test set will be selected for subsequent iterations.
Example
As previously stated,
Let's say we choose k = 4 as the number of iterations;
Then we make k copies of the provided data set and divide them into train and test sets. (To form the train-test split, we apply the without replacement approach for each iteration — the same as the hold out method).
The model is then fitted to each train set before being evaluated on the test set. We now have four errors, and the final error is the average of the four.
This method is preferable to the Hold out method since we may receive a different test-train set each time. As a result, if all records from class B are present in the training set, the model will be able to learn B patterns more effectively.
In terms of drawbacks, there's a potential that the same record may be selected repeatedly in test sets.
As you can see, we have a majority of B in the test set in iteration 2,3,4. We will run into the same issue again, in which our model will be unable to learn for the B class and will fail validation.
Another effective way for resolving this issue is K-fold Cross Validation.
Q14) Write short notes on the confusion matrix?
A14) The Confusion Matrix is a graphic depiction of actual vs predicted values. It is a table-like structure that gauges the performance of our Machine Learning classification model.
Looking at the confusion matrix is a much better technique to assess a classifier's performance. The basic concept is to keep track of how many times examples of class A are categorised as class B. For example, you may check at the 5th row and 3rd column of the confusion matrix to see how many times the classifier confused images of 5s with images of 3s.
A binary classification problem's Confusion Matrix looks like this:
Elements of Confusion Matrix
It depicts various combinations of Actual vs. Predicted values. Let's take a look at each one individually.
True Positive: Values that were both truly positive and projected to be positive.
FP stands for False Positive, which refers to results that were truly negative but were incorrectly forecasted as positive. Type I Error is another name for it.
FN stands for False Negative, which refers to values that were actually positive but were incorrectly forecasted as negative. Type II Error is another name for it.
TN: True Negative: Values that were both negative and projected to be negative.
The confusion matrix provides a lot of information, but you might prefer a more simple metric at times.
Precision
Precision = (TP) / (TP+FP)
The number of true positives (TP) is equal to the number of false positives (FP).
Making one single positive prediction and ensuring it is correct (precision = 1/1 = 100 percent) is a simple technique to achieve perfect precision. This would be ineffective since the classifier will discard all positive instances except one.
Recall
Recall = (TP) / (TP+FN)
Your 5-detector is no longer as gleaming as it was when you examined its accuracy. Only 72.9 percent of the time (precision) is it true when it believes an image represents a 5. Furthermore, it only detects 75.6 percent of the 5s (recall).
When you need a quick way to compare two classifiers, it's typically easier to combine precision and recall into a single metric called the F1 score.
Q15) Define AUC - ROC curve?
A15) The Area Under the Curve (AUC) - ROC curve is a performance statistic for classification issues at various threshold levels. AUC represents the degree or measure of separability, whereas ROC is a probability curve. It indicates how well the model can distinguish between classes. The AUC indicates how well the model predicts 0 classes as 0 and 1 courses as 1. The higher the AUC, the better the model predicts 0 classes as 0 and 1 classes as 1. By analogy, the higher the AUC, the better the model distinguishes between people who have the condition and those who do not.
The ROC curve is plotted with TPR on the y-axis and FPR on the x-axis, with TPR on the y-axis and FPR on the x-axis.
How Does the AUC-ROC Curve Work?
A greater X-axis value in a ROC curve suggests a higher number of False positives than True negatives. While a higher Y-axis value implies a greater number of True positives than False negatives, a lower Y-axis value suggests a lower number of True positives. As a result, the threshold is determined by the ability to balance False positives and False negatives.
Let's delve a little deeper to see how our ROC curve would appear for various threshold values, as well as how the specificity and sensitivity would change.
We can try to comprehend this graph by creating a confusion matrix for each point that corresponds to a threshold and discussing our classifier's performance:
The maximum sensitivity and lowest specificity are found at point A. This means that all Positive class points are correctly classified, while all Negative class points are wrongly classified.
Any point on the blue line represents a circumstance in which True Positive Rate equals False Positive Rate.
All points above this line represent a situation in which the proportion of correctly categorised points in the Positive class exceeds the proportion of mistakenly identified points in the Negative class.
Point B has a higher Specificity than Point A, although having the same Sensitivity. In other words, compared to the prior threshold, the amount of wrongly Negative class points is fewer. This means that this threshold is superior to the one before it.
For the same Specificity, the Sensitivity at point C is higher than the Sensitivity at point D. This signifies that the classifier predicted a higher number of Positive class points for the same number of erroneously categorised Negative class points. As a result, the point C threshold is superior to point D.
Now, depending on how many erroneously categorised points we want to tolerate for our classifier, we'll pick point B or C to predict whether you'll be able to defeat me in PUBG.
“False hopes are more dangerous than fears.”–J.R.R. Tolkein
The maximum level of specificity is at point E. There are no False Positives in the model's classification. All of the Negative class points are accurately classified by the model! If our problem was to provide ideal song recommendations to our users, we would choose this location.
Can you figure out where the point on the graph that corresponds to a perfect classifier would be based on this logic?
Yes! It would be in the top-left corner of the ROC graph, which corresponds to the cartesian coordinate (0, 1). The classifier would accurately classify all of the Positive and Negative class points since both the Sensitivity and Specificity would be at their highest.
Q16) Describe Elbow plot?
A16) The optimal number of clusters into which the data can be grouped is a crucial stage in any unsupervised technique. One of the most prominent approaches for determining the ideal value of k is the Elbow Method.
The provided method is now demonstrated utilising the K-Means clustering methodology and the Python Sklearn module.
Step 1: Importing the essential libraries.
From sklearn.cluster import KMeans
From sklearn import metrics
From scipy.spatial.distance import cdist
Import numpy as np
Import matplotlib.pyplot as plt
Step 2: Data Creation and Visualization
# Creating the data
x1 = np.array([3, 1, 1, 2, 1, 6, 6, 6, 5, 6, 7, 8, 9, 8, 9, 9, 8])
x2 = np.array([5, 4, 5, 6, 5, 8, 6, 7, 6, 7, 1, 2, 1, 2, 3, 2, 3])
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)
# Visualizing the data
Plt.plot()
Plt.xlim([0, 10])
Plt.ylim([0, 10])
Plt.title('Dataset')
Plt.scatter(x1, x2)
Plt.show()
We can observe from the above graphic that the ideal number of clusters is around 3. However, simply visualising the data does not necessarily result in the correct solution. As a result, we'll show you how to do the following.
Now we'll define the following terms:
● Distortion: The average of the squared distances from the cluster centres of the respective clusters is used to compute distortion. The Euclidean distance measure is commonly employed.
● Inertia: Inertia is defined as the total of the squared distances between samples and the cluster centre.
We cycle through the values of k from 1 to 9 and calculate the distortion and inertia for each value of k in the specified range.
Step 3: Create the clustering model and calculate the Distortion and Inertia values:
Distortions = []
Inertias = []
Mapping1 = {}
Mapping2 = {}
K = range(1, 10)
For k in K:
# Building and fitting the model
KmeanModel = KMeans(n_clusters=k).fit(X)
KmeanModel.fit(X)
Distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,
'euclidean'), axis=1)) / X.shape[0])
Inertias.append(kmeanModel.inertia_)
Mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_,
'euclidean'), axis=1)) / X.shape[0]
Mapping2[k] = kmeanModel.inertia_
Step 4: Tabulating and Visualizing the results
a) Using the different values of Distortion:
For key, val in mapping1.items():
Print(f'{key} : {val}')
Plt.plot(K, distortions, 'bx-')
Plt.xlabel('Values of K')
Plt.ylabel('Distortion')
Plt.title('The Elbow Method using Distortion')
Plt.show()
b) Using the different values of Inertia:
For key, val in mapping2.items():
Print(f'{key} : {val}')
Plt.plot(K, inertias, 'bx-')
Plt.xlabel('Values of K')
Plt.ylabel('Inertia')
Plt.title('The Elbow Method using Inertia')
Plt.show()
To find the optimal number of clusters, we must find the value of k at the "elbow," that is, the point at which the distortion/inertia begins to decrease linearly. As a result, we find that the best number of clusters for the given data is three.
The clustered data points for varied value of k:
1. k = 1
2. k = 2
3. k = 3
4. k = 4
Q17) What are some Stopping Criteria for k-Means Clustering?
A17) The common stopping conditions I have seen:
● Convergence. No further changes, points stay in the same cluster.
● The maximum number of iterations. When the maximum number of iterations has been reached, the algorithm will be stopped. This is done to limit the runtime of the algorithm.
● Variance did not improve by at least x
● Variance did not improve by at least x * initial variance
If you use MiniBatch k-means, it will not converge, so you need one of the other criteria. The usual one is the number of iterations.
Q18) How to determine k using the Elbow Method?
A18) Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.
Implementation
From sklearn.cluster import KMeans
# function returns WSS score for k values from 1 to kmax
Def calculate_WSS(points, kmax):
Sse = []
For k in range(1, kmax+1):
Kmeans = KMeans(n_clusters = k).fit(points)
Centroids = kmeans.cluster_centers_
Pred_clusters = kmeans.predict(points)
Curr_sse = 0
# calculate square of Euclidean distance of each point from its cluster center and add to current WSS
For i in range(len(points)):
Curr_center = centroids[pred_clusters[i]]
Curr_sse += (points[i, 0] - curr_center[0]) ** 2 + (points[i, 1] - curr_center[1]) ** 2
Sse.append(curr_sse)
Return sse
Q19) Use K-Means Algorithm to create two clusters?
A19) We follow the above discussed K-Means Clustering Algorithm.
Assume A(2, 2) and C(1, 1) are centers of the two clusters.
Iteration-01:
● We calculate the distance of each point from each of the center of the two clusters.
● The distance is calculated by using the euclidean distance formula.
The following illustration shows the calculation of distance between point A(2, 2) and each of the center of the two clusters-
Calculating Distance Between A(2, 2) and C1(2, 2)-
Ρ(A, C1)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (2 – 2)2 + (2 – 2)2 ]
= sqrt [ 0 + 0 ]
= 0
Calculating Distance Between A(2, 2) and C2(1, 1)-
Ρ(A, C2)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (1 – 2)2 + (1 – 2)2 ]
= sqrt [ 1 + 1 ]
= sqrt [ 2 ]
= 1.41
In the similar manner, we calculate the distance of other points from each of the center of the two clusters.
Next,
● We draw a table showing all the results.
● Using the table, we decide which point belongs to which cluster.
● The given point belongs to that cluster whose center is nearest to it.
Given Points | Distance from center (2, 2) of Cluster-01 | Distance from center (1, 1) of Cluster-02 | Point belongs to Cluster |
A(2, 2) | 0 | 1.41 | C1 |
B(3, 2) | 1 | 2.24 | C1 |
C(1, 1) | 1.41 | 0 | C2 |
D(3, 1) | 1.41 | 2 | C1 |
E(1.5, 0.5) | 1.58 | 0.71 | C2 |
From here, New clusters are-
Cluster-01:
First cluster contains points-
● A(2, 2)
● B(3, 2)
● E(1.5, 0.5)
● D(3, 1)
Cluster-02:
Second cluster contains points-
● C(1, 1)
● E(1.5, 0.5)
Now,
● We re-compute the new cluster clusters.
● The new cluster center is computed by taking mean of all the points contained in that cluster.
For Cluster-01:
Center of Cluster-01
= ((2 + 3 + 3)/3, (2 + 2 + 1)/3)
= (2.67, 1.67)
For Cluster-02:
Center of Cluster-02
= ((1 + 1.5)/2, (1 + 0.5)/2)
= (1.25, 0.75)
This is completion of Iteration-01.
Next, we go to iteration-02, iteration-03 and so on until the centers do not change anymore.