Unit - 3
Clustering
Q1) Define clustering?
A1) Clustering
Clustering is the process of grouping data. Based on some similarity measure, the resulting groups should be such that data within the same group should be identical and data within different groups should be dissimilar. A good clustering algorithm can create groups that maximize data similarity among similar groups while minimizing data similarity among different groups.
Application of clustering
Classes are conceptually important groups of objects with identical characteristics. Data that is defined in terms of classes offers a better image than raw data.
An image, for example, may be defined as "a picture of houses, vehicles, and people" or by specifying the pixel value. The first approach is more useful for finding out what the picture is about.
Clustering is also seen as a prelude to more sophisticated data processing techniques.
As an example,
● A music streaming service could divide users into groups based on their musical preferences. Instead of estimating music recommendations for each individual person, these classes can be measured. It's also simpler for a new consumer to find out who their closest neighbors are.
● Time is proportional to n2 in linear regression, where n is the number of samples. As a result, nearby data points may be grouped together as a single representative point in large datasets. On this smaller dataset, regression can be used.
● There may be millions of colors in a single picture. To make the color palette smaller, identical RGB values could be grouped together.
Q2) What are the types of clusters?
A2) Depending on the issue at hand, clusters may be of various forms. Clusters can be anything from
● Well separated clusters
Each data point in a well-separated cluster is closer to all points within the cluster than to any point outside of it. When clusters are well apart, this is more likely to occur.
● Prototype based clusters
Each cluster in prototype-based clusters is represented by a single data point. Clusters are allocated to data points. Each point in a cluster is closer to that cluster's representative than any other cluster's representative.
● Graph based clusters
A cluster is a group of linked data points that do not have a connection to any point outside the cluster if data is described as a graph with nodes as data points and edges as connections. Depending on the issue, the word "related" can mean various things.
● Density based clusters
A density-based cluster is a dense set of data points surrounded by a low-density area.
Q3) Explain K - means clustering?
A3) K-Means Clustering is an unsupervised learning approach used in machine learning and data science to solve clustering problems. K specifies the number of predefined clusters that must be produced during the process; for example, if K=2, two clusters will be created, and if K=3, three clusters will be created, and so on.
It allows us to cluster data into different groups and provides a simple technique to determine the categories of groups in an unlabeled dataset without any training.
It's a centroid-based approach, which means that each cluster has its own centroid. The main goal of this technique is to reduce the sum of distances between data points and the clusters that they belong to.
For numeric results, K-Means clustering is one of the most commonly used prototype-based clustering algorithms. The centroid or mean of all the data points in a cluster is the prototype of a cluster in k-means. As a consequence, the algorithm works best with continuous numeric data. When dealing with data that includes categorical variables or a mixture of quantitative and categorical variables.
The technique takes an unlabeled dataset as input, separates it into a k-number of clusters, and continues the procedure until no better clusters are found. In this algorithm, the value of k should be predetermined.
The k-means clustering algorithm primarily accomplishes two goals:
● Iteratively determines the optimal value for K centre points or centroids.
● Each data point is assigned to the k-center that is closest to it. A cluster is formed by data points that are close to a specific k-center.
As a result, each cluster contains datapoints with certain commonality and is isolated from the others.
Fig 1: Working of the K-means Clustering Algorithm
Pseudo Algorithm
- Choose an appropriate value of K (number of clusters we want)
- Generate K random points as initial cluster centroids
- Until convergence (Algorithm converges when centroids remain the same between iterations):
● Assign each point to a cluster whose centroid is nearest to it ("Nearness" is measured as the Euclidean distance between two points)
● Calculate new values of centroids of each cluster as the mean of all points assigned to that cluster
K-Means as an optimization problem
Any learning algorithm has the aim of minimizing a cost function. We'll see how, by using centroid as a cluster prototype, we can minimize a cost function called "sum of squared error".
Where is the number of points in cluster k.
Cluster centroids are thus the prototypes that minimize the cost function.
Q4) What is hierarchical clustering?
A4) Hierarchical clustering, also known as hierarchical cluster analysis or HCA, is another unsupervised machine learning approach for grouping unlabeled datasets into clusters.
The hierarchy of clusters is developed in the form of a tree in this technique, and this tree-shaped structure is known as the dendrogram.
Although the results of K-means clustering and hierarchical clustering may appear to be comparable at times, their methods differ. As opposed to the K-Means algorithm, there is no need to predetermine the number of clusters.
There are two ways to hierarchical clustering:
● Agglomerative - Agglomerative is a bottom-up strategy in which the algorithm begins by grouping all data points into single clusters and merging them until only one remains.
● Divisive - Because it is a top-down method, the divisive algorithm is the inverse of the agglomerative algorithm.
Q5) Why do we need hierarchical clustering?
A5) Why do we need hierarchical clustering when we already have alternative clustering methods like K-Means Clustering? So, as we've seen with K-means clustering, this algorithm has some limitations, such as a set number of clusters and a constant attempt to construct clusters of the same size. We can utilise the hierarchical clustering algorithm to tackle these two problems because we don't need to know the specified number of clusters in this algorithm.
Q6) Describe dimensionality reduction?
A6) Dimensionality refers to the amount of input characteristics, variables, or columns in a dataset, and dimensionality reduction refers to the process of reducing these features.
In some circumstances, a dataset has a large number of input features, making predictive modelling more difficult. Because it is difficult to visualise or forecast a training dataset with a large number of characteristics, dimensionality reduction techniques must be used in such circumstances.
"It is a strategy of turning the higher dimensions dataset into fewer dimensions dataset while guaranteeing that it gives similar information," says one definition.
These methods are commonly utilised in machine learning to develop a more accurate predictive model while solving classification and regression challenges.
Speech recognition, signal processing, bioinformatics, and other fields that deal with high-dimensional data use it frequently. It can also be used to visualise data, reduce noise, and do cluster analysis, among other things.
Common techniques of Dimensionality Reduction
● Principal Component Analysis
● Backward Elimination
● Forward Selection
● Score comparison
● Missing Value Ratio
● Low Variance Filter
● High Correlation Filter
● Random Forest
● Factor Analysis
● Auto-Encoder
Q7) Write the Approaches of Dimension Reduction?
A7) The dimension reduction approach can be used in two ways, as shown below:
Feature selection
To develop a high-accuracy model, feature selection is the process of picking a subset of important characteristics and excluding irrelevant features from a dataset. In other words, it's a method for choosing the best characteristics from a dataset.
The feature selection is done in three ways:
- Filters method
The dataset is filtered in this manner, and only the relevant features are taken as a subset. The following are some examples of filtering techniques:
● Correlation
● Chi-Square Test
● ANOVA
● Information Gain, etc.
2. Wrappers Methods
The wrapper method achieves the same aim as the filter function, but it evaluates it using a machine learning model. In this procedure, various features are fed into the machine learning model, and the performance is evaluated. To improve the model's accuracy, the performance selects whether to include or eliminate those features. This method is more accurate than filtering, but it is more difficult to implement.
The following are some examples of wrapper methods:
● Forward Selection
● Backward Selection
● Bi-directional Elimination
3. Embedded Methods
Methods that are embedded Examine the machine learning model's several training iterations and rank the value of each feature. Embedded methods are used in a variety of ways, including:
● LASSO
● Elastic Net
● Ridge Regression, etc.
Feature Extraction
The process of changing a space with many dimensions into a space with fewer dimensions is known as feature extraction. This method is handy when we want to keep all of the information while processing it with fewer resources.
The following are some examples of common feature extraction techniques:
● Principal Component Analysis
● Linear Discriminant Analysis
● Kernel PCA
● Quadratic Discriminant Analysis
Q8) Write the pros and cons of dimensionality reduction?
A8) Pros to applying dimensionality reduction
The following are some of the advantages of using a dimensionality reduction technique on a given dataset:
● The space required to store the dataset is lowered by lowering the dimensionality of the features.
● Reduced feature dimensions necessitate less Computation training time.
● The reduced dimensions of the dataset's features aid in quickly displaying the data.
● It takes care of multicollinearity to remove redundant features (if any are present).
Cons of Dimensionality reduction
There are a few drawbacks to using dimensionality reduction, which are listed below:
● Due to the reduction in dimensionality, some data may be lost.
● The principal components that must be considered in the PCA dimensionality reduction technique are occasionally unknown.
Q9) Explain principal component analysis?
A9) Principal Component Analysis is an unsupervised learning approach used in machine learning to reduce dimensionality. With the help of orthogonal transformation, it is a statistical technique that turns observations of correlated features into a set of linearly uncorrelated data. The Principal Components are the newly altered features. It's one of the most widely used programmes for exploratory data analysis and predictive modelling. It's a method for extracting strong patterns from a dataset by lowering variances.
PCA seeks out the lowest-dimensional surface on which to project the high-dimensional data.
The variance of each characteristic is taken into account by PCA since the high attribute indicates a good separation between the classes and so minimises dimensionality. Image processing, movie recommendation systems, and optimising power allocation in multiple communication channels are some of the real-world uses of PCA. Because it is a feature extraction technique, it keeps the significant variables while discarding the less important ones.
The PCA algorithm is based on the following mathematical concepts:
● Variance and Covariance
● Eigenvalues and Eigen factors
Principal components in PCA
The Principal Components are the converted new features or the result of PCA, as stated above. The number of PCs in this dataset is either equal to or less than the number of original features in the dataset. The following are some of the properties of these main components:
● The linear combination of the original features must be the main component.
● These components are orthogonal, which means there is no link between two variables.
● When going from 1 to n, the importance of each component declines, indicating that the 1 PC is the most important and the n PC is the least important.
Applications
● PCA is mostly utilised as a dimensionality reduction approach in AI applications like computer vision and picture compression.
● If the data includes a lot of dimensions, it can also be utilised to find hidden patterns. Finance, data mining, psychology, and other fields employ PCA.
Q10) Write the Steps for PCA algorithm?
A10) Steps for PCA algorithm
● Getting the dataset
To begin, we must divide the input dataset into two halves, X and Y, with X being the training set and Y being the validation set.
● Representing data into a structure
Now we'll use a structure to represent our data. As an example, the two-dimensional matrix of independent variable X will be represented. Each column correlates to the Features, and each row corresponds to the data elements. The dataset's dimensions are determined by the number of columns.
● Standardizing the data
We'll normalise our data in this stage. In a given column, for example, features with a large variance are more essential than features with a lower variance.
If the importance of features is unaffected by the variance of the feature, we shall divide each data item in a column by the column's standard deviation. The matrix will be referred to as Z in this case.
● Calculating the Covariance of Z
We will take the matrix Z and transpose it to get the covariance of Z. We'll multiply it by Z after it's been transposed. The Covariance matrix of Z will be the output matrix.
● Calculating the Eigen Values and Eigen Vectors
The eigenvalues and eigenvectors for the resulting covariance matrix Z must now be calculated. The directions of the axes with high information are called eigenvectors or the covariance matrix. The eigenvalues are defined as the coefficients of these eigenvectors.
● Sorting the EigenVectors
We'll take all of the eigenvalues and arrange them in decreasing order, from largest to smallest, in this phase. In the eigenvalues matrix P, sort the eigenvectors in the same order. P* will be the name of the resulting matrix.
● Calculating the new features Or Principal Components
We'll calculate the new features here. We'll do this by multiplying the P* matrix by the Z. Each observation in the resulting matrix Z* is a linear combination of the original features. The Z* matrix's columns are all independent of one another.
● Remove less or unimportant features from the new dataset
We'll determine what to keep and what to discard now that the new feature set has arrived. It indicates that only relevant or crucial features will be kept in the new dataset, while unimportant ones will be deleted.
Q11) What do you mean by anomaly detection?
A11) Anomaly Detection is a technique for finding uncommon occurrences or observations that are statistically distinct from the rest of the observations and can raise suspicions. Such "abnormal" behaviour usually indicates a problem, such as credit card fraud, a failed server system, a cyber assault, and so on.
Anomaly detection (also known as outlier detection) is the process of identifying unusual things, occurrences, or observations that raise suspicion by deviating greatly from the rest of the data. Typically, the anomalous things will point to a problem such as bank fraud, a structural flaw, medical issues, or textual flaws. Outliers, novelties, noise, deviations, and exceptions are all terms used to describe anomalies.
In the context of abuse and network intrusion detection, the intriguing objects are frequently sudden bursts of activity rather than infrequent items. Many outlier detection methods (in particular unsupervised approaches) may fail on such data unless it has been aggregated adequately, as this pattern does not adhere to the usual statistical definition of an outlier as an uncommon object. A cluster analysis method, on the other hand, might be able to recognise the micro clusters generated by these patterns.
Anomalies can be classified into one of three groups:
● Point Anomaly: If a tuple in a dataset deviates significantly from the rest of the data, it is referred to as a Point Anomaly.
● Contextual Anomaly: An observation is a Contextual Anomaly if it is anomalous due to the observation's context.
● Collective Anomaly: A group of data instances can aid in the detection of an abnormality.
The concepts of Machine Learning can be used to discover anomalies. It can be done in a variety of ways:
● Supervised Anomaly Detection: To build a prediction model to categorise future data points, this method requires a labelled dataset including both normal and anomalous samples. Supervised Neural Networks, Support Vector Machine Learning, K-Nearest Neighbors Classifier, and other techniques are often employed for this purpose.
● Unsupervised Anomaly Detection: This method does not require any training data and instead relies on two assumptions about the data: only a small fraction of data is anomalous, and any anomaly is statistically distinct from the normal samples. The data is then clustered using a similarity metric based on the previous assumptions, and data points that are far from the cluster are called anomalies.
Application
Intrusion detection, fraud detection, fault detection, system health monitoring, event detection in sensor networks, detecting ecological changes, and defect identification in images using machine vision are all examples of anomaly detection. It's frequently used in preprocessing to get rid of erroneous data from a dataset. Removing anomalous data from a dataset often results in a statistically significant gain in accuracy in supervised learning.
Q12) What is reinforcement learning?
A12) Reinforcement Learning is a feedback-based Machine Learning technique in which an agent learns how to behave in a given environment by executing actions and seeing the outcomes of those actions. The agent receives positive feedback for each excellent action, and negative feedback or a penalty for each bad action.
Unlike supervised learning, the agent learns autonomously utilising feedback and no labelled data in Reinforcement Learning. Because there is no labelled data, the agent must rely only on its own experience to learn.
RL is used to tackle a certain sort of problem in which sequential decision-making is required and the aim is long-term, such as game-playing, robotics, and so on. The agent interacts with and explores the world on its own. In reinforcement learning, an agent's primary goal is to increase performance by obtaining the most positive rewards.
The agent learns through trial and error, and as a result of its experience, it improves its ability to complete the task. As a result, "reinforcement learning" can be defined as "a form of machine learning method in which an intelligent agent (computer programme) interacts with the environment and learns how to function within it." Reinforcement learning is demonstrated by the way a robotic dog learns to move his arms.
Reinforcement learning is a fundamental idea in artificial intelligence, and it is used by all AI agents. We don't need to programme the agent ahead of time because it learns from its own experiences without the need for human assistance.
Q13) Write the feature of reinforcement learning?
A13) Features of Reinforcement learning
● In real life, the agent is not given any instructions regarding the surroundings or the tasks that must be taken.
● It is based on the hit-or-miss method.
● The agent performs the next action and changes states in response to the previous activity's feedback.
● A delayed incentive could be given to the agent.
● Because the environment is stochastic, the agent must explore it in order to maximise positive rewards.
Q14) What are the approaches to implement Reinforcement Learning?
A14) In machine learning, there are primarily three methods for implementing reinforcement learning:
Value-based:
The value-based approach is concerned with determining the optimal value function, or the maximum value at a given state under any policy. As a result, the agent anticipates a long-term return in any state(s) covered by the policy π.
Policy-based:
Without employing the value function, a policy-based approach is used to identify the best policy for the best future rewards. In this technique, the agent strives to apply a strategy in which each step's action contributes to the future reward being maximised.
There are primarily two types of policies in the policy-based approach:
● Deterministic: The policy () produces the same action in every state.
● Stochastic: In this policy, the resulting action is determined by probability.
Model-based:
In the model-based method, a virtual model of the environment is generated, and the agent explores it to learn about it. Because the model representation differs depending on the context, there is no specific solution or algorithm for this approach.
Q15) What is the type of reinforcement learning?
A15) Reinforcement learning can be divided into two categories:
● Positive Reinforcement
● Negative Reinforcement
Positive Reinforcement:
Positive reinforcement learning entails doing something to increase the likelihood of the desired behaviour occurring again. It has a beneficial effect on the agent's behaviour and raises the strength of the conduct.
This form of reinforcement can last a long period, but too much positive reinforcement might result in an overload of states, which can lessen the consequences.
Negative Reinforcement:
Bad reinforcement learning is the polar opposite of positive reinforcement learning in that it enhances the likelihood of the given behaviour recurring by avoiding the negative situation.
Depending on the situation and conduct, it may be more successful than positive reinforcement, although it only offers reinforcement for the bare minimum of activity.
Q16) Is Feature Scaling required for the K-means Algorithm?
A16) Yes, K-Means typically needs to have some form of normalization done on the datasets to work properly since it is sensitive to both the mean and variance of the datasets.
For performing feature scaling, generally, StandardScaler is recommended, but depending on the specific use cases, other techniques might be more suitable as well.
For Example, let’s have 2 variables, named age and salary where age is in the range of 20 to 60 and salary is in the range of 100-150K, since scales of these variables are different so when these variables are substituted in the euclidean distance formula, then the variable which is on the large scale suppresses the variable which is on the smaller scale. So, the impact of age will not be captured very clearly. Hence, you have to scale the variables to the same range using Standard Scaler, Min-Max Scaler, etc.
Q17) What do you understand by Cluster Sampling?
A17) Cluster Sampling is a process of randomly selecting intact groups within a defined population, sharing similar characteristics. Cluster sample is a probability where each sampling unit is a collection or cluster of elements.
For example, if we are clustering the total number of managers in a set of companies, in that case, managers (sample) will represent elements and companies will represent clusters.
Q18) What are the advantages and disadvantages of the K means Algorithm?
A18) Advantages:
● Easy to understand and implement.
● Computationally efficient for both training and prediction.
● Guaranteed convergence.
Disadvantages:
● We need to provide the number of clusters as an input variable to the algorithm.
● It is very sensitive to the initialization process.
● Good at clustering when we are dealing with spherical cluster shapes, but it will perform poorly when dealing with more complicated shapes.
● Due to the leveraging of the euclidean distance function, it is sensitive to outliers.
Q19) What are the challenges associated with K means Clustering?
A19) The major challenge associated with k means clustering is its initialization sensitivity.
While finding the initial centroids for K-Means clustering using Lloyd’s algorithm, we were using randomization i.e, initial k-centroids were picked randomly from the data points.
This Randomization in picking the k-centroids creates the problem of initialization sensitivity which tends to affect the final formed clusters. As a result, the final formed clusters depend on how initial centroids were picked.
Q20) How to decide the optimal number of K in the K means Algorithm?
A20) Most of the people give answers to this question directly as the Elbow Method, however the explanation is only partially correct.
In order to find the optimal value of k, we need to observe our business problem carefully, along with analyzing the business inputs as well as the person who works on that data so that a decent idea regarding the optimal number of clusters can be extracted.
For Example, If we consider the data of a shopkeeper selling a product in which he will observe that some people buy things in summer, some in winter while some in between these two. So, the shopkeeper divides the customers into three categories. Therefore, K=3.
In cases where we do not get inference from the data directly we often use the following mentioned techniques:
Elbow Method – This method finds the point of inflection on a graph of the percentage of variance explained to the number of K and finds the elbow point.
Silhouette method – The silhouette method calculates similarity/dissimilarity score between their assigned cluster and the next best (i.e, nearest) cluster for each of the data points.
Moreover, there are also other techniques along with the above-mentioned ones to find the optimal no of k.