Unit - 1
Introduction to Data Science and Big Data
Q1) What are Basics of Data Science and Big Data?
A1) Basics of Data science and Big data
● Data science only the stuff going on in companies like Google and Face book and tech companies? Why do many people refer to Big Data as crossing disciplines (astronomy, finance, tech, etc.) and to data science as only taking place in tech? Just how big is big? Or is it just a relative term? These terms are so ambiguous, they’re well-nigh meaningless.
● Here’s a distinct lack of respect for the researchers in academia and industry labs who have been working on this kind of stuff for years, and whose work is based on decades (in some cases, centuries) of work by statisticians, computer scientists, mathematicians, engineers, and scientists of all types. From the way the media describes it, machine learning algorithms were just invented last week and data was never “big” until Google came along.
● This is simply not the case. Many of the methods and techniques we’re using—and the challenges we’re facing now—are part of the evolution of everything that’s come before. This doesn’t mean that there’s not new and exciting stuff going on, but we think it’s important to show some basic respect for everything that came before.
● The hype is crazy—people throw around tired phrases straight out of the height of the pre- financial crisis era like “Masters of the Universe” to describe data scientists, and that doesn’t bode well. In general, hype masks reality and increases the noise-to- signal ratio. The longer the hype goes on, the more many of us will get turned off by it, and the harder it will be to see what’s good underneath it all, if anything.
● Statisticians already feel that they are studying and working on the “Science of Data.” That’s their bread and butter. Maybe you, dear reader, are not a statistician and don’t care, but imagine that for the statistician, this feels a little bit like how identity theft might feel for you.
● Although we will make the case that data science is not just a rebranding of statistics or machine learning but rather a field unto itself, the media often describes data science in a way that makes it sound like as if it’s simply statistics or machine learning in the context of the tech industry.
● People have said to us, “Anything that has to call itself a science isn’t.” Although there might be truth in there, that doesn’t mean that the term “data science” itself represents nothing, but of course what it represents may not be science but more of a craft.
Fig 1: Architecture of big data and data science
● There are many debates as to whether data science is a new field. Many argue that similar practices have been used and branded as statistics, analytics, business intelligence, and so forth. In either case, data science is a very popular and prominent term used to describe many different data-related processes and techniques that will be discussed here.
● Big data on the other hand is relatively new in the sense that the amount of data collected and the associated challenges continues to require new and innovative hardware and techniques for handling it.
● This article is meant to give the non-data scientist a solid overview of the many concepts and terms behind data science and big data. While related terms will be mentioned at a very high level, the reader is encouraged to explore the references and other resources for additional detail. Another post will follow as well that will explore related technologies, algorithms, and methodologies in much greater detail.
Q2) Write the needs of data science and big data?
A2) Need of Data science
The main goal of Data Science is to discover patterns in data. It analyses and draws conclusions from the data using a variety of statistical approaches. A Data Scientist must evaluate the data extensively from data extraction through wrangling and pre-processing.
Then it's up to him to make forecasts based on the data. A Data Scientist's mission is to draw conclusions from data. He is able to assist businesses in making better business decisions as a result of his findings.
Data is essential to propel the movement forward in everything from business to the health industry, science to our daily lives, marketing to research, and so on. Computer science and information technology have taken over our lives, and they are progressing at such a rapid and diverse rate that the operational procedures utilised just a few years ago are now useless.
Challenges and issues are in the same boat. In terms of complexity, the challenges and worries of the past for a certain theme, ailment, or deficiency may not be the same now.
To stay up with the difficulties of today and tomorrow, as well as to find answers to unresolved issues, every field of science and study, as well as every company, requires an updated set of operational systems and technologies.
Need of big data
The value of big data isn't solely determined by the amount of data available. Its worth is determined by how you use it. You can get answers that 1) streamline resource management, 2) increase operational efficiencies, 3) optimise product development, 4) drive new revenue and growth prospects, and 5) enable smart decision making by evaluating data from any source. When big data and high-performance analytics are combined, you can do business-related tasks such as:
● In near-real time, determining the root causes of failures, difficulties, and flaws.
● Anomalies are detected faster and more correctly than the naked eye.
● Improving patient outcomes by transforming medical picture data into insights as quickly as possible.
● In minutes, whole risk portfolios can be recalculated.
● Increasing the ability of deep learning models to effectively categorise and respond to changing variables.
● Detecting fraudulent activity before it has a negative impact on your company.
Q3) Write the applications of data science?
A3) Data Science Applications haven't taken on a new function overnight. We can now forecast outcomes in minutes, which used to take many human hours to process, because to faster computing and cheaper storage.
A Data Scientist earns a remarkable $124,000 per year, thanks to a scarcity of qualified workers in this industry. Python for Data Science Certifications are at an all-time high because of this!
10 apps that build on Data Science concepts and explore a variety of domains, including:
Fraud and Risk Detection
Finance was one of the first industries to use data science. Every year, businesses were fed up with bad loans and losses. They did, however, have a lot of data that was acquired during the first filing for loan approval. They decided to hire data scientists to help them recover from their losses.
Banking businesses have learned to divide and conquer data over time using consumer profiling, historical spending, and other critical indicators to assess risk and default possibilities. Furthermore, it aided them in promoting their banking products depending on the purchasing power of their customers.
Healthcare
Data science applications are very beneficial to the healthcare industry.
1. Medical Image Analysis
To identify appropriate parameters for jobs like lung texture categorization, procedures like detecting malignancies, artery stenosis, and organ delineation use a variety of approaches and frameworks like MapReduce. For solid texture classification, it uses machine learning techniques such as support vector machines (SVM), content-based medical picture indexing, and wavelet analysis.
2. Drug Development
The drug discovery process is quite complex and entails a wide range of professions. The best ideas are frequently circumscribed by billions of dollars in testing, as well as significant money and time commitments. An formal submission takes an average of twelve years.
From the first screening of therapeutic compounds through the prediction of the success rate based on biological parameters, data science applications and machine learning algorithms simplify and shorten this process, bringing a new viewpoint to each step. Instead of "lab experiments," these algorithms can predict how the substance will operate in the body using extensive mathematical modelling and simulations. The goal of computational drug discovery is to develop computer model simulations in the form of a physiologically appropriate network, which makes it easier to anticipate future outcomes with high accuracy.
3. Genetics & Genomics
Through genetics and genomics research, Data Science applications also provide a higher level of therapy customisation. The goal is to discover specific biological linkages between genetics, illnesses, and treatment response in order to better understand the impact of DNA on our health. Data science tools enable the integration of various types of data with genomic data in illness research, allowing for a better understanding of genetic concerns in medication and disease reactions. We will have a better grasp of human DNA as soon as we have solid personal genome data. Advanced genetic risk prediction will be a significant step toward more personalised care.
Internet Search
When you think about Data Science Applications, this is usually the first thing that comes to mind.
When we think of search, we immediately think of Google. Right? However, there are other search engines, such as Yahoo, Bing, Ask, AOL, and others. Data science techniques are used by all of these search engines (including Google) to offer the best result for our searched query in a matter of seconds. In light of the fact that Google processes over 20 petabytes of data per day.
Targeted Advertising
If you thought Search was the most important data science use, consider this: the full digital marketing spectrum. Data science algorithms are used to determine practically anything, from display banners on various websites to digital billboards at airports.
This is why digital advertisements have a far greater CTR (Call-Through Rate) than traditional advertisements. They can be tailored to a user's previous actions.
This is why you may see adverts for Data Science Training Programs while I see an advertisement for apparels in the same spot at the same time.
Website Recommendations
Aren't we all used to Amazon's suggestions for similar products? They not only assist you in locating suitable products from the billions of products accessible, but they also enhance the user experience.
Many businesses have aggressively employed this engine to market their products based on user interest and information relevance. This technique is used by internet companies such as Amazon, Twitter, Google Play, Netflix, Linkedin, imdb, and many others to improve the user experience. The recommendations are based on a user's previous search results.
Advanced Image Recognition
You share a photograph on Facebook with your pals, and you start receiving suggestions to tag your friends. Face recognition method is used in this automatic tag recommendation function.
Facebook's recent post details the extra progress they've achieved in this area, highlighting their improvements in image recognition accuracy and capacity.
Speech Recognition
Google Voice, Siri, Cortana, and other speech recognition products are some of the best examples. Even if you are unable to compose a message, your life will not come to a halt if you use the speech-recognition option. Simply say the message out loud, and it will be transformed to text. However, you will notice that voice recognition is not always correct.
Airline Route Planning
The airline industry has been known to suffer significant losses all over the world. Companies are fighting to retain their occupancy ratios and operational earnings, with the exception of a few aviation service providers. The issue has worsened due to the huge rise in air-fuel prices and the requirement to give significant discounts to clients. It wasn't long before airlines began to use data science to pinpoint important areas for development. Airlines can now, thanks to data science, do the following:
● Calculate the likelihood of a flight delay.
● Choose the type of plane you want to buy.
● Whether to land at the destination immediately or make a stop in between (for example, a flight from New Delhi to New York can take a straight route). It can also opt to come to a halt in any country.)
● Drive consumer loyalty programmes effectively.
Southwest Airlines and Alaska Airlines are two of the most well-known firms that have used data science to transform their business practices.
You may learn more about it by watching this movie created by our team, which eloquently describes all of the fields that Data Science Applications has conquered.
Gaming
Machine learning algorithms are increasingly used to create games that develop and upgrade as the player progresses through the levels. In motion gaming, your opponent (computer) also studies your previous moves and adjusts its game accordingly. EA Sports, Zynga, Sony, Nintendo, and Activision-Blizzard have all used data science to take gaming to the next level.
Augmented Reality
This is the last of the data science applications that appears to have the most potential in the future. Augmented reality is a term that refers to the use of technology
Because a VR headset incorporates computer expertise, algorithms, and data to provide you with the optimal viewing experience, Data Science and Virtual Reality have a link. The popular game Pokemon GO is a little step in the right direction. The ability to wander around and look at Pokemon on walls, streets, and other non-existent objects. The creators of this game used the data from Ingress, the last app from the same company, to choose the locations of the Pokemon and gyms.
Q4) Define data explosion?
A4) Data explosion
● Parallel to expansion in service offerings of IT companies, there is growth in another environment - the data environment. The volume of data is practically exploding by the day. Not only this, the data that is available now in becoming increasingly unstructured. Statistics from IDC state that 2011 will see global data grow by up to 44 times amounting to a massive 35.2 zettabytes (ZB - a billion terabytes).
● These factors, coupled with the need for real-time data, constitute the “Big-Data” environment. How can organizations stay afloat in the big data environment? How can they manage this copious amount of data.
● I believe a three-tier approach to managing big data would be the key - the first tier to handle structured data, the second involving appliances for real-time processing and the third for analyzing unstructured content. Can this structure be tailored for your organization?
● No matter what the approach might be, organizations need to create a cost effective method that provides a structure to big data. According to a report by McKinsey & Company, accurate interpretation of Big Data can improve retail operating margins by as much as 60%. This is where information management comes in.
● Information management is vital to be able to summarise the data into a manageable and understandable form. It is also needed to extract useful and relevant data from the large pool that is available and to standardize the data. With information management, data can be standardized in a fixed form. Standardized data can be used to find underlying patterns and trends.
● All trends indicate that organizations have caught on to the importance of navigating the big data environment. They are maturing modernizing their existing technologies to accommodate those that will help manage the flux of data. One worrying trend though is the lack of talent pool necessary to capitalize on Big Data.
● Statistics say that the United States alone could face a shortage of 140,000 to 190,000 persons with requisite analytic and decision making skills by 2018. Organizations are now looking for partners for effective information management to form mutually beneficial long sighted arrangements.
● The challenge before the armed forces is to develop tools that enable extraction of relevant information from the data for mission planning and intelligence gathering. And for that, armed forces require data scientists like never before.
● Big Data describes a massive volume of both structured and unstructured data. This data is so large that it is difficult to process using traditional database and software techniques. While the term refers to the volume of data, it includes technology, tools and processes required to handle the large amounts of data and storage facilities.
Q5) What do you mean by V’s of big data?
A5) In recent years, the "3Vs" of Big Data have been replaced by the "5Vs," which are also known as the characteristics of Big Data and are as follows:
- Volume
● Volume refers to the amount of data generated through websites, portals and online applications. Especially for B2C companies, Volume encompasses the available data that are out there and need to be assessed for relevance.
● Volume defines the data infrastructure capability of an organization’s storage, management and delivery of data to end users and applications. Volume focuses on planning current and future storage capacity - particularly as it relates to velocity - but also in reaping the optimal benefits of effectively utilizing a current storage infrastructure.
● Volume is the V most associated with big data because, well, volume can be big. What we’re talking about here is quantities of data that reach almost incomprehensible proportions.
● Face book, for example, stores photographs. That statement doesn’t begin to boggle the mind until you start to realize that Face book has more users than China has people. Each of those users has stored a whole lot of photographs. Face book is storing roughly 250 billion images.
● Try to wrap your head around 250 billion images. Try this one. As far back as 2016, Facebook had 2.5 trillion posts. Seriously, that’s a number so big it’s pretty much impossible to picture.
● So, in the world of big data, when we start talking about volume, we're talking about insanely large amounts of data. As we move forward, we're going to have more and more huge collections. For example, as we add connected sensors to pretty much everything, all that telemetry data will add up.
● How much will it add up? Consider this. Gartner, Cisco, and Intel estimate there will be between 20 and 200 (no, they don’t agree, surprise!) connected IoT devices, the number are huge no matter what. But it's not just the quantity of devices.
● Consider how much data is coming off of each one. I have a temperature sensor in my garage. Even with a one-minute level of granularity (one measurement a minute), that’s still 525,950 data points in a year, and that’s just one sensor. Let’s say you have a factory with a thousand sensors, you’re looking at half a billion data points, just for the temperature alone.
2. Velocity
● With Velocity we refer to the speed with which data are being generated. Staying with our social media example, every day 900 million photos are uploaded on Facebook, 500 million tweets are posted on Twitter, 0.4 million hours of video are uploaded on YouTube and 3.5 billion searches are performed in Google.
● This is like a nuclear data explosion. Big Data helps the company to hold this explosion, accept the incoming flow of data and at the same time process it fast so that it does not create bottlenecks.
● 250 billion images may seem like a lot. But if you want your mind blown, consider this: Facebook users upload more than 900 million photos a day. A day. So that 250 billion number from last year will seem like a drop in the bucket in a few months.
● Also: Facebook explains Fabric Aggregator, its distributed network system
● Velocity is the measure of how fast the data is coming in. Face book has to handle a tsunami of photographs every day. It has to ingest it all, process it, file it, and somehow, later, be able to retrieve it.
● Here’s another example. Let’s say you’re running a marketing campaign and you want to know how the folks “out there” are feeling about your brand right now. How would you do it? One way would be to license some Twitter data from Grip (acquired by Twitter) to grab a constant stream of tweets, and subject them to sentiment analysis.
● That feed of Twitter data is often called “the firehouse” because so much data is being produced, it feels like being at the business end of a firehouse.
● Here’s another velocity example: packet analysis for cyber security. The Internet sends a vast amount of information across the world every second. For an enterprise IT team, a portion of that flood has to travel through firewalls into a corporate network.
3. Variety
● It refers to the structured, semi-structured, and unstructured data types.
● It can also refer to a variety of sources.
● Variety refers to the influx of data from new sources both inside and outside of an organisation. It might be organised, semi-organized, or unorganised.
● Structured data - is simply data that has been arranged. It usually refers to data that has been defined in terms of length and format.
● Semi-structured data - is a type of data that is semi-organized. It's a type of data that doesn't follow the traditional data structure. This type of data is represented by log files.
● Unstructured data - is just data that has not been arranged. It usually refers to data that doesn't fit cleanly into a relational database's standard row and column structure. Texts, pictures, videos etc. are examples of unstructured data which can’t be stored in the form of rows and columns.
4. Veracity
● It refers to data inconsistencies and uncertainty, i.e., available data can become untidy at times, and quality and accuracy are difficult to control.
● Because of the numerous data dimensions originating from multiple distinct data kinds and sources, Big Data is also volatile.
● For example, a large amount of data can cause confusion, yet a smaller amount of data can only convey half or incomplete information.
5. Value
● After considering the four V's, there is one more V to consider: Value! The majority of data with no value is useless to the organisation until it is converted into something beneficial.
● Data is of no utility or relevance in and of itself; it must be turned into something useful in order to extract information. As a result, Value! can be considered the most essential of the five V's.
Q6) What is the relationship between data science and information science?
A6) The finding of knowledge or actionable information in data is what data science is all about.
The design of procedures for storing and retrieving information is known as information science.
Data science Vs Information science
Data science and information science are two separate but related fields.
Harry is a computer scientist and mathematician who focuses on data science. Library science, cognitive science, and communications are all areas of interest in information science.
Business tasks such as strategy formation, decision making, and operational processes all require data science. It discusses Artificial Intelligence, analytics, predictive analytics, and algorithm design, among other topics.
Knowledge management, data management, and interaction design are all domains where information science is employed.
| Data science | Information science |
Definitions | The finding of knowledge or actionable information in data is what data science is all about. | The design of procedures for storing and retrieving information is known as information science. |
Q7) Compare business intelligence and data science?
A7) Business intelligence Vs Data science
Data science
Data science is a field in which data is mined for information and knowledge using a variety of scientific methods, algorithms, and processes. It can thus be characterised as a collection of mathematical tools, algorithms, statistics, and machine learning techniques that are used to uncover hidden patterns and insights in data to aid decision-making. Both organised and unstructured data are dealt with in data science. It has to do with data mining as well as big data. Data science is researching historical trends and then applying the findings to reshape current trends and forecast future trends.
Business intelligence
Business intelligence (BI) is a combination of technology, tools, and processes that businesses utilise to analyse business data. It is mostly used to transform raw data into useful information that can then be used to make business decisions and take profitable actions. It is concerned with the analysis of organised and unstructured data in order to open up new and profitable business opportunities. It favours fact-based decision-making over assumption-based decision-making. As a result, it has a direct impact on a company's business decisions. Business intelligence tools improve a company's prospects of entering a new market and aid in the analysis of marketing activities.
The following table compares and contrasts Data Science with Business Intelligence:
Factor | Data Science | Business Intelligence |
Concept | It is a discipline that employs mathematics, statistics, and other methods to uncover hidden patterns in data. | It is a collection of technology, applications, and processes that businesses employ to analyse business data. |
Focus | It is centred on the future. | It concentrates on both the past and the present. |
Data | It can handle both structured and unstructured data. | It primarily works with structured data. |
Flexibility | Data science is more adaptable since data sources can be added as needed. | It is less flexible because data sources for business intelligence must be planned ahead of time. |
Method | It employs the scientific process. | It employs the analytic method. |
Complexity | In comparison to business intelligence, it is more sophisticated. | When compared to data science, it is a lot easier. |
Expertise | Data scientist is its area of competence. | Its area of specialisation is for business users. |
Questions | It addresses the questions of what will happen and what might happen. | It is concerned with the question of what occurred. |
Tools | SAS, BigML, MATLAB, Excel, and other programmes are among its tools. | InsightSquared Sales Analytics, Klipfolio, ThoughtSpot, Cyfe, TIBCO Spotfire, and more solutions are among them. |
Q8) Explain data science life cycle?
A8) A data science life cycle is a series of data science steps that you go through to complete a project or analysis. Because each data science project and team is unique, each data science life cycle is also unique. Most data science projects, on the other hand, follow a similar generic data science life cycle.
Fig 1: A simplified Data Science Life Cycle
A General Data Science Life Cycle
Some data science life cycles concentrate just on the data, modelling, and evaluation stages. Others are more complete, beginning with an understanding of the company and ending with deployment.
And the one we'll go over is considerably bigger, as it includes operations. It also places a greater emphasis on agility than other life cycles.
There are five stages to this life cycle:
● Problem Definition
● Data Investigation and Cleaning
● Minimal Viable Model
● Deployment and Enhancements
● Data Science Ops
These aren't steps in data science that follow a straight line. Step one will be completed first, followed by step two. However, you should naturally flow between the steps as needed after that.
It is preferable to do several minor incremental steps rather than a few big comprehensive ones.
Fig 2: General Data Science Life Cycle
Q9) Describe data types?
A9) Quantities, letters, or symbols on which a computer performs operations and which can be stored and communicated as electrical signals and recorded on magnetic, optical, or mechanical media.
Structured data
Structured data is any data that can be processed, accessed, and stored in a fixed format. Over time, software engineering expertise has made significant progress in developing strategies for working with this type of data and inferring a benefit from it. Nonetheless, we are predicting challenges in the future when the size of such data grows to huge proportions, with average quantities approaching zettabytes.
The most straightforward to work with in large data is structured data. Structured data is a sort of big data that is closely linked to measurements that are specified by parameters.
It’s all your quantitative data:
● Address
● Debit/credit card numbers
● Age
● Expenses
● Contact
● Billing
Example
An ‘Employee’ table in a database is a Structured Data Examples.
Employee_ID | Employee_Name | Gender | Department | Salary_In_ Lacs |
1865 | Meg Lanning | Female | Finance | 6,30,000 |
2145 | Virat Kohli | Male | HR | 6,30,000 |
4500 | Ellyse Perry | Female | Finance | 4,00,000 |
5475 | Alyssa Healy | Female | HR | 4,00,000 |
6570 | Rohit Sharma | Male | HR | 5,30,000 |
Unstructured data
This is one of the types of big data that incorporates the data format of a large number of unstructured files, such as image files, audio files, log files, and video files. Unstructured data refers to data that has an unfamiliar structure or model. Because of its magnitude, unstructured data in big data presents unique challenges in terms of preparation for evaluating a value.
One complicated data source with a mix of photos, videos, and text files is an example of this. A few organisations have a wealth of information at their disposal. However, because the data is in its raw form, these organisations are unable to derive an incentive from it.
Semi structured data
Semi-structured data is one of the types of big data that includes both unstructured and structured data formats. To be more explicit, it alludes to data that has crucial tags or information that isolate single components within the data, despite the fact that it has not been sorted under a certain database. Along these lines, we've reached the end of huge data kinds.
Examples
Personal data stored in an XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Q10) What are the needs of data wrangling?
A10) Although wrangling the data is critical, it is also regarded as the backbone of the entire analysis process. Data wrangling's fundamental goal is to make raw data usable. To put it another way, putting data into a shape. Data scientists, on average, spend 75% of their time wrangling data, which comes as no surprise. The following are some of the most important requirements for data wrangling:
● The data's quality is guaranteed.
● Supports rapid decision-making and increases the speed with which data insights are gained.
● Data that is noisy, faulty, or missing gets cleaned.
● It makes sense for the resulting dataset because it collects data that will be used in the data mining process.
● Cleaning and arranging raw data into the desired format assists in making concrete and making a decision.
● The raw data is reassembled into the desired format.
● The ideal way for creating a transparent and effective data management system is to have all data in a single location where it can be used to improve compliance.
● Wrangling data allows the wrangler to make quick judgments by cleaning, enriching, and transforming the data into a beautiful image.
Q11) Explain data cleaning?
A11) When working with data, your analysis and insights are only as good as the data you use. If you do data analysis on stale data, your company will be unable to make efficient and productive judgments. Data cleaning is an important element of data management since it allows you to ensure that your data is of excellent quality.
Data cleaning entails more than just correcting spelling and grammatical mistakes. It's a key machine learning approach and a vital part of data science analytics. Today, we'll learn more about data cleaning, its advantages, data concerns that can develop, and next steps in your learning.
Data cleaning, also known as data cleansing, is the act of eliminating or correcting inaccurate, incomplete, or duplicate data from a dataset. The initial stage in your workflow should be data cleaning. There's a good chance you'll duplicate or mislabel data while working with large datasets and merging several data sources. If your data is faulty or incorrect, it loses its value, and your algorithms and results become untrustworthy.
Data cleaning differs from data transformation in that it involves deleting data from your dataset that doesn't belong there. Data transformation is the process of transforming data into a different format or organisation. Data wrangling and data munging are terms used to describe data transformation procedures. Today, we'll concentrate on the data cleaning procedure.
How do you clean data?
While data cleaning processes differ depending on the sorts of data your firm stores, you may utilize these fundamental steps to create a foundation for your company.
Step 1: Remove duplicate or irrelevant observations
Remove any undesirable observations, such as duplicates or irrelevant observations, from your dataset. Duplicate observations are most likely to occur during the data collection process. Duplicate data can be created when you integrate data sets from numerous sources, scrape data, or get data from clients or multiple departments. One of the most important aspects to consider in this procedure is deduplication. When you observe observations that aren't relevant to the problem you're trying to solve, you've made irrelevant observations. For example, if you want to study data about millennial clients but your dataset includes observations from previous generations, you may wish to eliminate those observations.
Step 2: Fix structural errors
When you measure or transfer data and find unusual naming conventions, typos, or wrong capitalization, you have structural issues. Mislabeled categories or classes can result from these inconsistencies. "N/A" and "Not Applicable," for example, may both exist, but they should be examined as one category.
Step 3: Filter unwanted outliers
There will frequently be one-off observations that do not appear to fit into the data you are studying at first sight. If you have a good cause to delete an outlier, such as incorrect data entry, doing so will make the data you're working with perform better. The advent of an outlier, on the other hand, can sometimes prove a theory you're working on. It's important to remember that just because an outlier exists doesn't mean it's wrong. This step is required to determine the number's legitimacy. Consider deleting an outlier if it appears to be unimportant for analysis or is a mistake.
Step 4: Handle missing data
Many algorithms will not allow missing values, therefore you can't ignore them. There are several options for dealing with missing data. Neither option is ideal, but they can both be examined.
● You can drop observations with missing values as a first option, but this can cause you to lose or lose information, so be aware of this before you do so.
● As a second alternative, you can fill in missing numbers based on other observations; however, you risk losing data integrity because you're working with assumptions rather than actual observations.
● As a third solution, you may change the way the data is used to navigate null values more efficiently.
Q12) Why is data cleaning necessary?
A12) Although data cleansing may appear to be a tedious and monotonous activity, it is one of the most crucial tasks a data scientist must complete. Data that is inaccurate or of poor quality can sabotage your operations and analyses. A brilliant algorithm can be ruined by bad data.
High-quality data, on the other hand, can make a simple algorithm provide excellent results. You should become familiar with a variety of data cleaning processes in order to improve the quality of your data. Not every piece of information is valuable. As a result, another important element affecting the quality of your data is your location.
Q13) Define data integration?
A13) The technical and business methods used to combine data from many sources into a unified, single view of the data are referred to as data integration.
Fig 3: Data integration
Data integration is the process of combining data from various sources into a single dataset with the goal of providing users with consistent data access and delivery across a wide range of subjects and structure types, as well as meeting the information requirements of all applications and business processes. The data integration process is one of the most important parts of the total data management process, and it's becoming more common as big data integration and the need to share existing data become more important.
Data integration architects provide data integration tools and platforms that allow for an automated data integration process that connects and routes data from source systems to target systems. This can be accomplished via a variety of data integration methods, such as:
● Extract, Transform, and Load (ETL) - copies of datasets from various sources are combined, harmonised, and loaded into a data warehouse or database.
● Extract, Load, and Transform - data is fed into a big data system in its raw form, then processed afterwards for specific analytical purposes.
● Change Data Capture discovers and applies real-time data changes in databases to a data warehouse or other repositories.
● Data Replication - data from one database is duplicated to other databases in order to keep the information synced for operational and backup purposes.
● Instead of importing data into a new repository, data from disparate systems is virtualized and integrated to create a cohesive perspective.
● Streaming Data Integration is a real-time data integration method that involves continuously integrating and feeding multiple streams of data into analytics systems and data repositories.
Q14) Why data integration is important?
A14) Big data, with all of its benefits and challenges, is being embraced by businesses that want to stay competitive and relevant. Data integration enables searches in these massive databases, with benefits ranging from corporate intelligence and consumer data analytics to data enrichment and real-time data delivery.
The management of company and customer data is one of the most common use cases for data integration services and solutions. To provide corporate reporting, business intelligence (BI data integration), and advanced analytics, enterprise data integration feeds integrated data into data warehouses or virtual data integration architecture.
Customer data integration gives a holistic picture of key performance indicators (KPIs), financial risks, customers, manufacturing and supply chain operations, regulatory compliance activities, and other areas of business processes to business managers and data analysts.
In the healthcare industry, data integration is extremely vital. By arranging data from several systems into a single perspective of relevant information from which helpful insights can be gained, integrated data from various patient records and clinics aids clinicians in identifying medical ailments and diseases. Medical insurers benefit from effective data gathering and integration because it assures a consistent and accurate record of patient names and contact information. Interoperability is the term used to describe the exchange of data across various systems.
Q15) Describe data reduction?
A15) Because the vast gathering of large data streams introduces the 'curse of dimensionality' with millions of features (variables and dimensions), big data reduction is primarily thought of as a dimension reduction problem. This raises the storage and computational complexity of big data systems.
"Data reduction is the transition of numerical or alphabetical digital information derived empirically or experimentally into a rectified, ordered, and simpler form," according to a formal definition. Simply said, it means that enormous amounts of data are cleansed, sorted, and categorised based on predetermined criteria to aid in the making of business choices.
Dimensionality Reduction
Dimensionality Reduction and Numerosity Reduction are the two main approaches of data reduction.
The technique of lowering the number of dimensions across which data is dispersed is known as dimensionality reduction. The number of dimensions increases the sparsity of the properties or features that the data set contains. Clustering, outlier analysis, and other methods rely on this sparsity. It is simple to display and handle data when the number of dimensions is minimised. Dimensionality reduction can be divided into three categories.
Fig 4: Dimensionality reduction
● Wavelet Transform
The Wavelet Transform is a lossy dimensionality reduction approach in which a data vector X is transformed into another vector X' while maintaining the same length for both X and X'. Unlike its original, the wavelet transform result can be truncated, resulting in dimensionality reduction. Wavelet transforms work well with data cubes, sparse data, and severely skewed data. In picture compression, the wavelet transform is frequently utilised.
● Principal Component Analysis
This strategy entails identifying a small number of independent tuples with n properties that can represent the complete data set. This strategy can be used with data that is skewed or sparse.
● Attribute Subset Selection
A core attribute subset excludes attributes that aren't useful to data mining or are redundant. The selection of the core attribute subset decreases the data volume and dimensionality.
Numerosity Reduction
This strategy reduces data volume by using alternate, compact forms of data representation. Parametric and Non-Parametric Numerosity Reduction are the two forms of Numerosity Reduction.
Parametric
This method assumes that the data fits into a model. The parameters of the data model are estimated, and only those parameters are saved, with the remainder of the data being destroyed. If the data fits the Linear Regression model, for example, a regression model can be utilised to achieve parametric reduction.
A linear relationship between two data set features is modelled using linear regression. Let's imagine we need to fit a linear regression model between two variables, x and y, where y is the dependent variable and x is the independent variable. The equation y=wx b can be used to express the model. The regression coefficients w and b are used here. We can define the variable y in terms of many predictor attributes using a multiple linear regression model.
The Log-Linear model is another way for determining the relationship between two or more discrete characteristics. Assume we have a collection of tuples in n-dimensional space; the log-linear model may be used to calculate the probability of each tuple in this space.
Non-Parametric
There is no model in a non-parametric numerosity reduction strategy. The non-Parametric technique produces a more uniform reduction regardless of data size, but it does not accomplish the same large volume of data reduction as the Parametric technique. Non-parametric data reduction techniques include Histogram, Clustering, Sampling, Data Cube Aggregation, and Data Compression.
Q16) Explain data transformation?
A16) The process of modifying the format, structure, or values of data is known as data transformation. Data can be modified at two phases of the data pipeline for data analytics initiatives. On-premises data warehouses often employ an ETL (extract, transform, load) method, with data transformation serving as the middle phase. The majority of businesses now employ cloud-based data warehouses, which can increase computation and storage resources in seconds or minutes. Because of the cloud platform's scalability, enterprises can forego preload transformations and instead load raw data into the data warehouse, which is subsequently transformed at query time – a paradigm known as ELT ( extract, load, transform).
Data transformation can be used in a variety of processes, including data integration, data migration, data warehousing, and data wrangling.
Data transformation can be positive (adding, copying, and replicating data), negative (deleting fields and records), aesthetic (standardising salutations or street names), or structural (adding, copying, and replicating data) (renaming, moving, and combining columns in a database).
An organisation can choose from a number of ETL technologies to automate the data transformation process. Data analysts, data engineers, and data scientists use scripting languages like Python or domain-specific languages like SQL to alter data.
Benefits and challenges of data transformation
There are various advantages to transforming data:
● To make data more organised, it is changed. Humans and computers may find it easier to use transformed data.
● Null values, unexpected duplicates, wrong indexing, and incompatible formats can all be avoided with properly structured and verified data, which enhances data quality and protects programmes from potential landmines.
● Data transformation makes it easier for applications, systems, and different types of data to work together. Data that is utilised for several purposes may require different transformations.
However, there are some difficulties in effectively transforming data:
● It is possible that data transformation will be costly. The price is determined by the infrastructure, software, and tools that are utilised to process data. Licensing, computing resources, and recruiting appropriate employees are all possible expenses.
● Data transformations can be time-consuming and resource-intensive. Performing transformations after loading data into an on-premises data warehouse, or altering data before feeding it into apps, might impose a strain on other operations. Because the platform can scale up to meet demand, you can conduct the changes after loading if you employ a cloud-based data warehouse.
● During transition, a lack of competence and negligence might cause issues. Because they are unfamiliar with the range of valid and allowed numbers, data analysts without proper subject matter experience are less likely to spot typos or inaccurate data. Someone dealing with medical data who is inexperienced with relevant terminologies, for example, may misspell illness names or fail to indicate disease names that should be mapped to a singular value.
● Enterprises have the ability to undertake conversions that do not meet their requirements. For one application, a company may alter information to a specific format, only to restore the information to its previous format for another.
Q17) Why Transform Data?
A17) For a variety of reasons, you might want to modify your data. Businesses typically seek to convert data in order to make it compatible with other data, move it to another system, combine it with other data, or aggregate information in the data.
Consider the following scenario: your company has acquired a smaller company, and you need to merge the Human Resources departments' records. Because the purchased company's database differs from the parent company's, you'll have to perform some legwork to guarantee that the records match. Each new employee has been given an employee ID number, which can be used as a key. However, you'll need to update the date formatting, delete any duplicate rows, and make sure that the Employee ID field doesn't contain any null values to ensure that all employees are tallied. Before you load the data to the final target, you perform all of these crucial activities in a staging area.
Other causes for data transformation include:
● You're migrating your data to a new data store, such as a cloud data warehouse, and you need to modify the data types.
● You'd like to combine unstructured or streaming data with structured data in order to examine the data as a whole.
● You wish to enrich your data by doing lookups, adding geographical data, or adding timestamps, for example.
● You want to compare sales statistics from different regions or total sales from multiple regions.
Q18) Write about data discretization?
A18) The technique of transforming continuous data into discrete buckets by grouping it is known as data discretization. Discretization is also known for making data easier to maintain. When a model is trained using discrete data, it is faster and more effective than when it is trained with continuous data. Despite the fact that continuous-valued data includes more information, large volumes of data can cause the model to slow down. Discretization can assist us in striking a balance between the two. Binning and employing a histogram are two well-known data discretization techniques. Although data discretization is beneficial, we must carefully select the range of each bucket, which is a difficult task.
The most difficult part of discretization is deciding on the number of intervals or bins to use and how to determine their breadth.
In recent years, the discretization procedure has piqued public interest and has shown to be one of the most effective data pre-processing approaches in DM.
Discretization, to put it another way, converts quantitative data into qualitative data, resulting in a non-overlapping partition of a continuous domain. It also ensures that each numerical value is associated with a certain interval. Because it reduces data from a vast domain of numeric values to a subset of categorical values, discretization is considered a data reduction process.
Many DM methods that can only deal with discrete qualities require the usage of discretized data. Three of the 10 approaches listed as the top ten in DM, for example, need data discretization in some way. Discretization causes significant gains in learning speed and accuracy in learning methods, which is one of its key benefits. Furthermore, when discrete values are used, some decision tree-based algorithms give shorter, more compact, and accurate outputs.
A large number of discretization proposals can be found in the specialised literature. In reality, numerous surveys have been created in an attempt to systematise the strategies that are now available. When working with a new real-world situation or data collection, it's critical to figure out which discretizer is the greatest fit. In terms of correctness and simplicity of the solution obtained, this will determine the success and applicability of the upcoming learning phase.
Despite the effort put forth to categorise the entire family of discretizers, the most well-known and unquestionably effective are included in a new taxonomy described in this work, which has since been modified at the time of writing.