Unit - 3
Data Collection and Processing
Data collection is defined by standard validated techniques as the method of collecting, measuring and analyzing accurate insights for research. Based on collected data, a researcher can evaluate their hypothesis. Data collection is the primary and most important step for research, regardless of the field of research, in most cases. Depending on the required information, the approach of data collection is different for different fields of study.
The most critical objective of data collection is to ensure that statistical analysis collects information-rich and reliable data so that data-driven decisions can be made for research.
- Data is a set of subject values in relation to qualitative or quantitative variables.
- Data is raw, unorganized information that needs to be handled. Data, until it is organized, can be something simple and seemingly random and useless.
- If data in a given context is processed, organized, structured or presented in order to make it useful, it is called information.
- Information necessary for research activities are achieved in different forms.
- The main forms of the information available are:
1. Primary data
2. Secondary data
3. Cross-sectional data
4. Categorical data
5. Time series data
6. Spatial data
7. Ordered data
Key Takeaways:
- Data collection is defined as the procedure of collecting, measuring and analyzing accurate insights for research using standard validated techniques
- The most critical objective of data collection is ensuring that information-rich and reliable data is collected for statistical analysis so that data-driven decisions can be made for research.
- Data is raw, unorganized facts that need to be processed. Data can be something simple and seemingly random and useless until it is organized.
- Primary data is original and unique data that, according to its requirements, is directly collected by the researcher from a source.
- The information collected for a particular purpose by the investigator himself or herself.
- Examples of primary data are data collected by identifying a community's attitudes to health services first-hand, determining a community's health needs, evaluating a social program, determining the job satisfaction of an organization's employees, and determining the quality of service provided by a worker.
Observation:
Observation, when it serves a defined research purpose, becomes a scientific instrument and the data collection method is systematically planned and recorded and subject to validity and reliability checks and controls.
If the observation is characterized by a careful definition of the units to be observed, the style of recording the information observed, standardized conditions of observation and the selection of relevant observation data.
Participant Observation The role of the respondent is very valuable.
Observation of non-participants-when it takes place without the above features.
Experiments
An experiment is a structured study in which the researchers try to understand the causes, impacts, and processes involved in a specific process. The researcher, who determines which subject is used, how they are grouped and the treatment they receive, usually controls this data collection method.
The researcher selects the topic that will be considered during the first stage of the experiment. Therefore, some actions are carried out on these subjects, while the researcher records the primary data consisting of the actions and reactions.
They will be analyzed after that and a conclusion will be drawn from the outcome of the analysis. Although experiments can be used to gather various types of primary data, they are mostly used in the laboratory for data collection.
Pros:
- It is usually objective since the data recorded are results of a process.
- Non-response bias is eliminated.
Cons:
- Incorrect data may be recorded due to human error.
- It is expensive.
Interviews:
Interview is a data collection method that involves two groups of individuals, where the interviewer (researcher(s) asking questions and collecting data) and the interviewee are the first group (the subject or respondent that is being asked questions). As the case may be, the questions and answers during an interview may be oral or verbal.
Interviews can be conducted in two ways: in-person and telephone interviews. An interview in person requires an interviewer or a group of interviewers to ask the interviewee's questions in a face-to-face fashion.
It can be direct or indirect, organized or organized, concentrated or unfocused, etc. Some of the tools used to conduct in-person interviews include a notepad or recording device, which is very important because of human forgetfulness, to take note of the conversation.
On the other hand, telephone interviews are carried out over the phone via ordinary voice calls or video calls. The 2 participating parties may decide to use video calls such as Skype to conduct interviews.
A mobile phone, laptop, tablet or desktop computer that has an Internet connection is required for this.
Pros:
- In-depth information can be collected.
- Non-response and response bias can be detected.
- The samples can be controlled.
Cons:
- It is more time-consuming.
- It is expensive.
- The interviewer may be biased.
Schedules:
Schedule is the tool or instrument used to collect data from the respondents while the interview is conducted. Schedule contains questions, statements (on which opinions are elicited) and blank spaces/tables for filling up the respondents. The features of schedules are:
- The schedule is presented by the interviewer. The questions are asked and the answers are noted down by him.
- The list of questions is a more formal document, it need not be attractive.
- The schedule can be used in a very narrow sphere of social research.
- The main purposes of schedule are three fold:
1. To provide a standardized tool for observation or interview in order to attain objectivity,
2. To act as memory tickler i.e., the schedule keeps the memory of the interviewer/ observer refreshed and keeps him reminded of the different aspects that are to be particularly observed, and
3. To facilitate the work of tabulation and analysis
Surveys & Questionnaires
Two similar instruments used in the collection of primary data are surveys and questionnaires. These are a group of questions typed or written down and sent to the study sample to provide answers.
The survey is given back to the researcher to record after giving the required answers. It is advisable to carry out a pilot study in which experts fill in the questionnaires to assess the weakness of the questions or techniques used.
There are 2 main types of data collection surveys used, namely, online and offline surveys. Internet-enabled devices, such as mobile phones, PCs, tablets, etc., are used for online surveys.
They can be shared through email, websites, or social media with respondents. On the other hand, offline surveys do not require an internet connection in order to be carried out.
Paper-based surveys are the most common type of offline survey. Nevertheless, there are also offline surveys like Formplus that can be completed without access to an internet connection with a mobile device.
This kind of survey is called online-offline surveys because they can be filled offline but require an internet connection to be submitted.
Pros
- Respondents have adequate time to give responses.
- It is free from the bias of the interviewer.
- They are cheaper compared to interviews.
Cons
- A high rate of non-response bias.
- It is inflexible and can't be changed once sent.
- It is a slow process
Limitations of Primary data:
- Expensive:
The data gathering process for primary data is very costly compared to secondary data. No matter how little the research is, to carry out the research, at least one professional researcher will need to be employed. The research process itself may also cost a certain amount of money. The method used in conducting the study will determine how costly it is.
- Consuming time
The time is much longer compared to the time it takes to acquire secondary data, from the starting point of making the decision to perform the research to the point of generating data. Each stage of the primary process of data collection requires a great deal of time for execution.
- Feasibility
Due to the volume and unrealistic requirements that may be required, it is not always feasible to perform primary research. For example, doing a census of people living in a community, just to measure the size of their target market, will be unrealistic for a business.
In this case, a more sensible thing to do is to use the recorded census data to understand the demographics of individuals in that community.
Key Takeaways:
- Primary data is an original and unique data, which is directly collected by the researcher from a source according to his requirements.
- An experiment is a structured study where the researchers attempt to understand the causes, effects, and processes involved in a particular process.
- This data collection method is usually controlled by the researcher, who determines which subject is used, how they are grouped and the treatment they receive.
- Surveys and questionnaires are 2 similar tools used in collecting primary data. They are a group of questions typed or written down and sent to the sample of study to give responses.
- Secondary data refers to data that has already been collected and documented somewhere else for a certain purpose.
- Secondary data is data collected by someone else for some other purpose (but used for another purpose by the investigator).
- Collecting information using census data to obtain information on a population's age-sex structure, using hospital records to determine a community's morbidity and mortality patterns, using the records of an organization to determine its activities, and collecting data from sources such as articles, journals, magazines, books and periodicals to obtain historical information
Sources of Secondary Data:
Secondary data sources include books, private sources, journals, newspapers, websites, government records, etc. In comparison to primary data, secondary data is known to be readily available. To use these sources, very little research and the need for manpower is required.
With the advent of electronic media and the internet, it has become easier to access secondary data sources. Below are some of these sources highlighted.
Books
One of the most traditional methods of collecting information is books. Today, for all subjects you can think of, there are books available. All you have to do when conducting research is look for a book on the subject being researched on, then select from the available repository of books in that area. Books are an authentic source of authentic data when carefully chosen and can be useful in preparing a literature review.
- Published Sources
There are a variety of published sources available for different research topics. The authenticity of the data generated from these sources depends majorly on the writer and publishing company.
Published sources may be printed or electronic as the case may be. They may be paid or free depending on the writer and publishing company's decision.
- Unpublished Personal Sources
Compared to the sources published, this may not be readily available and easily accessible. They become available only if the researcher shares it with another researcher who is not permitted to share it with a third party.
For instance, an organization's product management team may need customer feedback data to assess what customers think about their product and suggestions for enhancement. In order to improve customer service, they will need to collect data from the customer service department, which primarily collects data.
- Journal
These days, when data collection is concerned, journals are gradually becoming more significant than books. This is because journals are regularly updated on a periodic basis with new publications, thus providing up-to-date data.
Also, when it comes to research, journals are usually more specific. For example, we can have a journal on "Secondary data collection for quantitative data" while a book will simply be titled "Secondary data collection"
- Newspapers
In most cases, the information passed through a newspaper is usually very reliable. Hence, making it one of the most authentic sources of collecting secondary data.
The kind of data commonly shared in newspapers is usually more political, economic, and educational than scientific. Therefore, newspapers may not be the best source for scientific data collection.
- Websites
The information shared on websites are mostly not regulated and as such may not be trusted compared to other sources. However, there are some regulated websites that only share authentic data and can be trusted by researchers.
Most of these websites are usually government websites or private organizations that are paid, data collectors.
- Blogs
Blogs are one of the most common online sources for data and may even be less authentic than websites. These days, practically everyone owns a blog and a lot of people use these blogs to drive traffic to their website or make money through paid ads.
Therefore, they cannot always be trusted. For example, a blogger may write good things about a product because he or she was paid to do so by the manufacturer even though these things are not true.
- Journals
They are personal records and are, as such, rarely used by researchers for data collection. Also, except for these days when people now share public diaries containing specific events in their lives, diaries are generally personal.
Anne Frank's diary, which included an accurate record of the Nazi wars, is a common example of this.
Records from the Government
A very important and authentic source of secondary data is government records. They contain data that is useful in research in marketing, management, humanities, and social sciences.
Some of these records include information from the census, health records, records of educational institutions, etc. Usually, they are collected to aid proper planning, allocation of funds, and project prioritization.
- Podcasts
Podcasts are gradually becoming very common these days, and a lot of people listen to them as an alternative to radio. They are more or less like online radio stations and are generating increasing popularity.
Information is usually shared during podcasts, and listeners can use it as a source of data collection.
Some other sources of data collection include:
- Letters
- Radio stations
- Public sector records.
Limitations:
- Data Quality:
The data collected through secondary sources may not be as authentic as when collected directly from the source. This is a very common disadvantage with online sources due to a lack of regulatory bodies to monitor the kind of content that is being shared.
Therefore, working with this kind of data may have negative effects on the research being carried out.
- Irrelevant Data:
Researchers spend so much time surfing through a pool of irrelevant data before finally getting the one they need. This is because the data was not collected mainly for the researcher.
In some cases, a researcher may not even find the exact data he or she needs, but have to settle for the next best alternative.
- Exaggerated Data
Some data sources are known to exaggerate the information that is being shared. This bias may be some to maintain a good public image or due to a paid advert.
This is very common with many online blogs that even go a bead to share false information just to gain web traffic. For example, a FinTech startup may exaggerate the amount of money it has processed just to attract more customers.
A researcher gathering this data to investigate the total amount of money processed by FinTech startups in the US for the quarter may have to use this exaggerated data.
- Outdated Information
Some of the data sources are outdated and there are no new available data to replace the old ones. For example, the national census is not usually updated yearly.
Therefore, there have been changes in the country's population since the last census. However, someone working with the country's population will have to settle for the previously recorded figure even though it is outdated.
Key Takeaways:
- Secondary data refers to data that has already been collected and documented somewhere else for a certain purpose.
- Secondary data is data collected by someone else for some other purpose (but used for another purpose by the investigator).
- Books, personal sources, journals, newspapers, websites, government records, etc. are secondary data sources. In comparison to primary data, secondary data is known to be readily available. To use these sources, very little research and the need for manpower is required
- The data collected from secondary sources may not be as authentic as when directly from the source is collected.
- Research Goal: Think of your research goals.
- Statistical significance. Another essential factor to consider while choosing the research methodology is statistical results.
- Quantitative vs. Qualitative data.
- Sample size.
- Timing.
Significance and Methods
There are two sampling techniques, namely, probability and non-probability. The "chance" to be included in the sample is commonly referred to as probability. On the basis of probability theories, the probability of an element to be included in a sample can be determined.
The essential characteristic of probability sampling is that the chance of being included in the sample can be specified for each element of the population. Each of the elements is equally likely to be included in the simplest case, but this is not a necessary condition. What is needed is that there must be some specific chance that it will be included for each element. There is no way to estimate the probability that each element has the chance of being included in the sample in non-probability sampling and no assurance that every element has some chance of being included.
Factors determining sample size:
We have found that there are three key components that can help establish your target sample size.
1. Know how variable the population is that you want to measure.
People often incorrectly think that sample size is related to population size, so they assume that you would need to measure many, many individuals for a very large population. But think of it this way, if China's population were all exactly the same, you would only need to measure one person! Instead, variability is the critical problem for sample size.
For a random sample, individuals will show up in approximately the same proportions as they are in the population to be measured. As a result, more individuals with typical measurements will show up more often, and more unusual individuals (for example, very short or very tall) will show up less often. So, if the population is quite variable, you will need to measure more of them to ensure that both the atypical and the typical individual are captured. We are often more interested in those ends of the distribution for design purposes, the small end and the large end of the bell curve, because we will also accommodate the folks in the middle if we can accommodate them. If you haven’t captured those more extreme individuals, you risk making your doorway too short, or your airplane seat too narrow.
2. Know how precise the population statistics need to be.
The reason we evaluate a number of individuals is that we will calculate a set of statistics to characterize the population. Our clients then use those statistics in their designs. The product design could be anything from a chemical protective mask to a T-shirt to a cockpit for a jet engine.
For each type of design, the level of accuracy necessary differs. Safety-critical products typically require higher statistical accuracy than, for instance, a clothing item. For a gas mask, for instance, we would like to estimate population statistics to the nearest millimeter, since a poorly fitting design can have dire consequences. A half-inch might be close enough for that T-shirt, however. Knowing how accurate you need to be will guarantee that time and resources are wisely spent.
3. Know exactly how confident you must be in the results.
Just as you need to know how precise your measurements must be, knowing how much trust you can place in the outcomes is also helpful. Do you have to be 99% confident that you have the correct average sleeve length, or do you have to be 80% confident? A higher vote of confidence is usually needed for those working on safety-relevant projects. On the other hand, less trust may be required for projects where safety is not a problem, or where consumers have many marketplace options.
Keeping these key factors in mind, sample sizes that are large enough to capture the required variability can be calculated successfully, with appropriate accuracy and confidence, but without wasteful oversampling.
Key takeaways:
- There are two sampling techniques, namely, probability and non-probability. The "chance" of being included in the sample is generally referred to as probability.
- There is no way to estimate the probability that each element has the chance of being included in the sample in non-probability sampling, and there is assurance that every element has some chance of being included.
Data presentation refers to the organization of data into tables, graphs or charts, so that the collected measurements can derive logical and statistical conclusions.
Tabular representation - A method used by a statistical table to present data. - A systematic organization of column and row data.
The very powerful communication tools are text, tables, and graphs for data and information presentation. They can make an article easy to understand, attract and sustain the interest of readers, and present large quantities of complex information efficiently.
Data editing is a process of examining raw data in order to detect and correct errors and omissions, if possible, in order to ensure legibility, completeness, consistency and accuracy. The data recorded must be legible so that it can later be coded. An unreadable response can be corrected by contacting people who have recorded it, or it can alternatively be inferred from other parts of the question. Completeness means that it is necessary to complete all the items in the questionnaire fully.
In the absence of answers to certain questions, interviewers may be contacted to find out whether they have not answered the question or whether the respondent has refused to answer the question. It is quite likely, in the case of the former, that the interviewer will not remember the answer. The respondent may be contacted again in such a case, or this particular piece of data may alternatively be treated as missing data.
Checking whether or not the respondent is consistent in answering the questions is very important. For instance, there could be a respondent claiming that he may not have one to make credit card purchases. The inaccuracy of the survey data may be due to bias or cheating by the interviewer. One way to spot a specific interviewer's instrument is to look for a common pattern of responses. In addition to ensuring quality data, this will also make coding and tabulation of information easier. In fact, the editing includes a thorough examination of the completed questionnaires. You can do the editing in two stages:
1. Field Editing, and
2. Central Editing.
Field Editing: Field editing consists of the investigator reviewing the reporting forms to complete or translate what was written in abbreviated form by the investigator at the time of the respondent's interview. In view of the writing of individuals, which varies from individual to individual and is sometimes difficult for the tabulator to understand, this form of editing is necessary. This sort of editing should be done as soon as possible after the interview, as the memory may need to be recalled sometimes. While doing so, care should be taken so that the investigator does not correct the omission errors by simply guessing what would have been answered by the respondent if he were asked the question.
Central Editing: When all forms of schedules have been completed and returned to the headquarters, central editing should be carried out. This type of editing requires that in the case of a large field study, all forms are thoroughly edited by a single person (editor) in a small field study or a small group of people, the editor can correct the obvious errors, such as an entry in the wrong place, entry recorded in daily terms, while it should have been recorded in weeks/months, etc. Sometimes, by reviewing the other information recorded in the schedule, inappropriate or missing replies can also be recorded by the editor. The respondent may, if necessary, be contacted for clarification. All the incorrect replies, which are quite obvious, must be deleted from the schedules.
CODING OF DATA:
Coding is the process of assigning the answers to certain symbols (either) alphabetical or numeral or (both) in order to record the answers in a limited number of classes or categories. The classes should be suitable for the problem of research being studied. They must be comprehensive and mutually exclusive in order to be able to place the answer in one and only one cell of a given category.
In addition, in terms of only one concept, every class must be defined. To efficiently analyze data, coding is necessary. At the design stage of the questionnaire itself, coding decisions should generally be taken so that the likely answers to questions are pre-coded. For further analysis, this simplifies the computer tabulation of data. It should be noted that any coding errors should be eliminated altogether or at least reduced to the minimum level possible.
It is more tedious to codify an open-ended question than a closed-ended question. The coding scheme is very simple for a closed ended or structured question and designed prior to the field work.
The same approach could also be used for coding numeric data that either are not coded into categories or have had their relevant categories specified. For example,
What is your monthly income?
Here the respondent would indicate his monthly income which may be entered in the relevant column. The same question may also be asked like this:
What is your monthly income?
< Rs. 5000
-Rs. 5000 - 8999
-Rs. 13000 – 12999
-Rs. 13000 or above.
We may code the class less than Rs.5000' as,1', Rs. 5000 - 8999' as 2', Rs. 9000 -
-12999' as 3' and Rs. 13000 or above' as 4'.
Coding of open-ended questions is a more complex task as the verbatim responses
The interviewer registers the respondents. In which categories should these answers be submitted? The researcher may select and list 60-70 of the answers to a question at random. After reviewing the list, a decision is taken on what categories are appropriate to summarize the data and the categorized data coding scheme as discussed above is used-a word of caution-that we should keep provision for "any other" to include responses that may not fall into our designated categories when classifying the data into different categories.
Data Classification/distribution:
Sarantakos (1998: 343) defines distribution of data as a form of classification of scores obtained for the various categories or a particular variable. There are four types of distributions:
1. Frequency distribution
2. Percentage distribution
3. Cumulative distribution
4. Statistical distributions
Frequency distribution:
In social science research, frequency distribution is very common. It presents the frequency of occurrences of certain categories. This distribution appears in two forms:
Ungrouped: Here, the scores are not collapsed into categories, e.g., distribution of ages of the students of a BJ (MC) class, each age value (e.g., 18, 19, 20, and so on) will be presented separately in the distribution.
Grouped: Here, the scores are collapsed into categories, so that 2 or 3 scores are presented together as a group. For example, in the above age distribution groups like 18-20, 21-22 etc., can be formed)
Percentage distribution:
It is also possible to give frequencies not in absolute numbers but in percentages. For instance instead of saying 200 respondents of total 2000 had a monthly income of less than Rs. 500, we can say 10% of the respondents have a monthly income of less than Rs. 500.
Cumulative distribution:
It tells how often the value of the random variable is less than or equal to a particular reference value.
Statistical data distribution:
In this type of data distribution, some measure of average is found out of a sample of respondents. Several kinds of averages are available (mean, median, mode) and the researcher must decide which is most suitable to his purpose. Once the average has been calculated, the question arises: how representative a figure it is, i.e., how closely the answers are bunched around it. Are most of them very close to it or is there a wide range of variation?
Tabulation of data:
After editing, which ensures that the information on the schedule is accurate and categorized in a suitable form, the data are put together in some kinds of tables and may also undergo some other forms of statistical analysis.
Tables can be prepared manually and/or by computers. For a small study of 100 to 200 persons, there may be little point in tabulating by computer since this necessitates putting the data on punched cards. But for a survey analysis involving a large number of respondents and requiring cross tabulation involving more than two variables, hand tabulation will be inappropriate and time consuming.
Usefulness of tables:
Tables are useful to the researchers and the readers in three ways:
1. They present an overall view of findings in a simpler way.
2. They identify trends.
3. They display relationships in a comparable way between parts of the findings.
By convention, the dependent variable is presented in the rows and the independent variable in the columns.
Tabulation of data:
After editing, which ensures that the information on the schedule is accurate and categorized in an appropriate manner, some types of tables are used to compile the data and some other forms of statistical analysis may also be carried out.
It is possible to prepare tables manually and/or via computers. There may be little point in tabulating by computer for a small study of 100 to 200 individuals, since this requires putting the data on punched cards. However, hand tabulation will be inappropriate and time consuming for a survey analysis that involves a large number of respondents and requires cross tabulation involving more than two variables.
Usefulness of tables:
Tables are useful to the researchers and the readers in three ways:
1. They present an overall view of findings in a simpler way.
2. They identify trends.
3. They display relationships in a comparable way between parts of the findings.
By convention, the dependent variable is presented in the rows and the independent variable in the columns.
Graphic Presentation:
A way to analyze numerical data is Graphical Representation. It shows the connection in a diagram between data, ideas, information and concepts. It is simple to understand and it is one of the most important strategies for learning. In a particular domain, it always depends on the type of information. Various types of graphical representation exist. The following are some of them:
- Line Graphs-The line graph or linear graph is used to display continuous information and is useful over time to predict future events.
- Bar Graphs-The Bar Graph is used to display the data category and compare the data to represent the quantities using solid bars.
- Histograms-The chart that uses bars to represent the frequency of numerical data divided into intervals. Since the intervals are all equal and continuous, the width of all the bars is the same.
- Line Plot-The frequency of data on a given number line is displayed. Each time the data occurs again, 'X' is placed over a number line.
Frequency Table-The table shows the amount of data pieces that fall within the given interval.
Circle Graph-Also referred to as the pie chart showing the interactions of the parts of the whole. The circle is 100 percent considered and the categories occupied are represented with that particular percentage, such as 15 percent, 56 percent, etc.
Stem and Leaf Plot-The data is organized from the lowest value to the highest value in the stem and leaf plot. The digits form the stems with the least location values from the leaves and the next location value digit.
Box and Whisker Plot- The diagram of the plot summarizes the information by dividing it into four parts. The range (spread) and the middle (median) of the data are shown by the box and whisker.
General Rules for Graphical Representation of Data:
There are certain rules to effectively present the information in the graphical representation. They are:
- Suitable Title: Make sure that the appropriate title is given to the graph which indicates the subject of the presentation.
- Measurement Unit: Mention the measurement unit in the graph.
- Proper Scale: To represent the data in an accurate manner, choose a proper scale.
- Index: Index the appropriate colors, shades, lines, and design in the graphs for better understanding.
- Data Sources: Include the source of information wherever it is necessary at the bottom of the graph.
- Keep it Simple: Construct a graph in an easy way that everyone can understand.
- Neat: Choose the correct size, fonts, colors etc. in such a way that the graph should be a visual aid for the presentation of information.
Key Takeaways:
- The editing of data is a process of examining the raw data to detect errors and omissions and to correct them, if possible, so as to ensure legibility, completeness, consistency and accuracy
- In fact, the editing involves a careful scrutiny of the completed questionnaires. The editing can be done at two stages: Field Editing, and Central Editing.
- Graphical Representation is a way of analyzing numerical data. It exhibits the relation between data, ideas, information and concepts in a diagram.
- Coding is the process of assigning some symbols (either) alphabetical or numerals or (both) to the answers so that the responses can be recorded into a limited number of classes or categories
Data collection, interpretation, and validation are involved in statistics. The technique of performing several statistical operations to quantify the data and apply statistical analysis is statistical analysis. Descriptive data such as surveys and observational data are involved in quantitative data. It is also referred to as a descriptive review. It includes various tools to perform statistical data analysis such as SAS (Statistical Analysis System), SPSS (Statistical Package for the Social Sciences), Stat soft, and more.
Data Analysis Techniques
Depending on the issue at hand, the type of data, and the amount of data collected, there are different techniques for data analysis. In order to transform facts and figures into decision-making parameters, each focuses on strategies for taking on new data, mining insights, and drilling into information. Accordingly, the various data analysis methods can be classified as follows:
1. Techniques based on Mathematics and Statistics
- Descriptive Analysis: Descriptive Analysis takes historical data, Key Performance Indicators, into account and describes performance based on a benchmark chosen. It takes past trends into account and how they might affect future performance.
- Dispersion Analysis: Dispersion in the area where a data set is distributed. This technique enables the variability of the factors under study to be determined by data analysts.
- Regression Analysis: This method works by modeling a dependent variable's relationship with one or more independent variables. Linear, multiple, logistic, ridge, non-linear, life data, and more, can be a regression model.
- Factor Analysis: This method helps to determine whether a set of variables have any relationship. It reveals other factors or variables in this process that describe the patterns in the relationship between the initial variables. Factor Analysis leaps forward into useful procedures for clustering and classification.
- Discriminant Analysis: It is a data mining classification technique. It defines the various points on the basis of variable measurements in different groups. In simple terms, it defines what distinguishes two groups from each other; it helps identify new items.
- Time Series Analysis: measurements are spanned over time in this type of analysis, which gives us a collection of organized data known as time series.
Techniques based on Artificial Intelligence and Machine Learning
- Artificial Neural Networks: A neural network is a paradigm of programming that is biologically inspired and presents a brain metaphor for information processing. An Artificial Neural Network is a system based on information that flows through the network, which changes its structure. ANN is highly precise and can accept noisy data. In business classification and forecasting applications, they can be considered highly dependable.
- Decision Trees: It is a tree-shaped model, as the name stands that represents models of classification or regression. It divides a data set into smaller subsets that develop into a related decision tree simultaneously.
- Evolutionary programming: This technique uses evolutionary algorithms to combine different types of data analysis. It is a domain-independent technique that can very effectively explore ample search space and manage attribute interaction.
- Fuzzy Logic: It is a probability-based data analysis method that helps to deal with the uncertainties in data mining techniques.
3. Techniques based on Visualization and Graphs
- Column Chart, Bar Chart: In order to present numerical differences between categories, both these charts are used. To reflect the differences, the column chart uses the height of the columns. In the case of the bar chart, axes interchange.
- Line Chart: This chart is used over a continuous interval of time to represent the change in data.
- Area Chart: The line chart is the basis of this concept. In addition, it fills the area with color between the polyline and the axis, representing better trend data.
- Pie Chart: It is used to represent the proportion of various classifications. It is only suitable for just one data series. However, to represent the proportion of data in various categories, it can be made multi-layered.
- Funnel Chart: This chart shows the percentage of each phase and represents the size of each module. In comparing rankings, it helps.
- Word Cloud Chart: It is a text data visual representation. It requires a large amount of information, and for users to perceive the most prominent one, the degree of discrimination needs to be high. It is not a very precise technique for analytics.
- Gantt chart: In comparison to the requirements, it shows the actual timing and activity progress.
- Radar Chart: Multiple quantized charts are compared using it. It reflects which variables have greater values in the data and which have lower values. For the comparison of classification and series together with proportional representation, a radar chart is used.
- Scatter Plot: This shows the distribution of variables over a rectangular coordinate system in the form of points. The distribution can reveal the correlation between the variables in the data points.
- Bubble Chart: This is a variation of the dispersion chart. Here, the area of the bubble represents the 3rd value in addition to the x and y coordinates.
- Gauge: This is a kind of materialized graph. The scale represents the metric here, and the dimension is represented by the pointer. It is an appropriate technique for representing comparisons of intervals.
- Frame Diagram: It is a visual representation, in the form of an inverted tree structure, of a hierarchy.
- Rectangular Tree Diagram: This method is used but at the same level to represent hierarchical relationships. It makes effective use of space and represents the proportion that each rectangular area represents.
- Regional map: Color is used to represent the distribution of values over a map partition.
- Point Map: The geographical distribution of data in the form of points on a geographical background is represented. It becomes meaningless for single data if the points are the same in size, but if the points are like a bubble, then the size of the data in each region is additionally represented.
- Flow Map: It reflects the relationship between an area of inflow and an area of outflow. A line connecting the geometric centers of gravity of the spatial elements is represented. To reduce visual clutter, the use of dynamic flow lines helps.
- Heat Map: This represents the weight of the geographical area of each point. The density is represented by the color here.
Data Analysis Tools
There are several tools available on the market for data analysis, each with its own set of functions. Tools should always be selected on the basis of the type of analysis performed and the type of data processed. Here is a list of a few compelling Data Analysis instruments.
1. Excel
It has a variety of compelling features, and it can handle a massive amount of data with extra plug-ins installed. So, if you have data that is not close to the significant margin of data, then Excel can be a very versatile data analysis tool.
2. Tableau
It falls under the category of BI Tool, made for the sole purpose of analyzing data. The Pivot Table and Pivot Chart are the essence of Tableau and work towards the most user-friendly representation of data. In addition, along with brilliant analytical functions, it has a data cleaning feature.
3. Power BI
It originally started as an Excel plugin, but later separated from it to be developed in one of the most data analytics tools. They are available in three versions: Free, Pro, and Premium. Sophisticated advanced analytics similar to writing Excel formulas can be enforced by its PowerPivot and DAX language.
4. Fine Report
Fine Report comes with a simple drag and drop operation, which helps to design different report styles and create a system for analyzing data decisions. It can connect to all kinds of databases directly, and its format is similar to Excel. In addition, a range of dashboard templates and several self-developed libraries of visual plug-ins are also provided.
5. R & Python
These are very powerful and flexible programming languages. In statistical analysis, such as normal distribution, algorithms for cluster classification, and regression analysis, R is best. It also performs individual predictive analysis based on his browsing history, such as customer behavior, his spending, items preferred by him, and more. It also includes machine learning and artificial intelligence concepts.
6. SAS
It is a data analytics and data manipulation programming language which can access data easily from any source. For web, social media, and marketing analytics, SAS has introduced a broad set of customer profiling products. It can predict, manage, and optimize their communication behaviors.
Measures of Central Tendency
A descriptive summary of a dataset through a single value reflecting the center of the data distribution is the central trend. The central tendency is a branch of descriptive statistics, along with the variability (dispersion) of a dataset.
One of the most quintessential concepts in statistics is the central trend. While it does not provide information in the dataset about the individual values, it provides a comprehensive summary of the entire dataset.
Generally, the central tendency of a dataset can be described using the following measures:
• Mean (Average): Represents the sum of all values divided by the total number of values in a dataset.
• Median: The middle value in a dataset organized in ascending order (from the smallest value to the largest value). If a dataset contains an even number of values, the mean of the two middle values will be the median of the dataset.
• Mode: Defines the most frequent value that occurs in a dataset. A dataset may contain multiple modes in some instances, while some datasets may not have a mode at all.
Although the above measures are most frequently used to define the central trend, there are other measures, including, but not limited to, the geometric mean, the harmonic mean, the midrange, and the geometric median.
The choice of a central measure of tendency depends on the properties of a dataset. For example, for categorical data, the mode is the only central tendency measure, while a median works best with ordinal data.
While the average is considered the best measure of the central trend for quantitative data, this is not always the case. For instance, with quantitative datasets containing extremely large or extremely small values, the average may not work well. The average may be distorted by extreme values. You may, therefore, consider other measures.
Using a formula or definition, the measures of central tendency can be found. Also, using a frequency distribution graph, they can be identified. Note that the mean, median, and mode are located at the same spot on the graph for datasets that follow a normal distribution.
Measures of Dispersion, Correlation Analysis and Regression Analysis:
Dispersion is the state in which dispersion or propagation takes place. Statistical dispersion means the extent to which an average value is likely to vary in numerical data. In other words, dispersion helps to understand how the data is distributed.
In statistics, dispersion measures help to understand data variability, i.e., how homogeneous or heterogeneous the data is. It shows how squeezed or scattered the variable is, in simple terms.
There are two main types of dispersion methods in statistics which are:
· Absolute Measure of Dispersion
· Relative Measure of Dispersion
Absolute Measure of Dispersion
The absolute dispersion measure contains the same unit as the original set of data. The technique of absolute dispersion expresses the variations in terms of the average of observation deviations such as standard or mean deviations. Range, standard deviation, quartile deviation, etc. are included.
The types of absolute measures of dispersion are:
1. Range: It is simply the difference in a data set between the maximum value and the minimum value that is given. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
2. Variance: Deduct the mean from each data in the set then squaring each of them and adding each square and finally dividing them by the total no of values in the data set is the variance. Variance (σ2) =∑ (X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the standard deviation i.e. S.D. = √σ.
4. Quartiles and Quartile Deviation: Quartiles are values that split a number list into quarters. Half of the distance between the third and the first quartile is the quartile deviation.
5. Mean and Mean Deviation: The average of numbers is referred to as the mean and the arithmetic mean of the absolute deviations of the observations from a central trend measure is referred to as the mean deviation (also called mean absolute deviation).
Relative Measure of Dispersion
To compare the distribution of two or more data sets, the relative measures of depression are used. Without units, this measure compares values.
Common relative dispersion methods include:
1. Coefficient of Range
2. Coefficient of Variation
3. Coefficient of Standard Deviation
4. Coefficient of Quartile Deviation
5. Coefficient of Mean Deviation
Coefficient of Dispersion
When two series are compared, dispersion coefficients (along with the dispersion measure) are calculated, which differ widely in their averages. When two sets with different measurement units are compared, the dispersion coefficient is also used. It is marked as C.D.
The common coefficients of dispersion are:
Co-relation Analysis:
For testing relationships between quantitative variables or categorical variables, correlation is used. It's a measure of how things are related, in other words. Correlation analysis is called the study of how variables are correlated.
Some examples of high correlation data
● Your caloric intake and your weight.
● Your eye color and your relatives’ eye colors.
● The amount of time your study and your GPA.
Some examples of data that have a low correlation (or none at all):
● Your sexual preference and the type of cereal you eat.
● A dog’s name and the type of dog biscuit they prefer.
● The cost of a car wash and how long it takes to buy a soda inside the station.
Correlations are useful because you can make predictions about future behavior if you can discover what relationship variables have. In social sciences, such as government and healthcare, knowing what the future holds is very important. For budgets and business plans, enterprises also use these statistics.
The Correlation Coefficient
A coefficient of correlation is a way to place a value on the relationship. The coefficients of correlation have a value of -1 to 1. A '0' means that the variables have no relationship at all, whereas -1 or 1 means that there is a perfect negative or positive correlation (negative or positive correlation here refers to the type of graph the relationship will produce).
The Pearson Coefficient of Correlation is the most common correlation coefficient. It's used to check for linear data relationships. The Pearson is probably the only one you will work with in AP statistics or elementary stats. However, depending on the type of data you are working with, you may come across others. The lambda coefficient of Goodman and Kruskal, for instance, is a fairly common coefficient. If you do not have to specify which variable is dependent, it can be symmetric and asymmetric if the dependent variable is specified.
Regression Analysis:
Regression analysis is a group of mathematical techniques used to approximate the relationship between a dependent variable and an independent variable using one or more independent variables. An independent variable is an altered input, inference, or driver to determine its effect on a dependent variable..
Key Takeaways:
- Statistics include gathering information, analyzing, and validating. Statistical analysis is the method of conducting many statistical operations in order to measure the data and apply statistical analysis.
- There are different methods for data analysis depending on the issue at hand, the form of data, and the amount of data obtained. Each focuses on strategies for taking on new data, mining knowledge, and digging into information in order to turn facts and figures into decision-making criteria.
- A key trend is a detailed description of a dataset through a single value that represents the data delivery center. A branch of descriptive statistics, along with the heterogeneity (dispersion) of a dataset, is the central trend.
- Analysis of regression is a group of statistical techniques used to estimate the relationship between an independent and a dependent variable.
Reference Books:
1) Research Methods in Accounting, Malcolm Smith
2) Research Methods and Methodology in Finance and Accounting, by Viv Beattie and Bob Ryan