Unit – 3
Statistics
STATISTICS is a branch of science dealing with the collection of data, organizing, summarizing, presenting and analyzing data and drawing valid conclusions and thereafter making reasonable decisions on the basis of such analysis.
Collection of data
The collection of data constitutes the starting point of any statistical investigation. Data may be collected for each and every unit of the whole lot (population), for it would ensure greater accuracy. But complete enumeration is prohibitively expensive and time consuming. Ash search out of a very large number of items a few of them (sample) are selected and conclusions drawn on the basis of this sample are taken to hold for the population.
Classification of data
The data collected in the course of an enquiry is not in an easily assimilable form. As such its proper classification is necessary for making intelligent interferences. The classification is done by dividing the raw data into a convenient number of groups according to the values of the variable and finding the frequency of the variable in each group.
Let us, for example, consider the raw data relating to marks obtained in Mechanics by a group of 64 students
79 | 88 | 75 | 60 | 93 | 71 | 59 | 85 |
84 | 75 | 82 | 68 | 90 | 62 | 88 | 76 |
65 | 75 | 87 | 74 | 62 | 95 | 78 | 63 |
78 | 82 | 75 | 91 | 77 | 69 | 74 | 68 |
67 | 73 | 81 | 72 | 63 | 76 | 75 | 85 |
80 | 73 | 57 | 88 | 78 | 62 | 76 | 52 |
62 | 67 | 97 | 78 | 85 | 76 | 65 | 71 |
78 | 89 | 61 | 75 | 95 | 60 | 79 | 83 |
This data can conveniently be grouped and shown in a tabular form as follows
Class | Frequency | Cumulative frequency |
50-54 | 1 | 1 |
55-59 | 2 | 3 |
60-64 | 9 | 12 |
65-69 | 7 | 19 |
70-74 | 8 | 27 |
75-79 | 17 | 44 |
80-84 | 6 | 50 |
85-89 | 8 | 58 |
90-94 | 3 | 61 |
95-99 | 3 | 64 |
| Total=64 |
|
It would be seen from the above table that there is one student getting marks between 50 -54, to students getting marks between 55-59, 9 students getting marks between 60-64 and so on. Thus the 64 figure have been put into only ten groups, called as classes. The width of the class is called the class interval and the number in that interval is called the frequency. The midpoint of the mid value of the class is called the class mark. The above table showing the classes and the corresponding frequencies is called a frequency table. Thus a set of row data are summarised by distributing it into a number of classes along with their frequencies is known as a frequency distribution. while forming the frequency distribution the number of classes should not ordinarily exceed 20 and should not in general be less than 10. As far as possible the class intervals should be of equal width.
Cumulative frequency
in some investigations we require the number of items less than a certain value. We add up The frequencies of the classes up to that value and call this number as the cumulative frequency. In the above table the third column shows the cumulative frequency, i.e. the number of students getting less than 54 marks, less than 59 marks and so on.
Graphical representation
A convenient way of representing a sample frequency distribution is by means of graphs. It gives to the eyes the general run of the observations and at the same time makes the raw data readily intelligible. We give below the important types of graphs in use:
1) Histogram. A histogram is drawn by erecting rectangles over the class intervals such that the areas of the rectangles are proportional to the class frequencies. If the class intervals are of equal size the height of the rectangles will be proportional to the class frequencies themselves. (Figure 25.1)
2) Frequency polygon. A frequency polygon for an ungrouped data can be obtained by joining points plotted with the variable values as the abscissa And The frequencies as the ordinates. For a grouped distribution the abscissa of the points will be the mid values of the class intervals. In case of intervals are equal the frequency polygon can be obtained by joining the middle points of the upper sides of the rectangles of the histogram by straight lines (shown by dotted lines in figure 25.1). if the class intervals become very very small the frequency polygon takes the form of a smooth curve called the frequency curve.
3) Cumulative frequency curve Ogive. Very often it is desired to show in a diagrammatic form not the relative frequencies in the various intervals but the cumulative frequencies above or below a given value. For example we may wish to read of from a diagram the number of proportions of people whose income is not less than any given amount or proportion of people whose height does not exceed any stated value. Diagrams of this type are known as cumulative frequency curves or ogives. these are of two kinds more than or less than and typical day look somewhat like a long-drawn S (figure 25.2).
Example. Draw the histogram frequency polygon frequency curve and the ogives 'less than' and 'more than' from the following distribution of marks obtained by 49 students.
Class (marks group) | Frequency (Number of students) | Cumulative frequency | |
(Less than) | (More than) | ||
5-10 | 5 | 5 | 49 |
10-15 | 6 | 11 | 44 |
15-20 | 15 | 26 | 38 |
20-25 | 10 | 36 | 23 |
25-30 | 5 | 41 | 13 |
30-35 | 4 | 45 | 8 |
35-40 | 2 | 47 | 4 |
40-45 | 2 | 49 | 2 |
Solution. In figure 25.1 the rectangles show the histogram, the dotted polygon represent the frequency polygon and the smooth curve in the frequency curve.
The ogives 'less than' and 'more than' are shown in figure 25.2.
Average or measures of Central tendency
An average is a value which is representative of a set of data. Average value may also be termed as measures of Central tendency. There are five types of averages in common.
(i) Arithmetic average or mean
(ii) Median
(iii) Mode
(iv) Geometric mean
(v) Harmonic mean
Arithmetic mean
If are n numbers then their arithmetic mean (A.M) is defined by
If the number occurs times X and so on then
This is known as direct method.
Example 1. Find the mean of 20, 22, 25, 28, 30.
Solution.
Example 2. Find the mean of the following:
Numbers | 8 | 10 | 15 | 20 |
Frequency | 5 | 8 | 8 | 4 |
Solution. fx = 8×5 + 10×8 + 15×8 + 20×4 = 40+80+120+80=320
f = 5+8+8+4=25
A.M.=
(b) Short cut method
Let a be the assumed mean, d the derivation of the variate x from a. Then
Example 3. Find the arithmetic mean for the following distribution
Class | 0-10 | 10-20 | 20-30 | 30-40 | 40-50 |
Frequency | 7 | 8 | 20 | 10 | 5 |
Solution. Let assumed mean (a) = 25
Class | Mid-value (x) | Frequency (f) | fd | |
0-10 | 5 | 7 | -20 | -140 |
10-20 | 15 | 8 | -10 | -80 |
20-30 | 25 | 20 | 0 | 0 |
30-40 | 35 | 10 | + 10 | +100 |
40-50 | 45 | 5 | + 20 | +100 |
Total |
| 50 |
| -20 |
(C) Step diffusion method
Let a be the assumed mean, i the width of the class interval and
Example 4. Find the arithmetic mean of the data given in example 3 by step deviation method.
Solution. Let a =25
Class | Mid-value x | Frequency f | f.D | |
0-10 | 5 | 7 | -2 | -14 |
10-20 | 15 | 8 | -1 | -8 |
20-30 | 25 | 20 | 0 | 0 |
30-40 | 35 | 10 | +1 | +10 |
40-50 | 45 | 5 | +2 | +10 |
Total |
| 50 |
| -2 |
Median
Median is defined as the measure of the central atom when they are arranged in ascending or descending order of magnitude .
When the total number of the items is odd and equal to say n of item gives the median.
When the total number of The frequencies is even, say n, then there are two middle items and so the mean of the values of th items is the median.
Example 5. Find the median of 6, 8, 9, 10, 11, 12, 13.
Solution. Total number of items =7
The middle item
Median= value of the 4th item = 10
For grouped data, median
Where l is the lower limit of the median class, f is the frequency of the class, i is the width of the class interval, F is the total of all the the preceding frequencies of the median class and N is total frequency of the data.
Example 6. Find the value of median from the following data
Number of days for which absent (less than) | 5 | 10 | 15 | 20 | 25 | 30 | 35 | 40 | 45 |
Number of students | 29 | 224 | 465 | 582 | 634 | 644 | 650 | 653 | 655 |
Solution. The given cumulative frequency distribution will first be converted into ordinary frequency as under:
Class interval | Cumulative frequency | Ordinary frequency |
0-5 | 29 | 29=29 |
5-10 | 224 | 224-29=105 |
10-15 | 465 | 465-224=241 |
15-20 | 582 | 582-465=117 |
20-25 | 634 | 634-582=52 |
25-30 | 644 | 644-634=10 |
30-35 | 650 | 650-644=6 |
35-40 | 653 | 653-650=3 |
40-45 | 655 | 655-653=2 |
Median = size of
327.5th item lies in 10-15 which is the median class
Where l stands for lower limit of median class.
N stands for the total frequency
C stands for cumulative frequency just preceding the median class
i stands for class interval
f stands for frequency for the median class
Mode
Mode is defined to be the size of the variable which occurs most frequently.
Example 7. Find the mode of the following items
0,1,6,7,2,3,7,6,6,2,6,0,5,6,0.
Solution. 6 occurs 5 times and no other item occurs 5 or more than 5 times, hence the mode is 6.
For grouped data,
Where l is the lower limit of the modal class, f is the frequency of the modal class, i is the width of the class, is the frequency before the model class and frequency of the modal class.
Empirical formula
Mean – Mode =3 [Mean – Median]
Example 8. Find the mode from the following data
Age | 0-6 | 6-12 | 12-18 | 18-24 | 24-30 | 30-36 | 36-42 |
Frequency | 6 | 11 | 25 | 35 | 18 | 12 | 6 |
Solution.
Age | Frequency | Cumulative frequency |
0-6 | 6 | 6 |
6-12 | 11 | 17 |
12-18 | 42 | |
18-24 | 35 = f | 77 |
24-30 | 95 | |
30-36 | 12 | 107 |
36-42 | 6 | 113 |
Geometric mean
If be n values of variates x, then the geometric mean
Example 10. Calculate the harmonic mean of 4,8,16.
Solution.
Average deviation on mean deviation
It is the mean of the absolute values of the definitions of given set of numbers from their arithmetic mean.
If be a set of numbers with frequencies respectively. Let x be the arithmetic mean of the numbers
Mean deviation =
Example 11. Find the mean deviation of the following frequency distribution
Class | 0-6 | 6-12 | 12-18 | 18-24 | 24-30 |
Frequency | 8 | 10 | 12 | 9 | 5 |
Solution. Let a = 15
Class | Mid-value x | Frequency f | d = x-a | fd | |x-14| | f|x-14| |
0-6 | 3 | 8 | -12 | -96 | 11 | 88 |
6-12 | 9 | 10 | -6 | -60 | 5 | 50 |
12-18 | 15 | 12 | 0 | 0 | 1 | 12 |
18-24 | 21 | 9 | +6 | 54 | 7 | 63 |
24-30 | 27 | 5 | +12 | 60 | 13 | 65 |
Total |
| 44 |
| -42 |
| 278 |
Average deviation=
MOMENTS
The rth moment of a variable x about the mean x is usually denoted by is given by
The rth moment of a variable x about any point a is defined by
Relation between moments about mean and moment about any point:
where and
In particular
Note. 1. The sum of the coefficients of the various terms on the right‐hand side is zero.
2. The dimension of each term on right‐hand side is the same as that of terms on the left.
MOMENT GENERATING FUNCTION
The moment generating function of the variate about is defined as the expected value of and is denoted .
where , ‘ is the moment of order about
Hence coefficient of or
again )
Thus the moment generating function about the point moment generating function about the origin.
SKEWNESS:
Skewness denotes the opposite of symmetry. It is lack of symmetry. In a symmetrical series, the mode, the median, and the arithmetic average are identical.
Coefficient of skewness
KURTOSIS: It measures the degree of peakedness of a distribution and is given by Measure of kurtosis.
Negative skewness Positive skewness A: Mesokurtic B: Leptokurtic
C: Playkurtic
If , the curve is normal or mesokurtic.
If , the curve is peaked or leptokurtic.
If , the curve is flat topped or platykurtic
Example. The first four moments about the working mean 28.5 of distribution are 0.2 94, 7.1 44, 42.409 and 454.98. Calculate the moments about the mean. Also evaluate and comment upon the skewness and kurtosis of the distribution.
Solution. The first four moments about the arbitrary origin 28.5 are
, which indicates considerable skewness of the distribution.
, which shows that the distribution is leptokurtic.
Example. Calculate the median, quartiles and the quartile coefficient of skewness from the following data:
Weight (lbs) | 70-80 | 80-90 | 90-100 | 100-110 | 110-120 | 120-130 | 130-140 | 140=150 |
No. of persons | 12 | 18 | 35 | 42 | 50 | 45 | 20 | 8 |
Solution. Here total frequency
The cumulative frequency table is
Weight (lbs) | 70-80 | 80-90 | 90-100 | 100-110 | 110-120 | 120-130 | 130-140 | 140=150 |
Frequency | 12 | 18 | 35 | 42 | 50 | 45 | 20 | 8 |
Cumulative Frequency | 12 | 30 | 65 | 107 | 157 | 202 | 222 | 230 |
Now, N/2 =230/2= 115th item which lies in 110 – 120 group.
Median or
Also, is 57.5th or 58th item which lies in 90-100 group.
Similarly 3N/4 = 172.5 i.e. is 173rd item which lies in 120-130 group.
Hence quartile coefficient of skewness =
A probability distribution is an arithmetical function which defines completely possible values &possibilities that a random variable can take in a given range. This range will be bounded between the minimum and maximum possible values. But exactly where the possible value is possible to be plotted on the probability distribution depends on a number of influences. These factors include the distribution's mean, SD, Skewness, and kurtosis.
Binomial Distribution:
BINOMIAL DISTRIBUTION
To find the probability of the happening of an event once, twice, thrice,…r times ….exactly in n trails.
Let the probability of the happening of an event A in one trial be p and its probability of not happening be 1 – p – q.
We assume that there are n trials and the happening of the event A is r times and its not happening is n – r times.
This may be shown as follows
AA……A
r times n – r times (1)
A indicates its happening its failure and P (A) =p and P (
We see that (1) has the probability
pp…p qq….q=
r times n-r times (2)
Clearly (1) is merely one order of arranging r A’S.
The probability of (1) =Number of different arrangements of r A’s and (n-r)’s
The number of different arrangements of r A’s and (n-r)’s
Probability of the happening of an event r times =
If r = 0, probability of happening of an event 0 times
If r = 1, probability of happening of an event 1 times
If r = 2, probability of happening of an event 2 times
If r = 3, probability of happening of an event 3 times and so on.
These terms are clearly the successive terms in the expansion of
Hence it is called Binomial Distribution.
Example. If on an average one ship in every ten is wrecked. Find the probability that out of 5 ships expected to arrive, 4 at least we will arrive safely.
Solution. Out of 10 ships one ship is wrecked.
I.e. nine ships out of 10 ships are safe, P (safety) =
P (at least 4 ships out of 5 are safe) = P (4 or 5) = P (4) + P(5)
Example. The overall percentage of failures in a certain examination is 20. if 6 candidates appear in the examination what is the probability that at least five pass the examination?
Solution. Probability of failures = 20%
Probability of (P) =
Probability of at least 5 pass = P(5 or 6)
Example. The probability that a man aged 60 will live to be 70 is 0.65. what is the probability that out of 10 men, now 60, at least seven will live to be 70?
Solution. The probability that a man aged 60 will live to be 70
Number of men= n = 10
Probability that at least 7 men will live to 70 = (7 or 8 or 9 or 10)
= P (7)+ P(8)+ P(9) + P(10) =
Example. assuming that 20% of the population of a city are literate so that the chance of an individual being literate is and assuming that hundred investigators each take 10 individuals to see whether they are illiterate, how many investigators would you expect to report 3 or less were literate.
Solution.
Required number of investigators = 0.879126118× 100 =87.9126118
= 88 approximate
Mean or binomial distribution
Successors r | Frequency f | rf |
0 | 0 | |
1 | ||
2 | n(n-1) | |
3 | ||
….. | …… | …. |
n |
Since,
STANDARD DEVIATION OF BINOMIAL DISTRIBUTION
Successors r | Frequency f | |
0 | 0 | |
1 | ||
2 | 2n(n-1) | |
3 | ||
….. | …… | …. |
n |
We know that (1)
r is the deviation of items (successes) from 0.
Putting these values in (1) we have
Hence for the binomial distribution, Mean
Example. A die is tossed thrice. A success is getting 1 or 6 on a TOSS. Find the mean and variance of the number of successes.
Solution.
RECURRENCE RELATION FOR THE BINOMIAL DISTRIBUTION
By Binomial Distribution
On dividing (2) by (1) , we get
Poisson Distribution:
Poisson distribution is a particular limiting form of the Binomial distribution when p (or q) is very small and n is large enough.
Poisson distribution is
where m is the mean of the distribution.
Proof. In Binomial Distribution
Taking limits when n tends to infinity
MEAN OF POISSON DISTRIBUTION
Success r | Frequency f | f.r |
0 | 0 | |
1 | ||
2 | ||
3 | ||
… | … | … |
r | ||
… | … | … |
STANDARD DEVIATION OF POISSON DISTRIBUTION
Successive r | Frequency f | Product rf | Product |
0 | 0 | 0 | |
1 | |||
2 | |||
3 | |||
……. | …….. | …….. | …….. |
r | |||
…….. | ……. | …….. | ……. |
Hence mean and variance of a Poisson distribution are equal to m. Similarly we can obtain,
MEAN DEVIATION
Show that in a Poisson distribution with unit mean, and the mean deviation about the mean is 2/e times the standard deviation.
Solution. But mean = 1 i.e. m =1 and S.D. =
r | P (r) | |r-1| | P(r)|r-1| |
0 | 1 | ||
1 | 0 | 0 | |
2 | 1 | ||
3 | 2 | ||
4 | 3 | ||
….. | ….. | ….. | ….. |
r | r-1 |
Mean Deviation =
MOMENT GENERATING FUNCTION OF POISSON DISTRIBUTION
Solution.
Let be the moment generating function then
CUMULANTS
The cumulant generating function is given by
Now cumulant =coefficient of in K (t) = m
i.e. , where r = 1,2,3,…
Mean =
RECURRENCE FORMULA FOR POISSON DISTRIBUTION
SOLUTION. By Poisson distribution
On dividing (2) by (1) we get
Example. Assume that the probability of an individual coal miner being killed in a mine accident during a year is . Use appropriate statistical distribution to calculate the probability that in a mine employing 200 miners, there will be at least one fatal accident in a year.
Solution.
Example. Suppose 3% of bolts made by a machine are defective, the defects occuring at random during production. If bolts are packaged 50 per box, find
(a) Exact probability and
(b) Poisson approximation to it, that a given box will contain 5 defectives.
Solution.
(a) Hence the probability for 5 defectives bolts in a lot of 50.
(b) To get Poisson approximation m = np =
Required Poisson approximation=
Example. In a certain factory producing cycle tyres, there is a smallchance of 1 in 500 tyres to be defective. The tyres are supplied in lots of 10. Using Poisson distribution, calculate the approximate number of lots containing no defective, one defective and two defective tyres, respectively, in a consignment of 10,000 lots.
Solution.
S.No. | Probability of defective | Number of lots containing defective |
1. | ||
2. | ||
3. |
Normal Distribution
Normal distribution is a continuous distribution. It is derived as the limiting form of the Binomial distribution for large values of n and p and q are not very small.
The normal distribution is given by the equation
(1)
Where = mean, = standard deviation, =3.14159…e=2.71828…
On substitution in (1) we get (2)
Here mean = 0, standard deviation = 1
(2) is known as standard form of normal distribution.
MEAN FOR NORMAL DISTRIBUTION
Mean [Putting
STANDARD DEVIATION FOR NORMAL DISTRIBUTION
Put,
MEDIAN OF THE NORMAL DISTRIBUTION
If a is the median then it divides the total area into two equal halves so that
Where,
Suppose mean, then
Thus,
Similarly, when mean, we have a =
Thus, median = mean =
MEA DEVIATION ABOUT THE MEAN
Mean deviation
MODE OF THE NORMAL DISTRIBUTION
We know that mode is the value of the variate x for which f (x) is maximum. Thus by differential calculus f (x) is maximum if and
Where,
Thus mode is and model ordinate =
NORMAL CURVE
Let us show binomial distribution graphically. The probabilities of heads in 1 tosses are
. it is shown in the given figure.
If the variates (head here) are treated as if they were continuous, the required probability curve will be a normal curve as shown in the above figure by dotted lines.
Properties of the normal curve
AREA UNDER THE NORMAL CURVE
By taking the standard normal curve is formed.
The total area under this curve is 1. The area under the curve is divided into two equal parts by z = 0. Left hand side area and right hand side area to z = 0 is 0.5. The area between the ordinate z = 0.
Example. On a final examination in mathematics, the mean was 72, and the standard deviation was 15. Determine the standard scores of students receiving graders.
(a) 60
(b) 93
(c) 72
Solution. (a)
(b)
(c)
Example. Find the area under the normal curve in each of the cases
(a) Z = 0 and z = 1.2
(b) Z = -0.68 and z = 0
(c) Z = -0.46 and z = -2.21
(d) Z = 0.81 and z = 1.94
(e) To the left of z = -0.6
(f) Right of z = -1.28
Solution. (a) Area between Z = 0 and z = 1.2 =0.3849
(b)Area between z = 0 and z = -0.68 = 0.2518
(c)Required area = (Area between z = 0 and z = 2.21) + (Area between z = 0 and z =-0.46)\
= (Area between z = 0 and z = 2.21)+ (Area between z = 0 and z = 0.46)
=0.4865 + 0.1772 = 0.6637
(d)Required area = (Area between z = 0 and z = 1.+-(Area between z = 0 and z = 0.81)
= 0.4738-0.2910=0.1828
(e) Required area = 0.5-(Area between z = 0 and z = 0.6)
= 0.5-0.2257=0.2743
(f)Required area = (Area between z = 0 and z = -1.28)+0.5
= 0.3997+0.5
= 0.8997
Example. The mean inside diameter of a sample of 200 washers produced by a machine is 0.0502 cm and the standard deviation is 0.005 cm. The purpose for which these washers are intended allows a maximum tolerance in the diameter of 0.496 to 0.508 cm, otherwise the washers are considered defective. Determine the percentage of defective washers produced by the machine assuming the diameters are normally distributed.
Solution.
Area for non – defective washers = Area between z = -1.2
And z = +1.2
=2 Area between z = 0 and z = 1.2
=2 (0.3849)-0.7698=76.98%
Percentage of defective washers = 100-76.98=23.02%
Example. A manufacturer knows from experience that the resistance of resistors he produces is normal with mean and standard deviation . What percentage of resistors will have resistance between 98 ohms and 102 ohms?
Solution. ,
Area between
= (Area between z = 0 and z = +1)
= 2 (Area between z = 0 and z = +1)=2 0.3413 = 0.6826
Percentage of resistors having resistance between 98 ohms and 102 ohms = 68.26
Example. In a normal distribution, 31% of the items are 45 and 8% are over 64. Find the mean and standard deviation of the distribution.
Solution. Let be the mean and the S.D.
If x = 45,
If x = 64,
Area between 0 and
[From the table, for the area 0.19, z = 0.496)
Area between z = 0 and z =
(from the table for area 0.42, z = 1.405)
Solving (1) and (2) we get
Correlation
So far we have confined our attention to the analysis of observations on a single variable. There are however, many phenomenon where the changes in one variable are related to the changes in the other variable. For instance the yield of a crop varies with the amount of rainfall, the price of a commodity increases with the reduction in its supply and so on. Such a data connecting two variables is called bivariate population.
To obtain a measure of relationship between the two variables, we plot their corresponding values on the graph taking one of the variable along the x axis and the other along the y axis. (Figure 25.6).
Let the origin be shifted to , where re the means of X’s and y's that the new coordinates are given by
Now the points (X,Y) are so distributed over the four quadrants of XY plane that the product XY is positive in the first and third quadrant but negative in the second and fourth quadrants. the algebraic sum of the products can be taken as describing the trend of the dots in all the quadrants.
(i) If XY is positive, the trend of the dots is through the first and third quadrants.
(ii) If XY is negative the trend of two dots is in the second and fourth quadrants and
(iii) If XY is zero, the points indicate no trend i.e. the points are evenly distributed over the quadrants.
The XY or better still XY i.e. the average of n products may be taken as a measure of correlation. If we put X and Y in their units, i.e. taking , as the unit for x and for y, then
Is the measure of correlation.
Coefficient of correlation
The numerical measure of correlation is called the coefficient of correlation and is defined by the relation
Where, X = deviation from the mean = = devaluation from the mean
= Standard deviation of x series, = standard deviation of y series and n = number of the values of the two variables
Methods of calculation
(a) Direct method. Substituting the value of in the above formula we get
Another form of the formula (1) which is quite handy for calculation is
(b) Step deviation method. The direct method becomes very lengthy and tedious if the means of the two series are not integers. In such cases, use is made of assumed means. If are step deviations from the assumed means, then
(c) Coefficient of correlation for grouped data. When x and y series are both given as frequency distributions these can be represented by a two way table known as the correlation table. the coefficient of correlation for such a bivariate frequency distribution is calculated by the formula
Where = derivation of the central values from the assumed mean of x series
derivation of the central values from the assumed mean of y series
is the frequency corresponding to the pair (x, y)
is the total number of frequency.
Example. Psychological test of the intelligence and of Engineering ability were applied to 10 students. Here is a record of ungrouped data showing intelligence ratio (I.R) and Engineering ratio (E.R). Calculate the coefficient of correlation.
Student | A | B | C | D | E | F | G | H | I | J |
I.R. | 105 | 104 | 102 | 101 | 100 | 99 | 98 | 96 | 93 | 92 |
E.R. | 101 | 103 | 100 | 98 | 95 | 96 | 104 | 92 | 97 | 94 |
Solution. We construct the following table
Student | Intelligence ratio x | Engineering ratio y y | XY | ||
A | 105 6 | 101 3 | 36 | 9 | 18 |
B | 104 5 | 103 5 | 25 | 25 | 25 |
C | 102 3 | 100 2 | 9 | 4 | 6 |
D | 101 2 | 98 0 | 4 | 0 | 0 |
E | 100 1 | 95 -3 | 1 | 9 | -3 |
F | 99 0 | 96 - 2 | 0 | 4 | 0 |
G | 98 -1 | 104 6 | 1 | 36 | -6 |
H | 96 -3 | 92 -6 | 9 | 36 | 18 |
I | 93 -6 | 97 -1 | 36 | 1 | 6 |
J | 92 -7 | 94 -4 | 49 | 16 | 28 |
Total | 990 0 | 980 0 | 170 | 140 | 92 |
From this table, mean of x, i.e. and mean of y, i.e.
Substituting these value in the formula (1)p.744 we have
Example. The correlation table given below shows that the ages of husband and wife of 53 married couples living together on the census night of 1991. Calculate the coefficient of correlation between the age of the husband and that of the wife.
Age of husband | Age of wife | Total | ||||||
15-25 | 25-35 | 35-45 | 45-55 | 55-65 | 65-75 | |||
15-25 | 1 | 1 | - | - | - | - | 2 | |
25-35 | 2 | 12 | 1 | - | - | - | 15 | |
35-45 | - | 4 | 10 | 1 | - | - | 15 | |
45-55 | - | - | 3 | 6 | 1 | - | 10 | |
55-65 | - | - | - | 2 | 4 | 2 | 8 | |
65-75 | - | - | - | - | 1 | 2 | 3 | |
Total | 3 | 17 | 14 | 9 | 6 | 4 | 53 | |
Solution.
Age of husband | Age of wife x series | Suppose | |||||||||||
15-25 | 25-35 | 35-45 | 45-55 | 55-65 | 65-75 |
Total f | |||||||
Years | Midpoint x | 20 | 30 | 40 | 50 | 60 | 70 | ||||||
Age group | Midpoint y |
|
| -20 | -10 | 0 | 10 | 20 | 30 | ||||
| -2 | -1 | 0 | 1 | 2 | 3 | |||||||
15-25 | 20 | -20 | -2 | 4 1 | 2 1 |
|
|
|
| 2 | -4 | 8 | 6 |
25-35 | 30 | -10 | -1 | 4 2 | 12 12 | 0 1 |
|
|
| 15 | -15 | 15 | 16 |
35-45 | 40 | 0 | 0 |
| 0 4 | 0 10 | 0 1 |
|
| 15 | 0 | 0 | 0 |
45-55 | 50 |
|
|
|
| 0 3 | 6 6 | 2 1 |
| 10 | 10 | 10 | 8 |
55-65 | 60 |
|
|
|
|
| 4 2 | 16 4 | 12 2 | 8 | 16 | 32 | 32 |
65-75 | 70 |
|
|
|
|
|
| 6 1 | 18 2 | 3 | 9 | 27 | 24 |
Total f | 3 | 17 | 14 | 9 | 6 | 4 | 53 = n | 16 | 92 | 86 | |||
-6 | -17 | 0 | 9 | 12 | 12 | 10 | Thick figures in small sqs. for Check: From both sides | ||||||
12 | 17 | 0 | 9 | 24 | 36 | 98 | |||||||
8 | 14 | 0 | 10 | 24 | 30 | 86 |
With the help of the above correlation table, we have
Lines of Regression
It frequently happens that the dots of the scatter diagram generally tends to cluster along a well- defined direction which suggests a linear relationship between the variables x and y. Such a line of best fit for the given distribution of dots is called the line of regression (figure 25.6). in fact there are two such lines, one giving the best possible mean values of y for each specified value pf x and the other giving the best possible mean values of x for given value of y. the former is known as the line of regression of y on x and the latter as the line of regression of x on y.
Consider first the line of regression of y on x. Let the straight line satisfying the general trend of n dots in a scatter diagram be
(1)
We have to determine the constant a and b so that (1) gives for each value of x, the best estimate for the average value of y in accordance with the principle of least squares therefore, the normal equation for a and b are
i.e.
This shows that i.e. the mean of x and y lie on (1).
Shifting the origin to (3) takes the form of
Cor. The correlation coefficient r is the geometric mean between the two regression coefficients
For
Example. The two regression equations of the variable x and y are x = 19.13 and y = 11.64 – 0.50 x. Find (i) mean of x’s (ii) mean of y’s and (iii) the correlation coefficient between x and y.
Solution. Since the mean of x’s and the mean of y’s lie on the two regression lines, we have
Multiplying (ii) by 0.87 and subtracting from (i) we have
Regression coefficient of y and x is -0.50 and that of x and y is -0.87.
Now since the coefficient of correlation is the geometric mean between the two regression coefficients.
[-ve sign is taken since both the regression coefficients are –ve]
Example. If is the angle between the two regression lines show that
Explain the significance when .
Solution. The equations to the line of regression of y on x and x on y are
Their slopes are
Thus,
When r = 0,i.e. when the variable are independent, the two lines of regression are perpendicular to each other.
When . Thus the line of regression coincide i.e. there is perfect correlation between the two variables.
Example. While calculating correlation coefficient between two variables x and y from 25 pairs of observations, the following results were obtained : n = 25, Later it was discovered at the time of checking that the pairs of values x -8,6 and y = 12, 8 were copied down as x = 6,8 and y = 14,6. Obtain the correct value of correlation coefficients.
Solution. To get the correct results, we subtract the incorrect values and add the corresponding correct values.
The correct results would be
RANK CORRELATION
A group of n individuals may be arranged in order to merit with respect to some characteristics. The same group would give different orders for different characteristics. Considering the orders corresponding to two characteristics A and B, the correction between these n pairs of rank is called the rank correlation in the characteristics A and B for that group of individuals.
Let be the ranks of the ith individuals in A and B respectively. Assuming that no two individuals are bracketed equal in either case, each of the variables taking the values 1,2,3,…,n we have
If X, Y be the deviations of x, y from their means, then
Now let,
Hence the correlation coefficient between these variables is
This is called the rank correlation coefficient and is denoted by
Example. Ten participants in a contest are ranked by two judges as follows:
x | 1 | 6 | 5 | 10 | 3 | 2 | 4 | 9 | 7 | 8 |
y | 6 | 4 | 9 | 8 | 1 | 2 | 3 | 10 | 5 | 7 |
Calculate the rank correlation coefficient
Solution. If
Hence,
Example. Three judges A,B,C give the following ranks. Find which pair of judges has common approach
A | 1 | 6 | 5 | 10 | 3 | 2 | 4 | 9 | 7 | 8 |
B | 3 | 5 | 8 | 4 | 7 | 10 | 2 | 1 | 6 | 9 |
C | 6 | 4 | 9 | 8 | 1 | 2 | 3 | 10 | 5 | 7 |
Solution. Here n = 10
A (=x) | Ranks by B(=y) | C (=z) | x-y | y - z | z-x |
| ||
1 | 3 | 6 | -2 | -3 | 5 | 4 | 9 | 25 |
6 | 5 | 4 | 1 | 1 | -2 | 1 | 1 | 4 |
5 | 8 | 9 | -3 | -1 | 4 | 9 | 1 | 16 |
10 | 4 | 8 | 6 | -4 | -2 | 36 | 16 | 4 |
3 | 7 | 1 | -4 | 6 | -2 | 16 | 36 | 4 |
2 | 10 | 2 | -8 | 8 | 0 | 64 | 64 | 0 |
4 | 2 | 3 | 2 | -1 | -1 | 4 | 1 | 1 |
9 | 1 | 10 | 8 | -9 | 1 | 64 | 81 | 1 |
7 | 6 | 5 | 1 | 1 | -2 | 1 | 1 | 4 |
8 | 9 | 7 | -1 | 2 | -1 | 1 | 4 | 1 |
Total |
|
| 0 | 0 | 0 | 200 | 214 | 60 |
Since is maximum, the pair of judge A and C have the nearest common approach.
Method of Least Squares
Let (1)
Be the straight line to be fitted to the given data points
Let be the theoretical value for
Then,
For S to be minimum
On simplification equation (2) and (3) becomes
The equation (3) and (4) are known as Normal equations.
On solving ( 3) and (4) we get the values of a and b
(b)To fit the parabola
The normal equations are
On solving three normal equations we get the values of a,b and c.
Example. Find the best values of a and b so that y = a + bx fits the data given in the table
x | 0 | 1 | 2 | 3 | 4 |
y | 1.0 | 2.9 | 4.8 | 6.7 | 8.6 |
Solution.
y = a + bx
x | y | xy | |
0 | 1.0 | 0 | 0 |
1 | 2.9 | 2.0 | 1 |
2 | 4.8 | 9.6 | 4 |
3 | 6.7 | 20.1 | 9 |
4 | 8.6 | 13.4 | 16 |
x = 10 | y ,= 24.0 | xy = 67.0 |
Normal equations, y= na+ bx (2)
On putting the values of
On solving (4) and (5) we get,
On substituting the values of a and b in (1) we get
Example. By the method of least squares, find the straight line that best fits the following data :
x | 1 | 2 | 3 | 4 | 5 |
y | 14 | 27 | 40 | 55 | 68 |
Solution. Let the equation of the straight line best fit be y = a + bx. (1)
x | y | x y | |
1 | 14 | 14 | 1 |
2 | 27 | 54 | 4 |
3 | 40 | 120 | 9 |
4 | 55 | 220 | 16 |
5 | 68 | 340 | 25 |
x=15 | y=204 | xy=748 |
Normal equations are
On putting the values of x, y, xy and in (2) and (3) we have
On solving equations (4) and (5) we get
On substituting the values of (a) and (b) in (1) we get,
Example. Find the least squares approximation of second degree for the discrete data
x | 2 | -1 | 0 | 1 | 2 |
y | 15 | 1 | 1 | 3 | 19 |
Solution. Let the equation of second degree polynomial be
x | y | xy | ||||
-2 | 15 | -30 | 4 | 60 | -8 | 16 |
-1 | 1 | -1 | 1 | 1 | -1 | 1 |
0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 3 | 3 | 1 | 3 | 1 | 1 |
2 | 19 | 38 | 4 | 76 | 8 | 16 |
x=0 | y=39 | xy=10 |
Normal equations are
On putting the values of x, y, xy,
have
On solving (5),(6),(7), we get,
The required polynomial of second degree is
Change of scale
If the data is of equal interval in large numbers then we change the scale as
Example. Fit a second degree parabola to the following data by least square method:
x | 1929 | 1930 | 1931 | 1932 | 1933 | 1934 | 1935 | 1936 | 1937 |
y | 352 | 356 | 357 | 358 | 360 | 361 | 365 | 360 | 359 |
Solution. Taking
Taking
The equation is transformed to
x | y | uv | ||||||
1929 | -4 | 352 | -5 | 20 | 16 | -80 | -64 | 256 |
1930 | -3 | 360 | -1 | 3 | 9 | -9 | -27 | 81 |
1931 | -2 | 357 | 0 | 0 | 4 | 0 | -8 | 16 |
1932 | -1 | 358 | 1 | -1 | 1 | 1 | -1 | 1 |
1933 | 0 | 360 | 3 | 0 | 0 | 0 | 0 | 0 |
1934 | 1 | 361 | 4 | 4 | 1 | 4 | 1 | 1 |
1935 | 2 | 361 | 4 | 8 | 4 | 16 | 8 | 16 |
1936 | 3 | 360 | 3 | 9 | 9 | 27 | 27 | 81 |
1937 | 4 | 350 | 2 | 8 | 16 | 32 | 64 | 256 |
Total | u=0 |
| y=11 | uv=51 |
Normal equations are
On solving these equations we get
Example. Fit a second degree parabola to the following data.
x=1.0 | 1.5 | 2.0 | 2.5 | 3.0 | 3.5 | 4.0 |
y=1.1 | 1.3 | 1.6 | 2.0 | 2.7 | 3.4 | 4.1 |
Solution. We shift the origin to (2.5, 0) antique 0.5 as the new unit. This amounts to changing the variable x to X, by the relation X = 2x – 5.
Let the parabola of fit be y = a + bX The values of X etc. Are calculated as below:
x | X | y | Xy | ||||
1.0 | -3 | 1.1 | -3.3 | 9 | 9.9 | -27 | 81 |
1.5 | -2 | 1.3 | -2.6 | 4 | 5.2 | -5 | 16 |
2.0 | -1 | 1.6 | -1.6 | 1 | 1.6 | -1 | 1 |
2.5 | 0 | 2.0 | 0.0 | 0 | 0.0 | 0 | 0 |
3.0 | 1 | 2.7 | 2.7 | 1 | 2.7 | 1 | 1 |
3.5 | 2 | 3.4 | 6.8 | 4 | 13.6 | 8 | 16 |
4.0 | 3 | 4.1 | 12.3 | 9 | 36.9 | 27 | 81 |
Total | 0 | 16.2 | 14.3 | 28 | 69.9 | 0 | 196 |
The normal equations are
7a + 28c =16.2; 28b =14.3;. 28a +196c=69.9
Solving these as simultaneous equations we get
Replacing X bye 2x – 5 in the above equation we get
Which simplifies to y = This is the required parabola of the best fit
Comparison of large samples
Two large samples of sizes are taken from two populations giving proportions of attributes A's are
(a) On the hypothesis that the populations are similar as regards the attribute A, we combine the two samples to find an estimate of the common value of proportion of A’s in the populations which is given by
If be the standard errors in the two samples then
If e with the standard error of the difference between
If z>3, the difference between is real one.
If z<2, the difference may be due to fluctuations of simple sampling.
But if z lies between 2 and 3, then the difference is significant at 5% level of significance.
(b)If the proportions of A's are not the same in the two populations from which the samples are drawn but are the True values of proportions then S.E., e off the difference is given by
If the difference could have rising due to fluctuations of simple sampling.
Example. In two large populations there are 30% and 25% respectively of fair haired people. Is this difference likely to be hidden in samples of 1200 and 900 respectively from the two populations?
Solution. Here
So that,.
Hence it is unlikely that the real difference will be hidden.
Example. One type of aircraft is found to be develop engine trouble in 5 flights out of a total of hundred and another type in 7 flights out of a total of 200 flights. is there a significant difference in the two types of aircraft so as far as engine defects are concerned.
Solution. number of troubled flights =5
200 flights, number of troubled flights
e=0.0254
z<1, difference is not significant.
Example. In a sample of 600 men from a certain City 450 are found smokers. In another sample of 900 men from another City, 450 are smokers. do the data indicate that the cities are significantly different with respect to the habit of smoking among men.
Solution. number of smokers = 450,
900 men, number of smokers = 450,
z>3 so that the difference is significant.
Significance test of a sample mean
Given a random small sample from a normal population we have to test the hypothesis that mean of the population is μ. For this we first calculate
Then find the value of P for the given df from the table.
If the calculated value of the difference between and μ is said to be significant at 5% level of significance.
the difference is said to be significant at 1% level of significance.
If the data is said to be the consistent with the hypothesis that μ is the mean of the population.
Example. A certain stimulus administered to each of 12 patients resulted in the following increases off blood pressure: 5, 2, 8, -1, 3, 0, -2, 1, 5, 0, 4, 6. Can it be concluded that the stimulus will in general be accompanied by an increase in blood pressure.
Solution. Let us assume that the stimulus administered to all the 12 patients will increases the blood pressure. Taking the population to be normal with mean μ = 0 and S.D.
Here
For , from table IV.
Since the our assumptions is rejected i.e. the stimulus does not increase the B.P.
Example. The 9 items of a sample have the following values : 45, 47, , 50, 52, 48, 47, 49, 53, 51. Does the mean of these differ significantly from the assumed mean of 47.5?
Solution. We find the mean and the standard deviation of the sample as follows
x | ||
45 | -3 | 9 |
47 | -1 | 1 |
50 | 2 | 4 |
52 | 2 | 4 |
48 | 0 | 0 |
47 | -1 | 1 |
49 | 1 | 1 |
53 | 5 | 25 |
51 | 3 | 9 |
Total | 10 | 66 |
Hence,
Here,
For v = 8, we get from table IV
As calculated value of the value of t is not significant at 5% level of significance which implies that there is no significant difference between and μ. Thus the test provides no evidence against the population mean being 47.5.
Example. A mechanism is making engine parts with axle diameter of 0.7 inch. A random sample of 10 parts shows mean diameter 0.742 inches with a standard deviation of 0.04 inch. On the basis of this sample would you say that the work is inferior?
Solution. Here we have,
Taking the hypothesis that the product is not inferior that is there is no significant difference between and μ.
Degree of freedom = 10-1=9
For we get from table IV,
As the calculated value of the value of t is significant at 5% level of significance. This implies that differs significantly from μ and the hypothesis is rejected. Hence the work is inferior. In fact the work is inferior even at 2% level of significance.
Significance test of difference between sample mean
Given two independent samples, which means and standard deviations from a normal population with the same variance, we have to test the hypothesis that the population means are the same
For this, we calculate,
It can be shown that the variate t defined by (1) follows the t distribution with degree of freedom.
If the calculated value of the difference between the sample means is said to be significant at 5% level of significance.
If , the difference is said to be significant at 1% level of significance.
If , the data is said to be consistent with the hypothesis, that
Cor. If the two samples are of the same size and the data are paired, then t is defined by
Example. From a random sample of 10 pigs fed on diet A. The increase in weight in a certain period were 10, 6, 16, 17, 13, 12, 8, 14, 15, 9 lbs. For another random sample of 12 pig’s fat on diet B, the increases in the same period were 7, 13, 22, 15, 12, 14, 18, 8, 21, 23, 10, 17 lbs. Test whether diets A and B differ significantly as regards their effects on increases in weight?
Solution. We calculate the means and standard deviation of the samples as follows
| Diet A |
|
| Diet B |
|
10 | -2 | 4 | 7 | -8 | 64 |
6 | -6 | 36 | 18 | -2 | 4 |
16 | 4 | 16 | 22 | 7 | 49 |
17 | 5 | 25 | 15 | 0 | 0 |
13 | 1 | 1 | 12 | -3 | 9 |
12 | 0 | 0 | 14 | -1 | 1 |
8 | -4 | 16 | 18 | 3 | 9 |
14 | 2 | 4 | 8 | -7 | 49 |
15 | 3 | 9 | 21 | 6 | 36 |
9 | -3 | 9 | 23 | 8 | 64 |
|
|
| 10 | -5 | 25 |
|
|
| 23 | 2 | 4 |
120 | 0 | 120 | 10 | 0 | 314 |
Assuming that the samples do not differ in weight so far as two diets are concerned i.e.
Hence,
Here,
For
The calculated value of
Hence the difference between the sample means is not significant that is the two diets do not differ significantly as regards their effects on increase in weight.
(1) CHI SQUARE TEST
When a fair coin is tossed 80 times we expect from the theoretical considerations that heads will appear 40 times and tail 40 times. But this never happens in practice that is the results obtained in an experiment do not agree exactly with the theoretical results. The magnitude of discrepancy between observations and theory is given by the quantity (pronounced as chi squares). If the observed and theoretical frequencies completely agree. As the value of increases, the discrepancy between the observed and theoretical frequencies increases.
(1) Definition. If and be the corresponding set of expected (theoretical) frequencies, then is defined by the relation
(2) Chi – square distribution
If be n independent normal variates with mean zero and s.d. unity, then it can be shown that is a random variate having distribution with ndf.
The equation of the curve is
(3) Properties of distribution
have been tabulated for various values of P and for values of v from 1 to 30. (Table V Appendix 2)
,the curve approximates to the normal curve and we should refer to normal distribution tables for significant values of .
IV. Since the equation of curve does not involve any parameters of the population, this distribution does not dependent on the form of the population.
V. Mean = and variance =
Goodness of fit
The values of is used to test whether the deviations of the observed frequencies from the expected frequencies are significant or not. It is also used to test how will a set of observations fit given distribution therefore provides a test of goodness of fit and may be used to examine the validity of some hypothesis about an observed frequency distribution. As a test of goodness of fit, it can be used to study the correspondence between the theory and fact.
This is a nonparametric distribution free test since in this we make no assumptions about the distribution of the parent population.
Procedure to test significance and goodness of fit
(i) Set up a null hypothesis and calculate
(ii) Find the df and read the corresponding values of at a prescribed significance level from table V.
(iii) From table, we can also find the probability P corresponding to the calculated values of for the given d.f.
(iv) If P<0.05, the observed value of is significant at 5% level of significance
If P<0.01 the value is significant at 1% level.
If P>0.05, it is a good faith and the value is not significant.
Example. A set of five similar coins is tossed 320 times and the result is
Number of heads | 0 | 1 | 2 | 3 | 4 | 5 |
Frequency | 6 | 27 | 72 | 112 | 71 | 32 |
Solution. For v = 5, we have
P, probability of getting a head=1/2;q, probability of getting a tail=1/2.
Hence the theoretical frequencies of getting 0,1,2,3,4,5 heads are the successive terms of the binomial expansion
Thus the theoretical frequencies are 10, 50, 100, 100, 50, 10.
Hence,
Since the calculated value of is much greater than the hypothesis that the data follow the binomial law is rejected.
Example. Fit a Poisson distribution to the following data and test for its goodness of fit at level of significance 0.05.
x | 0 | 1 | 2 | 3 | 4 |
f | 419 | 352 | 154 | 56 | 19 |
Solution. Mean m =
Hence, the theoretical frequency are
x | 0 | 1 | 2 | 3 | 4 | Total |
f | 404.9 (406.2) | 366 | 165.4 | 49.8 | 11..3 (12.6) | 997.4 |
Hence,
Since the mean of the theoretical distribution has been estimated from the given data and the totals have been made to agree, there are two constraints so that the number of degrees of freedom v = 5- 2=3
For v = 3, we have
Since the calculated value of the agreement between the fact and theory is good and hence the Poisson distribution can be fitted to the data.
Example. In experiments of pea breeding, the following frequencies of seeds were obtained
Round and yellow | Wrinkled and yellow | Round and green | Wrinkled and green | Total |
316 | 101 | 108 | 32 | 556 |
Theory predicts that the frequencies should be in proportions 9:3:3:1. Examine the correspondence between theory and experiment.
Solution. The corresponding frequencies are
Hence,
For v = 3, we have
Since the calculated value of is much less than there is a very high degree of agreement between theory and experiment.
References:
1. Erwin Kreyszig, Advanced Engineering Mathematics, 9thEdition, John Wiley & Sons, 2006.
2. N.P. Bali and Manish Goyal, A text book of Engineering Mathematics, Laxmi Publications.
3. P. G. Hoel, S. C. Port and C. J. Stone, Introduction to Probability Theory, Universal Book Stall.
4. S. Ross, A First Course in Probability, 6th Ed., Pearson Education India,2002.