Mean is the average value. This means, mean is the value obtained by summing all the values in the dataset of your interest and divide by the number of values in the dataset. Mean is usually the central value, which divides the distribution into two equal halves.
Suppose you have the following dataset: Dataset = \(\{1, 2, 3, 4, 5\}\).
Mean for this dataset is
Mean, \(\bar{x} = \sum_{i=1}^{i = N}x_{i} = \frac{1 + 2 + 3 + 4 + 5}{N} = \frac{1 + 2 + 3 + 4 + 5}{5 } = \frac{15}{5} = 3\)
Standard Deviation
Standard deviation is simply an average distance to the mean. From our Dataset above, we can first think of individual 'deviations', i..e., how far is each of the data value from the mean (from
- 1 is two units from the mean (In other way, to get to the mean you have to walk two steps from 1);
- 2 is one unit from the mean;
- 3 is zero units from the mean;
- 4 is one unit from the mean and;
- 5 is two units from the mean.
Standard deviation,
According to our definition and calculation above, simply
standard deviation,
Hey! There is a problem, with the formula above, it tells us that the standard deviation for the Dataset is
Remember, our definition, standard deviation is simply a distance to the mean, so we must find a way to escape from getting negative distances.
To do so, we decide to square distances, sum them up, divide by
So, what does the standard deviation tell us?
Standard deviation is the measure of dispersion, it tells us how spread out (how far) in average is each of the data point from the mean. The further the data spreads, the greater the standard deviation!
If you give test to your students and then compute the standard deviation for their scores, and find that the standard deviation is small, that means most students have achieved close to the average score, with few achieving high or low.
If the standard deviation is large, that means the students scores are very dispersed (spread out) from the mean, with individuals achieving very low or very high scores on the test (scores are far away from the mean)!
Standard deviation is very useful, especially in comparing performances.
Variance
Variance is a rough idea, that tells how much in average the data changes, i.e., varies from the overall average value (mean) in the dataset. If variance is large, this means the data values fluctuate much, if the variance is very small, it means most of the data values are very close in value to the overall average. Variance, like standard deviation, measures the data dispersion, that's, how far away is the data from the overall average.
Mathematically, variance is the square of standard deviation, given by
Sample variance
Sample is a representative dataset taken out of population. The sample variance is usually given by the formula
Population Variance
If you know each and every data point in your population, you can compute the population variance using the formula
Note different conventions used for sample and population variance.
Performance tests are usually visualized by using what is called 'normal distribution curve'. This curve my be more or less spread outwards depending on the variability of data from the mean. The curve can be generated from data that are Gaussian, i.e., a random variable with Gaussian distribution
The simplest case of a normal distribution is known as the standard normal distribution. This is a special case when
The total area under the curve
Here is a simple Python code that shows evolution of normal distribution behaviour:
- """ Gaussianity """
- from numpy import *
- from scitools.std import *
- import time
- import glob,os
- for filename in glob.glob('tmp*.png'):
- os.remove(filename)
- def f(x, m, s):
- return (1.0/(sqrt(2*pi)*s))*exp(-0.5*((x-m)/s)**2)
- m = 0
- s_start = 2
- s_stop = 0.2
- s_values = linspace(s_start, s_stop, 30)
- x = linspace(m -3*s_start, m + 3*s_start, 1000)
- # f is max for x=m; smaller s gives larger max value
- max_f = f(m, m, s_stop)
- # show the movie on the screen
- # and make hardcopies of frames simultaneously:
- counter = 0
- for s in s_values:
- y = f(x, m, s)
- plot(x, y, axis=[x[0], x[-1], -0.1, max_f],
- xlabel='x', ylabel='f',legend='s=%4.2f' % s,
- hardcopy = 'tmp%04d.png' % counter)
- counter += 1
- #time.sleep(0.2) # can insert a pause to control movie speed
- # make movie file the simplest possible way:
- movie('tmp*.png')