Mean, Standard Deviation and Variance

#1

Mean

Mean is the average value. This means, mean is the value obtained by summing all the values in the dataset of your interest and divide by the number of values in the dataset. Mean is usually the central value, which divides the distribution into two equal halves.

Suppose you have the following dataset: Dataset = $\{1, 2, 3, 4, 5\}$.

Mean for this dataset is $3$ , calculated in this way,

Mean, $\bar{x} = \sum_{i=1}^{i = N}x_{i} = \frac{1 + 2 + 3 + 4 + 5}{N} = \frac{1 + 2 + 3 + 4 + 5}{5 } = \frac{15}{5} = 3$

Standard Deviation

Standard deviation is simply an average distance to the mean. From our Dataset above, we can first think of individual 'deviations', i..e., how far is each of the data value from the mean (from $3$ ):

1 is two units from the mean (In other way, to get to the mean you have to walk two steps from 1);
2 is one unit from the mean;
3 is zero units from the mean;
4 is one unit from the mean and;
5 is two units from the mean.

We can therefore calculate the standard deviation for the Dataset as follows:

Standard deviation,

$\sigma = \frac{\text{Sum of the individual deviations}}{\text{Total number of values in the Dataset}}$ $= \frac{2+ 1 + 0 + 1 + 2}{5} = \frac{6}{5} = 1.2$ .

According to our definition and calculation above, simply

standard deviation, $\sigma = \frac{\sum_{i=1}^{i = N}(x_{i} - \bar{x})}{N} = \frac{(1-3) + (2-3) + (3-3) + (4-3) + (5-3)}{5}$ .

Hey! There is a problem, with the formula above, it tells us that the standard deviation for the Dataset is $0$ !

Remember, our definition, standard deviation is simply a distance to the mean, so we must find a way to escape from getting negative distances.

To do so, we decide to square distances, sum them up, divide by $N$ and then take the square root of the answer!, Thus

$\sigma = \sqrt{\dfrac{\sum_{i=1}^{N}(x_i -\bar{x})^{2}}{N}}$ .

So, what does the standard deviation tell us?

Standard deviation is the measure of dispersion, it tells us how spread out (how far) in average is each of the data point from the mean. The further the data spreads, the greater the standard deviation!

If you give test to your students and then compute the standard deviation for their scores, and find that the standard deviation is small, that means most students have achieved close to the average score, with few achieving high or low.

If the standard deviation is large, that means the students scores are very dispersed (spread out) from the mean, with individuals achieving very low or very high scores on the test (scores are far away from the mean)!

Standard deviation is very useful, especially in comparing performances.

Variance

Variance is a rough idea, that tells how much in average the data changes, i.e., varies from the overall average value (mean) in the dataset. If variance is large, this means the data values fluctuate much, if the variance is very small, it means most of the data values are very close in value to the overall average. Variance, like standard deviation, measures the data dispersion, that's, how far away is the data from the overall average.

Mathematically, variance is the square of standard deviation, given by

$\sigma^{2} = \dfrac{\sum_{i=1}^{N}(x_i -\bar{x})^{2}}{N}$ .

Sample variance

Sample is a representative dataset taken out of population. The sample variance is usually given by the formula

$s^{2}= \dfrac{\sum_{i=1}^{N}(x_i -\bar{x})^{2}}{N-1}$ .

Population Variance

If you know each and every data point in your population, you can compute the population variance using the formula

$\sigma^{2}= \dfrac{\sum_{i=1}^{N}(x_i -\mu)^{2}}{N-1}$ .

Note different conventions used for sample and population variance.

Performance tests are usually visualized by using what is called 'normal distribution curve'. This curve my be more or less spread outwards depending on the variability of data from the mean. The curve can be generated from data that are Gaussian, i.e., a random variable with Gaussian distribution

$f(x, |\mu, \sigma^{2}) = \left(\sqrt{2\pi \sigma^{2}}\right)^{-1}e^{-\dfrac{(x-\mu)^{2}}{2\sigma^{2}}}$ (the probability density of the normal distribution).

The simplest case of a normal distribution is known as the standard normal distribution. This is a special case when $\mu = 0$ , $\sigma = 1$ , and it is described by the probability density function:
$\Phi(x) = \left(\sqrt{2\pi}\right)^{-1}e^{-\dfrac{(x)^{2}}{2}}$ .

The total area under the curve $\Phi(x) = 1$ .

Here is a simple Python code that shows evolution of normal distribution behaviour:

Code: [Select all] [Expand/Collapse]

""" Gaussianity """
from numpy import * 
from scitools.std import *
import time
import glob,os
for filename in glob.glob('tmp*.png'):
    os.remove(filename)
 
def f(x, m, s):
    return (1.0/(sqrt(2*pi)*s))*exp(-0.5*((x-m)/s)**2)
m = 0
s_start = 2
s_stop = 0.2
s_values = linspace(s_start, s_stop, 30)
x = linspace(m -3*s_start, m + 3*s_start, 1000)
# f is max for x=m; smaller s gives larger max value
max_f = f(m, m, s_stop)
# show the movie on the screen
# and make hardcopies of frames simultaneously:
counter = 0
for s in s_values:
    y = f(x, m, s)
    plot(x, y, axis=[x[0], x[-1], -0.1, max_f],
        xlabel='x', ylabel='f',legend='s=%4.2f' % s,                        
        hardcopy = 'tmp%04d.png' % counter)                                              
    counter += 1
#time.sleep(0.2) # can insert a pause to control movie speed
# make movie file the simplest possible way:
movie('tmp*.png')

And this is the animation from the code:

TSSFL TECHNOLOGY STACK

Mean, Standard Deviation and Variance

Who is online

Mean, Standard Deviation and Variance

Who is online

Login • Register