**Probability**

There are many occurrences that cannot be explained how and why it happened. The best we can do about such events is to speculate on the probable cause and likely occurence. This is what probability is all about. Probability refers to the measure of the likelihood that an event will take place. It is a number from zero to one; in which case zero indicates impossibility, while one shows a greater level of certainty. If the probability is high, the event has higher chances of taking place. Conversely, the lower the probability, the less likely the event will happen (Wikipedia, n.d.).

The best example of probability is tossing a coin. The coin has only two outcomes when it is tossed – it can either show heads or tails. This condition is described as equally probable. The probability of one getting heads is equal to the probability of getting tails. Since no other outcome is probable, the probability of tails or probability of heads is 1/2, sometimes given as a percentage (50%) or as a decimal number (0.5).

Probability theory has been designed in order to explain the concepts of probability. As much as there are various interpretations of probability, the theory is more detailed in explaining probability with the aid of mathematical terms expressed as a set of axioms. The axioms have been used to devise the idea of probability space which spans from 0 to 1.

Probability theory comes up with key subjects including probability distributions, random variables (continuous and discrete), and stochastic processes.

It may be tough to predict random events perfectly, but there is a lot to talk about in regrad to their behavior. The central limit theorem and the law of large numbers are central ideas in probability theory which describe this behavior.

The progression of statistics is deeply entrenched on probability theory because the theory is necessary for a number of human activities particularly associated with regard to analyzing data quantitatively. Complex systems also have an application of different probability theory concepts. One of such is in statistical mechanics.

Our everyday lives have an application of probability. Markets and insurance industries make use of actual science when they want to ‘see’ the likely changes that will occur to pricing. Such information ends up affecting trading decisions. Governments also make use of probability when dealing with matters linked to environmental regulation, financial regulation, and entitlement analysis.

Reliability theory is also a key application of probability to our everyday lives. Different consumer products such as electronics and automobiles use the theory to guide the development of products and also determine their probability of failing. When the probability is low, the manufacturer gives a longer product warranty (Wikipedia, n.d.).

**And vs. Or**

When dealing with probability, you will often encounter the concepts of And (*and*) Or. Thus, a clear distinction between the two will empower you to better tackle your probabilistic problem.

In probability, **And** refers to the fact that the outcome must satisfy both conditions at the same time. **Or** implies that the outcome should satisfy either one condition, *or* the other condition, *or* both conditions at the same time (Shmoop, n.d.).

**Consider this probabilistic problem **

Say John has been given a die to roll. What is the probability that he will get an odd and less than 5?

To solve this problem, you first have to get the whole facts together.

We know that a die has a total of 6 faces, but we have to focus on the conditions given, both of which **must** be satisfied because of the use of **And** in the problem. The first condition is that this number has to be odd and the second condition is that it has to be less than 5. Only two numbers meet this requirement for a die, that is, 1 and 3.

Therefore, the probability (p) is:

p = 2/6 = 1/3 = 33.3%

**Consider the second probabilistic problem**

Mary is given a deck of cards out of which she is asked to draw one card. What is the probability that the card drawn is a face card or a black card?

Similar to our case above, you first need all the facts. There are a total of 12 face cards in the deck. Half of the cards are black (26) but given that we have already counted 6 of the cards when determining the face cards, we are left with 20 black cards. This gives a total of 32 cards which are a black face card, or a red face card, or a black card. The probability is thus: 32/53.

That is the basic difference between And vs. OR. Just to recap, when dealing with “And”, all conditions must be met. When dealing with “Or”, you do not have to meet all the conditions.

**Bayes’ Theorem**

The name Bayes’ Theorem is derived from Reverend Thomas Bayes who was the first to devise an equation which makes it possible to update beliefs through a publication made in 1763(Wikipedia, n.d.). Pierre-Simon Laplace would then progress to refine the theorem in 1812 before Sir Harold Jeffreys converted the works of the two iconic figures into an axiomatic basis (Wikipedia, n.d.). In his arguments, Jeffreys explained that just the same way Pythagorean Theorem relates with geometry, Bayes theorem relates with probability.

When talking about Bayes’ Theorem, some people tend to bring up other terms like Baye’s Rule, or Bayesian reasoning. These concepts are more of the same. Bayes’ Theorem falls under probability and gives you an equation which many students seem to be enthusiastic about.

The equation is represented simply as shown below:

Before you get all mixed up with the various symbols introduced in the above equation, let us decode it:

**Pr (A|X)** = (conditional probability). Represents the probability of event A occurring given another event X.

**Pr (X|A)** = (conditional probability). In linguistic terms, you could say this is the converse of the symbol given above. It is basically the probability of event X happening given event A.

**Pr (A)** = (marginal probability). The probability of event A occurring.

**Pr (not A)** = The probability of event A not occurring.

**Pr (X|not A)** = The probability of event X happening given that event A does not occur.

But the above equation seems to be more complex than what you are used to, doesn’t? Well, it could be further simplified as shown below, with the explanations above holding.

As you can see from the first equation, we did not have Pr(X) standing on its. However, the simplified version of Bayes’ Theorem brings out this value. Pr(X) is basically a normalizing constant which empowers us to scale up this equation. It tackles both the real probability and the false probability.

Bayes’ Theorem has a wide range of applications. Typically, it is used in Bayesian Inference, one of the most important approaches in statistical inference.

**Random Variables**

The symbol X is typically used to represent a random variable.

So what is a random variable? This is a concept that you find both in probability and statistics and it refers to a variable which has possible values which are influenced by a random phenomenon (Yale, n.d.).

To be more specific, a random variable could be looked at as a procedure relied on to assign numerical quantity to each physical outcome. Do not be deceived by the name, the procedure does not happen in a random manner, neither is it a variable. On the contrary, the underlying process which feeds the procedure generates random output that is then mapped to a real-number value.

The possible values of a random variable give us the likely outcomes that an experiment which has not been performed will yield. They may also represent the likely outcomes of a past experiment but one with uncertain outcomes.

Given that it is a function, it is expected that a random variable be measurable. As such, the outcomes ought to depend on some physical variable that is unclear. For instance, when a fair coin is tossed, the uncertain physics determine the final outcome which could be heads or tails. There is no certainty on the outcome to be observed. Even though there is the possibility of the coin getting caught in a crack, the final consideration gets rid of that possibility.

There are two types of random variables: continuous random variables and discrete random variables. The continuous random variables take an infinite number of possible values (Yale, n.d.). Typical examples are weight, height, time needed to run 1 kilometer, and the amount of salt in food.

Discrete random variables take finite numbers of distinct values like 2,8,25,30, … That is to say that a discrete random variable is a count but this is not true at all times. The ‘finite’ is used in the description because the random variable only takes distinct values.

Consider a situation where you have a variable X that is able to take the values 6, 7, 8, or 9. The probability that X is equal to 7 or 8 is the sum of two probabilities.

Pr(X = 7 or X = 8):

=Pr(X = 7) + P(X = 8)

= 0.3 + 0.4 = 0.7

**Variance**

In order to better perform different statistics and probability operations, you ought to be well versed with the concept of variance. This is a concept that is closely tied to standard deviation (discussed in the next section.)

From a simplified perspective, variance measures how far given numbers are spread from their average value. It is obtained by squaring the standard deviation.

The symbol for variance is σ^{2}σ^{2}* _{x}*or Var(X).

The definition for variance includes random variables which could be discrete or continuous. When assessing variance, you may also approach them from the perspective of a co-variance of a random variable.

This is to say that: Var(X) = Cov(X, X).

The formal (mathematical) definition for variance is:

Var(X) = E[X^{2}] – E [X]^{2 }

where,

Var(X) = Variance of X

E[X^{2}] = mean of the square of X

E [X]^{2 }= square of the mean of X

That is, to get the variance of X, you find the mean of the square of X and also square of the mean of X and then find the difference between the two values as shown above.

**Key properties**

Variance is normally non-negative due to the fact that squares are positive or 0. Thus, Var(X) will always be greater than or equal to zero (Wikipedia, n.d.).

Variance is invariant with respect to location parameter changes. That is to say that variance remains unchanged even after the addition of a constant to all values of a variable.

Var(X + a) = Var(X).

Scaling all values by a constant is equivalent to scaling the variance by the square of the given constant.

Var(aX) = a^{2}Var(X)

The sum of two random variables has a variance which is calculated as shown below:

Var(aX + bY) = a^{2 }Var(X) + b^{2 }Var(Y) + 2ab Cov(X,Y)

Var(aX – bY) = a^{2 }Var(X) + b^{2 }Var(Y) +-2ab Cov(X,Y)

Where Cov stands for covariance.

Consider the following example in which we calculate variance.

Apple Inc. requests analysts to conduct a research, who then writes a report to the board, presenting the following probabilities which project next year’s iPhone sales.

Case |
Probability (Pr) |
Projected sales in millions ($) |

A |
0.10 | $16 |

B |
0.30 | $15 |

C |
0.30 | $14 |

D |
0.30 | $13 |

The estimated iPhone sales the next year is:

(0.1)*(16) + (0.3)*(15) + (0.3)*(14) + (0.3)*(13) = $14.2 million

To get variance, we will find the difference in each likely sale outcome from $14.2 million and, then square the obtained values.

Case |
Probability (Pr) |
Deviation |
Squared deviation |

A |
0.10 | $16 – $14.2 | 3.24 |

B |
0.30 | $15 – $14.2 | 0.64 |

C |
0.30 | $14 – $14.2 | 0.04 |

D |
0.30 | $13 – $14.2 | 1.44 |

Your calculation for variance then proceeds to the final step where you weigh the squared deviation and the probability and sum the outcome of the weight (MathIsFun, n.d.).

Probability (Pr) |
Squared deviation |
Weight |

0.10 |
3.24 | 0.10*3.24 |

0.30 |
0.64 | 0.30*0.64 |

0.30 |
0.04 | 0.30*0.04 |

0.30 |
1.44 | 0.30*1.44 |

Hence, variance = |
0.96 |

**Standard Deviation (SD)**

Standard Deviation is a vital concept in statistics which provides a way in which we can measure dispersion of a set of data with respect to its mean. It can also be calculated by square-rooting variance.

Most calculations of Standard Deviation are done by first determining the variance and then finding its square root. In this case, one determines the variation between two data points in comparison to the mean. The data set has a higher deviation if the variation is further from the mean. That would then imply that your data is spread more and has a higher standard deviation (Khan Academy, n.d.).

Investors calculate their financial stakes using SD as a statistical measurement whereby it finds applications in the calculation of annual rate of return of an investment. The standard deviation value gives a clear picture of volatile changes the investment has gone through over time. A larger price range is indicated by a higher standard deviation which is as a result of a higher variance between mean and price. For instance, stocks that are highly volatile will have a higher standard deviation in comparison to stable blue chip stock whose standard deviation tends to be low.

But SD is not just applied in finance. It has got other real-world applications from industrial, experiment and hypothesis testing. In the industrial manufacturers could be required to satisfy certain weight requirements for products coming off a production line. Some fraction of the products could be taken to determine the average weight. Standard deviation could then be used to get the minimum and maximum values based on the average weight. It is expected that the values would be within a certain range. If it is found to be outside this range, then some corrections may be necessary. Such a statistical approach is relied on in most cases when testing is deemed too expensive (Investopedia, n.d.).

Even the metrological department has found standard deviation to be quite useful. Say for instance you have the maximum humidity for two countries, one with more forest cover and another with less. The one with a dense forest cover tends to have a higher daily maximum humidity compared to the scanty forest. Even if the countries have the same average maximum humidity, SD of the daily maximum humidity for the country with more forest cover will be more than that with less forest cover, on any single day.

**Coefficient of Determination (R-Squared)**

The Coefficient of Determination is normally symbolized by r^{2}. It refers to the variance proportion in the independent variable which is predictable based on an independent variable or a set of independent variables (Wikipedia, n.d.).

Within the statistical field, it is used with regard to statistical models whose main goal is to test hypothesis or predict future outcomes as guided by the available information. With this feature, it becomes possible to measure the manner in which the model has been effective at replicating outcomes. Different definitions have been used to explain R-squared. In one of these definitions, R-squared is regarded as a measure of the closeness at which data is fitted on the regression line.

To calculate R-squared, you basically divide explained variation by total variation. The value of the coefficient of determination ranges from 0 to 100% (MinitabBlog, 2013).

When a value of 0% is obtained, it shows that the model does not give an explanation on the variability of the data around its mean. The extreme is 100% which shows the model accounts for the variability witnessed in the data around the mean.

Generally speaking, the model best fits the data if it has a higher R-squared value.

**Mean Squared Error**

Mean squared error indicates the closeness at which a regression line is regarding a given set of points. To get the mean squared error, a calculation is made to determine the distance between the points and the regression line, and the value obtained is squared. These values are what are considered to be errors. It is important to square so as to get rid of any negative values. Furthermore, larger differences are given more weight. The name *mean squared error* is used since you are getting the average of a set of errors. No matter how well-refined your data is it must always have some type of error. This error enters into the data in different ways including during data collection and during the analysis of your data. The mean squared error presents the researcher with an opportunity to quantify the error so that you can have deeper insights in it.

The formula for mean squared error is: MSE(θ_{1}) = E[(θ_{1 }– θ)^{2}]. θis the unknown parameter while θ_{1 }is an estimator. It’s important to understand that mean squared error isn’t a random variable.