In this series we are learning about statistics for data science. In part one we discussed topics like mean, median, population, sample, random variables, discrete random variables, continuous random variables, and more. You can read the previous article here: Statistics for Data Science (Part1). In this article we will move to intermediate topics like:
Standard Normal Deviation
Z Score
Probability Density Function
Cumulative Distribution Function
Different Plotting Graphs
Kernel Density Estimation
Skewness of Data
Covariance
Pearson Correlation Coefficient
Spearman Rank Correlation
Before moving to the topics mentioned above we will see the central limit theorem:
Population mean is equal to sample mean.
µ(sample)=µ(population)
Population mean ≠ Gaussian distribution(µ,σ^2/n) and n≥30 (number of data objects in one sample)
s1(sample) = x1,x2………………………….x30= x̅1
s2(sample)=x1,x2……………………………x30= x̅2
s3(sample)=x1,x2……………………………x30.= x̅3
s100 (sample)=x1,x2……………….………x30= x̅100
x̅(Mean of all sample)=( x̅1+ x̅2+…….+ x̅100)/100 ≠ Gaussian distribution(µ,σ^2/n)
The number of data in any sample should always be greater than or equal to 30, population mean=µ, standard deviation=σ.
σ (standard deviation of sample mean)=σ (standard deviation of population)/ √(n).
If the population data follows normal distribution/gaussian distribution then sample data will also follow normal distribution (independent of size).
If the population is not normally distributed, but sample size n>30, then sample data will approximately follow normal distribution for any population distribution shape.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import numpy
import matplotlib.pyplot as plt
num = [1, 10, 50, 100]
means = []
for j in num:
numpy.random.seed(1)
x = [numpy.mean(
numpy.random.randint(
-40, 40, j)) for _i in range(1000)]
means.append(x)
k = 0
fig, ax = plt.subplots(2, 2, figsize = (8, 8))
for i in range(0, 2):
for j in range(0, 2):
ax[i, j].hist(means[k], 10, density = True)
ax[i, j].set_title(label = num[k])
k = k + 1
plt.show()
The probability density curve shows under the number of events or data sample to the probability of occurrence of any event.
In part one of this series we learned that the definition of the normal distribution is that curve which is symmetrical about the mean (µ), a bell shaped curve. Now we will define the definition of z-score and standard normal distribution.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as np
import matplotlib.pyplot as plt
from scipy
import stats
mu = 0
std = 1
snd = stats.norm(mu, std)
x = np.linspace(-5, 5, 100)
plt.figure(figsize = (10, 5))
plt.plot(x, snd.pdf(x))
plt.xlim(-5, 5)
plt.title('standard Normal Distribution', fontsize = '15')
plt.xlabel('Values of Random Variable X', fontsize = '15')
plt.ylabel('Probability', fontsize = '15')
plt.savefig('sta.jpg')
plt.show()
Below is the curve of the above code.
But in standard normal distribution the mean (µ=0) and standard deviation is (σ=1). When we put the values of the µ=0 and σ=1 then the updated equation is
Now the equation is changed at any point x in the graph we can find the value of z which is called z-score.
Here the z formula given above is the unit of standard normal distribution at any point x. It is also called standard score.
Cumulative density function (CDF) is the area under at any point x on standard normal equations. Interval is (-∞,z)
Above we can see that the maximum value on the y-axis is 1 which means the total area under standard normal distribution is 1.
According to Chebyshev’s theorem the percentage of data that lies within ‘k’ standard deviation is at least equal to =(1-1/k^2) for k.
In the first part of this series we discussed the basics of statistics, like the type of statistics that are descriptive and inferential in other languages. Parametric and non-parametric kernel density estimation is an objective way to get any conclusion in population data. KDE is also used for smoothing our data curve which is useful in binning of data. KDE is a technique in which we normalize our data among all data points using the formula below:
For more details about kernels visit the link kernel density estimation.
The next topic will be different and lengthy - plotting graphs in statistics. Stay tuned for the next article.