🎉 Wipro's Lab45 AI Platform is now free with your Topcoder account!

September 28, 2021

Statistics for Data Science - Part 2

Shubham Kumar ShuklaShubham9455

DURATION

15min

Central Limit Theorem

Before moving to the topics mentioned above we will see the central limit theorem:

Theorem 1:

Population mean is equal to sample mean.

µ(sample)=µ(population)

Population mean ≠ Gaussian distribution(µ,σ^2/n) and n≥30 (number of data objects in one sample)

s1(sample) = x1,x2………………………….x30= x̅1

s2(sample)=x1,x2……………………………x30= x̅2

s3(sample)=x1,x2……………………………x30.= x̅3

s100 (sample)=x1,x2……………….………x30= x̅100

x̅(Mean of all sample)=( x̅1+ x̅2+…….+ x̅100)/100 ≠ Gaussian distribution(µ,σ^2/n)

The number of data in any sample should always be greater than or equal to 30, population mean=µ, standard deviation=σ.

Theorem 2:

σ (standard deviation of sample mean)=σ (standard deviation of population)/ √(n).

Theorem 3:

If the population data follows normal distribution/gaussian distribution then sample data will also follow normal distribution (independent of size).

Theorem 4:

If the population is not normally distributed, but sample size n>30, then sample data will approximately follow normal distribution for any population distribution shape.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import numpy
import matplotlib.pyplot as plt
num = [1, 10, 50, 100]
means = []
for j in num:
  numpy.random.seed(1)
x = [numpy.mean(
  numpy.random.randint(
    -40, 40, j)) for _i in range(1000)]
means.append(x)
k = 0
fig, ax = plt.subplots(2, 2, figsize = (8, 8))
for i in range(0, 2):
  for j in range(0, 2):
  ax[i, j].hist(means[k], 10, density = True)
ax[i, j].set_title(label = num[k])
k = k + 1
plt.show()

The probability density curve shows under the number of events or data sample to the probability of occurrence of any event.

Standard Normal Distribution:

In part one of this series we learned that the definition of the normal distribution is that curve which is symmetrical about the mean (µ), a bell shaped curve. Now we will define the definition of z-score and standard normal distribution.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as np
import matplotlib.pyplot as plt
from scipy
import stats
mu = 0
std = 1
snd = stats.norm(mu, std)
x = np.linspace(-5, 5, 100)
plt.figure(figsize = (10, 5))
plt.plot(x, snd.pdf(x))
plt.xlim(-5, 5)
plt.title('standard Normal Distribution', fontsize = '15')
plt.xlabel('Values of Random Variable X', fontsize = '15')
plt.ylabel('Probability', fontsize = '15')
plt.savefig('sta.jpg')
plt.show()

Below is the curve of the above code.

But in standard normal distribution the mean (µ=0) and standard deviation is (σ=1). When we put the values of the µ=0 and σ=1 then the updated equation is

Now the equation is changed at any point x in the graph we can find the value of z which is called z-score.

Here the z formula given above is the unit of standard normal distribution at any point x. It is also called standard score.

Cumulative Density function:

Cumulative density function (CDF) is the area under at any point x on standard normal equations. Interval is (-∞,z)

Above we can see that the maximum value on the y-axis is 1 which means the total area under standard normal distribution is 1.

Chebyshev’s Theorem:

According to Chebyshev’s theorem the percentage of data that lies within ‘k’ standard deviation is at least equal to =(1-1/k^2) for k.

Kernel Density Estimation (KDE):

In the first part of this series we discussed the basics of statistics, like the type of statistics that are descriptive and inferential in other languages. Parametric and non-parametric kernel density estimation is an objective way to get any conclusion in population data. KDE is also used for smoothing our data curve which is useful in binning of data. KDE is a technique in which we normalize our data among all data points using the formula below:

For more details about kernels visit the link kernel density estimation.

The next topic will be different and lengthy - plotting graphs in statistics. Stay tuned for the next article.

Chat on Discord

September 28, 2021

Statistics for Data Science - Part 2

DURATION

categories

Tags

share

Topcoder Thrive

Central Limit Theorem

Theorem 1:

Theorem 2:

Theorem 3:

Theorem 4:

Standard Normal Distribution:

Cumulative Density function:

Chebyshev’s Theorem:

Kernel Density Estimation (KDE):

Recommended for you

Statistics for Data Science

Is Data Science Just a Rebranding of Statistics?

TIME SERIES FORECASTING using Python