In this project, we observe distributions of a read world data. The data comprises of Stock prices of "Google" from year 2014-2018. We will compare normal and t-distributions to fit the observed values. Finally, we will try Mixture of Gaussians to fit the data and note scope and limitations.

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
```

`data = pd.read_csv('stock_prices.csv', parse_dates=True)`

```
goog = data[data['Name'] == 'GOOG'].copy()
goog.head()
```

```
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
```

</style>
date | open | high | low | close | volume | Name | |
---|---|---|---|---|---|---|---|

251567 | 2014-03-27 | 568.000 | 568.00 | 552.92 | 558.46 | 13052 | GOOG |

251568 | 2014-03-28 | 561.200 | 566.43 | 558.67 | 559.99 | 41003 | GOOG |

251569 | 2014-03-31 | 566.890 | 567.00 | 556.93 | 556.97 | 10772 | GOOG |

251570 | 2014-04-01 | 558.710 | 568.45 | 558.71 | 567.16 | 7932 | GOOG |

251571 | 2014-04-02 | 565.106 | 604.83 | 562.19 | 567.00 | 146697 | GOOG |

`goog.set_index('date')['close'].plot()`

```
<matplotlib.axes._subplots.AxesSubplot at 0x7fc579892b70>
```

Since returns are normalized to 0-1, we will be working with returns instead of prices

```
goog['prev_close'] = goog['close'].shift(1)
goog['return'] = goog['close'] / goog['prev_close'] - 1
goog['return'].hist(bins=100);
```

`goog['return'].mean()`

```
0.000744587445980615
```

`goog['return'].std()`

```
0.014068710504926713
```

### Normal Distribution

`from scipy.stats import norm`

```
x_list = np.linspace(goog['return'].min(),goog['return'].max(),100)
y_list = norm.pdf(x_list, loc=goog['return'].mean(), scale=goog['return'].std())
```

```
plt.plot(x_list, y_list);
goog['return'].hist(bins=100, density=True);
```

We note that the distribution doesnot quite fit the gaussian distribution due to higher kurtosis.

### t-Distribution

`from scipy.stats import t as tdist`

`params = tdist.fit(goog['return'].dropna())`

`params`

```
(3.4870263950708473, 0.000824161133877212, 0.009156583689241837)
```

`df, loc, scale = params`

`y_list = tdist.pdf(x_list, df, loc, scale)`

```
plt.plot(x_list, y_list);
goog['return'].hist(bins=100, density=True);
```

The t-distribution fits the data quite well. However we have one more param that we deal with.

### Mixture of Gaussians

```
from sklearn.mixture import GaussianMixture
data = np.array(goog['return'].dropna()).reshape(-1, 1)
model = GaussianMixture(n_components=2)
model.fit(data)
weights = model.weights_
means = model.means_
cov = model.covariances_
print("weights:", weights)
print("means:", means)
print("variances:", cov)
```

```
weights: [0.29533565 0.70466435]
means: [[-0.00027777]
[ 0.00117307]]
variances: [[[5.05105403e-04]]
[[6.96951828e-05]]]
```

```
means = means.flatten()
var = cov.flatten()
```

```
x_list = np.linspace(data.min(), data.max(), 100)
fx0 = norm.pdf(x_list, means[0], np.sqrt(var[0]))
fx1 = norm.pdf(x_list, means[1], np.sqrt(var[1]))
fx = weights[0] * fx0 + weights[1] * fx1
```

```
goog['return'].hist(bins=100, density=True)
plt.plot(x_list, fx, label='mixture model')
plt.legend();
```

We note the performance of GMM with n_components=2. The fit is similar to the t-distibution. The number of params is still 3 since GMM with n_component=2 has 2*DOF-1 where DOF is the degree of freedom of each gaussian.

We note that GMMs are universal approximatiors. Hence, we can better fit the distribution with more gaussian in the mixture. We try it with 10 components.

```
num_components = 10
model = GaussianMixture(n_components=num_components)
model.fit(data)
weights = model.weights_
means = model.means_
cov = model.covariances_
means = means.flatten()
var = cov.flatten()
fx = 0
x_list = np.linspace(data.min(), data.max(), 100)
for i in range(num_components):
fx += weights[i] * norm.pdf(x_list, means[i], np.sqrt(var[i]))
```

```
goog['return'].hist(bins=100, density=True)
plt.plot(x_list, fx, label='mixture model')
plt.legend();
```

We note the fit is much better with n_components = 10. However, this comes at the cost of 16 additional params. Also, we're in the range of overfitting the data. Finally, there doesnt seem to exist simple Hypothesis testing routine for mixture of gaussians. The best I've seen are Bayes factor methods on the posterior distribution. I plan to discuss Bayesian methods for hypothesis testing on my next project.