Hypothesis Testing II - Gaussian Mixtures - Meditations

In this project, we observe distributions of a read world data. The data comprises of Stock prices of "Google" from year 2014-2018. We will compare normal and t-distributions to fit the observed values. Finally, we will try Mixture of Gaussians to fit the data and note scope and limitations.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv('stock_prices.csv', parse_dates=True)

goog = data[data['Name'] == 'GOOG'].copy()
goog.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	date	open	high	low	close	volume	Name
251567	2014-03-27	568.000	568.00	552.92	558.46	13052	GOOG
251568	2014-03-28	561.200	566.43	558.67	559.99	41003	GOOG
251569	2014-03-31	566.890	567.00	556.93	556.97	10772	GOOG
251570	2014-04-01	558.710	568.45	558.71	567.16	7932	GOOG
251571	2014-04-02	565.106	604.83	562.19	567.00	146697	GOOG

goog.set_index('date')['close'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fc579892b70>

Since returns are normalized to 0-1, we will be working with returns instead of prices

goog['prev_close'] = goog['close'].shift(1)
goog['return'] = goog['close'] / goog['prev_close'] - 1
goog['return'].hist(bins=100);

goog['return'].mean()

0.000744587445980615

goog['return'].std()

0.014068710504926713

Normal Distribution

from scipy.stats import norm

x_list = np.linspace(goog['return'].min(),goog['return'].max(),100)
y_list = norm.pdf(x_list, loc=goog['return'].mean(), scale=goog['return'].std())

plt.plot(x_list, y_list);
goog['return'].hist(bins=100, density=True);

We note that the distribution doesnot quite fit the gaussian distribution due to higher kurtosis.

t-Distribution

from scipy.stats import t as tdist

params = tdist.fit(goog['return'].dropna())

params

(3.4870263950708473, 0.000824161133877212, 0.009156583689241837)

df, loc, scale = params

y_list = tdist.pdf(x_list, df, loc, scale)

plt.plot(x_list, y_list);
goog['return'].hist(bins=100, density=True);

The t-distribution fits the data quite well. However we have one more param that we deal with.

Mixture of Gaussians

from sklearn.mixture import GaussianMixture

data = np.array(goog['return'].dropna()).reshape(-1, 1)
model = GaussianMixture(n_components=2)
model.fit(data)

weights = model.weights_
means = model.means_
cov = model.covariances_
print("weights:", weights)
print("means:", means)
print("variances:", cov)

weights: [0.29533565 0.70466435]
means: [[-0.00027777]
 [ 0.00117307]]
variances: [[[5.05105403e-04]]

 [[6.96951828e-05]]]

means = means.flatten()
var = cov.flatten()

x_list = np.linspace(data.min(), data.max(), 100)
fx0 = norm.pdf(x_list, means[0], np.sqrt(var[0]))
fx1 = norm.pdf(x_list, means[1], np.sqrt(var[1]))
fx = weights[0] * fx0 + weights[1] * fx1

goog['return'].hist(bins=100, density=True)
plt.plot(x_list, fx, label='mixture model')
plt.legend();

We note the performance of GMM with n_components=2. The fit is similar to the t-distibution. The number of params is still 3 since GMM with n_component=2 has 2*DOF-1 where DOF is the degree of freedom of each gaussian.

We note that GMMs are universal approximatiors. Hence, we can better fit the distribution with more gaussian in the mixture. We try it with 10 components.

num_components = 10
model = GaussianMixture(n_components=num_components)
model.fit(data)

weights = model.weights_
means = model.means_
cov = model.covariances_

means = means.flatten()
var = cov.flatten()

fx = 0
x_list = np.linspace(data.min(), data.max(), 100)
for i in range(num_components):
    fx += weights[i] * norm.pdf(x_list, means[i], np.sqrt(var[i]))

goog['return'].hist(bins=100, density=True)
plt.plot(x_list, fx, label='mixture model')
plt.legend();

We note the fit is much better with n_components = 10. However, this comes at the cost of 16 additional params. Also, we're in the range of overfitting the data. Finally, there doesnt seem to exist simple Hypothesis testing routine for mixture of gaussians. The best I've seen are Bayes factor methods on the posterior distribution. I plan to discuss Bayesian methods for hypothesis testing on my next project.

Search Site

Hypothesis Testing II – Gaussian Mixtures

Normal Distribution

t-Distribution

Mixture of Gaussians

Leave a Reply Cancel reply

Normal Distribution

t-Distribution

Mixture of Gaussians

You may also like

Gaussian Mixture Models

Hypothesis Testing III – Bayesian Methods

Hypothesis Testing I

Leave a Reply Cancel reply