Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
738 views
in Technique[技术] by (71.8m points)

statistics - normality test of a distribution in python

I have some data I have sampled from a radar satellite image and wanted to perform some statistical tests on. Before this I wanted to conduct a normality test so I could be sure my data was normally distributed. My data appears to be normally distributed but when I perform the test Im getting a Pvalue of 0, suggesting my data is not normally distributed.

I have attached my code along with the output and a histogram of the distribution (Im relatively new to python so apologies if my code is clunky in any way). Can anyone tell me if Im doing something wrong - I find it hard to believe from my histogram that my data is not normally distributed?

values = 'inputfile.h5'
f = h5py.File(values,'r')
dset = f['/DATA/DATA']
array = dset[...,0]
print('normality =', scipy.stats.normaltest(array))
max = np.amax(array)
min = np.amin(array)

histo = np.histogram(array, bins=100, range=(min, max))
freqs = histo[0]
rangebins = (max - min)
numberbins = (len(histo[1])-1)
interval = (rangebins/numberbins)
newbins = np.arange((min), (max), interval)
histogram = bar(newbins, freqs, width=0.2, color='gray')
plt.show()

This prints this: (41099.095955202931, 0.0). the first element is a chi-square value and the second is a pvalue.

I have made a graph of the data which I have attached. I thought that maybe as Im dealing with negative values it was causing a problem so I normalised the values but the problem persists.

histogram of values in array

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This question explains why you're getting such a small p-value. Essentially, normality tests almost always reject the null on very large sample sizes (in yours, for example, you can see just some skew in the left side, which at your enormous sample size is way more than enough).

What would be much more practically useful in your case is to plot a normal curve fit to your data. Then you can see how the normal curve actually differs (for example, you can see whether the tail on the left side does indeed go too long). For example:

from matplotlib import pyplot as plt
import matplotlib.mlab as mlab

n, bins, patches = plt.hist(array, 50, normed=1)
mu = np.mean(array)
sigma = np.std(array)
plt.plot(bins, mlab.normpdf(bins, mu, sigma))

(Note the normed=1 argument: this ensures that the histogram is normalized to have a total area of 1, which makes it comparable to a density like the normal distribution).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...