Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
380 views
in Technique[技术] by (71.8m points)

python - NumPy or SciPy to calculate weighted median

I'm trying to automate a process that JMP does (Analyze->Distribution, entering column A as the "Y value", using subsequent columns as the "weight" value). In JMP you have to do this one column at a time - I'd like to use Python to loop through all of the columns and create an array showing, say, the median of each column.

For example, if the mass array is [0, 10, 20, 30], and the weight array for column 1 is [30, 191, 9, 0], the weighted median of the mass array should be 10. However, I'm not sure how to arrive at this answer.

So far I've

  1. imported the csv showing the weights as an array, masking values of 0, and
  2. created an array of the "Y value" the same shape and size as the weights array (113x32). I'm not entirely sure I need to do this, but thought it would be easier than a for loop for the purpose of weighting.

I'm not sure exactly where to go from here. Basically the "Y value" is a range of masses, and all of the columns in the array represent the number of data points found for each mass. I need to find the median mass, based on the frequency with which they were reported.

I'm not an expert in Python or statistics, so if I've omitted any details that would be useful let me know!

Update: here's some code for what I've done so far:

#Boilerplate & Import files
import csv
import scipy as sp
from scipy import stats
from scipy.stats import norm
import numpy as np
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt

inputFile = '/Users/cl/prov.csv'
origArray = genfromtxt(inputFile, delimiter = ",")
nArray = np.array(origArray)
dimensions = nArray.shape
shape = np.asarray(dimensions)

#Mask values ==0
maTest = np.ma.masked_equal(nArray,0)

#Create array of masses the same shape as the weights (nArray)
fieldLength = shape[0]
rowLength = shape[1]

for i in range (rowLength):
    createArr = np.arange(0, fieldLength*10, 10)
    nCreateArr = np.array(createArr)
    massArr.append(nCreateArr)
    nCreateArr = np.array(massArr)
nmassArr = nCreateArr.transpose()
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

What we can do, if i understood your problem correctly. Is to sum up the observations, dividing by 2 would give us the observation number corresponding to the median. From there we need to figure out what observation this number was.

One trick here, is to calculate the observation sums with np.cumsum. Which gives us a running cumulative sum.

Example:
np.cumsum([1,2,3,4]) -> [ 1, 3, 6, 10]
Each element is the sum of all previously elements and itself. We have 10 observations here. so the mean would be the 5th observation. (We get 5 by dividing the last element by 2).
Now looking at the cumsum result, we can easily see that that must be the observation between the second and third elements (observation 3 and 6).

So all we need to do, is figure out the index of where the median (5) will fit.
np.searchsorted does exactly what we need. It will find the index to insert an elements into an array, so that it stays sorted.

The code to do it like so:

import numpy as np
#my test data
freq_count = np.array([[30, 191, 9, 0], [10, 20, 300, 10], [10,20,30,40], [100,10,10,10], [1,1,1,100]])

c = np.cumsum(freq_count, axis=1) 
indices = [np.searchsorted(row, row[-1]/2.0) for row in c]
masses = [i * 10 for i in indices] #Correct if the masses are indeed 0, 10, 20,...

#This is just for explanation.
print "median masses is:",  masses
print freq_count
print np.hstack((c, c[:, -1, np.newaxis]/2.0))

Output will be:

median masses is: [10 20 20  0 30]  
[[ 30 191   9   0]  <- The test data
 [ 10  20 300  10]  
 [ 10  20  30  40]  
 [100  10  10  10]  
 [  1   1   1 100]]  
[[  30.   221.   230.   230.   115. ]  <- cumsum results with median added to the end.
 [  10.    30.   330.   340.   170. ]     you can see from this where they fit in.
 [  10.    30.    60.   100.    50. ]  
 [ 100.   110.   120.   130.    65. ]  
 [   1.     2.     3.   103.    51.5]]  

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...