VERY IMPORTANT EDIT: All Ai are unique.
The Question
I have a list A of n unique objects. Each object Ai has a variable percentage Pi.
I want to create an algorithm that generates a new list B of k objects (k < n/2 and in most cases k is significantly less than n/2. E.g. n=231 , k=21). List B should have no duplicates and will be populated with objects originating from list A with the following restriction:
The probability that an object Ai appears in B is Pi.
What I Have Tried
(These snipits are in PHP simply for the purposes of testing)
I first made list A
$list = [
"A" => 2.5,
"B" => 2.5,
"C" => 2.5,
"D" => 2.5,
"E" => 2.5,
"F" => 2.5,
"G" => 2.5,
"H" => 2.5,
"I" => 5,
"J" => 5,
"K" => 2.5,
"L" => 2.5,
"M" => 2.5,
"N" => 2.5,
"O" => 2.5,
"P" => 2.5,
"Q" => 2.5,
"R" => 2.5,
"S" => 2.5,
"T" => 2.5,
"U" => 5,
"V" => 5,
"W" => 5,
"X" => 5,
"Y" => 5,
"Z" => 20
];
At first I tried the following two algorthms (These are in PHP simply for the purposes of testing):
$result = [];
while (count($result) < 10) {
$rnd = rand(0,10000000) / 100000;
$sum = 0;
foreach ($list as $key => $value) {
$sum += $value;
if ($rnd <= $sum) {
if (in_array($key,$result)) {
break;
} else {
$result[] = $key;
break;
}
}
}
}
AND
$result = [];
while (count($result) < 10) {
$sum = 0;
foreach ($list as $key => $value) {
$sum += $value;
}
$rnd = rand(0,$sum * 100000) / 100000;
$sum = 0;
foreach ($list as $key => $value) {
$sum += $value;
if ($rnd <= $sum) {
$result[] = $key;
unset($list[$key]);
break;
}
}
}
The only differences between the two algorithms is that one tries again when it encounters a duplicate, and one removes the object form list A when it is picked. As it turns out, these two algorithms have the same probability outputs.
I ran the second algorithm 100,000 times and kept track of how many times each letter was picked. The following array contians the percentage chance that a letter is picked in any list B based off of the 100,000 tests.
[A] => 30.213
[B] => 29.865
[C] => 30.357
[D] => 30.198
[E] => 30.152
[F] => 30.472
[G] => 30.343
[H] => 30.011
[I] => 51.367
[J] => 51.683
[K] => 30.271
[L] => 30.197
[M] => 30.341
[N] => 30.15
[O] => 30.225
[P] => 30.135
[Q] => 30.406
[R] => 30.083
[S] => 30.251
[T] => 30.369
[U] => 51.671
[V] => 52.098
[W] => 51.772
[X] => 51.739
[Y] => 51.891
[Z] => 93.74
When looking back at the algorithm this makes sense. The algorithm incorrectly interpreted the original percentages to be the percentage chance that an object is picked for any given location, not any list B. So for example, in reality, the chance that Z is picked in a list B is 93%, but the chance that Z is picked for an index Bn is 20%. This is NOT what I want. I want the chance that Z is picked in a list B to be 20%.
Is this even possible? How can it be done?
EDIT 1
I tried simply having the sum of all Pi = k, this worked if all Pi are equal, but after modifying their values, it started to get more and more wrong.
Initial Probabilities
$list= [
"A" => 8.4615,
"B" => 68.4615,
"C" => 13.4615,
"D" => 63.4615,
"E" => 18.4615,
"F" => 58.4615,
"G" => 23.4615,
"H" => 53.4615,
"I" => 28.4615,
"J" => 48.4615,
"K" => 33.4615,
"L" => 43.4615,
"M" => 38.4615,
"N" => 38.4615,
"O" => 38.4615,
"P" => 38.4615,
"Q" => 38.4615,
"R" => 38.4615,
"S" => 38.4615,
"T" => 38.4615,
"U" => 38.4615,
"V" => 38.4615,
"W" => 38.4615,
"X" => 38.4615,
"Y" =>38.4615,
"Z" => 38.4615
];
Results after 10,000 runs
Array
(
[A] => 10.324
[B] => 59.298
[C] => 15.902
[D] => 56.299
[E] => 21.16
[F] => 53.621
[G] => 25.907
[H] => 50.163
[I] => 30.932
[J] => 47.114
[K] => 35.344
[L] => 43.175
[M] => 39.141
[N] => 39.127
[O] => 39.346
[P] => 39.364
[Q] => 39.501
[R] => 39.05
[S] => 39.555
[T] => 39.239
[U] => 39.283
[V] => 39.408
[W] => 39.317
[X] => 39.339
[Y] => 39.569
[Z] => 39.522
)
See Question&Answers more detail:
os