-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we approximate the random AP probability? #84
Comments
In current tests I actually found it is faster to calculate it for low total_pairs, specially when compared to sampling a high number: M=100, n=10
"""
%timeit (timing(random_ap)(100000,n,M,0)).mean()
110 ms ± 755 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit random_exact_ap(M,n)
15.1 ms ± 75.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
""" |
Indeed it is too computationally expensive, even with threading and optimisations. #!/usr/bin/env jupyter
"""Compare the Stirling approximation to the binomial distribution.
Note that I tried Stirling and Goper approximation for factorials but that
turned out to be slower. Surprisingly, `math.comb` seemed to be the
best way to get the combinatorials.
"""
from timeit import Timer
setup="""
from functools import partial
from math import comb
from concurrent.futures import ThreadPoolExecutor
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view
from copairs.compute import random_ap
from copairs.timing import timing
def get_hypergeom_prob(M,n,N,k):
return comb(n,k) * comb(M-n,N-k) / comb(M,N)
def random_exact_ap(M,n):
window = sliding_window_view(np.arange(1,M+1), M-n+1)
xs, ys = np.where(window)
ks = xs+1
Ns = window[xs,ys]
p_at_N = ks / Ns
partial_fn = partial(get_hypergeom_prob,M,n)
with ThreadPoolExecutor() as executor:
prob = list(executor.map(partial_fn, Ns, ks))
aps = prob * p_at_N * p_at_N
return aps.sum()/n
M={M}
n={n}
null_size={null_size}
"""
Ms = (100,1000,10000, 100000)
ns = (10,100,1000)
nsamples = (1000,10000,100000)
n_times = 3
n_repeats = 5
results = dict()
for M,n in zip(Ms,ns):
for fn in (
"random_exact_ap(M,n)",
"timing(random_ap)(null_size,n,M,0)",
):
for null_size in nsamples:
key = f"{M}_{n}_{null_size}"
times = Timer(fn, setup=setup.format(M=M,n=n,null_size=null_size)).repeat(n_times, n_repeats)
avg = sum(times) / n_repeats
print(f"{fn.split('(')[0]},{M=},{n=},{null_size=} has an avg time of {avg:0.3f} secs.")
results[key] = avg
if "exact" in fn:
break
"""
random_exact_ap,M=100,n=10,null_size=1000 has an avg time of 0.034 secs.
timing,M=100,n=10,null_size=1000 has an avg time of 0.003 secs.
timing,M=100,n=10,null_size=10000 has an avg time of 0.033 secs.
timing,M=100,n=10,null_size=100000 has an avg time of 0.316 secs.
random_exact_ap,M=1000,n=100,null_size=1000 has an avg time of 13.404 secs.
timing,M=1000,n=100,null_size=1000 has an avg time of 0.029 secs.
timing,M=1000,n=100,null_size=10000 has an avg time of 0.294 secs.
timing,M=1000,n=100,null_size=100000 has an avg time of 3.007 secs.
""" |
Relevant notebook made by Shantanu that I found when researching this |
I did a little bit more digging and managed to get a better implementation by vectorising the inputs of scipy.stats.hypergeom. If we compare it to the sampling of #!/usr/bin/env jupyter
"""Timing the exact random_ap calculation vs the simulations.
Note that I tried Stirling and Goper approximation for factorials but that
turned out to be slower. Surprisingly, `math.comb` seemed to be the
best way to get the combinatorials.
"""
from timeit import Timer
setup="""
import numpy as np
from copairs.compute import random_ap
from copairs.timing import timing
from scipy.stats import hypergeom
def hypergeom_vectorised(M,n):
k, N = np.indices((n,M)) +1
p_at_N = k/N
result = hypergeom.pmf(np.arange(n)[::-1,np.newaxis], M, n, np.arange(M)[np.newaxis,::-1])
norm = result * p_at_N * p_at_N
added = norm.sum()
divided = added/n
return divided
M={M}
n={n}
null_size={null_size}
"""
from itertools import product
Ms = (1000, 5000, 10000)
ns = (10, 100,500)
nsamples = (50000, 500000)
n_times = 3
n_repeats = 5
results = dict()
for M,n in product(Ms,ns):
for fn in (
"hypergeom_vectorised(M,n)",
"timing(random_ap)(null_size,n,M,0)",
):
if "random_ap" not in fn:
iters = (0,)
else:
iters = nsamples
for null_size in iters:
times = Timer(fn, setup=setup.format(M=M,n=n,null_size=null_size)).repeat(n_times, n_repeats)
avg = sum(times) / n_repeats
print(f"{fn.split('(')[0]},{M=},{n=},{null_size=} has an avg time of {avg:0.3f} secs.")
"""
hypergeom_vectorised,M=1000,n=10,null_size=0 has an avg time of 0.081 secs.
hypergeom_vectorised,M=1000,n=100,null_size=0 has an avg time of 0.744 secs.
hypergeom_vectorised,M=1000,n=500,null_size=0 has an avg time of 2.078 secs.
hypergeom_vectorised,M=5000,n=10,null_size=0 has an avg time of 1.458 secs.
hypergeom_vectorised,M=5000,n=100,null_size=0 has an avg time of 14.402 secs.
hypergeom_vectorised,M=5000,n=500,null_size=0 has an avg time of 66.143 secs.
hypergeom_vectorised,M=10000,n=10,null_size=0 has an avg time of 5.199 secs.
hypergeom_vectorised,M=10000,n=100,null_size=0 has an avg time of 52.056 secs.
timing,M=1000,n=10,null_size=5000 has an avg time of 0.142 secs.
timing,M=1000,n=100,null_size=5000 has an avg time of 0.148 secs.
timing,M=1000,n=500,null_size=5000 has an avg time of 0.185 secs.
timing,M=5000,n=10,null_size=5000 has an avg time of 0.781 secs.
timing,M=5000,n=100,null_size=5000 has an avg time of 0.794 secs.
timing,M=10000,n=10,null_size=5000 has an avg time of 1.593 secs.
timing,M=10000,n=100,null_size=5000 has an avg time of 1.606 secs.
timing,M=1000,n=10,null_size=50000 has an avg time of 1.449 secs.
timing,M=1000,n=100,null_size=50000 has an avg time of 1.528 secs.
timing,M=1000,n=500,null_size=50000 has an avg time of 1.954 secs.
timing,M=5000,n=10,null_size=50000 has an avg time of 7.943 secs.
timing,M=5000,n=100,null_size=50000 has an avg time of 8.040 secs.
timing,M=10000,n=10,null_size=50000 has an avg time of 16.092 secs.
timing,M=10000,n=100,null_size=50000 has an avg time of 16.173 secs.
""" |
My optimization of matching pushed me to dig a bit and found this paper where they show very little difference in the approximated AP baseline vs the expected when
total_pairs >> 0
andpos_pairs
is not too close t o 0 or tototal_pairs
(their R snippet has a bug though). I'll elaborate further but I needed a place to add notes for future reference. Hence this issue.The text was updated successfully, but these errors were encountered: