Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we approximate the random AP probability? #84

Closed
afermg opened this issue Feb 21, 2025 · 4 comments
Closed

Can we approximate the random AP probability? #84

afermg opened this issue Feb 21, 2025 · 4 comments

Comments

@afermg
Copy link
Contributor

afermg commented Feb 21, 2025

My optimization of matching pushed me to dig a bit and found this paper where they show very little difference in the approximated AP baseline vs the expected when total_pairs >> 0 and pos_pairs is not too close t o 0 or to total_pairs (their R snippet has a bug though). I'll elaborate further but I needed a place to add notes for future reference. Hence this issue.

@afermg
Copy link
Contributor Author

afermg commented Feb 21, 2025

In current tests I actually found it is faster to calculate it for low total_pairs, specially when compared to sampling a high number:

M=100, n=10
"""
%timeit (timing(random_ap)(100000,n,M,0)).mean()
110 ms ± 755 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit random_exact_ap(M,n)
15.1 ms ± 75.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
"""

@afermg
Copy link
Contributor Author

afermg commented Feb 21, 2025

Indeed it is too computationally expensive, even with threading and optimisations.

#!/usr/bin/env jupyter
"""Compare the Stirling approximation to the binomial distribution.

Note that I tried Stirling and Goper approximation for factorials but that
turned out to be slower. Surprisingly, `math.comb` seemed to be the
best way to get the combinatorials.
"""
from timeit import Timer

setup="""
from functools import partial
from math import comb
from concurrent.futures import ThreadPoolExecutor

import numpy as np
from numpy.lib.stride_tricks import sliding_window_view
from copairs.compute import random_ap
from copairs.timing import timing

def get_hypergeom_prob(M,n,N,k):
    return comb(n,k) * comb(M-n,N-k) / comb(M,N)

def random_exact_ap(M,n):
    window = sliding_window_view(np.arange(1,M+1), M-n+1)
    xs, ys = np.where(window)
ks = xs+1
    Ns = window[xs,ys]

    p_at_N = ks / Ns
    partial_fn = partial(get_hypergeom_prob,M,n)
    with ThreadPoolExecutor() as executor:
        prob = list(executor.map(partial_fn, Ns, ks))
    aps = prob * p_at_N * p_at_N
    
    return aps.sum()/n

M={M}
n={n}
null_size={null_size}
"""

Ms = (100,1000,10000, 100000)
ns = (10,100,1000)
nsamples = (1000,10000,100000)

n_times = 3
n_repeats = 5
results = dict()
for M,n in zip(Ms,ns):
    for fn in (
        "random_exact_ap(M,n)",
        "timing(random_ap)(null_size,n,M,0)",
    ):
        for null_size in nsamples:
            key = f"{M}_{n}_{null_size}"
            times = Timer(fn, setup=setup.format(M=M,n=n,null_size=null_size)).repeat(n_times, n_repeats)
            avg = sum(times) / n_repeats
            print(f"{fn.split('(')[0]},{M=},{n=},{null_size=} has an avg time of {avg:0.3f} secs.")
            results[key] = avg
            if "exact" in fn:
                break
        
"""
random_exact_ap,M=100,n=10,null_size=1000 has an avg time of 0.034 secs.
timing,M=100,n=10,null_size=1000 has an avg time of 0.003 secs.
timing,M=100,n=10,null_size=10000 has an avg time of 0.033 secs.
timing,M=100,n=10,null_size=100000 has an avg time of 0.316 secs.
random_exact_ap,M=1000,n=100,null_size=1000 has an avg time of 13.404 secs.
timing,M=1000,n=100,null_size=1000 has an avg time of 0.029 secs.
timing,M=1000,n=100,null_size=10000 has an avg time of 0.294 secs.
timing,M=1000,n=100,null_size=100000 has an avg time of 3.007 secs.
"""

@afermg afermg closed this as completed Feb 21, 2025
@afermg
Copy link
Contributor Author

afermg commented Feb 21, 2025

Relevant notebook made by Shantanu that I found when researching this

@afermg
Copy link
Contributor Author

afermg commented Feb 22, 2025

I did a little bit more digging and managed to get a better implementation by vectorising the inputs of scipy.stats.hypergeom.

If we compare it to the sampling of 5e5 null APs it performs well until either M=5000,n=10 or M=1000,n=500. That is when sampling outperforms it. I think I could get some performance improvements by using the cdf and/or some windows views but I don't know how much better we can make it if we are still depending on scipy.stats.

#!/usr/bin/env jupyter
"""Timing the exact random_ap calculation vs the simulations.

Note that I tried Stirling and Goper approximation for factorials but that
turned out to be slower. Surprisingly, `math.comb` seemed to be the
best way to get the combinatorials.
"""
from timeit import Timer

setup="""
import numpy as np
from copairs.compute import random_ap
from copairs.timing import timing
from scipy.stats import hypergeom

def hypergeom_vectorised(M,n):
    k, N = np.indices((n,M)) +1
    p_at_N = k/N
    result = hypergeom.pmf(np.arange(n)[::-1,np.newaxis], M, n, np.arange(M)[np.newaxis,::-1])
    norm = result * p_at_N * p_at_N
    added = norm.sum()
    divided = added/n

    return divided

M={M}
n={n}
null_size={null_size}
"""

from itertools import product

Ms = (1000, 5000, 10000)
ns = (10, 100,500)
nsamples = (50000, 500000)

n_times = 3
n_repeats = 5
results = dict()
for M,n in product(Ms,ns):
    for fn in (
        "hypergeom_vectorised(M,n)",
        "timing(random_ap)(null_size,n,M,0)",
    ):
        if "random_ap" not in fn:
            iters = (0,)
        else:
            iters = nsamples
        for null_size in iters:
            times = Timer(fn, setup=setup.format(M=M,n=n,null_size=null_size)).repeat(n_times, n_repeats)
            avg = sum(times) / n_repeats
            print(f"{fn.split('(')[0]},{M=},{n=},{null_size=} has an avg time of {avg:0.3f} secs.")

"""
hypergeom_vectorised,M=1000,n=10,null_size=0 has an avg time of 0.081 secs.
hypergeom_vectorised,M=1000,n=100,null_size=0 has an avg time of 0.744 secs.
hypergeom_vectorised,M=1000,n=500,null_size=0 has an avg time of 2.078 secs.
hypergeom_vectorised,M=5000,n=10,null_size=0 has an avg time of 1.458 secs.
hypergeom_vectorised,M=5000,n=100,null_size=0 has an avg time of 14.402 secs.
hypergeom_vectorised,M=5000,n=500,null_size=0 has an avg time of 66.143 secs.
hypergeom_vectorised,M=10000,n=10,null_size=0 has an avg time of 5.199 secs.
hypergeom_vectorised,M=10000,n=100,null_size=0 has an avg time of 52.056 secs.
timing,M=1000,n=10,null_size=5000 has an avg time of 0.142 secs.
timing,M=1000,n=100,null_size=5000 has an avg time of 0.148 secs.
timing,M=1000,n=500,null_size=5000 has an avg time of 0.185 secs.
timing,M=5000,n=10,null_size=5000 has an avg time of 0.781 secs.
timing,M=5000,n=100,null_size=5000 has an avg time of 0.794 secs.
timing,M=10000,n=10,null_size=5000 has an avg time of 1.593 secs.
timing,M=10000,n=100,null_size=5000 has an avg time of 1.606 secs.
timing,M=1000,n=10,null_size=50000 has an avg time of 1.449 secs.
timing,M=1000,n=100,null_size=50000 has an avg time of 1.528 secs.
timing,M=1000,n=500,null_size=50000 has an avg time of 1.954 secs.
timing,M=5000,n=10,null_size=50000 has an avg time of 7.943 secs.
timing,M=5000,n=100,null_size=50000 has an avg time of 8.040 secs.
timing,M=10000,n=10,null_size=50000 has an avg time of 16.092 secs.
timing,M=10000,n=100,null_size=50000 has an avg time of 16.173 secs.
"""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant