Can we approximate the random AP probability? #84

afermg · 2025-02-21T14:28:12Z

My optimization of matching pushed me to dig a bit and found this paper where they show very little difference in the approximated AP baseline vs the expected when total_pairs >> 0 and pos_pairs is not too close t o 0 or to total_pairs (their R snippet has a bug though). I'll elaborate further but I needed a place to add notes for future reference. Hence this issue.

The text was updated successfully, but these errors were encountered:

afermg · 2025-02-21T18:51:35Z

In current tests I actually found it is faster to calculate it for low total_pairs, specially when compared to sampling a high number:

M=100, n=10
"""
%timeit (timing(random_ap)(100000,n,M,0)).mean()
110 ms ± 755 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit random_exact_ap(M,n)
15.1 ms ± 75.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
"""

afermg · 2025-02-21T20:23:13Z

Indeed it is too computationally expensive, even with threading and optimisations.

#!/usr/bin/env jupyter
"""Compare the Stirling approximation to the binomial distribution.

Note that I tried Stirling and Goper approximation for factorials but that
turned out to be slower. Surprisingly, `math.comb` seemed to be the
best way to get the combinatorials.
"""
from timeit import Timer

setup="""
from functools import partial
from math import comb
from concurrent.futures import ThreadPoolExecutor

import numpy as np
from numpy.lib.stride_tricks import sliding_window_view
from copairs.compute import random_ap
from copairs.timing import timing

def get_hypergeom_prob(M,n,N,k):
    return comb(n,k) * comb(M-n,N-k) / comb(M,N)

def random_exact_ap(M,n):
    window = sliding_window_view(np.arange(1,M+1), M-n+1)
    xs, ys = np.where(window)
ks = xs+1
    Ns = window[xs,ys]

    p_at_N = ks / Ns
    partial_fn = partial(get_hypergeom_prob,M,n)
    with ThreadPoolExecutor() as executor:
        prob = list(executor.map(partial_fn, Ns, ks))
    aps = prob * p_at_N * p_at_N
    
    return aps.sum()/n

M={M}
n={n}
null_size={null_size}
"""

Ms = (100,1000,10000, 100000)
ns = (10,100,1000)
nsamples = (1000,10000,100000)

n_times = 3
n_repeats = 5
results = dict()
for M,n in zip(Ms,ns):
    for fn in (
        "random_exact_ap(M,n)",
        "timing(random_ap)(null_size,n,M,0)",
    ):
        for null_size in nsamples:
            key = f"{M}_{n}_{null_size}"
            times = Timer(fn, setup=setup.format(M=M,n=n,null_size=null_size)).repeat(n_times, n_repeats)
            avg = sum(times) / n_repeats
            print(f"{fn.split('(')[0]},{M=},{n=},{null_size=} has an avg time of {avg:0.3f} secs.")
            results[key] = avg
            if "exact" in fn:
                break
        
"""
random_exact_ap,M=100,n=10,null_size=1000 has an avg time of 0.034 secs.
timing,M=100,n=10,null_size=1000 has an avg time of 0.003 secs.
timing,M=100,n=10,null_size=10000 has an avg time of 0.033 secs.
timing,M=100,n=10,null_size=100000 has an avg time of 0.316 secs.
random_exact_ap,M=1000,n=100,null_size=1000 has an avg time of 13.404 secs.
timing,M=1000,n=100,null_size=1000 has an avg time of 0.029 secs.
timing,M=1000,n=100,null_size=10000 has an avg time of 0.294 secs.
timing,M=1000,n=100,null_size=100000 has an avg time of 3.007 secs.
"""

afermg · 2025-02-21T20:33:12Z

Relevant notebook made by Shantanu that I found when researching this

afermg · 2025-02-22T00:50:20Z

I did a little bit more digging and managed to get a better implementation by vectorising the inputs of scipy.stats.hypergeom.

If we compare it to the sampling of 5e5 null APs it performs well until either M=5000,n=10 or M=1000,n=500. That is when sampling outperforms it. I think I could get some performance improvements by using the cdf and/or some windows views but I don't know how much better we can make it if we are still depending on scipy.stats.

#!/usr/bin/env jupyter
"""Timing the exact random_ap calculation vs the simulations.

Note that I tried Stirling and Goper approximation for factorials but that
turned out to be slower. Surprisingly, `math.comb` seemed to be the
best way to get the combinatorials.
"""
from timeit import Timer

setup="""
import numpy as np
from copairs.compute import random_ap
from copairs.timing import timing
from scipy.stats import hypergeom

def hypergeom_vectorised(M,n):
    k, N = np.indices((n,M)) +1
    p_at_N = k/N
    result = hypergeom.pmf(np.arange(n)[::-1,np.newaxis], M, n, np.arange(M)[np.newaxis,::-1])
    norm = result * p_at_N * p_at_N
    added = norm.sum()
    divided = added/n

    return divided

M={M}
n={n}
null_size={null_size}
"""

from itertools import product

Ms = (1000, 5000, 10000)
ns = (10, 100,500)
nsamples = (50000, 500000)

n_times = 3
n_repeats = 5
results = dict()
for M,n in product(Ms,ns):
    for fn in (
        "hypergeom_vectorised(M,n)",
        "timing(random_ap)(null_size,n,M,0)",
    ):
        if "random_ap" not in fn:
            iters = (0,)
        else:
            iters = nsamples
        for null_size in iters:
            times = Timer(fn, setup=setup.format(M=M,n=n,null_size=null_size)).repeat(n_times, n_repeats)
            avg = sum(times) / n_repeats
            print(f"{fn.split('(')[0]},{M=},{n=},{null_size=} has an avg time of {avg:0.3f} secs.")

"""
hypergeom_vectorised,M=1000,n=10,null_size=0 has an avg time of 0.081 secs.
hypergeom_vectorised,M=1000,n=100,null_size=0 has an avg time of 0.744 secs.
hypergeom_vectorised,M=1000,n=500,null_size=0 has an avg time of 2.078 secs.
hypergeom_vectorised,M=5000,n=10,null_size=0 has an avg time of 1.458 secs.
hypergeom_vectorised,M=5000,n=100,null_size=0 has an avg time of 14.402 secs.
hypergeom_vectorised,M=5000,n=500,null_size=0 has an avg time of 66.143 secs.
hypergeom_vectorised,M=10000,n=10,null_size=0 has an avg time of 5.199 secs.
hypergeom_vectorised,M=10000,n=100,null_size=0 has an avg time of 52.056 secs.
timing,M=1000,n=10,null_size=5000 has an avg time of 0.142 secs.
timing,M=1000,n=100,null_size=5000 has an avg time of 0.148 secs.
timing,M=1000,n=500,null_size=5000 has an avg time of 0.185 secs.
timing,M=5000,n=10,null_size=5000 has an avg time of 0.781 secs.
timing,M=5000,n=100,null_size=5000 has an avg time of 0.794 secs.
timing,M=10000,n=10,null_size=5000 has an avg time of 1.593 secs.
timing,M=10000,n=100,null_size=5000 has an avg time of 1.606 secs.
timing,M=1000,n=10,null_size=50000 has an avg time of 1.449 secs.
timing,M=1000,n=100,null_size=50000 has an avg time of 1.528 secs.
timing,M=1000,n=500,null_size=50000 has an avg time of 1.954 secs.
timing,M=5000,n=10,null_size=50000 has an avg time of 7.943 secs.
timing,M=5000,n=100,null_size=50000 has an avg time of 8.040 secs.
timing,M=10000,n=10,null_size=50000 has an avg time of 16.092 secs.
timing,M=10000,n=100,null_size=50000 has an avg time of 16.173 secs.
"""

afermg closed this as completed Feb 21, 2025

afermg mentioned this issue Feb 22, 2025

Calculate exact null distirbution #85

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we approximate the random AP probability? #84

Can we approximate the random AP probability? #84

afermg commented Feb 21, 2025 •

edited

Loading

afermg commented Feb 21, 2025 •

edited

Loading

afermg commented Feb 21, 2025

afermg commented Feb 21, 2025

afermg commented Feb 22, 2025 •

edited

Loading

Can we approximate the random AP probability? #84

Can we approximate the random AP probability? #84

Comments

afermg commented Feb 21, 2025 • edited Loading

afermg commented Feb 21, 2025 • edited Loading

afermg commented Feb 21, 2025

afermg commented Feb 21, 2025

afermg commented Feb 22, 2025 • edited Loading

afermg commented Feb 21, 2025 •

edited

Loading

afermg commented Feb 21, 2025 •

edited

Loading

afermg commented Feb 22, 2025 •

edited

Loading