-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelization through OhMyThreads.jl #219
Conversation
I understand that this is not a proper benchmark but I just compared the above code (executed on my desktop) with the non parallel code (executed a cluster) and got : On the cluster So this is a tremendous speedup ! Furthermore, from monitoring htop I see that the non parallel code spends +- 25 of its time running on very few cores. With the above code this is reduced to +- 5 % of the time. |
These benchmarks don't really mean a lot like this, you are comparing on different machines with a completely different setup of threads... A better comparison could be:
Using the first as a baseline, you would be able to deduce the efficiency of the multi-threading, i.e. performance gain per thread, comparing with how well BLAS does. As a sidenote, I think we discussed this before, but I would much rather migrate to using |
I'll do a better benchmark in the near future. Apart from that, I don't think you ever mentioned wanting to change to That being said, if this is something that's relevant now I could look into this. With schedulers you mean the underlying CPU scheduler right ? Or are these some settings in |
I would also be interested in seeing a breakdown of the time spent. I would assume that almost all time is spent calculating the environments, so it might be worthwhile to think about how we could parallelize those for large unit cells. |
I switched out the parallelization for the It would be really nice if you could still add the benchmarks so we can have an idea if this helps, and the documentation still needs to be updated. Do you maybe have time for this @Gertian ? Let me know if something is not clear or I can help. |
Looks like I did mess up something with the Grassmann stuff, given that the linesearch is complaining again... |
[skip ci]
Ok, I finally got to benchmarking the new code. The file I used is : using LinearAlgebra
using MPSKit
using TensorKit
using BenchmarkTools
using MKL
BLAS.set_num_threads(1)
#we will make a basic 2+1D NN Ising Hamiltonian to benchmark the parallel GD. To get the desired large unit cell we will put the ising model on a cylinder with circumference L.
#example for L = 4
#1 2 3 4
#5 6 7 8
L=16
ph = ComplexSpace(2)
lattice = PeriodicArray(repeat([ph], L))
NN_sites = vcat([ (n, n+1) for n in 1:L ],[ (n, n+L) for n in 1:L ])
Z = TensorMap([1.0 0.0; 0.0 -1.0], ph,ph)
X = TensorMap([0.0 1.0; 1.0 0.0], ph,ph)
@tensor ZZ[-1 -2;-3 -4] := Z[-1 -3] * Z[-2 -4]
H = InfiniteMPOHamiltonian(lattice, i=> X for i in 1:L) + InfiniteMPOHamiltonian(lattice, (i,j)=> ZZ for (i,j) in NN_sites)
for i in 1:length(H)
MPSKit.dropzeros!(H[i])
end
#make an initial state :
D = ComplexSpace(50)
initial_state = InfiniteMPS(randn, ComplexF64, repeat([ph],L), repeat([D],L))
#perform the optimization :)
MPSKit.Defaults.set_scheduler!(:serial) # disable multithreading
serial_bm = @benchmark find_groundstate(initial_state, H, GradientGrassmann(maxiter=10) ) setup=(find_groundstate(initial_state, H, GradientGrassmann(maxiter=10) ))
if false #the greedy schedulur does not work. Uncomment for more info :)
MPSKit.Defaults.set_scheduler!(:greedy) # multithreading with greedy load-balancing
greedy_bm = @benchmark find_groundstate(initial_state, H, GradientGrassmann(maxiter=10) ) setup=(find_groundstate(initial_state, H, GradientGrassmann(maxiter=10) ))
end
MPSKit.Defaults.set_scheduler!(:dynamic) # default: multithreading with some load-balancing
dynamic_bm = @benchmark find_groundstate(initial_state, H, GradientGrassmann(maxiter=10) ) setup=(find_groundstate(initial_state, H, GradientGrassmann(maxiter=10) ))
where The results of the benchmark are :
Let me know if there are any other things that you'd like me to check. ps : Apart from the fact that the greedy schedular doesn't seem to work the documentation was very sufficient to get the up and running ! +1 from me 👍 ps ps : this is all run on @lkdvos his latest push :) |
From looking into the manual of Since this also limits allocations it might also help towards the large memory usage that was noticed before. (Although this was greatly improved in the new |
Yeah, I also noticed the I'm a bit confused by the benchmark results, in the sense that I would have expected a much larger impact of using threads. What kind of machine are you using to test this? Some minor notes about your benchmark:
alg = GradientGrassmann(maxiter=10)
serial_bm = @benchmark find_groundstate($(copy(initial_state)), $H, $alg) # same initial state, copy not really necessary
# alternative: generate new initial state every sample:
serial_bm = @benchmark find_groundstate(initial_state, $H, $alg) setup=(initial_state = InfiniteMPS(randn, ComplexF64, repeat([ph],L), repeat([D],L))) |
So it seems like I found a race condition in the infinite environments, although I have to admit I have no clue why this is happening. Somehow the locking mechanism must be failing, and the Aside from that, since TensorKitManifolds updated its compat, TensorKit v0.14 is now used, which apparently was not made compatible yet. That should be fixed in #223 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If tests turn green, this should be good to go.
Great, many thanks for all this @lkdvos . For completeness I redid the benchmark (exact same code). Since this is now merged with master I did this for the master branch :) The results are : Single-threaded run i.e. julia -t 1 and BLAS.set_num_threads(1) **Blas-threaded run i.e. julia-t 1 BLAS.set_num_threads(16) julia-threaded run i.e. julia-t 16 BLAS.set_num_threads(1) Half-half run i.e. julia-t 4 BLAS.set_num_threads(4) |
That's definitely strange, and should not be happening... Before I investigate, this is on the same computer, with no additional tasks running right? |
Yes, this was on the same computer doing nothing else. |
This PR adds some parallelization to GD using the @static and @Spawn macros.