Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization through OhMyThreads.jl #219

Merged
merged 22 commits into from
Jan 8, 2025

Conversation

Gertian
Copy link
Collaborator

@Gertian Gertian commented Dec 23, 2024

This PR adds some parallelization to GD using the @static and @Spawn macros.

Copy link

codecov bot commented Dec 23, 2024

Codecov Report

Attention: Patch coverage is 85.29412% with 15 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/algorithms/timestep/tdvp.jl 60.00% 6 Missing ⚠️
src/algorithms/approximate/vomps.jl 66.66% 2 Missing ⚠️
src/algorithms/groundstate/vumps.jl 66.66% 2 Missing ⚠️
src/algorithms/statmech/vomps.jl 85.71% 2 Missing ⚠️
src/algorithms/statmech/vumps.jl 85.71% 2 Missing ⚠️
src/utility/defaults.jl 80.00% 1 Missing ⚠️
Files with missing lines Coverage Δ
src/MPSKit.jl 100.00% <100.00%> (ø)
src/algorithms/changebonds/svdcut.jl 92.85% <100.00%> (ø)
src/algorithms/derivatives.jl 82.88% <ø> (+7.47%) ⬆️
...c/algorithms/excitation/quasiparticleexcitation.jl 60.26% <100.00%> (+1.66%) ⬆️
src/algorithms/grassmann.jl 99.20% <100.00%> (+0.05%) ⬆️
src/algorithms/groundstate/gradient_grassmann.jl 73.33% <100.00%> (+1.90%) ⬆️
src/environments/abstract_envs.jl 87.75% <100.00%> (ø)
src/environments/multiple_envs.jl 90.90% <100.00%> (+1.16%) ⬆️
src/utility/dynamictols.jl 72.72% <100.00%> (ø)
src/utility/defaults.jl 94.11% <80.00%> (+41.94%) ⬆️
... and 5 more

@Gertian
Copy link
Collaborator Author

Gertian commented Dec 23, 2024

I understand that this is not a proper benchmark but I just compared the above code (executed on my desktop) with the non parallel code (executed a cluster) and got :

On the cluster
julia -t 32 with BLAS.set_num_threads(2) takes about 23 minutes for one gradient descent step
On my desktop
julia -t 8 with BLAS.set_num_threads(2) takes about 5 minutes for one gradient descent step

So this is a tremendous speedup ! Furthermore, from monitoring htop I see that the non parallel code spends +- 25 of its time running on very few cores. With the above code this is reduced to +- 5 % of the time.

@lkdvos
Copy link
Member

lkdvos commented Dec 24, 2024

These benchmarks don't really mean a lot like this, you are comparing on different machines with a completely different setup of threads... A better comparison could be:

  • a single-threaded run (1 julia thread, 1 blas thread)
  • a blas-threaded run (1 julia thread, all blas threads)
  • a julia-threaded run (all julia threads, 1 blas thread)
  • a half-half run (e.g. for 16 core machine: 4 julia threads, 4 blas threads, with MKL)

Using the first as a baseline, you would be able to deduce the efficiency of the multi-threading, i.e. performance gain per thread, comparing with how well BLAS does.

As a sidenote, I think we discussed this before, but I would much rather migrate to using OhMyThreads.jl for handling the multithreading, since that setting could very easily toggle between different schedulers, which might be relevant for different kinds of systems.

@Gertian
Copy link
Collaborator Author

Gertian commented Dec 24, 2024

I'll do a better benchmark in the near future.

Apart from that, I don't think you ever mentioned wanting to change to OhMyThreads.jl I guess this makes this PR rather useless.

That being said, if this is something that's relevant now I could look into this. With schedulers you mean the underlying CPU scheduler right ? Or are these some settings in OhMyThreads ?

@maartenvd
Copy link
Collaborator

I would also be interested in seeing a breakdown of the time spent. I would assume that almost all time is spent calculating the environments, so it might be worthwhile to think about how we could parallelize those for large unit cells.

@lkdvos
Copy link
Member

lkdvos commented Dec 30, 2024

I switched out the parallelization for the OhMyThreads approach, which I think is a bit cleaner (and should be more versatile as well). Apologies that I didn't discuss this with you sooner, I thought I had mentioned this before.

It would be really nice if you could still add the benchmarks so we can have an idea if this helps, and the documentation still needs to be updated. Do you maybe have time for this @Gertian ? Let me know if something is not clear or I can help.

@lkdvos lkdvos changed the title Parallel GD for infinite MPS Parallelization through OhMyThreads.jl Dec 30, 2024
@lkdvos
Copy link
Member

lkdvos commented Dec 30, 2024

Looks like I did mess up something with the Grassmann stuff, given that the linesearch is complaining again...

@Gertian
Copy link
Collaborator Author

Gertian commented Jan 5, 2025

Ok, I finally got to benchmarking the new code. The file I used is :

using LinearAlgebra
using MPSKit
using TensorKit
using BenchmarkTools
using MKL 


BLAS.set_num_threads(1) 

#we will make a basic 2+1D NN Ising Hamiltonian to benchmark the parallel GD. To get the desired large unit cell we will put the ising model on a cylinder with circumference L. 
#example for L = 4 
#1 2 3 4 
#5 6 7 8
L=16
ph = ComplexSpace(2)
lattice = PeriodicArray(repeat([ph], L))

NN_sites = vcat([ (n, n+1) for n in 1:L ],[ (n, n+L) for n in 1:L ])

Z = TensorMap([1.0 0.0; 0.0 -1.0], ph,ph)
X = TensorMap([0.0 1.0; 1.0  0.0], ph,ph)
@tensor ZZ[-1 -2;-3 -4] := Z[-1 -3] * Z[-2 -4]

H = InfiniteMPOHamiltonian(lattice, i=> X for i in 1:L) + InfiniteMPOHamiltonian(lattice, (i,j)=> ZZ for (i,j) in NN_sites) 
for i in 1:length(H)
    MPSKit.dropzeros!(H[i])
end

#make an initial state :
D = ComplexSpace(50)
initial_state = InfiniteMPS(randn, ComplexF64, repeat([ph],L), repeat([D],L))

#perform the optimization :) 
MPSKit.Defaults.set_scheduler!(:serial) # disable multithreading
serial_bm = @benchmark find_groundstate(initial_state, H, GradientGrassmann(maxiter=10) ) setup=(find_groundstate(initial_state, H, GradientGrassmann(maxiter=10) ))
if false #the greedy schedulur does not work. Uncomment for more info :)
    MPSKit.Defaults.set_scheduler!(:greedy) # multithreading with greedy load-balancing
    greedy_bm = @benchmark find_groundstate(initial_state, H, GradientGrassmann(maxiter=10) ) setup=(find_groundstate(initial_state, H, GradientGrassmann(maxiter=10) ))
end
MPSKit.Defaults.set_scheduler!(:dynamic) # default: multithreading with some load-balancing
dynamic_bm = @benchmark find_groundstate(initial_state, H, GradientGrassmann(maxiter=10) ) setup=(find_groundstate(initial_state, H, GradientGrassmann(maxiter=10) ))

where BLAS.set_num_threads(1) is changed for the different runs. Note that the greedy scheduler does not work. It doesn't like the usage of tmap. To see the stacktrace just put true instead of false. Additionally, to eliminate the amount of steps involved for convergence I set maxiter=10 .

The results of the benchmark are :

  1. Single-threaded run i.e. julia -t 1 and BLAS.set_num_threads(1)
    image

  2. Blas-threaded run i.e. julia-t 1 BLAS.set_num_threads(16)
    image

  3. julia-threaded run i.e. julia-t 16 BLAS.set_num_threads(1)
    image
    @lkdvos, congratz, you outperformed MKL :)

  4. Half-half run i.e. julia-t 4 BLAS.set_num_threads(4)
    image

Let me know if there are any other things that you'd like me to check.

ps : Apart from the fact that the greedy schedular doesn't seem to work the documentation was very sufficient to get the up and running ! +1 from me 👍

ps ps : this is all run on @lkdvos his latest push :)

@Gertian
Copy link
Collaborator Author

Gertian commented Jan 6, 2025

From looking into the manual of OhMyThreads.jl it seems that the non working of the greedy scheduler can be fixed by telling tmap it's output type.

i.e.
image

Since this also limits allocations it might also help towards the large memory usage that was noticed before. (Although this was greatly improved in the new julia It's probably still nice ?

@lkdvos
Copy link
Member

lkdvos commented Jan 6, 2025

Yeah, I also noticed the :greedy scheduler not working. It's not too hard to fix, indeed either by defining the output type or by preallocating the resulting vector. Note that this does not actually change the amount of allocations, since we do really need a new vector, and also that this is not really that big of a problem, since it is just a single vector for each retraction, ..., which contains just pointers to the actual tensors. Changing tmap to tmap! shouldnt really affect performance 😉 . I'll try and fix this ASAP.

I'm a bit confused by the benchmark results, in the sense that I would have expected a much larger impact of using threads. What kind of machine are you using to test this?
It's a bit strange that 16 threads only give a speedup of more or less 1.5x, although I would expect that this does increase with the bond dimension.
I'm also surprised that the serial implementations outperform the dynamic ones, and the dynamic ones somehow allocate 20% less memory. This is a bit counterintuitive, since you would expect that the multi-threaded approach requires strictly more memory...

Some minor notes about your benchmark:

  • From what I see, I think you are testing a helix instead of a cylinder (which is fine by me), just wanted to make sure you knew this. The last axial interaction should be (L, 1), not (L, L+1) in order to make this a true cylinder.
  • The setup block of a @benchmark block is something that is run before each sample, and typically used to initialize variables. You don't have to manually add a run to precompile the code, @benchmark actually will do that for you (I am assuming this is why you added these setup blocks?)
  • In benchmark results, it is common practice to interpolate global variables to avoid having untyped globals that might affect the performance. This is achieved with $, see snippet below.
  • You might want to make the benchmark slightly more robust by including more than 1 sample, especially in this case where this is still relatively cheap. (see the benchmark parameters here, in particular samples and seconds)
alg = GradientGrassmann(maxiter=10)
serial_bm = @benchmark find_groundstate($(copy(initial_state)), $H, $alg) # same initial state, copy not really necessary

# alternative: generate new initial state every sample:
serial_bm = @benchmark find_groundstate(initial_state, $H, $alg) setup=(initial_state = InfiniteMPS(randn, ComplexF64, repeat([ph],L), repeat([D],L)))

@lkdvos
Copy link
Member

lkdvos commented Jan 7, 2025

So it seems like I found a race condition in the infinite environments, although I have to admit I have no clue why this is happening. Somehow the locking mechanism must be failing, and the ManifoldPoint constructor seems to cause it to recompute multiple times, which gives invalid results since the output tensors are used by the other threads.
While fixing it does not seem too hard, I fail to understand why the locking mechanism doesn't prevent these issues from happening. @maartenvd, do you have any clue?

Aside from that, since TensorKitManifolds updated its compat, TensorKit v0.14 is now used, which apparently was not made compatible yet. That should be fixed in #223 .

Copy link
Member

@lkdvos lkdvos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If tests turn green, this should be good to go.

@lkdvos lkdvos enabled auto-merge (squash) January 7, 2025 23:03
@lkdvos lkdvos merged commit d86b5fa into QuantumKitHub:master Jan 8, 2025
28 checks passed
@Gertian
Copy link
Collaborator Author

Gertian commented Jan 8, 2025

Great, many thanks for all this @lkdvos .

For completeness I redid the benchmark (exact same code). Since this is now merged with master I did this for the master branch :)
My output of versioninfo() is :
image
And the packages I have installed are :
image

The results are :

Single-threaded run i.e. julia -t 1 and BLAS.set_num_threads(1)
image
Note : as expected this is unchanged apart from the fact that the greedy scheduler now works.

**Blas-threaded run i.e. julia-t 1 BLAS.set_num_threads(16)
image
Note : this now seems slightly slower but this might just be due to numerical noise ?

julia-threaded run i.e. julia-t 16 BLAS.set_num_threads(1)
image
Note : This is substantially slower then before...

Half-half run i.e. julia-t 4 BLAS.set_num_threads(4)
image
Note : this also seems to be quite a bit slower then before.

@lkdvos
Copy link
Member

lkdvos commented Jan 8, 2025

That's definitely strange, and should not be happening... Before I investigate, this is on the same computer, with no additional tasks running right?

@Gertian
Copy link
Collaborator Author

Gertian commented Jan 13, 2025

Yes, this was on the same computer doing nothing else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants