Multithreading optimization analysis

Multithreading and other optimizations in Core

*Edit, after readings Waffle's much better explanation on the topic I have learned that my explanation is inaccurate, and this is in fact not multithreading even though it is multiple "threads". I recommend you read Waffles reply first, and disregard my explanations for most of this, but the numbers are accurate.

In this I am going to be covering Multithreading and its Pros and Cons for code in Core. I would also appreciate it if others add onto this with their own observations and methods that they have found.

It is best to note that these optimizations are intended for code with loops or nested loops containing lots of logic.

First, I will display the data and then the code for each of the things. This was benchmarked on a Ryzen 5 3600 with no background apps open using the os.clock() function for maximal accuracy.

Benchmark one (100x100)

Time (Seconds) for 100x100 benchmark (5)

Benchmark two (200x200)

Time (Seconds) for 200x200 benchmark (3)

Important Note
Even though the Multithreading trials typically completed faster, it is important to note that the fps during the runtime of these was much lower than that of not multithreading.

Code for different optimizations

No optimization:

        for x = 0, SizeX do
            for y = 0, SizeY do
                genTile(x, y)
            end 
            Task.Wait() //To avoid instruction limit
        end

Multithreading:
(xMultithreadingCount and yMultithreadingCount are either 2X2, 4X4, 10X10, or 25X25 depending on benchmark)

        for xThread = 0, xMultithreadingCount - 1 do
            Task.Spawn(function() 
                for x = xThread * SizeX/xMultithreadingCount,  (xThread + 1) * SizeX/xMultithreadingCount do  
                    Task.Spawn(function() 
                        for yThread = 0, yMultithreadingCount - 1 do
                            Task.Spawn(function()
                                for y = yThread * SizeY/yMultithreadingCount,  (yThread + 1) * SizeY/yMultithreadingCount do  
                                    genTile(x,y)
                                end
                            end)
                        end
                    end)
                end
            end)
        end

Optimal Multithreading

        for x = 0, SizeX do  
            Task.Spawn(function() 
                for y = 0,  SizeY do  
                    genTile(x,y)
                end
            end)
        end

Some background info on what Task.Spawn() does and what Multithreading is.

Edit* I have removed my explanation, as Waffle has posted a much better and more accurate one in the replies.

What is going on here?

I honestly don't really know.
This is why I am asking for others to give their thoughts on why this might be happening, but will give my best theory below.

My best theory

*Please read Waffles reply in the Replies as his reply gives a more accurate explanation.

Why is Single Threaded slow?
As far as I know the greatest factor that is slowing down the single threaded code is the Instruction Limit which forces us to have the Task.Wait() which adds great delay.

Why is Multi Threaded slower with more threads?
My best guess for this is that after the point where you are no longer suffering from instruction limit, having more threads doesn't make it any faster, but actually makes it slower from the time it takes to Spawn them and the extra processing power of having more threads.

Why is the "Optimal Multithreading" so fast?
The Optimal Multithreading is fast in speed, but low in fps. Sure its a lot faster than no optimization, but during this time FPS is below 1. The reason it is so much faster is likely because of no Task.Wait()s because it won't go over instruction limit.

Conclusion

So, the conclusion of my observations is that Multithreading is faster* if used correctly. By faster I mean in terms of time taken, however, it does drop FPS by a huge amount. The best place to use it would most likely be in loading screens or such where FPS doesn't really matter, but time does. I still don't have a good explanation for why it is faster other than instruction limit, and spreading the workload out across multiple processes.

Full testing data:

100x100

Trial None 2X2 Multithreading 4X4 Multithreading 10X10 Multithreading 25X25 Multithreading Optimal Multithreading
Trial 1 3.787 2.846 3.102 3.597 5.277 2.789
Trial 2 3.802 3.014 3.246 3.682 5.283 2.622
Trial 3 3.823 2.922 3.263 3.747 5.269 2.745
Average 3.804 2.927333333 3.203666667 3.675333333 5.276333333 2.718666667

200X200

Trial None 2X2 Multithreading 4X4 Multithreading 10X10 Multithreading 25X25 Multithreading Optimal Multithreading
Trial 1 18.247 12.795 13.523 14.348 17.846 11.844
Trial 2 17.959 13.063 13.06 14.345 18.257 11.535
Trial 3 18.386 12.505 13.363 14.381 17.995 11.587
Average 18.19733333 12.78766667 13.31533333 14.358 18.03266667 11.65533333

If you have a better explanation, or more methods of optimization please leave a reply! I know my explanation is probably a bit off and most likely a bit hard to follow as I don't really have a very good idea of why this is happening the way it is.

Thanks for reading, I hope this helped you out, and be sure to check the replies! Have an awesome day and may your code be optimized!

1 Like

Maybe just few quick thoughts.

The amount of time for the task itself versus the overhead of dividing up the work is important. For fast operations, the time saved by having the threads run parallel with each other could be overshadowed by the time it takes to distribute the work.

The number of cores in the CPU should be taken into consideration, since that's the max number of parallel operations that can run at a time. Having more threads than that could just mean that the CPU time is now split up between multiple tasks, and you have the extra overhead from dividing up the work.

I am curious about the case of say 5 scripts that are each assigned some tiles to generate, that are waiting for for a broadcast. You would keep track of the time form when the broadcast is sent out, to when the scripts are done.

1 Like

Yeah. The overhead of the amount of time that it takes to divide up the work is important to take into consideration as well. That is likely the main reason that the trials with more threads ended up being slower.

What's going on here isn't technically multithreading; Lua is single-threaded and runs in a single thread in Core. There are two things that take time here, the actual code, and pauses caused by Task.Wait(), which does nothing for one frame, and Task.Spawn, which doesn't call the function until the next frame. By creating more tasks you're ensuring more of the code is running in a single frame and fewer delays where no code is running are inserted in-between. The fastest case should be that everything is completed in a single frame, where the game is frozen until the code finishes running. The instruction limit prevents this, but the instruction limit is per-task, so running multiple tasks bypasses the instruction limit.

1 Like

I see. This makes more sense. Thank you for the explanation! I always assumed that a Lua thread would be a completely separate thread.

Is each script its own thread, and is the order deterministic?
Does this mean that everything is pretty much synchronous, without having to worry as much about certain types of race conditions?

Lua uses coroutines, everything is in a single thread so it is all synchronous, which does avoid certain race conditions. Context switching happens so that when one coroutine yields, another resumes. "Tasks" in core are separate coroutines. Individual scripts and event listeners all run in separate tasks.

2 Likes