Probably the lesson I'd take away from this example is that libc specifically often has surprising shared state, and you have to be really careful about using it from multiple contexts. And of course, you should avoid hammering a mutex from multiple threads -- but you already know that.
This reminds me of the fact that Go 1.20 added a clever optimization: If you call `rand.Seed` to seed the global RNG, it's assumed that you want it to produce a single deterministic sequence across all threads/goroutines, which requires locking. But if you never explicitly seed it, you get a different implementation that uses thread-local RNG state without locking.
(In practice, there is no way for a program to distinguish this from a single RNG that was initialized with a truly random seed that you can't recover, so it doesn't break backwards compatibility.)
Unfortunately, the C standard prescribes that failing to call `srand` must be equivalent to calling `srand(1)`, so I don't think glibc can get away with this trick.
Mentioned here: https://go.dev/blog/randv2
Personally I think it's kind of silly that we've normalized the idea of PRNGs-as-global-state. I'm not asking everyone to go Full Monad, but please just consider giving each Thing that needs access to randomness its own independent PRNG.
PRNGs are one of those things that need to have global state if you want to get decent statistics out of them. You cannot have sets of threads fetching the same value from the PRNG - it will totally destroy how random the PRNG is. Thread-local PRNGs with separate seeds is the way to go if you need parallelism, and it was the author's solution, but I can easily see not knowing that and running into this.
I feel like it's a rite of passage for every engineer to write a program with the assumption of "more threads = faster", only to find their program going considerably slower.
Many, many, many years ago, my employer needed me to write some code to read 8 COM ports in OS/2. Then based on that, do some operations.
So, me, being brand new to OS/2, immediately wanted to try out the new multi-threading features. Without knowing anything about locks and spins or deadlocks or shared state, I plunged into it just using the OS/2 documentation and a programming book I bought. Keep in mind, this was a single CPU machine, probably an Intel 486dx or something like that.
I spent the next couple of weeks debugging all sorts of things that should not have been happening. There were crashes, lock up, slow downs, missing variables, etc... that I couldn't resolve.
After a while, I gave up and did what the OP did: just put everything in a loop and it solved everything.
I don't know what can we learn from analyzing a poorly written program aside from what not to do.
If the program was properly written there wouldn't be any issues with multithreading.
The title of the article gives the false impression that multithreading is slowing programs in the general case. In fact poorly written programs are slow and that is true for both multithreaded and singlethreaded programs.
If your program runs slow but you don't need it to run fast, don't bother.
If your program runs slow and you need it to run fast, go see what optimizations can be done.
I've had trouble with futex congestion. Worst case was under Wine. Wine has DLL files which emulate various Microsoft libc-level functions. They have their own memory allocator. It has three nested locks, some of them have spinlocks, and at least one is locked while a "realloc" call is in progress and a buffer is being recopied.
Now try a multi-thread program compute bound program with more threads than CPUs that does Rust vector "push" operations. Push causes an array to grow, locks get set, other threads hit locks, threads go into spinlock mode, threads inside spinlocks context switch, lose control, and performance drops by two orders of magnitude. Most CPU time is going into spinlocks.
I'm not convinced that user-space spinlocks are a good idea. If you're never compute-bound, they can work, but if you run out of CPU time and a context switch occurs inside a spinlock, it's all downhill from there.
Windows itself doesn't do this; it has a different locking strategy.
Discussed at the time:
Make your program slower with threads - https://news.ycombinator.com/item?id=8711162 - Dec 2014 (47 comments)
This is the exact same problem I encountered back when I was taking the Parallel Computing course, except I was an absolute novice and had no idea about debugging or profiling. I remember it took me several days speculating and searching for an answer to this issue on Google, and I finally found an explanation on Stack Overflow, which also suggested using rand_r. It was one of the first instances where I was introduced to solving and debugging a difficult computer programming problem, and it still sticks with me to this day.
Good article, also has references to the glib's source. I've never dived deep into the random() system function before. This is from the man pages:
"The random() function should not be used in multithreaded programs where reproducible behavior is required. Use random_r(3) for that purpose."
Perhaps the notes should be updated to state that the performance of multi-threaded programs is also affected due to the shared global resource?
This needs 2014 in the title.
Same happened to me. I used random then deployed to product. I could see the CPU on grafana going nuts. That was one rollback I never forgot.
If you used zig you wouldn't have this problem, just grab your entropy from std.crypto.random and you have yourself a thread-local RNG, properly seeded, with fork safety.
A sort isn’t necessary. Duplicates can be counted in linear time by using additional memory (e.g. with a hashmap)
However, the point of the article still stands.
"There's no semantic reason my program isn't embarrassingly parallel but oh WTF are my libraries/frameworks doing? is it all a lie?" is apparently the pretty universal first experience trying to speed things up with threads. Hopefully we're working towards a future where single threaded legacies don't routinely stab the unwary in the back.
I remember in the early 2000s doubling the performance of a CPU-bound C++ program by using two threads, and my boss being shocked that I was able to do it. He had written it off as a naive idea. I was too young and inexperienced to understand why he was surprised; to me it just seemed like he didn't understand what I was doing. In retrospect, now I feel like maybe he had the right default expectation, and I got lucky that it worked out so easily.
Immediately as I started reading this, my mind said, "random() has global shared state so of course more threads will make it slower". I say this here not to attempt to demonstrate my oh so special advanced knowledge, but just to marvel how bad our abstractions are sometimes. Maybe "bad" is the wrong word. Leaky, inadequate, inscrutable.
"Give me a random number" feels like it shouldn't require all this overhead and hidden complexity under the hood. But unless you've read a lot about implementation details, or have been bitten by that exact problem before, how would you know? You wouldn't, of course.