A lot of negativity in these threads. I say ~cudas~ kudos to the author for getting this far! The only similar project I'm aware of is Futhark, and that's haskell-y syntax - great for some people, but to the general class of C/C++/Python/Js/Java/etc. devs pretty arcane and hard to work with. My biggest complaint with this is, unlike Futhark, it only targets Cuda or multi-core. Futhark which can target OpenCL, Cuda, ISPC, HIP, sigle core CPU, or multi core CPU. The performance problems others are pointing out I'm certain can be tackled.
OP comes around with some of the coolest things posted in HN recently, and all he gets is extensive criticism, when it is clear that this is an early version :/
I just wanted to comment on how good the homepage is - it's immediately clear what you do. Most people working with "combinators" would feel a need to use lots of scary lingo, but OP actually shows the simple idea behind the tool (this is the opposite take of most academics, who instead show every last detail and never tell you what's going on). I really appreciate it - we need more of this.
Look, I understand the value proposition and how cool it is from a theoretical standpoint, but I honestly don't think this will ever become relevant.
Here are some notes from my first impressions and after skimming through the paper. And yes, I am aware that this is very very early software.
1. Bend looks like an extremely limited DSL. No FFI. No way of interacting with raw buffers. Weird 24bit floating point format.
2. There's a reason why ICs are not relevant: performance is and will always be terrible. There is no other way to put it, graph traversal simply doesn't map well on hardware.
3. The premise of optimal reduction is valid. However, you still need to write the kernels in a way that can be parallelized (ie. no data dependencies, use of recursion).
4. There are no serious examples that directly compare Bend/HVM code with it's equivalent OMP/CUDA program. How am I suppose to evaluate the reduction in implementation complexity and what to expect on performance. So many claims, so little actual comparisons.
5. In the real world of high performance parallel computing, tree-like structures are non-existent. Arrays are king. And that's because of the physical nature of how memory works on a hardware level. And do you know what works best on mutable contiguous memory buffers ? Loops. We'll see when HVM will implement this.
In the end, what we currently have is half-baked language that is (almost) fully isolated from external data, extremely slow, a massive abstraction on the underlying hardware (unutilised features: multilevel caches, tensor cores, simd, atomics).
I apologize if this comes out as harsh, I still find the technical implementation and the theoretical background to be very interesting. I'm simply not (yet) convinced of its usefulness in the real world.
Ten years ago, I took a course on parallel algorithms (15-210 at CMU). It pitched parallelism as the future of computing as Moore's law would hit inevitable limits. I was sold and I was excited to experiment with it. Unfortunately, there weren't many options for general parallel programming. Even the language we used for class (SML) wasn't parallel (there was a section at the end about using extensions and CUDA but it was limited from what I recall).
Since then, I was able to make some experiments with multithreading (thanks Rust) and getting very creative with shaders (thanks Shadertoy). But a general parallel language on the GPU? I'm super excited to play with this!
Very cool idea - but unless I'm missing something, this seems very slow.
I just wrote a simple loop in C++ to sum up 0 to 2^30. With a single thread without any optimizations it runs in 1.7s on my laptop -- matching Bend's performance on an RTX 4090! With -O3 it vectorizes the loop to run in less than 80ms.
#include <iostream>
int main() {
int sum = 0;
for (int i = 0; i < 1024*1024*1024; i++) {
sum += i;
}
std::cout << sum << "\n";
return 0;
}
I want to congratulate the author on this, it's super cool. Making correct automatic parallelization is nothing to sneeze at, and something you should absolutely be proud of.
I'm excited to see how this project progresses.
Why so much negativity? An angry crowd sounded more like bots trying to test OP's intelligence by exploiting the ReadMe file imperfections while trying to change the context and intent of the post. It's so ignorant and brutal. They spent hours arguing without taking 2 minutes to properly read the ReadMe file.OP is a one man's show and now they all want to piss on OP. Keep going OP!
Fala Taelin, nice work! Does HVM2 compile interaction nets to e.g. spirv, or is this an interpreter (like the original HVM) that happens to run on the GPU?
I ask because a while back I was messing around with compiling interaction nets to C after reducing as much of the program as possible (without reducing the inputs), as a form of whole program optimization. Wouldn't be too much harder to target a shader language.
Edit: Oh I see...
> This repository provides a low-level IR language for specifying the HVM2 nets, and a compiler from that language to C and CUDA HVM
Will have to look at the code then!
https://github.com/HigherOrderCO/HVM
Edit: Wait nvm, it looks like the HVM2 cuda runtime is an interpreter, that traverses an in-memory graph and applies reductions.
https://github.com/HigherOrderCO/HVM/blob/5de3e7ed8f1fcee6f2...
I was talking about traversing an interaction net to recover a lambda-calculus-like term, which can be lowered to C a la lisp in small pieces with minimal runtime overhead.
Honestly the motivation is, you are unlikely to outperform a hand-written GPU kernel for like ML workloads using Bend. In theory, HVM could act as glue, stitching together and parallelizing the dispatch order of compute kernels, but you need a good FFI to do that. Interaction nets are hard to translate across FFI boundaries. But, if you compile nets to C, keeping track of FFI compute kernel nodes embedded in the interaction network, you can recover a sensible FFI with no translation overhead.
The other option is implementing HVM in hardware, which I've been messing around with on a spare FPGA.
This is incredible. This is the kind of work we need to crack open the under utilized GPUs out there. I know LLMs are all the rage, but there's more gold in them hills.
Bend looks like a nice language.
> That's a 111x speedup by doing nothing. No thread spawning, no explicit management of locks, mutexes. We just asked bend to run our program on RTX, and it did. Simple as that. Note that, for now, Bend only supports 24-bit machine ints (u24), thus, results are always mod 2^24.
Ahh, not even 32bit? Hmm, that seems pretty arbitrary for someone not accustomed to gpu's and wanting to solve some problems requiring 64 bits (gravitational simulation of solar system at millimeter resolution could use ~58bit ints for position).
Been watching your development for a while on Twitter. This is a monumental achievement and I hope it gets the recognition it deserves.
This is nice, and obvious. I've waited about 20 years since I learned MATLAB and GNU Octave for someone to make a graph solver like this. And about 25 years since I first had the idea, when I was learning VLSI with VHDL in college and didn't see anything like the functional programming of circuits in what at the time was the imperative C++ world. The closest thing then was Lisp, but nobody talked about how the graph representation (intermediate code or i-code in the imperative world) could be solved in an auto-parallelized way.
We still see this today in how languages go out of their way to implement higher order method libraries (map/reduce/filter) but then under the hood there is no multithreading, they just expect the developer to annotate their loops to be parallel because the languages aren't formal enough to know about side effects in the innermost logic, and don't support immutability or performant pass-by-value semantics with copy-on-write anyway. So we end up with handwavy languages like Rust that put all of that mental load onto the developer for basically no gain, they just save memory by performing computation in-place imperatively.
I also like how Bend sidesteps the nonexistence of highly scaled symmetric multiprocessing CPUs by supporting GPUs. It makes the argument moot that GPUs can't be stopped because they're too big to fail. Julia is the only other language I've seen that tries this. I wish Clojure did, although it's been a long time since I followed it so maybe it has some parallelism?
I would have dearly loved to work on something like Bend, had someone solved the funding issue. Nobody wants to pay for pure research, and nobody sees the need for languages that do what everyone else is doing except easier. We have Kickstarter for widgets and Patreon for influencers, but makers have to bootstrap themselves or learn everything about finance or live in the right city or have a large network to hopefully meet an angel investor or work in academia and lose all rights to what they invent while spending the majority of their time hustling for grants anyway. So it just never happens and we're stuck with the same old busted techniques. Like how Hollywood only has money for sequels and reboots or the recording industry only has money for canned corporate music and hits from already famous artists and yet another cover that yanks the original better song off the radio.
A quarter of a century can go by in the blink of an eye if you get suckered into building other people's dreams as a people-pleaser. Be careful what you work on.
Is the recursive sum the best function to show multi-threading or GPU speedups? Seems unlikely. FWIW, i ported the python example to Julia and it ran in about 2.5 seconds the same as the C++ version. Pure python 3.12 took 183 seconds.
function sum(depth, x)
if depth == 0
return x
else
fst = sum(depth-1, x*2+0)
snd = sum(depth-1, x*2+1)
end
return fst + snd
end
println(sum(30,0))This is really, really cool. This makes me think, "I could probably write a high performance GPU program fairly easily"...a sentence that's never formed in my head.
This is awesome and much needed. Keep going, forget the overly pedantic folks, the vision is great and early results are exciting.
I remember seeing HVM on here a year or two back when it came out and it looked intriguing. Exciting to see something being built on top of it!
I would say that the play on words that gives the language its name ("Bend") doesn't really make sense...
https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md
> Bending is the opposite of folding. Whatever fold consumes, bend creates.
But in everyday language bending is not the opposite of folding, they are more or less the same thing. Why not "unfold", which also has a connotation of "the process of happening" as well as merely the opposite of folding?
I have a question about the example code and output for bending:
type Tree:
Node { ~lft, ~rgt }
Leaf { val }
def main():
bend x = 0:
when x < 3:
tree = Tree/Node { lft: fork(x + 1), rgt: fork(x + 1) }
else:
tree = Tree/Leaf { val: 7 }
return tree
tree = fork(0)
tree = ![fork(1), fork(1)]
tree = ![![fork(2),fork(2)], ![fork(2),fork(2)]]
tree = ![![![fork(3),fork(3)], ![fork(3),fork(3)]], ![![fork(3),fork(3)], ![fork(3),fork(3)]]]
tree = ![![![7,7], ![7,7]], ![![7,7], ![7,7]]]
Where does the initial "tree = fork(0)" come from?I've made a benchmark of Bend running a simple counter program on CPU vs GPU, vs Haskell,Node,Python,C that I plan to write a blogpost about, probably this Sunday:
https://docs.google.com/spreadsheets/d/1V_DZPpc7_BP3bmOR8Ees...
It's magical how the GPU version is basically flat (although with a high runtime init cost).
Congrats on the HVM2 launch! Been following for a while, excited to see where this project goes. For others who are lost on the interaction net stuff, there was a neat show hn that gave a more hands-on interactive intro: https://news.ycombinator.com/item?id=37406742 (the 'Get Started' writeup was really helpful)
This is very exciting. I don’t have any GPU background, but I have been worrying a lot about CUDA cementating itself in the ecosystem. Here devs don’t need CUDA directly which would help decoupling the ecosystem from cynical mega corps, always good! Anyway enough politics..
Tried to see what the language is like beyond hello world and found the guide[1]. It looks like a Python and quacks like a Haskell? For instance, variables are immutable, and tree-like divide and conquer data structures/algorithms are promoted for getting good results. That makes sense I guess! I’m not surprised to see a functional core, but I’m surprised to see the pythonic frontend, not that it matters much. I must say I highly doubt that it will make it much easier for Python devs to learn Bend though, although I don’t know if that’s the goal.
What are some challenges in programming with these kind of restrictions in practice? Also, is there good FFI options?
[1]: https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md
As a resident of Bend, Oregon... it was kind of funny to read this and I'm curious about the origin of the name.
Oh wow do I wish this existed when I was playing with evolutionary computation and genetic algorithms in college…
I have no interest in this tech as it's apparently for backend stuff and not actually rendering things by itself.
But the demo gif is probably the best I have seen in a Github readme. I watched it till the end. It was instantly engaging. I wanted to see the whole story unfold.
Would a compiler be faster by using HVM? Would love to see a fully parallel version of typescript tsc
Wow, Bend looks like a nice language.
> That's a 111x speedup by doing nothing. No thread spawning, no explicit management of locks, mutexes. We just asked bend to run our program on RTX, and it did. Simple as that. Note that, for now, Bend only supports 24-bit machine ints (u24), thus, results are always mod 2^24.
Ahh, not even 32bit? Hmm, that seems pretty arbitrary for someone not accustomed to gpu's and wanting to solve some problems requiring 64 bits (gravitational simulation of solar system at millimeter resolution could use ~58bit ints for position).
Maybe I missed it, but there seems to be no license attached to HVM2, nor to Bend or Kind?
the interesting comparison nowadays would be against mojo:
Congrats on the launch.
I know the docs say this will be fixed soon, but what is the main reason for restricting number types to 24 bits? I saw in the code that they are wrapper around the 32-bit system number types, so what prevents Bend from changing them to U32(u32) right now?
I think I have a use for this but I’m realizing that I don’t know how to build a mental model of what is going to parallelize in this system. Surely some algorithms are better and getting chopped up than others - how can I tell what is going on?
This looks cool, I find myself wishing for a language and introductory tutorial that isn't so targeted at Python programmes however (though I understand from a commercial point of view why that may make sense).
It seems like this is actually an elegant typed functional language but the Python syntax looks ugly and verbose and like it's trying to hide that compared to something more ML/F# or Haskell inspired.
I'll try and get past that though as it does look like there's something pretty interesting here.
> Everything that can run in parallel, will run in parallel.
On the CPU, there's typically a threshold where dividing and coordinating the parallel work takes more time than simply doing the work on a single thread. Thus you can make the overall runtime much faster by not dividing the work all the way, but rather stop at that optimal threshold and then just loop over the remaining work in the worker threads.
How does this work on the GPU using Bend? Been too long since I did any GPU programming.
The website claims "automatically achieves near-ideal speedup"
12x for 16x threads
51x for 16.000x threads
Can someone point me to a website where it explains that this is the "ideal speedup"? Is there a formula?
I've made a benchmark of the current version of Bend running a simple counter program on CPU vs GPU, vs Haskell,Node,Python,C: https://docs.google.com/spreadsheets/d/1V_DZPpc7_BP3bmOR8Ees...
It's magical how the GPU version is basically flat (although with a high runtime init cost)
> CPU, Apple M3 Max, 1 thread: 3.5 minutes
> CPU, Apple M3 Max, 16 threads: 10.26 seconds
Surprised to see a more than linear speedup in CPU threads. What’s going on here?
Speaking of parallel computing, any book or series of books that can help an engineer learn parallel program and go from zero to hero? Ideally the books will cover both intuition, in-depth details, and theories. Something like The Art of Multiprocessor Programming by Herlihy et el for concurrent programming, even though the book arguably still has too steep of a learning curve.
Congratulations on the launch and hard work so far! We need projects like this. Great readme and demo as well.
Every time I try to write shaders, or even peek through my fingers at CUDA C(++) code, I recoil in disbelief that we don't have high level programming yet on the GPU. I can't wait until we do. The more great projects attacking it the better in my book.
This is some very cool project! I sometime dream about a instruction set architecture (ISA) that runs in some kind of VM that allows for existing languages to be able to run on CPU/GPU/FPGAs/ASICs automatically.
I think this is a much more practically approach and i hope this will give some inspiration to this possibility.
What's the expected effort needed for supporting GPUs other than Nvidia — e.g. AMD GPUs or the GPU in a MacBook Pro M1/2/3?
As I understand, it's a lot of work because there's no common way to target these different GPUs. Is this correctly understood?
Dupe of https://news.ycombinator.com/item?id=40387394 and https://news.ycombinator.com/item?id=40383196
Congrats!
I’ve been watching HVM for a while and think it’s extremely cool.
My intuition is that this will eventually be a really big deal.
Incredible feat, congratulations!
Looks cool but what's one toy problem that it can solve more efficiently than others?
This kind of sound like Mojo. I wonder how these compare? (Besides HVM/Bend being opensource, which is awesome.)
Reminds me a little bit of Concurnas [0] which sadly seemed to get abandoned right at the point when it was nearly viable.
What is the terminal used for that demo? https://github.com/HigherOrderCO/Bend does it just skip commands it cannot execute or?
Wow this is very impressive!
The first graphic midway or so down the page has this tag:
tested on: CPU - Apple M3 Max, GPU - NVIDIA RTX 4090
But how? I thought eGPUs don’t work on apple silicon and the pci-e having Mac Pro is still M2 based, no?
This seems pretty cool!
Question: Does this take into account memory bandwidth and caches between cores? Because getting them wrong can easily make parallel programs slower than sequential ones.
What's going on with the super-linear speedup going from one thread to all 16?
210 seconds (3.5 minutes) to 10.5 seconds is a 20x speedup, which isn't really expected.
So HVM finally yields fruit. I've been eagerly awaiting this day! Bend seems like a very suitable candidate for a Lispy S-expression makeover.
This is cool! Is the idea to put Kind2 on top of this in some way?
I’d also love to find an example of writing a small interpreter in Bend - which runs on the GPU.
Looking forward to using this. Curious about how far away WebGPU/WASM support might be, it could provide a single cross-platform backend.
What kind of software would this language be good for? I assume it's not the kind of language you'd use for web servers exactly.
What would be super interesting is having something library, and able to use something like this inside Elixir or Ruby, to optimize hotspots.
This and HVM2 are some of the most interesting work I know off currently. Nice break from all the LLM stuff.
Now I just need a Common Lisp implemented using it!
If folds and bends are isomorphic to loops then loops can be parallelized ala Occam-Pi?
I am really enjoying this implementation :)
I just read through the HVM2 and Lafont papers, and I'm pretty impressed with this style of computation!
It's pretty cool that you *actually* built this necessary Developer interface aimed towards accessibility of HPC!
This looks like the language I've wanted for a long time. I'm excited to see how this plays out.
Has someone written the example in "native" GPU (C/Cuda) to compare performance?
I can already see how many people are so illiterate about GPUs.
Is there an OpenCL backend?
Awesome! It looks similar to the problems that Modular AI aims to solve.
Nice!
Python-like + High-performance.
And, Different from Mojo, its Fully Open-Source.
Reminds me a lot of the reduceron, sans FPGA.
Exciting project, congrats on the release!
Honestly incredible, and congrats on the release after what looks like an insane amount of work.
Another project locking its communications to the Discord black hole.
so cool. I see it has a lib target, can we use it as a crate instead of external program?
I for one found the 'how is this possible' video near the bottom of the page to be unhelpful:
- surely for `3 x 3 = 9`, there is some concept of primitive operations?
- I get that replacement of patterns in a graph can be done in parallel, but (a) identifying when a rewrite rule should apply and (b) communicating the state of the updated graph to worker threads and (c) organizing worker threads to agree on which does each task all take some effort. When is this more work than the original computation (as in the 3x3 example)?
24bit integers and floats, no array datatype, and a maximum 4GB heap of nodes are very harse restrictions, especially for any workloads that would actually want to be running on a GPU. The limitations in the HVM2 whitepaper about unsound evaluation around closures and infinite loops because it is evaluating both sides of a conditional without any short circuiting are also extremely concerning.
Before you reply "these are things we can address in the future": that doesn't matter. Everyone can address everything in the future. They are currently hard technical barriers to it's use, with no way of knowing the level of effort that will require or the knock-on effects, especially since some of these issues have been "we can fix that later" for ten years.
I also highly recommend changing your benchmark numbers from "interactions per second" to a standard measurement like FLOPS. No one else on earth knows how many of those interactions are pure overhead from your evaluation semantics, and not doing useful work. They come across as attempting to wow an audience with high numbers and not communicating an apples to apples comparison with other languages.
Thank you for sharing!
> First, install Rust nightly.
Eeek.
This is really cool!
Very nice!
congratz on the release
amazing
Pure functions only? This is disappointing. Furthermore, it invites a comparison with JAX.
This is cool as shit.
> That's a 57x speedup by doing nothing.
Okay, I'll have what you're having.
Massive promises of amazing performance but they can't find one convincing example to showcase. It's hard to see what they're bringing to the table when even the simplest possible Haskell code just as fast on my 4 year old laptop with an ancient version of GHC (8.8). No need for an RTX 4090.
module Main where
sum' :: Int -> Int -> Int
sum' 0 x = x
sum' depth x = sum' (depth - 1) ((x \* 2) + 0) + sum' (depth - 1) ((x \* 2) + 1)
main = print $ sum' 30 0
Runs in 2.5s. Sure it's not on a GPU, but it's faster! And things don't get much more high level.If you're going to promise amazing performance from a high level language, I'd want to see a comparison against JAX.
It's an improvement over traditional interaction nets, sure! But interaction nets have always been a failure performance-wise. Interaction nets are PL equivalent of genetic algorithms in ML, they sound like a cool idea and have a nice story, but then they always seem to be a dead end.
Interaction nets optimize parallelism at the cost of everything else. Including single-threaded performance. You're just warming up the planet by wasting massive amounts of parallel GPU cores to do what a single CPU core could do more easily. They're just the wrong answer to this problem.
For what it's worth, I ported the sum example to pure python.
under pypy3 it executes in 0m4.478s, single threaded. Under python 3.12, it executed in 1m42.148s, again single threaded. I mention that because you include benchmark information: The bend single-threaded version has been running for 42 minutes on my laptop, is consuming 6GB of memory, and still hasn't finished (12th Gen Intel(R) Core(TM) i7-1270P, Ubuntu 24.04). That seems to be an incredibly slow interpreter. Has this been tested or developed on anything other than Macs / aarch64?I appreciate this is early days, but it's hard to get excited about what seems to be incredibly slow performance from a really simple example you give. If the simple stuff is slow, what does that mean for the complicated stuff?
If I get a chance tonight, I'll re-run it with `-s` argument, see if I get anything helpful.