Show HN: Factorio Learning Environment – Agents Build Factories

by noddybearon 3/11/25, 12:02 PMwith 209 comments
by vesseneson 3/11/25, 2:53 PM

OK, You’ve permanently nerd-baited me, and I wish to apply for a job at the Anthropic Factorio lab immediately.

I can’t tell from the paper or these comments if you’re sending multimodal data back — I’m guessing no, because many of these models aren’t multimodal. But some are — and of course we now have recently released Qwen 2.5 VLM which seems to be quite strong for its size.

You harp on this lack of spatial ability a fair amount, which - fair enough - and you mention difficulties in both planning and spatial planning. Are you sending images back? If not, any thoughts on this?

Thanks for this amazing bit of work, I really am reorganizing my day to play with it now.

P.s. seems like MCP enabling the python library is a natural must-do so that all tool-enabled LLMs everywhere can play factorio.

by scottmsulon 3/11/25, 5:00 PM

There was a HN post here not too long ago about a team that used reinforcement learning to train an agent to beat pokemon red. They mentioned how they had to tweak the cost function to give small rewards for exploring and big rewards for completing "essential tasks" like beating gyms.

I wonder if this same approach could be used here in factorio? Using the pokemon red analogy the main "essential tasks" in Factorio are setting up automation for new items and new science packs. I think a good reward function could involve small rewards functions for production rates of each item/sec, medium rewards for setting up automation for new items, and big rewards for automating each new science pack.

Telling a Factorio agent to just "make a big factory" is like telling a pokemon red agent to just "beat the game", it has to be broken down into smaller steps with a very carefully tuned reward function.

Thinking about this is really making me want to jump into this project!

by noosphron 3/12/25, 12:36 AM

>We evaluate six frontier language models across both settings: Claude 3.5-Sonnet, GPT-4o, GPT-4o-Mini, Deepseek-v3, Gemini-2-Flash, and Llama-3.3-70B-Instruct.

While I appreciate the effort and creativity that went into this there are a lot of much simpler dynamic benchmarks that can let you saturate the planning capabilities of non-reasoning models.

Something as simple as giving a list of flight connections between cities and then asking for an itinerary between them confuses all these models when the shortest path between two nodes is long enough.

Longest shortest path the models could reliably find (8/10 tests for a given length) between two cities:

    | Model            | Path Length |
    |------------------+-------------|
    | Claude Sonnet3.5 |          10 |
    | GPT-4o           |           7 |
    | GPT-4o-mini      |           4 |
    | Deepseek-v3      |           6 |
    | Gemini-2-Flash   |  Not tested |
    | Llama3.3-70B-Ins |           4 |

by owenpalmeron 3/11/25, 6:13 PM

> All models exhibited limitations in spatial planning when constructing multi-section factories. Common failures included placing entities too close together, not allocating space for connections, or incorrect inserter placement

It makes sense why LLMs are bad with spatial reasoning. Not a lot of training data for it. I wonder what additional reasoning abilities will emerge when spatial reasoning is solved.

by Imnimoon 3/11/25, 5:02 PM

Another category of "Lab Play" task I'd be interested in seeing is balancer design. Even small balancers can be quite complicated (https://factorioprints.com/view/-NopheiSZZ7d8VitIQv9), and it would be interesting to see how models do at designing and troubleshooting them.

by spieswlon 3/11/25, 2:01 PM

Fantastic idea.

It seems like there are a lot of interesting experiments to be had here. The lab-play scenarios having a time-related component seems like a good idea, I assume most Factorio players that keep biters on treat them as a combined temporal-spatial constraint, so you have a sort-of proxy comparison to a real game situation when you put the agents on a timer.

I like the way that the framework design is testing different things than micromanagement proficiency, such as what we have seen in DOTA 2 or StarCraft 2 experiments. Notably, severe worker micromanagement (in the case of the latter game) becomes a way to squeak out extra minerals when you have infinite APM available. This is an interesting learned behavior in a narrow context, but that tactic is really control intensive and has a high chance for even pro players to screw it up when attempting to do so. It also doesn't seemingly give additional insight into an agent's longer-term planning, execution, and analytical performance. FLE seems way more interesting as a higher-level "thinking" evaluation framework, with all that in mind.

Any plans for layout optimization benchmarks? As in, start with a given factory cell with X inputs and Y outputs, and optimize its performance.

by jxjnskkzxxhxon 3/12/25, 12:35 PM

I don't understand - were these models post-trained to play factorio? A) If so, how is that possible given that e.g. Claude doesn't have public weights? B) If not, how would the agent know what the API does? Even if it's "guessing" from the English meaning of the API commands (e.g. place_entity_next_to places entity next to something), how would it know what the recipes are? If it's trying and learning we go back to A).

Having read the pdf I don't think these models were post-trained, so how do we explain the questions in B)?

And if indeed there's no post-training and authors expected exploration of recipes to come from the context window.... I think that's way too short for RL-style improvement.

In short, I don't understand how they could've tested those models with post training, and without post training they all did unbelievably well.

If the authors read this: can you give us an idea how many API query and API pairs fit within the context window, on average? Follow up, do you get better results if you abbreviate the API call names, so that more response pairs fit within one context window?

by infogulchon 3/11/25, 4:36 PM

Interesting to see only a handful of complex scenarios. I've always suspected ML game agents need hundreds of tiny puzzles with hundreds of variations each to learn game mechanics properly. Like:

    The factory is not powered, place the missing power pole(s)
    The factory is missing items, place the missing belt(s)
    Craft and place these 200 assembly machines
    The assembly machine is not running for some reason, fix it
    The factory production is too low, double it
    Get to this other point in the factory as fast as possible
    Fix the brownout
    All of the above with and without bots
Programmatically generating a few thousand example scenarios like these should be relatively easy. Then use it like an IQ test question bank: draw a dozen scenarios from the bank and evaluate performance on each based on time & materials used.

I hypothesize that ML agents learn faster when evaluated on a sample from a large bank of scenarios of smoothly increasing complexity where more complex scenarios are presented after it scores sufficiently high on lower complexity scenarios.

by mNovakon 3/11/25, 7:10 PM

Is there a human-play benchmark (even informally) for this style of interface? Not saying it's necessary or even relevant, I'm just curious to know what programmatic Factorio feels like -- I imagine spatial reasoning around text prompts would be fairly challenging for human players to navigate as well.

by p10jkleon 3/11/25, 12:15 PM

Wow, fascinating. I wonder if in a few years every in-game opponent will just be an LLM with access to a game-controlling API like the one you've created.

Did you find there are particular types of tasks that the models struggle with? Or does difficulty mostly just scale with the number of items they need to place?

by gglonon 3/11/25, 5:54 PM

I was thinking, to build a large, efficient factory autonomously, one could use LLM as a high level agent that is using specialized tools. The overall strategy would perhaps look like following:

1. create a (intermittent) goal for a resource production

2. create a factory graph with calculated number of machines and number of resources required to transport between them. This would be done by using linear programming (factorio calculator)

3. somehow map the resulting graph to a hardware description language. Such that each entity would be mapped to unique logic component. And each transport lane would be mapped to a unique wire (most difficult)

4. compile to 2d FPGA layout using all the VLSI algos like partitioning, routing (hdl compiler)

5. map the resulting plan back to a concrete factorio design

by myrmidonon 3/11/25, 1:16 PM

Fascinating. Would have loved to see more pictures of the bigger factories-- or is the zig-zag belt into plastic production currently the best result?

I think this very clearly illustrates a big weakness of current LLMs-- humans might struggle just as much at first, but are able to specialize and adapt to a task, while LLMs can't-- yet.

I'm expecting even greater improvements from figuring out online learning/adaptation than what we got from chain-of-thought approaches.

Do you think the "API" to interact with the game is a big obstacle, compared to a human interacting with the game via monitor? Did anyone try to interact with the game via this API, and how does human effort measure up to the AIs?

by moconnoron 3/11/25, 12:52 PM

Incredible idea and execution, very interesting results. Genuinely: what a time to be alive!

by barrystaeson 3/11/25, 2:43 PM

I have long dreamt of automating Factorio in the way that HDL and a PCB router works: just specify the ingredients and it produces a Factorio Blueprint.

First MVP stupid designs, then optimized routing, and eventually usable ingame where it connects with provided in/outputs.

Would be more fun to develop than to play obviously..

I liked the nilhouse mega base with that factory-train-blocks blueprints, its basically Factorio DUPLO.

by kevmo314on 3/11/25, 1:39 PM

Does it provide screenshots of the game state? I, too, would struggle to play the game pretty effectively if I could not visually see the game.

by iliketrainson 3/11/25, 4:33 PM

This is awesome! I like the idea of abstracting the factory building with a code-like structure. I wonder if supplemental 2D image (mini-map style) as an input to the policy would help with the spatial reasoning?

I work on a similar factory game (Captain of Industry) and I have always wanted an agent that can play the game for testing and balancing reasons. However, pixels-to-mouse-actions RL policy (similar to Deep Mind's StarCraft agent) always seemed like a very hard and inefficient approach. Using code-like API seems so much better! I might try to find some time to port this framework to COI :) Thanks for sharing!

by jharohiton 3/11/25, 12:51 PM

Everytime a paper like this comes out, I always have 1 question - How do they control the game using the LLMs? How does the control-feedback loop work? WHat tools, software and APIs they use to do it on Mac or Windows?

by deviton 3/11/25, 5:21 PM

Seems like it might be more effective to use the LLMs to write a program that plays Factorio rather than having them pick the next action given a game state.

Also in general I think the issue with Factorio is that you can just find an "optimal" factory design and build order and just follow it every time; perhaps starting with a suboptimal building layout already present and restrictions like being unable to change them or build others of the same type could help.

by mritchie712on 3/11/25, 1:29 PM

have you tried sonnet 3.7 yet? guessing these aren't cheap evals to run.

leaderboard: https://jackhopkins.github.io/factorio-learning-environment/...

by sturzaon 3/11/25, 1:00 PM

> 1. Coding skill predicts performance

> Models with stronger coding abilities (Claude 3.5-Sonnet, GPT-4o) achieved higher Production Scores and completed more lab tasks. Claude outperformed others with a PS of 293,206 and 28 milestones, progressing beyond early-game resource extraction.

by dansoon 3/11/25, 1:41 PM

Tangentially: been wondering when we’d ever see the breakthroughs in LLMs trickle down to making better adversarial game AIs. Haven’t tried Civ 7 b/c of its terrible reviews, but I’d happily buy in if there were AIs that were more human-like and varied in their scheming

by onehairon 3/11/25, 4:43 PM

Hi Jack, just reached 85/88 achievements in Space Age. Seeing an article about computer science and Factorio at this stage is either the nicest romantic gesture or very cruel intended to keep me playing this beautiful game forever.

by mentalgearon 3/11/25, 12:57 PM

> [LLMs] yet are unable to operate effectively in constrained environments, reflecting limitations in error analysis

This reflects my experience with gen-LLM coding, where LLMs keep trying to do the same thing in a loop.

by cmgriffingon 3/11/25, 9:21 PM

Their first key insight is interesting. It says that coding ability predicts model performance in the game. I wonder if it also predicts performance of human players in some way?

by zoogenyon 3/11/25, 5:02 PM

This reminds me of how I'll judge LLM ability when it can compete in a 1v1 dark souls battle. Even Factorio is pretty slow compared to what is possible in modern games.

by ramesh31on 3/12/25, 2:32 PM

Claude in a league of it's own of course. It's so painfully obvious Anthropic is going to the be the winner here, barring any further upsets.

by nickvecon 3/12/25, 6:34 PM

> The factory consists of a electricity steam generator (top-left)...

nit: "a" should be "an" here

by bocklundon 3/11/25, 6:31 PM

Is there an additional arXiv paper? The abstract referenced at the bottom refers to a completely different paper.

by artemonsteron 3/11/25, 12:46 PM

Diagonal belts are signs of evil

by keeganpoppenon 3/11/25, 12:48 PM

this is an absolutely fascinating project-- wow! i am going to have to fire up Factorio again and try it out! the implications of what the experience of playing games is like in this new LLM era / world order are fascinating.

by alexopon 3/11/25, 1:01 PM

its funny how video games are the hardest benchmark that humanity has for ai

by ainiriandon 3/11/25, 8:08 PM

Wait are you basically telling me that now I can play factorio with code?

by cgannetton 3/11/25, 1:55 PM

I wonder if anyone has done something similar with Dwarf Fortress

by sc68calon 3/11/25, 1:26 PM

The diagonal belts and diagonal pipes are especially cursed.

by leetbulbon 3/11/25, 12:32 PM

Very cool project. Lovely diagrams.

by idiotsecanton 3/11/25, 8:06 PM

...And that was how a bunch of very enthusiastic HN factorio fans inadvertantly started the exponentially expanding intelligence that consumed the universe to make more science cubes.

by Python3267on 3/11/25, 2:45 PM

The factory must grow.

by zeliason 3/11/25, 12:38 PM

Fantastic! Now I can sit back and watch the factory grow itself!

More seriously, I think this is a great "next step" in the evolution of benchmarks. Contemporary benchmarks are really just standardized tests, but the Factorio problem set presents an unbounded canvas of creativity for solutioning.

by deadbabeon 3/11/25, 10:40 PM

Am I the only one who doesn’t find the results promising?

This is a ton of compute power and complexity for what is basically a shitty AI. It has no practical purpose. Better AIs have been built with less, why don’t people appreciate them? Or do we just take them for granted?

by loveparadeon 3/11/25, 1:04 PM

"put the right signals into my train network"

Not even humans can pass this benchmark.

by johnisgoodon 3/11/25, 6:26 PM

Claude wins.

by andbbergeron 3/11/25, 6:49 PM

stream this on twitch

by tux3on 3/11/25, 12:47 PM

For the real frontier benchmark, at the edge of what humans will put up with, install Pyanodon's mod. The scale of it puts a real strain on your organizational skills. Overbuilding, underbuilding, or bad planning can all cause significant pain down the line as the factory risks becoming an unmanageable, tangled mess with no sane capacity for expansion. It's a real test of executive function and organization.

For something humans will definitely not put up with, install PyBlock in Hard Mode. I suspect that benchmark will not fall anytime soon. It is borderline impossible without superhuman patience.

by WJWon 3/11/25, 1:12 PM

Very cool and also pretty expected results tbh. Some thoughts:

Factorio is a game that requires SIGNIFICANT amounts of thinking ahead, often requiring investments into things that won't pay off until much later and which might even significantly hamper initial development. Building a main bus vs spaghetti belts is one of the obvious examples here.

Humans with a little bit of experience playing factorio know that while building 1 item/s of some new resource is good, the game is about eventually building thousands of the new item. Until the LLM learns not to be short term minded it will probably build itself into a corner very quickly.

It is kind of amazing that these models manage to figure out a strategy at all, considering the game is not in their training set. That said, the current research goals are not very good IMO. Building the largest possible base has the predictable result of the AI building a humongous belt loop covering much of the map. A much better target would be the "standard" goal of SPM.

I think 99% of Factorio could be "solved" with GOFAI algorithms from the 80s and enough processing power. Set up a goal like 10k SPM and then work backwards towards how many of each resource you need, then recursively figure out fastest way to set up the production for each subresource using standard optimization algorithms from OR. No LLMs needed.

by Starlord2048on 3/11/25, 4:28 PM

[flagged]

by philipwhiukon 3/11/25, 1:55 PM

It's great to see that LLMs too, struggle with oil production.

by dkuralon 3/11/25, 3:17 PM

So the battle of AGI will be won on the playing-fields of Factorio and StarCraft..