Hacker News

by johnnyapolon 2/9/18, 2:42 PMwith 174 comments

by mmanfrinon 2/9/18, 6:50 PM

When Bluehole take down PUBG for 5 hours, there's no communication outside of two tweets. When Epic see degraded performance for less than 2 hours, they give a postmortem.

There's a difference in the level of respect each company gives for its customers. I play PUBG a lot, but I want to see Epic win in the long run.

by jasonjayron 2/9/18, 5:24 PM

> We run Fortnite’s dedicated game servers primarily on thousands of c4.8xlarge AWS instances, which scale up and down with our daily peak of players.

That's between $572,000 (500 instances, 30 days) - $2,863,800 (2500 instances, 30 days), per month at current prices, and seems like it's only for one aspect of their infrastructure.

That seems .... excessive? Is that a typical spend with a game server system like this? That does seem to suggest that once this becomes less than profitable, it's all going away ...

by matt_son 2/9/18, 4:12 PM

A lot of players of massive online games tend to get hand-wavey when there are problems and act like "dude just get more servers" is the answer.

This clearly shows how complex a system is needed that has to handle 3.4 million concurrent, connected users. I think the connected part compounds any scale problems you have since it is implied they are connected to each other.

by pkilgoreon 2/9/18, 3:27 PM

Love this because it shows two things 1) competent people are handling problems and 2) they actually care.

A whole lot better than spoon feeding customers bullshit for weeks while hamstringing your product rather than investing in it (looks at EA, mumbles about SimCity).

by tweenagedreamon 2/9/18, 3:39 PM

Disclaimer: I work on Google Cloud so I will be speaking from the bias of knowing those products.

They talk a lot about reducing operating complexity and scaling their infrastructure, I wonder what the cost of their current infrastructure + the staff to maintain it might be vs the managed solutions that cloud providers offer now.

For example, using cloud datastore or spanner or big table as a persistent layer, these managed services can definitely scale to the current need and I've seen them go much higher as well.

For logs ingestion and analysis, big query can be a very powerful tool as well, and with streaming inserts that data can be queried in near real time. For things that are less urgent, batch queries. For other things dataflow can help with streaming workloads.

I think one of the problems they alluded to though was that at the moment they're on a single provider, and what they're looking for is a multi cloud strategy which totally makes sense. A lot of the above products create some kind of locking, with some exceptions, like using hbase as an interface to big table or beam as an interface to dataflow. Though I don't know what the other providers offer that may have these same interfaces.

Another option is kubernetes, which I believe all providers are pretty strongly embracing. Having most of the supporting infrastructure be brought up with a few kubectl commands could help them scale across several cloud providers quickly.

by victorqhongon 2/9/18, 3:57 PM

Really surprised that they use XMPP. Since you don't really hear anything about XMPP anymore, I think most people assumed that it's dropped off in usage/popularity (or people have moved to some other proprietary solution).

I've always thought that XMPP would be useful for games, just surprised to hear that people are actually doing it.

by swaggyBoatswainon 2/9/18, 8:32 PM

I was playing fortnite on 2-04-18 22:00 UTC during the "Friends Service" outage.

You couldn't see friends lists at all during that time period. So you couldn't queue up in a friends / people you knew at all in a match, the only options were either playing solo or using a "filled" team with random players.

I've been playing fortnite as one of the early 60k concurrent users all the way to the 3.4M, so its been interesting seeing their load / server issues over time and then reading this (Granted, I don't understand everything discussed in their blog). They've done a outstanding job handling their growing traffic.

One thing I've noticed with Fortnite, compared to PUBG or other MMOs, is how large their patch updates are. Its usually several GB large, and it comes fairly frequently about once a week.

by fokinseanon 2/9/18, 3:44 PM

As an addicted Fortnite player this is a neat read. However as an application layer dev, the architecture specifics were slightly over my head. My biggest concern is shipping a working docker image, all of the architecture is mostly abstracted at our company. This gave me some inspiration to dive deeper into our architecture.

by aaossaon 2/9/18, 3:36 PM

Loved the tone of the article. They know they have some problems to work on, they're being transparent about them and they're explicitly saying that they need help with it.

by iBotPeacheson 2/9/18, 3:17 PM

That was an incredible fun read. Makes me curious of the other failures in this industry if they could be explained in this detail.

by etermon 2/9/18, 3:20 PM

This is an interesting read, it's always interesting to hear why something that ought to be fairly heavily federated or sharded can nevertheless fall over centrally.

by SilverSurfer972on 2/10/18, 1:21 PM

> "Along with a number of things mentioned, even small performance changes over N nodes collectively make large impacts for our services and player experience."

I think this is where Stacktical helps with proactively detecting performance regressions at the CI level, before they hit production: https://stacktical.com

Disclaimer: I am Stacktical's CTO

by einrealiston 2/10/18, 12:42 AM

Nice read. And nice to see Java running at the backend.

I wonder whether Epic can solve its problems by rearchitecting more into a CQRS driven system with event sourcing: store events in a more write optimized DB (e.g. Cassandra) and then process the events for fast reads through whatever is required for the usecases. Maybe they touched the limits of MongoDB to handle both, reads and writes at their scale.

by tlynchpinon 2/9/18, 8:00 PM

This is a great article, lots of detail, props to Epic team for generally killing it and specifically putting this together.

by orliesauruson 2/9/18, 6:18 PM

I never spent a dime on any of these free 2play games. I am in awe at how dedicated the team behind Fortnite seems to be when it comes to providing us data (real data?) of what's happening on their side, while I am sitting on my couch logging into one of the matches with my keyboard and mouse

by halflingson 2/10/18, 7:21 AM

Meanwhile, the game is pretty much unplayable [1] on Mac OS while it was heralded as the first game to support Metal (even featured in Apple's keynote).

[1] Getting ~16 FPS on medium settings, with the high-end late 2016 MBP 15".

by aecorredoron 2/10/18, 3:38 PM

How do you get to this level of expertise? What are the resources people like these use to learn about this type of scalable systems? Any good books that start from the ground up on these topics?

by dom96on 2/9/18, 4:45 PM

Looks like they are still having scaling issues. I just tried creating a new Epic account and was shown an error.

by je42on 2/10/18, 8:13 AM

I wonder why they want to do the step:

- Followed by removing Nginx + Memcached couple altogether out of equation.

by andrea_son 2/9/18, 4:55 PM

404 not found at the moment - does anyone has a snapshot?

edit: nevermind, it is available again - weird

by skaplunon 2/10/18, 5:24 AM

any thoughts by game production managers on the success of the battle royale concept? anything we should take away for other products?

by NelsonMinaron 2/9/18, 6:57 PM

Launching this web page in Firefox on Windows causes my Oculus Rift software to start up. I am.. not happy about that. WTF?

by mevileon 2/9/18, 7:53 PM

I'd be more likely to share this with my team if they didn't have a recruiting pitch in there. I have a (probably irrational) fear of people abandoning the team for some new shiny thing. Just a thought. It's a great writeup otherwise!

by bullenon 2/9/18, 6:17 PM

This is why you use an async-to-async database:

Why: https://github.com/tinspin/rupy/wiki/Fuse

How: https://github.com/tinspin/rupy/wiki/Storage

Postmortem of Service Outage at 3.4M Concurrent Users