[Devblog] Tranquility Tech IV!

Spot on analysis @MalcomReynolds_Serenity – BTW, we did indeed look at the Platinum processors but the price was not right.

1 Like

Yeah, what’s up with that @CCP_Swift?

4 Likes

@CCP_DeNormalized and I love to hate NUMA.

3 Likes

Uuhhh, I don’t know. @CCP_DeNormalized, can you ask around and reply?

1 Like

7 years actually, that’s the extended warranty.

But the part that I covered very quickly in the devblog is that in between TQ Tech III and TQ Tech IV then there was TQ Tech IIIS (evidently S stands for “semis” meaning half in Roman numerals). So we are likely to remove the Gold 5122 and Gold 5222 powered machines in a few years time (possibly in 3-4 years), replace them with whatever is most high-end at that time (which will not be FLEX blades since IBM/Lenovo are stopping production), and move the heaviest workload there - JITA, The Forge market, fleet fights, etc.

Well, there is more to this. The Xeon Platinum processor are very expensive and some of the Gold ones are power-hungry and/or don’t fit our setup. Some liberal colour-coding:

It is also possible that we will move to AMD machines next time since IBM/Lenovo will not be making FLEX blades anymore (but purchasing FLEX blades was a requirement for this upgrade due to existing infrastructure) and the AMD EPYC Zen 4 processors are an interesting option.

Note the lack of AMD CPUs and the lack of Intel 4th gen:

2 Likes

Love and hate is right :slight_smile:

A few years ago, just months before super devs fixed the our numa issues for good nojinx, we were evaluating new DB boxes and had 2 single socket servers in the line up for testing.

An Intel 24/48 core box and the EYPC 64/128 core. Due to whatever was going on within the code base at the time, we could not get the cluster to start with the single numa node. Blocking chains would tie up all the worker threads, max the CPU and threadpool waits would starve the simulation out - sols unable to heart beat. We so hated it…

The EYPC on the other hand made short work of it - was it the fact that it had 128 cores vs. the 48 on the Intel box? Or was it due to windows being unable to address all cores in a single numa node and instead split them into 2 numa nodes.

Numa to the rescue! We love you NUMA! While 1 numa node’s CPUs were maxed out the other had room to get stuff done.

In the end, we picked our current set of boxes, dual socket 16 core cpus - 2 physical numa nodes, a nice bump in cores, but not so many that the sql license cost made our eyes jump out of our heads.

In the time it took to get the DB hardware ordered/shipped/racked devs had the numa issues sorted and now I rarely even think about it! :slight_smile:

4 Likes

I’m told VXRail was looked into some years ago. At the time the idea of having to buy entire compute+storage nodes to expand didn’t make sense for our needs.

It would be very cool to see TQDB as a virtual sql cluster on one of these though - 4tb of mem and 8 tb of storage - spread over multiple compute nodes.

2 Likes

There isn’t. All those other extensions have been standard additions for an extremely long time. New extensions don’t come along that often and even if one player gets to adding one before anyone eses does, it typically doesn’t matter, software support has to be added for it months or years later.

Pretty much all of that is totally wrong. The launch date of a product doesn’t start a countdown for the hardware life cycle for organizations that won’t even buy the product until potentially years into the future. It starts on the start date of the warranty. in typical SMBs with dirty power, inadequate ups, inadequate cooling, poorly maintained firmware and drivers, youll typically see that to work out to about 5 years. In an ideal datacenter maintained by an enterprise or at a colo offsite, most hardware has no problem lasting 7 years which is also typically the maximum practical warranty limit. (some vendors will sell sleazy support contracts with minimal support beyond that but don’t count on them to fix your hardware problem)

The specs you posted are irrelevant. 4th gen intel is in a paper launch only and not validated by any quality player. It could be a year before we start seeing the best in the microcomputer server business offering them. DDR5 isn’t only a paper launch, its vaporware. If you can’t say exactly what in a system would be benefit from higher RAM clock speed, then the system doesn’t need it. There is so little that benefits from it. With tranquility operating asynchronously on 1 second ticks and SQL server has a 4 millisecond quantum, it unlikely matters for tranquility benefits from it and certainly doesn’t matter for SQL server.

The architecture of the chips really doesn’t matter either. the job the chips do, is what matters. Gen 1 and gen 2 xeons are still sold and widely used especially in the xeon gold 5122 and 5222. Both of them have very high clock speed and a narrow gap between base clock and boost clock. These make absolutely outstanding mid size database servers or other application with a relatively low degree of parallelism in its work load, or anything else that needs a high clock speed with little variability in performance. It may be getting a little slow 7 years from now, but that doesn’t change its utility now or for the next several years.

As they already state in the post, the warranty period was up on the existing servers. There is no waiting. The mere suggestion of waiting for the next technology to come out is totally ridiculous, we would all still be loading programs from ropes with knots tied in it if everyone waited for the next technology to come out to buy it.

The 6334 is the highest performance per core processor available with vendor support, beneath one third gen platinum processor that is only nominally faster and is really more geared towards HPC and 8 socket machines. There is no comparison to be made between stateless applications running in docker and stateful applications running in a farm. Cache also doesn’t matter because everything else is a paper launch.

2 Likes

My first exposure to NUMA was right before they changed the licensing from sockets to cores. A client got clever and thought that if they upgraded their single socket, quad core SQL 2008 box, to a 4 socket, 8 core each box with the same amount of memory (16 or 32 Gb) that they would be more awesome for a long time - they were resoundingly wrong about that. I was still pretty green at the time and no one else had even thought of NUMA beyond theory before, took us weeks to figure it out constantly being bitched at by the client. The fix ended up being to pull three of the processors out and move the RAM :slight_smile:

years later getting into soft numa with oversubscribed virtualization hosts and reading posts on spiceworks and experts exchange by people who neither understood numa nor soft numa but spoke with enough authority to create an amazing amount of confusion was also pretty great. (Its nice that Microsoft’s documentation is better on this now…)

I have been so scarred by it, whenever I think it may come into play I push really hard to get a lot more ram than request to try to avoid it

3 Likes

Any chance then to move to a higher tick rate than 1Hz at some point? The entire game having a full 1 second lag feels really really clunky.

This is, in my opinion, the one biggest thing that makes the game feel really dated.

No, what makes the game feel dated/laggy is the client side “estimation” stuff that tries to smooth things out between server tics (e.g. 3.5 second cycle time on a weapon falling between server tics), and then gets jankily auto-corrected to the server tic after the fact (cycling and unclycling weapons repeatedly with frigate sized guns will show this very well). Some of the many janky and dated UI features that pre-date photon and are not touched by photon… only if the current status is anything to go by it’s probably a good thing that they weren’t added else they would’ve been screwed up to be even worse. Anyhow, I’ve digressed enough on that.

Though you could argue that many text based games, including most Multi-User Dungeons (MUDs) that are still around are on a 4 or 5 hz interval, and have been for 25-30ish years… but the server update rate alone doesn’t really make anything feel or be “dated”.

all the upgrades and still cant have a proper big fight

2 Likes

Fight in X47L-Q shown quality of the server upgrade

2 Likes

It comes with an “Edge of Tomorrow” feature :rofl:

“Round 1 2 3”

2 Likes

We are investigating what happened there, but currently suspect some sort of an infinitive loop or O(n^m) processing where m is 2+. No amount of server hardware will help.

2 Likes

[quote=“Axhind, post:6, topic:398191, full:true, username:Axhind”]
Interesting changes.[/quote]

There’s a lot more unsaid than said. It’s interesting…

Proven reliability in datacentres. I love AMD myself (Ryzen 3700x in my home PC), but I’ve also worked for one of the big Tier 1 server and storage hardware vendors, and I can tell you that Intel vs AMD is is about a 95:5 split in the corporate world for general purpose workloads (AMD’s big where parallelism is key, Intel don’t do 128-core CPUs - but then you get a 360W TDP with an Epyc 128-core CPU, and that takes some cooling…).

So a code issue? What were the stats on the node/physical hardware running - did it run out of RAM, or was CPU an issue?

I’m still astounded that in an age where 16-core and 20-core CPUs being readily available (and even 32-core CPUs being able to run at over 3GHz, though those do have a killer TDP) and RAM being stupidly cheap (2x24-core servers with 768GB were commonly available 5+ years ago, there are 4th Gen Xeon Scalable Gold CPUs that have 16 cores @ 3.6GHz and 3rd Gen at 3.1GHz), that 8-core CPUs are being run - is the server code optimised for single-threaded performance rather than parallel execution (which is generally how you track more things at once…)?

2 Likes

There’s not much to track in parallel. Each scene tick is a single thread. It’s much easier and more cost effective to run a thread per scene, than to try and split scene processing between multiple threads.

2 Likes

Sure, where there’s a limited number of items. But when you’re talking about 5000 undocked players in a system, and each sub-cap launches 5 drones and each carrier/super launches fighters, and a structure or two or three starts launching things like bombs and missiles and then they all have to be tracked (along with missiles from ships - there were plenty of Munins yesterday) - that’s a LOT of things to track in a single thread!