Proposal: Lemmy incoming federation data, queue to database INSERT into comments, votes and other large tables

RoundSparrow · edit-2 2 years ago

Proposal: Lemmy incoming federation data, queue to database INSERT into comments, votes and other large tables

RoundSparrow · 2 years ago

GItHub issue 3188 has mostly been ignored by the project, I commented on it today - because even with the fixes lemmy.world pushed into production today, PostgreSQL insert is still slow once you get significant amounts of data into the tables: https://github.com/LemmyNet/lemmy/issues/3188

RoundSparrow · 2 years ago

Ok, so attention is going to it: https://github.com/LemmyNet/lemmy/pull/3493

chiisana · 2 years ago

So glad to see this and think this is super important conversation to be had if Reddit exodus makes their way to ActivityPub platforms such as Lemmy.

The current issue is two parts:

Messages are signed for only 10 seconds; I don’t know why, but I’m hoping the change in activitypub-federation-rust (line 70) will alleviate some backed up queue issue.
Protocol itself doesn’t seem very scalable; if every action must be emitted outwards to every federated server via HTTPS POST to every applicable instance, it bears to reason that as more people embrace the Fediverse and spin up their own server, and as communities grow, the outbound message requirement will grow exponentially and be unsustainable.

Having independent queues, and message workers, all deployed as independently scalable components is going to be a big step forward, but ultimately will still impose a lot of load on the big servers such as lemmy.ml. I think on top of improvements to the implementation of ActivityPub, Lemmy needs to add additional extensions such as statically cached interval-based activity log (with tiered clumping and eventual fall off) for each community that can be requested and ingested. That is, it would be very beneficial if, for example, when a community reach a certain scale (think !technology@lemmy.ml for example), it could publish activity log of past 15 minutes, half hour, hour, day, and week. That way, even if there were missed/delayed messages, instances could “catch up” by consuming these cached files (that doesn’t even need to hit the DB).

I hope this makes sense, and I hope we see Lemmy grow further :)

RoundSparrow · 2 years ago

A GitHub issue was opened on the topic of using message queues for outbound.

https://github.com/LemmyNet/lemmy/issues/3230

I think both inbound and outbound should have it.

RoundSparrow · 2 years ago

Moving federation out of lemmy_server into an independent app and service would also allow rework of the backend without breaking federation.

Comments are the bulk data of the site, and most end users are only going to be reading what is ‘fresh’ on the site, loading data form the past 7 days. There is potential for tiered storage.

RoundSparrow · 2 years ago

The comment database table is going to have a lot of concurrency concerns, remote federated servers are going to be all connecting at the same time with INSERT transactions into that table. The primary key on that table is going to have a lot of contention, and the interactive website users should be given highest priority.

Let the HTTPS connection be accepted, get the data, queue the data to a place that does not rely on locking the comment database table, release the HTTPS connection. Then do a linear insert of those new records one at a time.