Quick Links

Re: UPDATE on two large datasets is very slow

From:	Listmail <lists(at)peufeu(dot)com>
To:	"Scott Marlowe" <smarlowe(at)g2switchworks(dot)com>, "Steve Gerhardt" <ocean(at)ocean(dot)fraknet(dot)org>
Cc:	"pgsql general" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: UPDATE on two large datasets is very slow
Date:	2007-04-03 17:29:02
Message-ID:	op.tp7xiooezcizji@apollo13
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

> I can't help but think that the way this application writes data is
> optimized for MySQL's transactionless table type, where lots of
> simultaneous input streams writing at the same time to the same table
> would be death.
>
> Can you step back and work on how the app writes out data, so that it
> opens a persistent connection, and then sends in the updates one at a
> time, committing every couple of seconds while doing so?

No, he can't, I also coded a bittorrent tracker of the same kind, and the
problem is :

- each user has N torrents active
- torrent client does N tracker requests
- tracker only does 1 UDPATE to update user's stats

So if you do instantaneous updates you multiply your query load by N (on
average between 6 and 8).

Besides, these kinds of trackers face several problems :

- they are accessed by clients which have near saturated connections
since they're leeching illegal prOn like crazy
- therefore these HTTP connections are very slow
- therefore you have a hell of a lot of concurrent connections.

Therefore using a threaded server for this kind of load is asking for
trouble.
All decent torrent trackers are designed like lighttpd : select() /
poll() or other variants, and no threads.
No threads means, database queries are something long and to be avoided.
Hosting providers will delete your account if they see a php torrent
tracker on it, and for good reason.

600 hits/s = 600 connections = 600 apache and PG process = you cry.

Anyway my tracker was in Python with select/poll asynchronous HTTP model.
It handled 200 HTTP requests per second using 10% CPU on a Core 2. I guess
thats pretty decent.

(I do NOT work on it anymore, DO NOT ASK for sources, it is illegal now
in my country to code trackers so I have completely dropped the project,
but I guess helping a fellow living in a free country is OK)

Back to databases.

You complain that postgres is slow for your application.
Yes, it is a lot slower than MyISAM *on this application* (but try InnoDB
and cry).
But PG is a real database. It is simply not the right tool to your
application.
You have to choose between in-place updates and transactions.

(besides, your website is locked while MySQL does your big UPDATE).

Here is how you can do it :

Your problem is that you put the peers in the database.
Ask yourself why ?

You need seeders / leechers count for each torrent ?
-> Two INTEGERs in your torrents table, updated in batch by the tracker
every hour.

You need to have all peers saved somewhere so that you may exit and
restart your tracker ?
-> code your tracker in python and be able to reload running code
-> or just save it when you exit
-> or don't save it, it's not like it's your bank accounting data, who
cares
-> the FBI will be happy to have all that data when they seize your
server (see: piratebay laughs as all data was in RAM and police had to
unplug the server to seize it.)

So, DO NOT put the peers in the database.
IF you put the peers info in the database you get one UPDATE per user per
torrent.
If you only update the user stats you only get one UPDATE per user.
And the tracker never inserts users and torrents (hopefully) so you only
get UPDATES to users and to torrents tables, never inserts.

Now you need to display realtime info on the user's and torrents pages.
This is easily done : your tracker is a HTTP server, it can serve data
via HTTP (php serialized, JSON, whatever) that is inserted via AJAX of PHP
in your webpages.

From my stats my tracker needs about 100 microseconds to serve a HTTP web
page with the peer counts for a torrent.

So, you don't need Postgres for your tracker ! Use it for your website
instead...

In response to

Re: UPDATE on two large datasets is very slow at 2007-04-03 15:44:07 from Scott Marlowe

Responses

Re: UPDATE on two large datasets is very slow at 2007-04-03 23:22:24 from Steve Gerhardt

Browse pgsql-general by date

	From	Date	Subject
Next Message	Listmail	2007-04-03 17:31:22	Re: BitmapScan mishaps
Previous Message	brian	2007-04-03 17:26:32	Re: COPY FROM - how to identify results?