Using PostgreSQL to store URLs for a web crawler

From: Simon Connah <scopensource(at)gmail(dot)com>
To: "pgsql-novice(at)postgresql(dot)org" <pgsql-novice(at)postgresql(dot)org>
Subject: Using PostgreSQL to store URLs for a web crawler
Date: 2018-12-28 17:15:47
Message-ID: 0d3a8c63-55c7-5250-c52a-4244d18e6c92@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

First of all apologies if this is a stupid question but I've never used
a database for something like this and I'm not sure if what I want is
possible with PostgreSQL.

I'm writing a simple web crawler in Python and the general task it needs
to do is first get a list of domain names and URL paths from the
database and download the HTML associated with that URL and save it in
an object store.

Then another process goes through the downloaded HTML and extracts all
the links on the page and saves it to the database (if the URL does not
already exist in the database) so the next time the web crawler runs it
picks up even more URLs to crawl. This process is exponential so the
number of URLs that are saved in the database will grow very quickly.

I'm just a bit concerned that saving so much data to PostgreSQL quickly
would cause performance issues.

Is this something PostgreSQL could handle well? I'm not going to be
running PostgreSQL server myself. I'll be using Amazon RDS to host it.

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Fabio Pardi 2018-12-28 19:48:48 Re: Using PostgreSQL to store URLs for a web crawler
Previous Message Aleksey Tsalolikhin 2018-12-28 14:44:24 Re: [NOVICE] How to list partitions of a table in PostgreSQL 10