Re: How to use full-text search URL parser to filter query results by domain name?

From: hamann(dot)w(at)t-online(dot)de
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: How to use full-text search URL parser to filter query results by domain name?
Date: 2019-04-07 06:42:06
Message-ID: wolfgang-1190407084206.A0129904@laptop.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

>> I am trying to understand how to use the full-text search parser for
>> URLS and hostnames to filter results from a text field containing URLS
>> based on domain, and also how to index text columns for fast
>> lookup/matching based on domain.
>> >> I have a PostgreSQL database containing documents and links downloaded
>> by a web crawler, with the following tables:
>> >>         pages
>> >>         ----------
>> >>         id:          Integer (primary key)
>> >>         url:         String  (unique)
>> >>         title:       String
>> >>         text:        String
>> >>         html:        String
>> >>         last_visit:  DateTime
>> >>         word_pos:    TSVECTOR
>> >>         >> >>         links
>> >>         ----------
>> >>         id         Integer (primary key)
>> >>         source:    String
>> >>         target:    String  >> >>         link_text: String
>> >>         UNIQUE(source,target)
>> >>         >> >>         crawls
>> >>         ---------
>> >>         id:         Integer (primary key)
>> >>         query:      String
>> >>         >> >>         crawl_results
>> >>         -------------
>> >>         id:       Integer (primary key)
>> >>         score:    Integer (constraint 0<=score<=1)
>> >>         crawl_id: Integer (foreign key, crawls.id)
>> >>         page_id:  Integer (foreign key, pages.id)
>> >> >> The `source` and `target` fields in the `links` table contain URLs. I am
>> running the following query to extract scored links from the top-ranking
>> search results, for pages that haven't been fetched yet:
>> >>         WITH top_results AS >> >>             (SELECT page_id, score FROM crawl_results >> >>             WHERE crawl_id=$1 >> >>             ORDER BY score LIMIT 100)
>> >>         SELECT top_results.score, l.target
>> >>         FROM top_results >> >>             JOIN pages p ON top_results.page_id=p.id
>> >>             JOIN links l on p.url=l.source >> >>         WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)
>> >> >> However, *I would like to filter these results so that only one row is
>> returned for a given domain (the one with the lowest score)*. So for
>> instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8,
>> 'http://www.foo.com/zor'), I only want the first because it has same
>> domain `foo.com` and has the lower score.
>> >> I was able to find documentation for the builtin full text search
>> parsers <https://www.postgresql.org/docs/11/textsearch-parsers.html>,
>> which can parse URLS and extract the hostname. For instance, I can
>> extract the hostname from a URL as follows:
>>
Hi,

I have no real idea about solving the complete problem, and would probably try
something with a temp table first.
For extracting the hostname from a url you could use

select regex_replace('https?://(.*=)/.*', '\\1', url)

instead of the fulltext parser

Best regards
Wolfgang

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Joe Conway 2019-04-07 13:08:23 Re: Query much slower on 9.6.5 than on 9.3.5
Previous Message Jess Wren 2019-04-07 04:06:26 How to use full-text search URL parser to filter query results by domain name?