Re: grep -f keyword data query

From: Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com>
To: Hiroyuki Sato <hiroysato(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>, Andreas Kretschmer <andreas(at)a-kretschmer(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: grep -f keyword data query
Date: 2016-01-04 07:49:35
Message-ID: CAG6W84JXVLnQ6njNyM4-itC=guZBq9ScPaQuTWzxWMmrJfUz2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Dec 28, 2015 00:55, "Hiroyuki Sato" <hiroysato(at)gmail(dot)com> wrote:
>
> Hello Andreas and Tom
>
> Thank you for replying.
>
> Sorry, I re-created my questions. I was mis-pasted query log on previous
question.
> (@~ operator is PGroonga extension (http://pgroonga.github.io))
> Please ignore it.
>
> Best regards.
>
> 1, Problem.
> (1) Following query is exteme slow. (478sec)
> SELECT
> u.url
> FROM
> url_lists4 u,
> keywords4 k
> WHERE
> u.url like k.url
> AND
> k.name = 'esc_url';
>
>
> (2) grep -f kwd.txt sample.txt (exec time under 1sec)
>

These are not the same 'query'. Grep will match any of the patterns for
each url while postgres finds all matching combinations. You need to use
'... OR ... OR ...' or 'ANY()' in postgres to get the same result.

> 2, Questions
>
> (1) Is it possible to improve this query like the command ``grep -f
keyword data``?

I get the best results by using OR of all the different patterns:

SELECT url FROM url_lists4
WHERE
url LIKE 'http://ak.yahoo.co.jp/xwv/%'
OR url LIKE 'http://ao.yahoo.co.jp/wdl/%'
OR ...

In theory you could use:

... WHERE url LIKE ANY(ARRAY(SELECT url FROM keywords4 k WHERE k.name =
'esc_url'));

but that's very slow.

> (2) What kind of Index should I create on url_lists table?

Both btree text_pattern_ops and gin trigram on the URL.

>
> 3, Environment
> OS: CentOS7
> PostgreSQL 9.4
>
> 4, sample source
> https://github.com/hiroyuki-sato/postgres_like_test
>
>
> 5, Create table
>
> drop table if exists url_lists4;
> create table url_lists4 (
> id int not null primary key,
> url text not null
> );
> create index ix_url_url_lists4 on url_lists4(url);
>
> drop table if exists keywords4;
> create table keywords4 (
> id int not null primary key,
> name varchar(40) not null,
> url text not null
> );
>
> create index ix_url_keywords4 on keywords4(url);
> create index ix_name_keywords4 on keywords4(name);
>
>
> \copy url_lists4(id,url) from 'sample.txt' with delimiter ',';
> \copy keywords4(id,name,url) from 'keyword.txt' with delimiter ',';
>
> vacuum url_lists4;
> vacuum keywords4;
> analyze url_lists4;
> analyze keywords4;
>
> 6, Query
>
> EXPLAIN SELECT
> u.url
> FROM
> url_lists4 u,
> keywords4 k
> WHERE
> u.url like k.url
> AND
> k.name = 'esc_url';
>
> EXPLAIN ANALYZE SELECT
> u.url
> FROM
> url_lists4 u,
> keywords4 k
> WHERE
> u.url like k.url
> AND
> k.name = 'esc_url';
>
>
> SELECT
> u.url
> FROM
> url_lists4 u,
> keywords4 k
> WHERE
> u.url like k.url
> AND
> k.name = 'esc_url';
>
> 7, EXPLAIN
>
> QUERY PLAN

>
-----------------------------------------------------------------------------
> Nested Loop (cost=0.00..37510799.00 rows=12500000 width=57)
> Join Filter: (u.url ~~ k.url)
> -> Seq Scan on url_lists4 u (cost=0.00..10682.00 rows=500000
width=57)
> -> Materialize (cost=0.00..129.50 rows=5000 width=28)
> -> Seq Scan on keywords4 k (cost=0.00..104.50 rows=5000
width=28)
> Filter: ((name)::text = 'esc_url'::text)
> (6 rows)
>
> 8, EXPLAIN ANALYZE
> QUERY PLAN

>
-----------------------------------------------------------------------------------------------------------------------------
> Nested Loop (cost=0.00..37510799.00 rows=12500000 width=57) (actual
time=6011.642..478011.117 rows=4850 loops=1)
> Join Filter: (u.url ~~ k.url)
> Rows Removed by Join Filter: 2499995150
> -> Seq Scan on url_lists4 u (cost=0.00..10682.00 rows=500000
width=57) (actual time=0.034..192.646 rows=500000 loops=1)
> -> Materialize (cost=0.00..129.50 rows=5000 width=28) (actual
time=0.000..0.261 rows=5000 loops=500000)
> -> Seq Scan on keywords4 k (cost=0.00..104.50 rows=5000
width=28) (actual time=0.021..1.705 rows=5000 loops=1)
> Filter: ((name)::text = 'esc_url'::text)
> Planning time: 0.061 ms
> Execution time: 478011.773 ms
> (9 rows)
>
>
> 2015年12月28日(月) 3:39 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
>>
>> Andreas Kretschmer <andreas(at)a-kretschmer(dot)de> writes:
>> >> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> hat am 27. Dezember 2015 um 19:11
geschrieben:
>> >> What in the world is this @~ operator? And what sort of index are
>> >> you using now, that can accept it? Are the rowcount estimates in
>> >> the EXPLAIN output accurate? (If they are, it's hardly surprising
>> >> that the query takes a long time.)
>>
>> > in a privat mail he called an other operator: ~~. I think, the @~ is
an error.
>>
>> Well, ~~ isn't directly indexable by btree indexes either, so there's
>> still something wrong with either the EXPLAIN output or the claimed
>> index definitions.
>>
>> regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Hiroyuki Sato 2016-01-04 08:23:21 Re: grep -f keyword data query
Previous Message subhan alimy 2016-01-04 05:00:17 Re: PostgreSQL GUI tools