Re: grep -f keyword data query

From: Hiroyuki Sato <hiroysato(at)gmail(dot)com>
To: Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com>
Cc: Andreas Kretschmer <andreas(at)a-kretschmer(dot)de>, pgsql-general <pgsql-general(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: grep -f keyword data query
Date: 2016-01-05 09:31:39
Message-ID: CA+Tq-RqUm+pPii60VCOxsf+K=9-oypZLth+aW3VW6g9BgfnSBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello Arjen.

Thank you for replying.

I'll try OR query on my environment.
Thanks.

2016年1月4日(月) 23:03 Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com>:

>
> On Jan 4, 2016 09:45, "Hiroyuki Sato" <hiroysato(at)gmail(dot)com> wrote:
> >
> > Hello Arjen
> >
> > Thank you for replying.
> >
> > 2016年1月4日(月) 16:49 Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com>:
> >>
> >>
> >> On Dec 28, 2015 00:55, "Hiroyuki Sato" <hiroysato(at)gmail(dot)com> wrote:
> >> >
> >> > Hello Andreas and Tom
> >> >
> >> > Thank you for replying.
> >> >
> >> > Sorry, I re-created my questions. I was mis-pasted query log on
> previous question.
> >> > (@~ operator is PGroonga extension (http://pgroonga.github.io))
> >> > Please ignore it.
> >> >
> >> > Best regards.
> >> >
> >> > 1, Problem.
> >> > (1) Following query is exteme slow. (478sec)
> >> > SELECT
> >> > u.url
> >> > FROM
> >> > url_lists4 u,
> >> > keywords4 k
> >> > WHERE
> >> > u.url like k.url
> >> > AND
> >> > k.name = 'esc_url';
> >> >
> >> >
> >> > (2) grep -f kwd.txt sample.txt (exec time under 1sec)
> >> >
> >>
> >> These are not the same 'query'. Grep will match any of the patterns for
> each url while postgres finds all matching combinations. You need to use
> '... OR ... OR ...' or 'ANY()' in postgres to get the same result.
> >
> > I thought the following query are same meaning.
> > Could you point me same examples about two differences?
>
> If one URL matches two patterns, grep will return the match only once. In
> your SQL query it will match multiple times and will be in the result
> multiple times.
>
> You can test this by putting both 'http' and 'yahoo' in the wordlist.
>
> >
> > (1) u.url like k.url
> > (2) u.url like 'k.url 1', or u.url like 'k.url2' ...
> >
> >> > 2, Questions
> >> >
> >> > (1) Is it possible to improve this query like the command ``grep -f
> keyword data``?
> >>
> >> I get the best results by using OR of all the different patterns:
> >>
> >> SELECT url FROM url_lists4
> >> WHERE
> >> url LIKE 'http://ak.yahoo.co.jp/xwv/%'
> >> OR url LIKE 'http://ao.yahoo.co.jp/wdl/%'
> >> OR ...
> >
> >
> > I'll try it.
> >
> > BTW Do you know how many OR can I use ?
> > I have 5000 URLs.
>
> There is no real limit. I tried with your test file (5000 patterns). I got
> 6 seconds planning time and 12 seconds execution time.
>
> (I also tried making one big regular expression with url1|url2|... but
> that did fail)
>
> >
> >> In theory you could use:
> >>
> >> ... WHERE url LIKE ANY(ARRAY(SELECT url FROM keywords4 k WHERE k.name
> = 'esc_url'));
> >>
> >> but that's very slow.
> >
> > It's very interesting. It seems same query.
> > I'll check EXPLAIN
> >
> >
> >>
> >> > (2) What kind of Index should I create on url_lists table?
> >>
> >> Both btree text_pattern_ops and gin trigram on the URL.
> >>
> >> >
> >> > 3, Environment
> >> > OS: CentOS7
> >> > PostgreSQL 9.4
> >> >
> >> > 4, sample source
> >> > https://github.com/hiroyuki-sato/postgres_like_test
> >> >
> >> >
> >> > 5, Create table
> >> >
> >> > drop table if exists url_lists4;
> >> > create table url_lists4 (
> >> > id int not null primary key,
> >> > url text not null
> >> > );
> >> > create index ix_url_url_lists4 on url_lists4(url);
> >> >
> >> > drop table if exists keywords4;
> >> > create table keywords4 (
> >> > id int not null primary key,
> >> > name varchar(40) not null,
> >> > url text not null
> >> > );
> >> >
> >> > create index ix_url_keywords4 on keywords4(url);
> >> > create index ix_name_keywords4 on keywords4(name);
> >> >
> >> >
> >> > \copy url_lists4(id,url) from 'sample.txt' with delimiter ',';
> >> > \copy keywords4(id,name,url) from 'keyword.txt' with delimiter
> ',';
> >> >
> >> > vacuum url_lists4;
> >> > vacuum keywords4;
> >> > analyze url_lists4;
> >> > analyze keywords4;
> >> >
> >> > 6, Query
> >> >
> >> > EXPLAIN SELECT
> >> > u.url
> >> > FROM
> >> > url_lists4 u,
> >> > keywords4 k
> >> > WHERE
> >> > u.url like k.url
> >> > AND
> >> > k.name = 'esc_url';
> >> >
> >> > EXPLAIN ANALYZE SELECT
> >> > u.url
> >> > FROM
> >> > url_lists4 u,
> >> > keywords4 k
> >> > WHERE
> >> > u.url like k.url
> >> > AND
> >> > k.name = 'esc_url';
> >> >
> >> >
> >> > SELECT
> >> > u.url
> >> > FROM
> >> > url_lists4 u,
> >> > keywords4 k
> >> > WHERE
> >> > u.url like k.url
> >> > AND
> >> > k.name = 'esc_url';
> >> >
> >> > 7, EXPLAIN
> >> >
> >> > QUERY PLAN
>
> >> >
> -----------------------------------------------------------------------------
> >> > Nested Loop (cost=0.00..37510799.00 rows=12500000 width=57)
> >> > Join Filter: (u.url ~~ k.url)
> >> > -> Seq Scan on url_lists4 u (cost=0.00..10682.00 rows=500000
> width=57)
> >> > -> Materialize (cost=0.00..129.50 rows=5000 width=28)
> >> > -> Seq Scan on keywords4 k (cost=0.00..104.50
> rows=5000 width=28)
> >> > Filter: ((name)::text = 'esc_url'::text)
> >> > (6 rows)
> >> >
> >> > 8, EXPLAIN ANALYZE
> >> > QUERY
> PLAN
> >> >
> -----------------------------------------------------------------------------------------------------------------------------
> >> > Nested Loop (cost=0.00..37510799.00 rows=12500000 width=57)
> (actual time=6011.642..478011.117 rows=4850 loops=1)
> >> > Join Filter: (u.url ~~ k.url)
> >> > Rows Removed by Join Filter: 2499995150
> >> > -> Seq Scan on url_lists4 u (cost=0.00..10682.00 rows=500000
> width=57) (actual time=0.034..192.646 rows=500000 loops=1)
> >> > -> Materialize (cost=0.00..129.50 rows=5000 width=28)
> (actual time=0.000..0.261 rows=5000 loops=500000)
> >> > -> Seq Scan on keywords4 k (cost=0.00..104.50
> rows=5000 width=28) (actual time=0.021..1.705 rows=5000 loops=1)
> >> > Filter: ((name)::text = 'esc_url'::text)
> >> > Planning time: 0.061 ms
> >> > Execution time: 478011.773 ms
> >> > (9 rows)
> >> >
> >> >
> >> > 2015年12月28日(月) 3:39 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> >> >>
> >> >> Andreas Kretschmer <andreas(at)a-kretschmer(dot)de> writes:
> >> >> >> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> hat am 27. Dezember 2015 um 19:11
> geschrieben:
> >> >> >> What in the world is this @~ operator? And what sort of index are
> >> >> >> you using now, that can accept it? Are the rowcount estimates in
> >> >> >> the EXPLAIN output accurate? (If they are, it's hardly surprising
> >> >> >> that the query takes a long time.)
> >> >>
> >> >> > in a privat mail he called an other operator: ~~. I think, the @~
> is an error.
> >> >>
> >> >> Well, ~~ isn't directly indexable by btree indexes either, so there's
> >> >> still something wrong with either the EXPLAIN output or the claimed
> >> >> index definitions.
> >> >>
> >> >> regards, tom lane
> >
> >
> > Best regards.
> >
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Andreas Kretschmer 2016-01-05 09:57:36 Re: Streaming replication stacked.
Previous Message Andreas Kretschmer 2016-01-05 09:17:17 Re: Streaming replication stacked.