Re: grep -f keyword data query

From: Hiroyuki Sato <hiroysato(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andreas Kretschmer <andreas(at)a-kretschmer(dot)de>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: grep -f keyword data query
Date: 2015-12-29 15:21:28
Message-ID: CA+Tq-Rrm_mf_DQzwSjaLmEfoU9Yv9kcXVavxMxH+S9Kt3ZQb5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello Jeff

Thank you for replying.

2015年12月29日(火) 4:35 Jeff Janes <jeff(dot)janes(at)gmail(dot)com>:

> On Sun, Dec 27, 2015 at 3:53 PM, Hiroyuki Sato <hiroysato(at)gmail(dot)com>
> wrote:
> > Hello Andreas and Tom
> >
> > Thank you for replying.
> >
> > Sorry, I re-created my questions. I was mis-pasted query log on previous
> > question.
> > (@~ operator is PGroonga extension (http://pgroonga.github.io))
> > Please ignore it.
> >
> > Best regards.
> >
> > 1, Problem.
> > (1) Following query is exteme slow. (478sec)
> > SELECT
> > u.url
> > FROM
> > url_lists4 u,
> > keywords4 k
> > WHERE
> > u.url like k.url
> > AND
> > k.name = 'esc_url';
> >
> >
> > (2) grep -f kwd.txt sample.txt (exec time under 1sec)
>
> Certainly not in my hands. The best I can get is 9 seconds.
>
> >
> > 2, Questions
> >
> > (1) Is it possible to improve this query like the command ``grep -f
> > keyword data``?
>
> You will not get a general tool to match a specialized tool in the
> specialized tool's own domain. fgrep is good at what fgrep does.
>

This is just same data and real data is forward proxy log.
There are no own domain. It is contain 20,000,000 log per day.

>
>
> Since your queries all have constant text strings at the beginning,
> they could use the index. But if you are not using the C collation,
> then you need build a special index:
>
> create index on url_lists4 (url text_pattern_ops);
>

Currently I just use C collation (ASCII) only.

>
> But, the planner refuses to use this index for your query anyway,
> because it can't see that the patterns are all left-anchored.
>
> Really, your best bet is refactor your url data so it is stored with a
> url_prefix and url_suffix column. Then you can do exact matching
> rather than pattern matching.
>

I see, exact matching faster than pattern matting.
But I need pattern match in path part
(ie, http://www.yahoo.com/a/b/c/... )
I would like to pattern match '/a/b/c' part.

That's why I asked this question.

If it is impossible to improve join speed, I will dump data once, and
match it with grep or something tools.

Thanks.

>
> Cheers,
>
> Jeff
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Melvin Davidson 2015-12-29 19:10:32 Re: Regex help again (sorry, I am bad at these)
Previous Message Hiroyuki Sato 2015-12-29 15:11:00 Re: grep -f keyword data query