From: | Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: gsoc, text search selectivity and dllist enhancments |
Date: | 2008-07-08 22:33:48 |
Message-ID: | 4873EB4C.3070404@students.mimuw.edu.pl |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Jan Urbański wrote:
> If you think the Lossy Counting method has potential, I could test it
> somehow. Using my current work I could extract a stream of lexemes as
> ANALYZE sees it and run it through a python implementation of the
> algorithm to see if the result makes sense.
I hacked together a simplistic python implementation and ran it on a
table with 244901 tsvectors, 45624891 lexemes total. I was comparing
results from my current approach with the results I'd get from a Lossy
Counting algorithm.
I experimented with statistics_target set to 10 and 100, and ran pruning
in the LC algorithm every 3, 10 or 100 tsvectors.
The sample size with statistics_target set to 100 was 30000 rows and
that's what the input to the script was - lexemes from these 30000
tsvectors.
I found out that with pruning happening every 10 tsvectors I got
precisely the same results as with the original algorithm (same most
common lexemes, same frequencies). When I tried pruning after every 100
tsvectors the results changed very slightly (they were a tiny bit more
distant from the ones from the original algorithm, and I think a tiny
bit more precise, but I didn't give it much attention).
Bottom line seems to be: the Lossy Counting algorithm gives roughly the
same results as the algorithm used currently and is also possibly faster
(and more scalable wrt. statistics_target).
This should probably get more testing than just running some script 5
times over a fixed set of data, but I had trouble already sucking ~300
MB of tsvectors from one of my production sites, putting it on my laptop
and so on.
Do you think it's worthwhile to implement the LC algorithm in C and send
it out, so others could try it out? Heck, maybe it's worthwhile to
replace the current compute_minimal_stats() algorithm with LC and see
how that compares?
Anyway, I can share the python script if someone would like to do some
more tests (I suppose no-one would, 'cause you first need to apply my
ts_typanalyze patch and then change it some more to extract lexemes from
the sample).
Cheers,
Jan
--
Jan Urbanski
GPG key ID: E583D7D2
ouden estin
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2008-07-08 22:35:07 | Re: Identifier case folding notes |
Previous Message | Tom Lane | 2008-07-08 22:30:15 | Re: Identifier case folding notes |