From: | Jan Urbański <wulczer(at)wulczer(dot)org> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jesper Krogh <jesper(at)krogh(dot)cc> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: tsvector pg_stats seems quite a bit off. |
Date: | 2010-05-30 14:41:40 |
Message-ID: | 1275230500.1541.4.camel@Nokia-N900-42-11 |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> Jesper Krogh <jesper(at)krogh(dot)cc> writes:
> > On 2010-05-29 15:56, Jan Urbański wrote:
> > > AFAIK statistics for everything other than tsvectors are built based
> > > on the values of whole rows.
>
> > Wouldn't it make sense to treat array types like the tsvectors?
>
> Yeah, I have a personal TODO item to look into that in the future.
There were plans to generalise the functions in ts_typanalyze and use LC for array types as well. If one day I'd find myself with a lot of free time I'd take a stab at that.
> > > The results are attached in a text (CSV) file, to preserve
> > > formatting. Based on them I'd like to propose top_stopwords and
> > > error_factor to be 100.
>
> > I know it is not percieved the correct way to do things, but I would
> > really like to keep the "stop words" in the dataset and have
> > something that is robust to that.
>
> Any stop words would already have been eliminated in the transformation
> to tsvector (or not, if none were configured in the dictionary setup).
> We should not assume that there are any in what ts_typanalyze is seeing.
Yes, and as a side note, if you want to be indexing stopwords, just don't pass a stopword file when creating the text search dictionary (or pass a custom one).
>
> I think the only relevance of stopwords to the current problem is that
> *if* stopwords have been removed, we would see a Zipfian distribution
> with the first few entries removed, and I'm not sure if it's still
> really Zipfian afterwards. However, we only need the assumption of
> Zipfianness to compute a target frequency cutoff, so it's not like
> things will be completely broken if the distribution isn't quite
> Zipfian.
That's why I was proposing to take s = 0.07 / (MCE-count + 10). But that probably doesn't matter much.
Jan
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2010-05-30 14:41:58 | Re: pg_trgm |
Previous Message | Tom Lane | 2010-05-30 14:24:47 | Re: tsvector pg_stats seems quite a bit off. |