Re: split string by special characters

From: Andreas Wenk <a(dot)wenk(at)netzmeister-st-pauli(dot)de>
To: Jan-Erik <jan-erik(dot)larka(at)os2world(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: split string by special characters
Date: 2009-07-24 21:22:55
Message-ID: 4A6A262F.3050506@netzmeister-st-pauli.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Jan-Erik wrote:
> I wonder if you could please help me out to extract a character string
> to an array or better yet, a table.
>
> I'd like to split strings of text up into words and delimiters (but
> not delete the delimiters). The delimiters are defined as comma,
> space, dot, singe/double quotation mark, question mark etc.¹ in a
> separate table (delimiters) depending on what rules apply for the
> input.
>
> regexp_split_to_array/table seem quite suitable but I have
> difficulties to form the right expression with it, apart from that it
> remove the delimiters as well.
>
> Example:
> This is just a text that contain special characters such as , (comma),
> "(", ")" (left and right parenthesis) as well as "?" question mark.
> How do I split it up with PostgreSQL?
>
> Expected result:
> {This, " ", is, " ", just, " ", a, ..., PostgreSQL, "?" }
> __________________
> ¹) Also later on tags such as <html> and at other times something
> else depending on the circumstances.
>
> //Jan-Erik
>
Hi,

I was thinking about that and in my opinion the approach to let the
database do that is the wrong direction. Sure you can do a lot with
regexp_split_to_table or regexp_split_to_array but they are kind of
limited compared to a programming language using regular expressions. If
I had to try to get your jobdone, I would try regexp_matches() like:

SELECT regexp_matches('This is just a text, that contain special
characters such as, (comma),"(", ")" (left and right parenthesis) as
well as "?" question, mark.How do I split it up with PostgreSQL?',
E'(\\w*.)\\s+','g');

regexp_matches
----------------
{This}
{is}
{just}
{a}
{"text,"}
{that}
{contain}
{special}
{characters}
{such}
{"as,"}
{","}
{"\""}
{left}
{and}
{right}
{parenthesis)}
{as}
{well}
{as}
{"\""}
{"question,"}
{How}
{do}
{I}
{split}
{it}
{up}
{with}
(29 rows)

So, you have the ability to catch the seperators like ','. But for now,
teh example just catches the comma. But you want to catch a lot of other
seperators as well. I suggest you do that within the logic of your
coding language because I don't think this will be an easy way to walk
;-). This is no database job in my opinion.

Cheers

Andy

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Brian A. Seklecki 2009-07-24 23:02:18 Disable databse listing for non-superuser (\l) ?
Previous Message Peter Hunsberger 2009-07-24 21:13:00 Find difference between two Text fields