Power Developer • Anagram or Levenshtein

Unanswered topics | Active topics

Board index » »

All times are UTC-06:00

Anagram or Levenshtein

Post new topic Reply to topic

Page 1 of 1

[ 6 posts ]

Previous topic | Next topic

Author

Message

gunnar

Post subject: Anagram or Levenshtein

PostPosted: Tue Oct 04, 2005 7:02 am

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

I wonder if there are any anagram algorythms which can be vectorized.

What I need is an algorythm which returns how similar two strings are.
Levenshtein is such an algorythm but its very expensive.

Is there a similar algorythm know which can be vectorized?

Cheers
Gunnar

Top

Profile

Reply with quote

markos

Post subject: Re: Anagram or Levenshtein

PostPosted: Tue Oct 04, 2005 9:03 pm

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

Quote:

I wonder if there are any anagram algorythms which can be vectorized.

What I need is an algorythm which returns how similar two strings are.
Levenshtein is such an algorythm but its very expensive.

what do you mean how similar? qualitative or quantitative? do you need a "distance" index like in Levenshtein? Do you need the exact same results?

Quote:

Is there a similar algorythm know which can be vectorized?

I'll try to work on sth, as one of the functions i had to vectorize is strfry(), which performs a random anagram on a string (not a word).

Konstantinos

Top

Profile

Reply with quote

gunnar

Post subject: Re: Anagram or Levenshtein

PostPosted: Tue Oct 04, 2005 9:33 pm

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Hi Markos,

Sorry my fault, I think I need to better explain what I need and why.

Often people use a database and want to find "similar" results.
Usually the people have a string column and they are looking for records which are equal to the search string or which are nearly equal.

Such a typical db-query works on >1000 rows of strings of typically 32 char length
The query should return all rows which are equal or which have nearly equal to the search pattern.

Normally I would compile levenshtein as function into the MySQL server
and use this to find strings which have less than 2 different chars.
But levenshtein is quite CPU intensive.

Such a SQL query would look like:
SELECT * FROM customer WHERE key=somevalue AND levenshtein("Gunnar von Boehn",contact_name) <= 1;

This query would look for rows that contain a string similar to "Gunnar von Boehn" in the contatc_name column.
The <=1 means that one letter is allowed to be missing or to be more in the string.

My question basicly is if there is an algorythm know which allows us to identify
similar strings that could be vectorized to get more speed.

Cheers
Gunnar

Top

Profile

Reply with quote

markos

Post subject: Re: Anagram or Levenshtein

PostPosted: Tue Oct 04, 2005 9:57 pm

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

Quote:

Hi Markos,
Such a typical db-query works on >1000 rows of strings of typically 32 char length
The query should return all rows which are equal or which have nearly equal to the search pattern.

For such string sizes, Altivec can be of great use, yes

I did some preliminary tests on vector permutes to find similar buffers.

Quote:

My question basicly is if there is an algorythm know which allows us to identify
similar strings that could be vectorized to get more speed.

There are different anagram algorithms that work on words or on complete strings.
For example, you want to check for both anagrams of the complete string like "Bunnar nov Goehn" or only for anagrams of its words like: "rannuB onv oeGnh" or for strings that are similar like : "Gunnarius novo Boehm"?
These are 3 different algorithms to find similar strings like that. Of course we could try to vectorize levenshtein itself, and looking at its code, i think it's possible.
Sorry for messing with your name like that, but i wanted to make this point, no offence meant

Konstantinos

Top

Profile

Reply with quote

gunnar

Post subject: Re: Anagram or Levenshtein

PostPosted: Tue Oct 04, 2005 11:22 pm

Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161

Hi Markos,

We need a routined for something like this:

You search "Gunnar von Boehn"

You should find an algoruthm
which will help us to find such typical typos:

"Gunna von Boehn" - missing "r"
"Gunner von Boehn" - "e" instead "a"
"Gunar von Boehn" - missing "n"
"Gunnar von Boen" - missing "h"
"Gunnar von Boehm" - "m" instead of "n"
"Gunnar von Doehn" - "D instead of "B"
"Gunnar von Boehnn" - "exta "n"

So basicly we need to match
- single missing chars
- single exchanged chars
- single extra chars

Maybe even something like this
"Gunnar von Bohen" - "h" and "e" have exchanged places.
- This is a char swap.
- or one missing char and one extra chars.

It would be got to be able to find results with one char differrence and maybe two chars
difference.

For some customer this problem is basicly bottlenecking their application if altivec
would be able to provide a similar result faster then scalar llevenshtein code does
then this would worth a lot.

Many thanks for your help
Gunnar

Top

Profile

Reply with quote

markos

Post subject: Re: Anagram or Levenshtein

PostPosted: Fri Oct 21, 2005 6:53 am

Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348

Quote:

For some customer this problem is basicly bottlenecking their application if altivec
would be able to provide a similar result faster then scalar llevenshtein code does
then this would worth a lot.

Hi Gunnar,

Finally, after a lot of effort and lots of silence

, I vectorized Levenshtein.
Well, actually, right now it's just reference code without any altivec code in there, so it's scalar but easily vectorizable. The nice thing is that I've split the logic into vectorizable parts, I am doing some tests now and most of the times I reproduce the result from scalar Levenshtien (most of the times, I probably have some bug in there...)

Anyway, expect a paper and the code soon, and I'm very interested to see benchmarks in MySQL from this.

PS. If anyone knows of a generic AltiVec matrix tranpose routine (that works on tables NxM, I'd be very interested as I need a matrix transpose routine for this algorithm and most I've found are for square NxN matrices. I suppose I could write my own, but thought I'd check first

Konstantinos

Top

Profile

Reply with quote

Post new topic Reply to topic

Page 1 of 1

[ 6 posts ]

Board index » »

All times are UTC-06:00

Who is online

Users browsing this forum: No registered users and 0 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum