All times are UTC-06:00




Post new topic  Reply to topic  [ 6 posts ] 
Author Message
 Post subject: Anagram or Levenshtein
PostPosted: Tue Oct 04, 2005 7:02 am 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
I wonder if there are any anagram algorythms which can be vectorized.

What I need is an algorythm which returns how similar two strings are.
Levenshtein is such an algorythm but its very expensive.

Is there a similar algorythm know which can be vectorized?

Cheers
Gunnar


Top
   
PostPosted: Tue Oct 04, 2005 9:03 pm 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
I wonder if there are any anagram algorythms which can be vectorized.

What I need is an algorythm which returns how similar two strings are.
Levenshtein is such an algorythm but its very expensive.
what do you mean how similar? qualitative or quantitative? do you need a "distance" index like in Levenshtein? Do you need the exact same results?
Quote:
Is there a similar algorythm know which can be vectorized?
I'll try to work on sth, as one of the functions i had to vectorize is strfry(), which performs a random anagram on a string (not a word).

Konstantinos


Top
   
PostPosted: Tue Oct 04, 2005 9:33 pm 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Hi Markos,

Sorry my fault, I think I need to better explain what I need and why. :)

Often people use a database and want to find "similar" results.
Usually the people have a string column and they are looking for records which are equal to the search string or which are nearly equal.

Such a typical db-query works on >1000 rows of strings of typically 32 char length
The query should return all rows which are equal or which have nearly equal to the search pattern.

Normally I would compile levenshtein as function into the MySQL server
and use this to find strings which have less than 2 different chars.
But levenshtein is quite CPU intensive.

Such a SQL query would look like:
SELECT * FROM customer WHERE key=somevalue AND levenshtein("Gunnar von Boehn",contact_name) <= 1;

This query would look for rows that contain a string similar to "Gunnar von Boehn" in the contatc_name column.
The <=1 means that one letter is allowed to be missing or to be more in the string.

My question basicly is if there is an algorythm know which allows us to identify
similar strings that could be vectorized to get more speed.


Cheers
Gunnar


Top
   
PostPosted: Tue Oct 04, 2005 9:57 pm 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
Hi Markos,
Such a typical db-query works on >1000 rows of strings of typically 32 char length
The query should return all rows which are equal or which have nearly equal to the search pattern.
For such string sizes, Altivec can be of great use, yes :-)
I did some preliminary tests on vector permutes to find similar buffers.
Quote:
My question basicly is if there is an algorythm know which allows us to identify
similar strings that could be vectorized to get more speed.
There are different anagram algorithms that work on words or on complete strings.
For example, you want to check for both anagrams of the complete string like "Bunnar nov Goehn" or only for anagrams of its words like: "rannuB onv oeGnh" or for strings that are similar like : "Gunnarius novo Boehm"?
These are 3 different algorithms to find similar strings like that. Of course we could try to vectorize levenshtein itself, and looking at its code, i think it's possible.
Sorry for messing with your name like that, but i wanted to make this point, no offence meant :-)

Konstantinos


Top
   
PostPosted: Tue Oct 04, 2005 11:22 pm 
Offline

Joined: Tue Nov 02, 2004 2:11 am
Posts: 161
Hi Markos,

We need a routined for something like this:

You search "Gunnar von Boehn"

You should find an algoruthm
which will help us to find such typical typos:

"Gunna von Boehn" - missing "r"
"Gunner von Boehn" - "e" instead "a"
"Gunar von Boehn" - missing "n"
"Gunnar von Boen" - missing "h"
"Gunnar von Boehm" - "m" instead of "n"
"Gunnar von Doehn" - "D instead of "B"
"Gunnar von Boehnn" - "exta "n"

So basicly we need to match
- single missing chars
- single exchanged chars
- single extra chars

Maybe even something like this
"Gunnar von Bohen" - "h" and "e" have exchanged places.
- This is a char swap.
- or one missing char and one extra chars.


It would be got to be able to find results with one char differrence and maybe two chars
difference.


For some customer this problem is basicly bottlenecking their application if altivec
would be able to provide a similar result faster then scalar llevenshtein code does
then this would worth a lot.

Many thanks for your help
Gunnar


Top
   
PostPosted: Fri Oct 21, 2005 6:53 am 
Offline

Joined: Wed Oct 13, 2004 7:26 am
Posts: 348
Quote:
For some customer this problem is basicly bottlenecking their application if altivec
would be able to provide a similar result faster then scalar llevenshtein code does
then this would worth a lot.
Hi Gunnar,

Finally, after a lot of effort and lots of silence :-), I vectorized Levenshtein.
Well, actually, right now it's just reference code without any altivec code in there, so it's scalar but easily vectorizable. The nice thing is that I've split the logic into vectorizable parts, I am doing some tests now and most of the times I reproduce the result from scalar Levenshtien (most of the times, I probably have some bug in there...)

Anyway, expect a paper and the code soon, and I'm very interested to see benchmarks in MySQL from this.

PS. If anyone knows of a generic AltiVec matrix tranpose routine (that works on tables NxM, I'd be very interested as I need a matrix transpose routine for this algorithm and most I've found are for square NxN matrices. I suppose I could write my own, but thought I'd check first :-)

Konstantinos


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 6 posts ] 

All times are UTC-06:00


Who is online

Users browsing this forum: No registered users and 13 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
PowerDeveloper.org: Copyright © 2004-2012, Genesi USA, Inc. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
All other names and trademarks used are property of their respective owners. Privacy Policy
Powered by phpBB® Forum Software © phpBB Group