Please tell me how to correctly implement search (ideally in scala), such that "Lodz" will match "Łódź". Heck, making "noël" match "noël" is nontrivial. I don't think existing string types are adequate for this (e.g. java.lang.String.equals() gives exceedingly surprising results).
You are looking for a distance metric on strings. If d(s1, s2) = 0, where s1 and s2 are strings, then they are equivalent.
"Łódź" may be the same as "Lodz" to you in the same way that "komputor" may be the same as "computer" to a Russian speaker. To someone else the names "Anderson" and "Andersson" are equivalent. Now you see the problem -- exact matching is futile and you should use fuzzy matching instead, like normalized Levenshtein distance, and rank the results based on similarity.
Even that is not enough if you want to support non-Latin alphabets because they have different ideas about what a character is but it should get you started.
I shouldn't have to implement this from scratch though. It's not like this problem is unique to my program; programming languages should have some support for solving this kind of problem in their standard libraries (or a readily available library)
> Why would the strings match? Aren't they completely different?
In some theoretical sense, yes. In terms of solving users' problems and providing business value, I need to make it possible for users to find the entry for "Łódź" without typing accents.
> (What language are we talking about, anyway?)
We're talking about Scala; it runs on the JVM and so java.lang.String is the string type.
This right here is where we start to mix up two totally different things by using the same words for them.
This thread is talking about two completely different types of Equals; we shouldn't be using the same word for them and certainly not the same function name in code:
- SomeString1.Equals(SomeString2) -> are all the bytes in array 1 equal to the bytes in array 2?
- HumanSimilarity(t1, t2) - given text s1 and text s2, give me a number that tells me how similar a person would perceive these strings to be. You could even go further:
SimilarityForReaderInLocale(locale, t1, locale1, t2, locale2) - for a human reader in locale, given text t1 written by a human in locale1 and text t2 written by a human in locale2, how similar would the reader perceive these two pieces of text?
When we talk about 'least surprise' in manipulations, what I think we really mean is that text manipulation should be defined by a locale; no actually, a superset of a locale. That superset being your typical human reader in that locale.
Oh, so you're coding around your users' inability to type the letters they actually want? That sounds basically impossible... you'd have to pick letters according to how similar they look instead of their meaning in any context. And, I meant human language! :)
"impossible" isn't good enough - I could write a bunch of hacks that would cover many of the cases, but the programming language (or ideally the unicode standard) should provide a standard answer, rather than each programmer having to implement this themselves.
Human language? English speakers talking about cities in a variety of countries (which should be correctly named, but searchable by english speakers).
Search is always a hairy problem, you would probably store a normalized version of your searchable text and match a normalized input on that. What should actually match will be different depending on what you're doing, so I wouldn't rely on something like string.equals for it.
> Search is always a hairy problem, you would probably store a normalized version of your searchable text and match a normalized input on that. What should actually match will be different depending on what you're doing
Sure, but the language should offer support for this, even if it requires some level of configuration. It's not a problem we want every programmer solving anew for every program.
> I wouldn't rely on something like string.equals for it.
True enough. But I think the existing String.equals method is broken: it behaves very surprisingly and leads programmers to introduce bugs. Likewise e.g. String.subString (which can chop a character in half). These methods cannot be fixed (because existing code assumes their current functionality) and are very difficult to use effectively; they should be deprecated, which in practice means a new String type.
>It's not a problem we want every programmer solving anew for every program.
Yeah, I completely agree. You can make an argument that this stuff belongs in a library (or even several libraries) rather than the language, simply because of how many decisions are involved:
- do you know what language your search input is in? can it be any of several languages? any at all? only one specific language, ever?
- what language is the text you're searching on? do you know? are you potentially searching across text in multiple languages at once?
- how exact does the match need to be? do you want phonetic matches? do you want to match characters that "look" similar but many sound different?
- do you need a binary or a fuzzy match? (e.g. are you doing a search ranked by relevance). Do you need to compute some sort of Levenshtein distance?
So the use cases can range from a simple byte-level exact comparison all the way to a full search server like Solr. How much of that should the language implement? (not a rhetorical question, I actually don't think there's an obvious answer).