With a question like this, your first reaction might be to think of regular expressions. While there are common forms of misspelling such as transposed characters (ie <--> ei) and (un)doubled consonants (Cincinnati, Mississippi), this is not a tenable approach. First, you are not going to capture all of the variants, common and uncommon. Second, brand names are not necessarily dictionary words and so may not follow normal spelling rules.
If we had a set of target brands, we might be able to use edit distance to associate some variants but "d+g" is very far away from "Dolce & Gabbana". That won't work. Besides, we don't have a limited list of brands. This is an open-ended question and gives rise to open ended results. For instance, Dale Earnhardt Jr has a line of glasses (who knew?) and appeared in our results.
To get a sense of the problem, here are the variants of just Tommy Hilfiger in our dataset:
tommy helfinger
tommy hf
tommy hildfigers
tommy hilfger
tommy hilfieger
tommy hilfigar
tommy hilfiger
tommy hilfigger
tommy hilfigher
tommy hilfigur
tommy hilfigure
tommy hilfiinger
tommy hilfinger
tommy hilifiger
tommy hillfiger
tommy hillfigger
tommy hillfigur
tommy hillfigure
tommy hillfinger
Even if it kind of worked and you could resolve the easy cases, leaving the remainder to resolve manually, it still is not desirable. We want to run this kind of survey at regular intervals and I'm lazy: I want to write code for the whole problem once and rerun multiple times later. Set it and forget it.
This kind of problem is something that search engines contend with all the time. So, what would google do? They have a sophisticated set of algorithms which associate document content and especially link text with target websites. They also have a ton of data and can reach out along the long tail of these variants. For my purposes, however, it doesn't matter how they do it but whether I can piggyback off their efforts.
Here was my solution to the problem:
If I type in "Diane von Burstenburg" into google, it returns,
Showing results for Diane von Furstenberg
We now have a reasonable approach. What about implementation? Google has really locked down API access. Their old JSON API is deprecated but is still available but limited to 100 queries per day. (I used pygoogle to query it easily with
>>> from pygoogle import pygoogle
>>> g = pygoogle('ray ban')
>>> g.pages = 1
>>> g.get_urls()[0]
u'http://www.ray-ban.com/'
but was shut down by google within minutes). Even if you wget their results pages, it doesn't even contain the search results as they are all ajaxed in. I didn't want to pay for API access to their results for a small one off project so went looking elsewhere. DuckDuckGo has a nice JSON API but its results are limited. I didn't feel like parsing Bing's results page. Yahoo (+ BeautifulSoup + ulrlib) saves the day!
The following works well, albeit slowly due to my rate limiting (sleep for 15 seconds):
from bs4 import BeautifulSoup
import urllib
import time
import urlparse
f_out = open("output_terms.tsv","w") #list of terms, one per line
f = open("terms.txt","r")
for line in f.readlines():
term = line.strip()
try:
print term
f = urllib.urlopen("http://search.yahoo.com/search?p=" + "\"" + term +"\"")
soup = BeautifulSoup(f)
link = soup.find("a", {"id": "link-1"})['href']
parsed_uri = urlparse.urlparse( link )
domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
f_out.write( term + "\t" + link + "\t" + domain + "\n")
time.sleep(15)
except TypeError:
print "ERROR with" + term
f.close()
f_out.close()
where output_terms.tsv was the set of unique terms after I lowercased each term, remove hyphens, ampersands and " and ".
This code outputs rows such as:
under amour http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
under armer http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
under armor http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
under armour http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
underarmor http://www.underarmour.com/shop/us/en/ http://www.underarmour.com/
karl lagerfeld http://en.wikipedia.org/wiki/Karl_Lagerfeld
For these, I did
cat output_terms.tsv | grep wikipedia | awk -F '\t' '{print $1}' > wikipediaterms.txt
and reran these through my code using this query instead:
f = urllib.urlopen("http://search.yahoo.com/search?p=" + "\"" + term +"\"" + "-wikipedia")
This worked well and Lagerfeld now maps to
karl lagerfeld http://www.karl.com/ http://www.karl.com/
(There are of course still errors: Catherine Deneuve maps to http://www.imdb.com/name/nm0000366/ http://www.imdb.com/, a perfectly reasonable response. I tried queries with '"searchterm" +glasses' for greater context but the overall results were not great. With that I got lots of ebay links appearing.)
Data scientists need to be good at lateral thinking. One skill is not to focus too much on the perfect algorithmic solution to a problem but, when possible, find a quick, cheap and dirty solution that gets you what you want. If that means a simple hack to piggyback of a huge team of search engine engineers and their enormous corpus, so much the better.