First, a distinction whether we are dealing with sparse data (nearly-binary data) or dense data (nearly-continuous) should be made. Second, a distinction whether we are calculating similarity for an unsupervised or a supervised problem should be made. Then, present most common similarity measures and variable selection algorithms for each of four type of problems.
This is a vast area of research. Diving head on might result in serious injury. For example, R statistical package simba (http://cran.r-project.org/web/packages/simba/index.html) lists 56 different similarity/dissimilarity measures just for binary data.
This is a bit of an oversimplification of data mining to the point where I am not sure it is useful. Most interesting data exists in only a subset of a large feature set, where most items are irrelevant to the similarity metric. Take movies for example, if you tried to find similar movies using all features, key grip names and minor actors would unrealistically mess up your similarity score. This relates to the "curse of dimensionality".
Many data mining approaches first use a feature selection or feature extraction approach. That is, an approach which finds the relevant feature subsets, or discovers the underlying features of the data set.
Inverse Image search and the solution to the Netflix prize both used feature extraction approaches.
You can solve complex things with k-nearest neighbor as long as you use an appropriate distance function. This is the beauty, a distance function abstracts the complexity. I tackled a tricky biology problem by applying a cascade of similarity filters. Check the presentation I gave at the Clojure/conj 2011:
http://prezi.com/zaaoq6pjrl2z/clojure-conj-final/
My startup provides a very fast similarity engine (in a DB of 100 million objects I can find similar objects in under 20 millisec. with one CPU) in case you worry about scalability. URL: http://simmachines.com
This was useful. In my case I have a million articles that I want to group related ones together. Similar to Google News. I'm guessing I can use one of the algorithms (Cosine Similarity) to calculate the similarity of every two article and group the close numbers together. Any recommendations on how I should go about it? I'm trying to find Python libraries that can make this easier.
Seconded gtani -- read up on LSA and SVD to do dimensionality reduction will get you pretty far in semantic analysis, after which you can do an additional step of analysis with the reduced dimensions to better suit your needs.
I don't know the python libraries off the top of my head, but the Python implementation of R (RPy) is decent and R is heavily used in most circles to have great literature written for it.
Not entirely clear what you're asking, if it's cluster by topic matter, or pick out specific named/physical entities or maybe sentiment analysis.
Two good first steps to look into, depending on your needs, are Bayesian classifiers and SVD (reduction of high dimensionality, the application to text processing was patented as Latent Semantic Indexing/Analysis, LSI or LSA, by IBM, I don't knnow if that's lapsed).
To stick with good first step approaches, look at ngrams and nwords.
Basically, you need a reasonable feature to match similarity on. N-words are pretty easy to construct, a 2-gram would be every pair of words used in a document.
Tf-idf is a good metric with that kind of feature, because it handles well the bias of frequent words like "the"
I am somewhat like many people who are unfamiliar with data mining for this type of matching.
For example, I want to provide a "similar items" for Vacation Rentals, where the "dimensions" or attributes, could be "___location", "bedrooms", "price", etc. It's hard to quantitfy anything to show something which might be more relevant to someone else based on the previous properties that they have currently been viewing.
Instead I have just taken to approach of creating a bounding box based on the Geo coordinates, and then offer up similar properties within their search price range. But I would really love to eventually implement something like your original article. (Suggestions welcome).
Amazon is taking an indirect approach. They are not necessarily comparing items directly to offer suggestions, although they probably take categories into consideration and that's because they have a good stream of traffic, ratings and purchases to rely on (having more, better data gives better results than smart algorithms).
Their suggestions are like: customers that viewed this item also viewed; customers that viewed this item ended up buying; customers that bought this product, also bought these other products.
That last metric in particular is interesting, because it tells you for a product what are the complementary products that customers may be interested in. So you don't actually have to measure somehow the physical properties of the objects getting sold to discover relationships.
In your case I don't have knowledge about the problem ___domain to give advice, but "customers that viewed this deal also viewed ..." is always a great addition. Also add ratings and follow-up on people with emails to rate on their vacation, after coming back from the trip. I don't know how well it will work - there's no general solution, you try something and if it doesn't work, try something else.
Thanks for the tip bad_user. I hadn't actually thought about trying to figure out "customers that viewed this item also viewed".
I could very easily create something that takes every visit, MapReduce it, and then track the entropy between potential matches to provide the "best" match based on user visits of that property also.
To take the example, it would be really great to also know that people from Germany aren't interested in the slightest in our Italian properties based on trends of their national behaviour.
I think the overall community needs this sort of tutorial to really garner value out of the vast sums of data that we as a community collect. Everyone needs a place to start.
The "Collective Intelligence" books, by Alag and by Marmanis / Babenko, are well done (source code is java). Along with NLP texts by Jurafsky/Martin and Manning / Schütze, and Norvig/Russell AI text, and a large number good texts on data mining(1st one i bought by Witten /Frank, has been recently updated and has lots weka examples), should give you a good base. The database collection and cleaning (spidering, scraping/info extraction, database dedup/record linkage) is the part that's not as ewll documented.
On a related note, I just started the "Programming Collective Intelligence" book and I'm in process of writing it's code on Ruby. In case anyone wants to contribute: https://github.com/herval/ruby_intelligence
This is a vast area of research. Diving head on might result in serious injury. For example, R statistical package simba (http://cran.r-project.org/web/packages/simba/index.html) lists 56 different similarity/dissimilarity measures just for binary data.