Analyzing pixels is something ML has gotten much better in the near history :) Even if you have a very simple recommendation engine running behind the scenes, you can throw in some convolutions to utilize your data to create a feature space just like word2vec, which would give you great power to generalize. Even putting in pre-trained networks like VGG16 (instead of training one from scratch) could give you a great headstart on this.