Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: We made an open-source personalization engine (github.com/metarank)
284 points by shutty on March 23, 2022 | hide | past | favorite | 60 comments
Hey, HN! You probably know that the ordering of products on Amazon, posts in FB, and search results in Google is personalized for each visitor, as it directly affects conversion, click rate and engagement. But not everyone can afford to hire an army of PhDs to squeeze every penny out of the ranking, and not everyone agrees on the current (im)balance between privacy and profits.

So we built Metarank, an open-source and privacy-focused personalization engine. It can rerank in real-time any type of content, using only the data you allow, and optimize metrics you define.

We made a lot of proprietary DIY services for personalization in e-commerce in our past careers and heard so many complaints from other companies also struggling to implement personalization. It’s often considered "too risky" to spend 6+ months on an in-house moonshot project to reinvent the wheel without an experienced team and no existing open-source tools. Like other people in the industry, we were tired of building everything from the bottom up each time we approached personalization - it should be easy not only for Amazon to do such magical ML tricks, but for everyone else.

A small demo of the tool with personalized recommendations: https://demo.metarank.ai

A blog post on how this demo was made: https://medium.com/metarank/personalizing-recommendations-wi...

The project itself: https://github.com/metarank/metarank




I’m one of the contributors to this project. The idea of the tool is to focus on typical ML feature engineering challenges. It takes a stream of business events like clicks and impressions, and computes a ton of common ML features on top:

* Parse User-Agent field, make a GeoIP lookup

* Count number of clicks over different items on multiple time windows, like 1-2-3-4 weeks

* Conversion and CTR rates

* Basic customer profiling, like “you clicked on a red item in the past, and this item is also red”

There is just a LambdaMART with xgboost inside, no rocket science. It won’t replace an in-house highly-focused solution, but building everything from scratch may take a ton of time. With Metarank you can quickly hack a good enough solution in a day, hopefully :)


Not only could it be good enough -- it's a great reference to benchmark commercial custom solutions against! (And I say this as an engineer working on one of those commercial custom solutions!)


What are approximate the costs in your demo https://demo.metarank.ai/ example to train and run the service?


Right now it runs in a dev-mode on a single EC2 t3.large instance with loadavg ~0.30, but the inference load is quite tiny right now: around 3-4 reranking requests per second. And yes, as a typical open-source project it still crashes from time to time :)

The training dataset is not that huge (see https://github.com/metarank/ranklens/ for details, it's open-source), so we do a full retraining directly on the node right after the deployment, and it takes around 1 minute to finish. We also run the same process inside the CI: https://github.com/metarank/metarank/blob/master/run_e2e.sh

There is an option to run this thing in a distributed mode:

* training is done using a separate batch job running on Apache Flink (and on k8s using flink's integration)

* feature updates are done in a separate streaming Flink job, writing everything in Redis

* The API fetches latest feature values from Redis and runs the ML model.

The dev-mode I've mentioned earlier is when all these three things are bundled together in a single process to make it easier to play with the tool. But we didn't spent much time testing distributed setup, as this thing is still a hobby side-project and we're limited in time spent developing it.


From reading some of the repository and architecture overview, I think this is true, but: could you confirm that users of metarank can self-train their own models from scratch?


This is actually part of our CI process: https://github.com/metarank/metarank/blob/master/run_e2e.sh . This script runs on every PR to retrain the model used on a demo and confirms that it's working fine.

So you can just download the jar file from releases page and run ./run_e2e.sh <jar file> in the checked-out repository, it should do the job.


Thanks!


What budget for cloud infrastructure for 100K/mo buyers to an ecommerce website, approximate range, with typical purchase habits? I am new to Flink. We use Redis in production.


> “you clicked on a red item in the past, and this item is also red”

Layman here: is this why I keep seeing ads for things I've already bought?


No. When the average person sees an ad for something they just bought, it increases their satisfaction with their purchase (thus making them less likely to return it.) Also, when you’ve just bought something, there is a non-zero chance you will return it and want to buy a different model of the same type of item.


Also perhaps if you made an informed purchase, the system knows you looked for information on your item but not that you bought it.


> Metarank is industry-agnostic and can be used in any place of your application where some content is displayed.

I'm afraid I'm skeptical.

Content ranking in small, well defined contexts is not hard to do and doesn't require an ML approach – rules based systems are often easier to specify, easier for both creators and users to understand, and easier to make conform to business rules.

When ML does need to be introduced, when the scale or complexity is large enough that a rules-based approach will be infeasible or worse, having a generic implementation is unlikely to return useful results. So much of the work of optimising an ML approach is engineering features out of the data that make sense and that don't introduce bias.

It's that last point that's really important because if you do the wrong feature engineering, then the bias introduced effectively means you're back to building a rules-based system, just one that has a bunch of inaccuracy built in, and where you don't understand what rules you've specified, or even that you have specified them.

I'm not an expert here, but I've worked on basic recommender systems for products, and worked with people who were far more knowledgeable about this, all of whom seemed to have a low opinion of generic systems.


Excellent. A system like this barely beats if at all simple rule based approach, which are easier to implement, easier to explain.

When use case demands it, ___domain relevant feature engineering is where the value is. A generic approach is unlikely to add any value


BTW, accessing metarank.ai gives warning. May be because it has Meta in its ___domain name but Metamask shows this message --

This ___domain is currently on the MetaMask ___domain warning list. This means that based on information available to us, MetaMask believes this ___domain could currently compromise your security and, as an added safety feature, MetaMask has restricted access to the site. To override this, please read the rest of this warning for instructions on how to continue at your own risk.


I got the same warning. I forgot I even had that thing, to be honest.

But, in saying that - what kind of filter is MetaMask using to just blatantly wipe out domains like this? Kind of on the fence on how I feel about it.


Doesn't seem to be a Metafilter.


thats ridiculous.. MetaMask puts warning on anything with Meta* in the name? good luck with the horde of metaverse startups on the way


According to the code on https://github.com/MetaMask/eth-phishing-detect/blob/45ea5cf..., looks like that everything within Levenstein distance of 3 from whitelisted hosts (like "metamask.*") is blocked.

Metarank and Metamask have the distance of 3. I've made a ticket some time ago in their github repo (https://github.com/MetaMask/eth-phishing-detect/issues/6855), but it seems that it was lost in thousands of similar tickets.


Yikes, best to just uninstall it then. That's insanely hostile to harmless sites.


Sorry that slipped through, I'll bring the team's attention to it.


lol, a lot of "metaXXXX" "fixes" in the PRs too... https://github.com/MetaMask/eth-phishing-detect/pulls


Thanks for bringing this up, I've created an issue in their github to unblock us


Great project! Elasticsearch / OpenSearch / Solr have their own learning to rank plugins. Have you considered integrating Metarank with such systems? Or is your vision to provide a reranker layer, that can be independent of the underlying search engine architecture?


We were considering creating a plugin for elasticsearch, but there's already one (ES-LTR) and such architecture limits the ability to create a good multi-purpose system.

We're still considering building plugins to easier integrate with existing search technologies and will keep an open eye on the demand for this.


This is super interesting!

On the demo page, nothing is happening when I try clicking on any of the buttons. I'm in a browser with no adblocking or jsblocking. Is this just the hug of death, or am I holding it wrong?


Same here, when using my default Chrome profile, with uBlock disabled.

However it seems to work in incognito.

EDIT: If you're using Metamask, I think that's the reason. After disabling it the demo worked. Also, when visiting metarank.ai from Github, I'm getting a warning containing:

  This ___domain is currently on the MetaMask ___domain warning list. This means that based on information available to us, MetaMask believes this ___domain could currently compromise your security and, as an added safety feature, MetaMask has restricted access to the site. To override this, please read the rest of this warning for instructions on how to continue at your own risk.
Screenshot: https://i.ibb.co/bHWTdtM/image.png


Looks like our demo is struggling with the load, typically it would display a list of movies with which you can interact.

We're looking at what we can do to revive it


same here. maybe they are being hit hard as this article reached the top 50.


What's a scenario or a method to apply a personalization engine that gives the lowest chance of making the overall UX worse?

I usually dislike personalized content, I prefer search results that accurately match my query I and find it distracting to see suggestions or uncommon ordering (to the point that I search for Netflix movies via an external website to avoid going through their UI).


I can actually relate to this, especially when personalization is applied in search.

However our stats and a\b test results show that personalization improves overall store conversion, ctr and other important metrics in ecommerce. And seeing how it's applied everywhere now (you social netwroks, ads, etc), the majority of users are engaging better with it.

Some sites opt to include a 'disable personalization' option, that might do the trick for some of the users


> However our [...] results show that personalization improves overall store conversion ...

So many questions of the form "this thing annoys me, how do I fix it" are answered by "These other people are making money annoying you, and they like it."


Honestly, personalization seems crazy to me. I can't believe how well it works and how fast I can get personalized stuff. I wouldn't know where to start to design a system to handle it. Sites like YouTube or Pixiv have no much content that it seems hard to rank it all for a single person.


That's exactly the problem we want to tackle - democratize machine learning, to make more developers and businesses apply it


I get cross policy warnings on the demo page Access to XMLHttpRequest at 'https://demo-api.metarank.ai:3000/movies?user=pnsar&session=...' from origin 'https://demo.metarank.ai' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.


This might have been due to the overload of the backend that we got thanks to HN :)


Could you publish the Dockerfile? When attempting to run train in Docker according to tutorial instructions I get "Cannot load library: java.lang.UnsatisfiedLinkError: /tmp/lightgbm[123]/lib_lightgbm.so: libgomp.so.1"

I logged into the container as root and ran "apt-get update && apt-get install libgomp1" and then training worked, but it'd be nice to be able to view/tweak your existing Dockerfile if/when needed. Thanks, cool project!


Thanks for bringing up the issue, we will take a look!

As for the Dockerfile, it's built via sbt at the moment, but we will look at adding it to the repo as well.


Soo cool would love to see this integrated with https://github.com/medusajs/medusa


Thanks for the tip, we will take a look!


This is cool. Not having read the behavior, I expected the demo to allow me to downvote films as well.

Partly, because there is a settings icon overlay. Maybe I’m missing something on mobile.

Also, it drove me nuts that I couldn’t like only the first matrix film.


Generally it's possible to implement any type of behaviour as you control what features and actions are taken into account. The demo is not really optimized for mobile as the goal, apart from clicks on the movies, is to show the feature values that are being changed real time.


Very cool - I haven't had time to peruse the offering or code, but it seems like a very needed tool for industries and small businesses which don't have the resources to make it happen.


Let us know if you want to try it out and we will help as much as we can!

The tool might be useful for larger companies as well, as it can give a head start for the machine learning engineers as they won't need to build the tools from scratch


Hug of death on the demo app. (504 on calls to https://demo-api.metarank.ai:3000/movies)


Are there any privacy implications? i.e. you're learning to show me the best results based on my experience, what happens to that learning when I leave the site?


As people with heavy e-commerce background, we feel that the main pain point of typical old-school offline personalization solutions is that 80% of customers in medium-sized online stores are coming only once:

* you have a very short window to adapt your store, as the visitor will never come back in the future.

* even if you have zero past knowledge about a new visitor, there is still something to compare with other similar visitors: are they from mobile? Is it ios or android? Are they US? Is it a holiday now? Did they come from google search or facebook ad?

* this knowledge is ephemeral and makes sense only within their current session. But a visitor can still do a couple of interactions like browsing different collections of items or clicking on search results, and it can also be taken into account.

But compared to Amazon and Google, it's you who define which features should be used for the ranking and how long they are stored (see the "ttl" option on all feature extractors in our docs for details).

For example, here is https://github.com/metarank/metarank/blob/master/src/test/re... the config of features used in the movie recommendations demo - in a most privacy-sensitive setup you can just drop all the "interacted_with" extractors and will get zero private data stored for each visitor.


Very cool! Thanks for sharing.

Rather than an offline model, why not use an online, continuously relearning model like a Multi-Armed Bandit to do the re-ranking?


We're completely on board with you for reinforcement learning, however we wanted to start with something simpler to build the tool faster. RL is one the plate however!


When I promoted dark knight it just shows all other super hero movies when I really like Nolan movies more than other action hero movies.


You can check out the features that increased, as we're using not only the director, but tags and other movie features in the model.

All of this is configurable, of course, so you can adjust the behaviour to your use case


Congrats on the launch.

It's a bit uneasing to hit the landing page and find a typo in "personalizaton made easy."


Spell checking should be a default feature in IDEs. I've seen teammates struggle to find the source of a bug, only to find a spelling error in a variable or configuration setting. I'm not the greatest touch typist, and my IDE catches double letters, missing letters, reversed letters and other mistakes all the time.


Thanks for bringing this up :)


Another one: "The actions you take will diretly affect"


Why do people use Scala?


The same question can be asked about JavaScript, but it's still one of the most popular languages in the world :) It's a common wisdom to use a language you know best for an MVP - that's the main reason it's Scala.

And it's not a framework, so you don't really need to write/read any Scala to play with it.



a) Runs on the JVM. So it’s fast, solid, well supported and has the largest array of enterprise grade libraries.

b) It is one of the few languages that lets you use the same code for frontend (Scala.js), backend (Scala) and desktop (Scala Native).

c) FP and the strong type system when used intelligently can make your code simpler, cleaner and safer.

d) Libraries such as ZIO (zio.dev) make robust concurrency a breeze. Not yet seen any other language/library except for Erlang come close.


Crazy that the GitHub repo went from almost no starts to 800+ in one day.


We had 56 just before we posted on HN, we actually got blown away by the community interest




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: