Show HN: We made an open-source personalization engine

shutty · on March 23, 2022

I’m one of the contributors to this project. The idea of the tool is to focus on typical ML feature engineering challenges. It takes a stream of business events like clicks and impressions, and computes a ton of common ML features on top:

* Parse User-Agent field, make a GeoIP lookup

* Count number of clicks over different items on multiple time windows, like 1-2-3-4 weeks

* Conversion and CTR rates

* Basic customer profiling, like “you clicked on a red item in the past, and this item is also red”

There is just a LambdaMART with xgboost inside, no rocket science. It won’t replace an in-house highly-focused solution, but building everything from scratch may take a ton of time. With Metarank you can quickly hack a good enough solution in a day, hopefully :)

kqr · on March 23, 2022

Not only could it be good enough -- it's a great reference to benchmark commercial custom solutions against! (And I say this as an engineer working on one of those commercial custom solutions!)

Ennergizer · on March 23, 2022

What are approximate the costs in your demo https://demo.metarank.ai/ example to train and run the service?

shutty · on March 23, 2022

Right now it runs in a dev-mode on a single EC2 t3.large instance with loadavg ~0.30, but the inference load is quite tiny right now: around 3-4 reranking requests per second. And yes, as a typical open-source project it still crashes from time to time :)

The training dataset is not that huge (see https://github.com/metarank/ranklens/ for details, it's open-source), so we do a full retraining directly on the node right after the deployment, and it takes around 1 minute to finish. We also run the same process inside the CI: https://github.com/metarank/metarank/blob/master/run_e2e.sh

There is an option to run this thing in a distributed mode:

* training is done using a separate batch job running on Apache Flink (and on k8s using flink's integration)

* feature updates are done in a separate streaming Flink job, writing everything in Redis

* The API fetches latest feature values from Redis and runs the ML model.

The dev-mode I've mentioned earlier is when all these three things are bundled together in a single process to make it easier to play with the tool. But we didn't spent much time testing distributed setup, as this thing is still a hobby side-project and we're limited in time spent developing it.

jka · on March 23, 2022

From reading some of the repository and architecture overview, I think this is true, but: could you confirm that users of metarank can self-train their own models from scratch?

shutty · on March 23, 2022

This is actually part of our CI process: https://github.com/metarank/metarank/blob/master/run_e2e.sh . This script runs on every PR to retrain the model used on a demo and confirms that it's working fine.

So you can just download the jar file from releases page and run ./run_e2e.sh <jar file> in the checked-out repository, it should do the job.

jka · on March 23, 2022

Thanks!

dannywarner · on March 24, 2022

What budget for cloud infrastructure for 100K/mo buyers to an ecommerce website, approximate range, with typical purchase habits? I am new to Flink. We use Redis in production.

airstrike · on March 23, 2022

> “you clicked on a red item in the past, and this item is also red”

Layman here: is this why I keep seeing ads for things I've already bought?

AmblingAvocado · on March 23, 2022

No. When the average person sees an ad for something they just bought, it increases their satisfaction with their purchase (thus making them less likely to return it.) Also, when you’ve just bought something, there is a non-zero chance you will return it and want to buy a different model of the same type of item.

tinus_hn · on March 23, 2022

Also perhaps if you made an informed purchase, the system knows you looked for information on your item but not that you bought it.

danpalmer · on March 23, 2022

> Metarank is industry-agnostic and can be used in any place of your application where some content is displayed.

I'm afraid I'm skeptical.

Content ranking in small, well defined contexts is not hard to do and doesn't require an ML approach – rules based systems are often easier to specify, easier for both creators and users to understand, and easier to make conform to business rules.

When ML does need to be introduced, when the scale or complexity is large enough that a rules-based approach will be infeasible or worse, having a generic implementation is unlikely to return useful results. So much of the work of optimising an ML approach is engineering features out of the data that make sense and that don't introduce bias.

It's that last point that's really important because if you do the wrong feature engineering, then the bias introduced effectively means you're back to building a rules-based system, just one that has a bunch of inaccuracy built in, and where you don't understand what rules you've specified, or even that you have specified them.

I'm not an expert here, but I've worked on basic recommender systems for products, and worked with people who were far more knowledgeable about this, all of whom seemed to have a low opinion of generic systems.

naijaboiler · on March 24, 2022

Excellent. A system like this barely beats if at all simple rule based approach, which are easier to implement, easier to explain.

When use case demands it, ___domain relevant feature engineering is where the value is. A generic approach is unlikely to add any value

Sharma · on March 23, 2022

BTW, accessing metarank.ai gives warning. May be because it has Meta in its ___domain name but Metamask shows this message --

This ___domain is currently on the MetaMask ___domain warning list. This means that based on information available to us, MetaMask believes this ___domain could currently compromise your security and, as an added safety feature, MetaMask has restricted access to the site. To override this, please read the rest of this warning for instructions on how to continue at your own risk.

skilled · on March 23, 2022

I got the same warning. I forgot I even had that thing, to be honest.

But, in saying that - what kind of filter is MetaMask using to just blatantly wipe out domains like this? Kind of on the fence on how I feel about it.

prionassembly · on March 23, 2022

Doesn't seem to be a Metafilter.

swyx · on March 23, 2022

thats ridiculous.. MetaMask puts warning on anything with Meta* in the name? good luck with the horde of metaverse startups on the way

shutty · on March 23, 2022

According to the code on https://github.com/MetaMask/eth-phishing-detect/blob/45ea5cf..., looks like that everything within Levenstein distance of 3 from whitelisted hosts (like "metamask.*") is blocked.

Metarank and Metamask have the distance of 3. I've made a ticket some time ago in their github repo (https://github.com/MetaMask/eth-phishing-detect/issues/6855), but it seems that it was lost in thousands of similar tickets.

oauea · on March 23, 2022

Yikes, best to just uninstall it then. That's insanely hostile to harmless sites.

Legogris · on March 23, 2022

Sorry that slipped through, I'll bring the team's attention to it.

detaro · on March 23, 2022

lol, a lot of "metaXXXX" "fixes" in the PRs too... https://github.com/MetaMask/eth-phishing-detect/pulls

vgoloviznin · on March 23, 2022

Thanks for bringing this up, I've created an issue in their github to unblock us

dmitrykan · on March 23, 2022

Great project! Elasticsearch / OpenSearch / Solr have their own learning to rank plugins. Have you considered integrating Metarank with such systems? Or is your vision to provide a reranker layer, that can be independent of the underlying search engine architecture?

vgoloviznin · on March 24, 2022

We were considering creating a plugin for elasticsearch, but there's already one (ES-LTR) and such architecture limits the ability to create a good multi-purpose system.

We're still considering building plugins to easier integrate with existing search technologies and will keep an open eye on the demand for this.

mushufasa · on March 23, 2022

This is super interesting!

On the demo page, nothing is happening when I try clicking on any of the buttons. I'm in a browser with no adblocking or jsblocking. Is this just the hug of death, or am I holding it wrong?

punkspider · on March 23, 2022

Same here, when using my default Chrome profile, with uBlock disabled.

However it seems to work in incognito.

EDIT: If you're using Metamask, I think that's the reason. After disabling it the demo worked. Also, when visiting metarank.ai from Github, I'm getting a warning containing:

  This ___domain is currently on the MetaMask ___domain warning list. This means that based on information available to us, MetaMask believes this ___domain could currently compromise your security and, as an added safety feature, MetaMask has restricted access to the site. To override this, please read the rest of this warning for instructions on how to continue at your own risk.

Screenshot: https://i.ibb.co/bHWTdtM/image.png

vgoloviznin · on March 23, 2022

Looks like our demo is struggling with the load, typically it would display a list of movies with which you can interact.

We're looking at what we can do to revive it

hirako2000 · on March 23, 2022

same here. maybe they are being hit hard as this article reached the top 50.

thih9 · on March 23, 2022

What's a scenario or a method to apply a personalization engine that gives the lowest chance of making the overall UX worse?

I usually dislike personalized content, I prefer search results that accurately match my query I and find it distracting to see suggestions or uncommon ordering (to the point that I search for Netflix movies via an external website to avoid going through their UI).

vgoloviznin · on March 23, 2022

I can actually relate to this, especially when personalization is applied in search.

However our stats and a\b test results show that personalization improves overall store conversion, ctr and other important metrics in ecommerce. And seeing how it's applied everywhere now (you social netwroks, ads, etc), the majority of users are engaging better with it.

Some sites opt to include a 'disable personalization' option, that might do the trick for some of the users

_lqaf · on March 23, 2022

> However our [...] results show that personalization improves overall store conversion ...

So many questions of the form "this thing annoys me, how do I fix it" are answered by "These other people are making money annoying you, and they like it."

charcircuit · on March 23, 2022

Honestly, personalization seems crazy to me. I can't believe how well it works and how fast I can get personalized stuff. I wouldn't know where to start to design a system to handle it. Sites like YouTube or Pixiv have no much content that it seems hard to rank it all for a single person.

vgoloviznin · on March 23, 2022

That's exactly the problem we want to tackle - democratize machine learning, to make more developers and businesses apply it

GrumpyNl · on March 23, 2022

I get cross policy warnings on the demo page Access to XMLHttpRequest at 'https://demo-api.metarank.ai:3000/movies?user=pnsar&session=...' from origin 'https://demo.metarank.ai' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.

vgoloviznin · on March 24, 2022

This might have been due to the overload of the backend that we got thanks to HN :)

themgt · on March 24, 2022

Could you publish the Dockerfile? When attempting to run train in Docker according to tutorial instructions I get "Cannot load library: java.lang.UnsatisfiedLinkError: /tmp/lightgbm[123]/lib_lightgbm.so: libgomp.so.1"

I logged into the container as root and ran "apt-get update && apt-get install libgomp1" and then training worked, but it'd be nice to be able to view/tweak your existing Dockerfile if/when needed. Thanks, cool project!

vgoloviznin · on March 24, 2022

Thanks for bringing up the issue, we will take a look!

As for the Dockerfile, it's built via sbt at the moment, but we will look at adding it to the repo as well.

sebrindom · on March 23, 2022

Soo cool would love to see this integrated with https://github.com/medusajs/medusa

vgoloviznin · on March 23, 2022

Thanks for the tip, we will take a look!

bredren · on March 24, 2022

This is cool. Not having read the behavior, I expected the demo to allow me to downvote films as well.

Partly, because there is a settings icon overlay. Maybe I’m missing something on mobile.

Also, it drove me nuts that I couldn’t like only the first matrix film.

vgoloviznin · on March 24, 2022

Generally it's possible to implement any type of behaviour as you control what features and actions are taken into account. The demo is not really optimized for mobile as the goal, apart from clicks on the movies, is to show the feature values that are being changed real time.

czbond · on March 23, 2022

Very cool - I haven't had time to peruse the offering or code, but it seems like a very needed tool for industries and small businesses which don't have the resources to make it happen.

vgoloviznin · on March 23, 2022

Let us know if you want to try it out and we will help as much as we can!

The tool might be useful for larger companies as well, as it can give a head start for the machine learning engineers as they won't need to build the tools from scratch

nwsm · on March 23, 2022

Hug of death on the demo app. (504 on calls to https://demo-api.metarank.ai:3000/movies)

orliesaurus · on March 23, 2022

Are there any privacy implications? i.e. you're learning to show me the best results based on my experience, what happens to that learning when I leave the site?

shutty · on March 23, 2022

As people with heavy e-commerce background, we feel that the main pain point of typical old-school offline personalization solutions is that 80% of customers in medium-sized online stores are coming only once:

* you have a very short window to adapt your store, as the visitor will never come back in the future.

* even if you have zero past knowledge about a new visitor, there is still something to compare with other similar visitors: are they from mobile? Is it ios or android? Are they US? Is it a holiday now? Did they come from google search or facebook ad?

* this knowledge is ephemeral and makes sense only within their current session. But a visitor can still do a couple of interactions like browsing different collections of items or clicking on search results, and it can also be taken into account.

But compared to Amazon and Google, it's you who define which features should be used for the ranking and how long they are stored (see the "ttl" option on all feature extractors in our docs for details).

For example, here is https://github.com/metarank/metarank/blob/master/src/test/re... the config of features used in the movie recommendations demo - in a most privacy-sensitive setup you can just drop all the "interacted_with" extractors and will get zero private data stored for each visitor.

nelsondev · on March 23, 2022

Very cool! Thanks for sharing.

Rather than an offline model, why not use an online, continuously relearning model like a Multi-Armed Bandit to do the re-ranking?

vgoloviznin · on March 23, 2022

We're completely on board with you for reinforcement learning, however we wanted to start with something simpler to build the tool faster. RL is one the plate however!

gizmodo59 · on March 23, 2022

When I promoted dark knight it just shows all other super hero movies when I really like Nolan movies more than other action hero movies.

vgoloviznin · on March 24, 2022

You can check out the features that increased, as we're using not only the director, but tags and other movie features in the model.

All of this is configurable, of course, so you can adjust the behaviour to your use case

nonoesp · on March 23, 2022

Congrats on the launch.

It's a bit uneasing to hit the landing page and find a typo in "personalizaton made easy."

joemaffei · on March 23, 2022

Spell checking should be a default feature in IDEs. I've seen teammates struggle to find the source of a bug, only to find a spelling error in a variable or configuration setting. I'm not the greatest touch typist, and my IDE catches double letters, missing letters, reversed letters and other mistakes all the time.

vgoloviznin · on March 23, 2022

Thanks for bringing this up :)

dewey · on March 23, 2022

Another one: "The actions you take will diretly affect"

minroot · on March 23, 2022

Why do people use Scala?

shutty · on March 23, 2022

The same question can be asked about JavaScript, but it's still one of the most popular languages in the world :) It's a common wisdom to use a language you know best for an MVP - that's the main reason it's Scala.

And it's not a framework, so you don't really need to write/read any Scala to play with it.

gmartres · on March 23, 2022

https://www.lihaoyi.com/post/FromFirstPrinciplesWhyScala.htm...

threeseed · on March 23, 2022

a) Runs on the JVM. So it’s fast, solid, well supported and has the largest array of enterprise grade libraries.

b) It is one of the few languages that lets you use the same code for frontend (Scala.js), backend (Scala) and desktop (Scala Native).

c) FP and the strong type system when used intelligently can make your code simpler, cleaner and safer.

d) Libraries such as ZIO (zio.dev) make robust concurrency a breeze. Not yet seen any other language/library except for Erlang come close.

nonoesp · on March 24, 2022

Crazy that the GitHub repo went from almost no starts to 800+ in one day.

vgoloviznin · on March 24, 2022

We had 56 just before we posted on HN, we actually got blown away by the community interest