Hacker News new | past | comments | ask | show | jobs | submit | MasterIdiot's comments login

I think the distribution he uses is pretty close to the paper he links "Exploiting Cloud Object Storage for High-Performance Analytics" https://www.durner.dev/app/media/papers/anyblob-vldb23.pdf


anecdotally Meta has pretty lax moderation against anti-palestinian in Hebrew, allowing tons of extremely racist/violent speech.


Yeah, probably because it's Hebrew and no one outside Israel will read it anyway.


Having worked for a SIEM vendor, I can say that all security software is extremely invasive, and most security people can probably track every action you make on company-issued devices, and that includes HTTPS decryption.


Reminds me of a guy I know openly bragging that he can watch all of his customers who installed his company's security cameras. I won't reveal his details but just imagine any cloud security camera company doing the same and you would probably be right.

I guess it's pretty much the same principle.


Yeah the question is always if the cure is better than the disease. I'm quite ambivalent on this. On the one hand I tend to agree with the "Anti AV camp" that a sufficiently maintained machine can do well when following best practices. Of course that includes SIEM which can also be run on-premise and doesn't necessarily have to decrypt traffic if it just consumes properly formatted logs.

On the other hand there was e.g. WannaCry in 2017 where 200,000 systems across 150 countries running Windows XP and other unsupported Windows Server versions had crypto miners installed. It shows that companies world-wide had trouble properly maintaining the life cycle of their systems. I think it's too easy to only accuse security vendors of quality problems.


Prometheus push gateway maybe? I guess it depends on limitations like battery life, bandwidth and network access (maybe it's behind a NAT or something)


hnswlib is in cpp and has python bindings (you should be able to make your own for other languages). Faiss, Annoy (by Spotify) should also provide similar functionality.

https://github.com/nmslib/hnswlib


There's one, but with some limitations (For example - only vectors of up to 1024 dimensions)

https://github.com/pgvector/pgvector


If you really think that's enough to build a real product, go for it. Even open-source companies (Elastic, Mongo, Scylla) have to build tons of infra around their core codebase in order to make it an actual cloud product.


Not that easy, the founder was a director at AWS. This is just devops/obfuscation on top of an open source library:

FAISS


Pinecone doesn’t use Faiss, nor ScaNN. We love Faiss and even teach people to use it[1]. There happens to be a sizable population of engineers who need more than what Faiss provides (like live index updates and metadata filtering, for example), and can’t be bothered or aren’t being paid to customize and manage open-source libraries all day.

[1] https://www.pinecone.io/learn/faiss/


So you guys developed and implemented state of the art neural network vector search from scratch? in a year? and something better than libraries with tens of contributors over years of research?


Most vector search research teams are a lot smaller than you suggest, and haven't been around that long (e.g. the FAISS paper was published in 2017).

From public info, you can see they have at least one researcher working there. It's believable to me that they could have some new innovations, especially since the product space they're focusing on is different from other teams working on vector search. State-of-the-art for a specific set of constraints is still state-of-the-art.

However, considering how much of their edu-marketing content is posted to HN, it would be great if they could share more details about the internals of their index with the community. One of the great things about vector search is how many techniques are open sourced or documented in papers :).

Disclaimer: I work on vector search at a different company


Many very competitive vector search libraries are done by small teams.

HNSW in NMSLIB[1] is mostly 3 people's work and it's very competitive[2].

[1] https://github.com/nmslib/nmslib

[2] http://ann-benchmarks.com/glove-100-angular_10_angular.html


I actually built a similar solution supporting similar operations (including filtering by meta-data) using open-source libraries. Took me about 2 weeks net.

I can see a clientele for such database (people who want a turnkey solution), but honestly it looks like an attempt to use a dev-ops solution to address deeper issues with problem formulation: e.g.

1. Is there really a need to search all items in the database? can subsampling make simple similarity comparison feasible?

2. Do the embeddings really need to have that many dimensions? Can we reduce their dimensionality and fit them in RAM?

3. Is embedding accurate enough compared to pairwise comparison? Can we formulate the problem to make the latter feasible?

I also could not find any explanation of the underlying algorithms, especially around meta-data filtering, which is not solved by FAISS as well as their accuracy. (happy to hear otherwise)


Among other things - larger caches, more instructions, more cores and more busses in between them.


I've seen similar solutions being built internally in multiple companies, none with a syntax as well thought out as this. Amazing work!


While there's probably some value in posting the resources "everyone knows", Papers with Code was submitted multiple times in the past few years, which is a pretty common HN thing (whether this behavior is desired is up to dang and the community I guess).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: