I've delayed the move away from PostgreSQL search on an open source project. Of course it's important to do the text matching quickly, but beyond that we have three concepts that on their own could be done simply enough but together they create a degree of complexity.
Thoughts on how to proceed welcome, in essence we have:
1. Textual documents indexed.
2. Permissions which are essentially group based (but lots of groups, potentially hundreds), and then owners || admins || moderators get special permissions.
3. "Followed"/"Watched" items.
4. "Ignored"/"Hidden" items.
Elastic Search remains our target, and we want to be able to reduce the results on the based of permissions and the ignore list, whilst allowing a further restriction to be "only the stuff I'm watching".
[Full disclosure: I’m a product manager at MarkLogic.]
MarkLogic can handle all of these requirements with aplomb. You can think of MarkLogic as a database built with search engine technology. It uses a document data model (text documents in XML or JSON). Each term (word, phrase, parent-child relationship, etc.) is indexed on ingest. There are index knobs and levers for things like diacritics, wildcards, and scalars, like you'd expect in real search engine.
As for document permissions, they're indexed just like other terms. However, they’re automatically ANDed on to each query in the database engine, not application code. MarkLogic supports role-based permissions (read, write, and execute for stored procedures) with optional Kerberos and/or LDAP auth*n.“Ignored/hidden items” are those that a user doesn’t have permissions to access.
"Followed/watched items" is a pretty common requirement. MarkLogic uses a special "reverse index" to index queries along with text, values, and structures. With regular "forward" queries, queries find documents. With reverse queries, documents find queries. Thus anything that can be expressed in a query can be turned into an alert. This provides some pretty powerful match-making where a document can express its own attributes as well as those it’s interested in matching. Hook that up to a trigger (pre- or post-commit) and you have alerting that scales to billions of documents and millions of queries. One of the world’s biggest news sites uses this infrastructure on a MarkLogic cluster to handle saved searches and alerts.
Makes no sense to me. Why not have a document for a file and a document for a folder? Or if you must have a single document, why not a field for folder permissions and one for file permissions? The corresponding query would be trivial.
This seems overly complicated and I can't find any justification for it in the article.
The article was maybe a bit too simplistic. What if you have more then one folder? What if a file can be located within a whole tree of folders, just like on a file system? Would the corresponding query still be trivial? Maybe there's a trivial way of doing recursion in elasticsearch?
elasticsearch.com works on security problems and will offer a solution: Shield [1]
It seems that you have to pay the support to enable Shield on elasticsearch. Probably the same business model as for marvel, free as in free beer for development and have to pay for production
Based on what I read here, it seems to be the equivalent of "Mysql users/Postgres roles" applied to ES.
I don't know how they will propose the feature but if we compare to mysql, it wouldn't be possible to map application permissions to mysql users/privileges. Mysql privileges are for "administration" accounts but not for the user management of the application itself.
Anyhow, Shield doesn't isn't likely to be an open source product. Tuleap, on its side, is 100% open source so it's a no go !
Thoughts on how to proceed welcome, in essence we have:
1. Textual documents indexed.
2. Permissions which are essentially group based (but lots of groups, potentially hundreds), and then owners || admins || moderators get special permissions.
3. "Followed"/"Watched" items.
4. "Ignored"/"Hidden" items.
Elastic Search remains our target, and we want to be able to reduce the results on the based of permissions and the ignore list, whilst allowing a further restriction to be "only the stuff I'm watching".