Show HN: PageFish search algorithm

sonofgod · on July 17, 2018

Instead of addition, why not use an actual hashing algorithm, truncated to a certain number of bits? That'll give you better binning and give you flexibility about filesize / number of file tradeoffs.

I like the idea of using unicode characters -- have you performance tested it to see if it gives superior performance to an ASCII filename of the same number of bytes?

peterburkimsher · on July 17, 2018

Binning! That's the word, not "hash distribution" like I wrote. Yes, I'd like to experiment more with different hashing algorithms. It needs to run fast on the query side.

Unicode seems great, but I noticed that filenames are restricted on various platforms, including Github. Uppercase and lowercase URLs are equivalent. So perhaps I should consider a different subset of characters instead of the first 65536 Unicode characters.

Performance is much faster than grep, because it only reads a fraction of the file. I couldn't find a good benchmark for ElasticSearch/Lucene/Sphinx, and apparently there isn't an easy way to compare.

https://stackoverflow.com/questions/44846271/performance-com...

sonofgod · on July 17, 2018

Whilst there's a few hashing algorithms explicitly designed to be memory and time-intensive to make password retrieval impractical, most are really fast, even in Javascript. (Given there's g and so should be fine but isn't particularly lightweight. oing to be network access, I expect it to be entirely negligible)

In terms of performance I was more wondering whether or not "£" (one code point, two bytes) was much less expensive than "Ao" (two code points, two bytes). Just restricting yourself to ASCII might also have advantages for ensuring portability.

https://qntm.org/safe might also be of interest: discusses potential issues with using unicode as a high-density encoding -- base-2048 is Twitter-safe.

nine_k · on July 17, 2018

Making a longer hash than 2 bytes would help selectivity, and filesystems are usually very good at finding files by name.

Also, apparently the files are not sorted by name (or I failed to see it). Searching in a sorted file using binary search is logarithmic, much faster than grepping which is linear.

Sorting can be done once, when the files are formed. BTW inserting into a sorted file should also be pretty fast (cheaper than appending and re-sorting).

peterburkimsher · on July 17, 2018

The data is in the order it came in from OSMNames.

http://osmnames.org/download/

When there's two elements with the same name (e.g. Geneva), the first result is the most common (the city I was born, not the town in the US).

netgusto · on July 17, 2018

I proposed to use a simple polynomial hash https://github.com/peterburk/pagefish/issues/1

nine_k · on July 17, 2018

> Server-side operations cost money. Client-side JavaScript is free.

This is a reasonable heuristic. OTOH when a page spins up fans in your laptop, or, worse, makes your browser unresponsive due to the amount of JavaScript executed, it's not as free as many users would like.

(I'm talking about the general principle, not the particular implementation.)

gregoriol · on July 17, 2018

Obviously it is faster than grep... which is "linear search"! Compare it instead to some "hashing search" to see how good or fast it really is. For example, take SQLite's Full Text Search as reference.

Imho, there is no point in such a custom solution: it is too limited and too simplified for real general uses.

LordHeini · on July 17, 2018

Using some database backed full text search will beat this by a huge margin.

Especially when it comes to actually working correctly.

As the author states where are some caveats with invalid "hashes" and i managed to get false negatives on the third try.

For example 'Baden-Württemberg' and 'Egypt' is in the file but not found by the search.

Not to mention fuzzy search which is highly important if you are doing any kind of text search.

And setting up some PostgresSQL or SQLite costs nothing compared to the cost of debugging this rather naive approach.

peterburkimsher · on July 17, 2018

Some queries (e.g. San Francisco) work on localhost but not Github, apparently because of case-insensitive URLs. That's probably what's up with Egypt and B-W.

A better hashing/binning algorithm should fix that.

I don't think this can compete with a database for features. It's more a proof of concept for using the filesystem as a search index, which might have some other applications when server-side isn't available (e.g. IPFS).

rjmunro · on July 17, 2018

Why not just have one file per city name? Or use e.g. the first 2 characters of the city name as the hash, so that certain near misses later in the city name appear in the downloaded files.

peterburkimsher · on July 17, 2018

Good idea! Even take it to the extreme: a subfolder for every character in the string.

peterburkimsher · on July 17, 2018

I realised why numbers are important, rather than characters! It scales to datacenter size.

Assign IP ranges to 255 servers or VMs, and hash to base 255 (not 65536 like this example).

Then let the router do the search.

dangoljames · on July 17, 2018

This is a very vertical search algorithm; it is useful only for indexing and searching cities as shown.

peterburkimsher · on July 17, 2018

I plan to use it for much more than just city names, specifically Chinese-language lyrics/subtitles/Bible/menus for Pingtype. I could also use it to speed up the word spacing, which does a lot of dictionary searching.

https://pingtype.github.io

It's really hard to type Chinese. But that means there's no typos in the query! Therefore fuzzy searching isn't as important.

ohum · on July 17, 2018

It could be made into a real meta search engine, and you could have layers of searchers, a network of searchers, with a common memory space, and parallelize the whole thing in sequencel, throw it on a vps with a few hundred/thousand cores, index the entire ipv4... You would probably need to use a big cloud