Instead of addition, why not use an actual hashing algorithm, truncated to a certain number of bits? That'll give you better binning and give you flexibility about filesize / number of file tradeoffs.
I like the idea of using unicode characters -- have you performance tested it to see if it gives superior performance to an ASCII filename of the same number of bytes?
Binning! That's the word, not "hash distribution" like I wrote. Yes, I'd like to experiment more with different hashing algorithms. It needs to run fast on the query side.
Unicode seems great, but I noticed that filenames are restricted on various platforms, including Github. Uppercase and lowercase URLs are equivalent. So perhaps I should consider a different subset of characters instead of the first 65536 Unicode characters.
Performance is much faster than grep, because it only reads a fraction of the file. I couldn't find a good benchmark for ElasticSearch/Lucene/Sphinx, and apparently there isn't an easy way to compare.
Whilst there's a few hashing algorithms explicitly designed to be memory and time-intensive to make password retrieval impractical, most are really fast, even in Javascript. (Given there's g and so should be fine but isn't particularly lightweight. oing to be network access, I expect it to be entirely negligible)
In terms of performance I was more wondering whether or not "£" (one code point, two bytes) was much less expensive than "Ao" (two code points, two bytes). Just restricting yourself to ASCII might also have advantages for ensuring portability.
https://qntm.org/safe might also be of interest: discusses potential issues with using unicode as a high-density encoding -- base-2048 is Twitter-safe.
Making a longer hash than 2 bytes would help selectivity, and filesystems are usually very good at finding files by name.
Also, apparently the files are not sorted by name (or I failed to see it). Searching in a sorted file using binary search is logarithmic, much faster than grepping which is linear.
Sorting can be done once, when the files are formed. BTW inserting into a sorted file should also be pretty fast (cheaper than appending and re-sorting).
> Server-side operations cost money. Client-side JavaScript is free.
This is a reasonable heuristic. OTOH when a page spins up fans in your laptop, or, worse, makes your browser unresponsive due to the amount of JavaScript executed, it's not as free as many users would like.
(I'm talking about the general principle, not the particular implementation.)
Obviously it is faster than grep... which is "linear search"! Compare it instead to some "hashing search" to see how good or fast it really is. For example, take SQLite's Full Text Search as reference.
Imho, there is no point in such a custom solution: it is too limited and too simplified for real general uses.
Some queries (e.g. San Francisco) work on localhost but not Github, apparently because of case-insensitive URLs. That's probably what's up with Egypt and B-W.
A better hashing/binning algorithm should fix that.
I don't think this can compete with a database for features. It's more a proof of concept for using the filesystem as a search index, which might have some other applications when server-side isn't available (e.g. IPFS).
Why not just have one file per city name? Or use e.g. the first 2 characters of the city name as the hash, so that certain near misses later in the city name appear in the downloaded files.
I plan to use it for much more than just city names, specifically Chinese-language lyrics/subtitles/Bible/menus for Pingtype. I could also use it to speed up the word spacing, which does a lot of dictionary searching.
It could be made into a real meta search engine, and you could have layers of searchers, a network of searchers, with a common memory space, and parallelize the whole thing in sequencel, throw it on a vps with a few hundred/thousand cores, index the entire ipv4...
You would probably need to use a big cloud
I like the idea of using unicode characters -- have you performance tested it to see if it gives superior performance to an ASCII filename of the same number of bytes?