Hacker News new | past | comments | ask | show | jobs | submit login
Designing better file organization around tags, not hierarchies (2017) (nayuki.io)
230 points by enobrev on April 5, 2018 | hide | past | favorite | 161 comments



Hello everyone, thank you for all the comments. Seeing this on the HN front page caught me by surprise. In the past year I shared this article publicly (Reddit) and privately (with tech-savvy acquaintances) for comment, and the general sentiment I received was that these ideas were not ready to be read by a mass audience. The article is way too long and pulls in many disparate ideas; it explains both why traditional features are problematic and how new features would work better. In the end, it is unclear what a real implementation would look like, and what concrete benefits and annoyances would come out of real-world usage. I was hoping to build an ugly prototype before asking for feedback.

Regarding the comments on this HN thread, it seems the general discussion is around tagging. This is indeed the title of the article and the main idea that motivated my exploration, but I believe the other ideas are just as important. I explored notions like no-filenames, strong preference for hash addressing and references, ___location independence, immutability, backups and deduplication, preference for external (non-embedded) file metadata, first-class media libraries, and more.

I think the debate about tagging is quite adequate, and would be happy to hear comments about the other features/non-features, and whether all the ideas fit or don't fit cohesively as a system.


I think a big problem with metadata-aware “file systems” is that the metadata is lost once the file is exported out of the system. This is a problem with ID3 tags for instance.

Another problem is where you make the compromise in the no-mans land between fully fledged data structure and file system. As soon as you start adding meaningful metadata to the file system, it quickly becomes apparent that you want the files themselves to be structured data and not just opaque sequences of bytes. At that point you’re redesigning the OS since that mode of usage requires user and application buy-in. It’s just a tough design problem to make any universally applicable progress in this space and it seems like any sort of non-HFS system is destined for application specific use cases.


> the metadata is lost once the file is exported out of the system. This is a problem with ID3 tags for instance.

Are you sure? ID3 tags are embedded in the file itself, and therefore remains in the file no matter what medium you store it on.


That’s my point.


"I think a big problem with metadata-aware “file systems” is that the metadata is lost once the file is exported out of the system."

This is my thought as well. I used to run BeOS as my main OS, and when I finally had to move away from it, all the BeFS metadata was left behind as well.


> the metadata is lost once the file is exported out of the system

As someone who has implemented such a "file system" (or three) for various types of enterprise clients, some of which in turn serve it to other b2b clients of their own... I just have to say... this isn't necessarily a bad thing, and also, it's not necessarily true either. For starters, one can easily give every file a unique uuid, and map that uuid to a spreadsheet full of metadata. Additionally, a little vendor lock in to keep "special features" like management of a bespoke file system isn't necessarily a bad thing, either, if it's in your best interest to keep paying customers. Application specific use cases? Sure... but what isn't? You can build a generic abstract non-hierarchical file system though... easily.


I would recommend looking at some of the ideas that ReiserFS was trying to do -- and some of the ideas about metadata and expanding some of the concepts of filesystems were present in their ideas as well. In particular one idea they had was allowing you to do things like SQL searches in your filesystem, using filesystem plugins (effectively the idea is to allow for database structures to be stored using filesystem plugins so that searches and operations become just VFS operations).

Obviously we know what happened to Hans Reiser, but I've always felt that some of the clever ideas in reiser4 were not fully explored because of what happened.


Hello, I am trying to make an application for my wife to manage embroideries. I encounter almost all your issues. My wife has thousands of embroideries downloaded from internet. There are many duplicates (filenames not unique because of internationalisation and special characters). She needs to add tags to help search. She also needs groups of tags (tiger belongs to animals, ...). She also has metadata (origin of the file, which license applies to which file). Sometimes she modifies an embroidery. She needs to ensure that the original file is not modified and to keep a link between the two files. Sometimes there are groups of embroideries (for example letters) or there is documentation attached to an embroidery. It is a mess and I think that your work would help a lot to handle this kind of use cases. The current paradigm of directory tree is outdated and something smarter could be done.


How large are these embroidery files? I feel like this is something that a SQL database might be able to help with. It's probably not an ideal solution, but remember that a filesystem is really nothing more than a database, which organizes files on-disk in a particular way and indexes them so that you can find them. It obviously has serious limitations due to its hierarchical nature, which is why relational databases were invented, so using existing tools it's probably quite feasible to create an application that uses a SQL database to index all the embroidery files and store all this data on them (license, whether it's a derivative of another one, and tags), and then if they're too large to just store in the DB itself, just point to files in the regular filesystem.


To be pragmatic, it sounds like she needs a relational database more than a filesystem. (Whether or not filesystems should be more like relational databases is a hypothetical at present.)

If she actually manages that sort of data outside of an RDBMS I do suffer with her.


Your description is very similar to a problem I have with academic papers as PDFs. I was toying with `rsync` solution and naming conventions for original file names (as they were named when acquired) and renaming after review (EG. using the file system). Organizing the files with tags, collections would be a great improvement.


If she is using Windows 10 (or similar), I've found http://tabbles.net/ is a pretty cool solution.


This is good thinking, and mostly overlaps with what I've tried to do a few times, so I would love to see it finally happen somehow.

For example, the Newton storage system I worked on at Apple 1990-1996 was based on separating organization from storage so we could have multiple (tag and/or hierarchy) organization systems. That eventually became the "soup" system in the shipping Newton OS, where objects were retrieved by content rather than hierarchy.

More recently, I spent some painful years working on Microsoft's WinFS, which had a ton of overlap with the principles here, and demonstrates just how hard it is to go from some nice principles that all seem like the Right Thing to an actual successful adopted implementation of the principles.


I think most people would agree a tag filesystem or a similar concept is a great idea. I have wanted one myself for a long time. Yet, for some reason, it doesn't take off. Do you think it is just a problem of implementation?


Hierarchical file systems allow you to say for sure where a file isn't. Every tagging system I've ever played with has turned out to be a mess in actual use compared to the simplicity a hierarchy provides.

(I am just a user, not a developer of filesystems, so this may be a naive opinion.)


Tags are fun when you have a few thousand items to test your MVP with. It gets much less fun when you have millions of items with thousands of tags, all on a flat hierarchy.

On the other hand, when you're stuck with a flat hierarchy anyway (e.g. thousands of pictures, all named DCIMxxxx.jpg), tags can be more useful. But only if they're automatic.

I want the best of both worlds. I want to organize my stuff into folders and use tags to search for individual items. There's no need to be a purist on either side. "Designing better file organization around tags" is a good thing. "Designing better file organization around tags, not hierarchies" is not.


I absolutely agree with this, except to add that it seems, in theory, in the case of a 100% tag-based file-system, there would never (*very rarely) be a flat list where you have to scroll through millions of files. The UX of a single flat list with millions of files named DCIMxxxx.jpg is a limitation of the current format and doesn't make sense when so much information can be generated about our files upon creation.

In this hypothetical FS, all files would have dynamically generated tags for created date/time, modified date/time, access date/time, originating application, given filename, owner, group, permissions, format, geo-data if available, originating hostname, EXIF data should all be first-class tags as prominent as the others, and so on. I visualize it to work, by design, something like how google photos UX works, grouping everything by EXIF data automatically before you ever start organizing things on your own.

Just as well, with any files created manually, in the application "tagging" should be just as prominent a function as "naming" is right now.


Part of the adoption problem here would be trust. Hierarchical filesystems are something we're used to, and we can trust they're implemented correctly. That means, if I visit a folder and see some files, I know what I see is all of the files there (+/- hidden file settings); if something is missing, it's not there, period.

Tag search is a search. Can be broken. Can be optimized in a way that causes it to lie. I look at the results, and I'm not sure if they're complete. Maybe the file I'm looking for is really not there, or maybe the search gave up too early. Or the tag was slightly misformatted?

Maybe I'm too used to the old thing, but I like the notion that there's one canonical tree structure that makes all my data reachable. In case I've misplaced something, the search space of all paths through the filesystem tree (or a subtree of interest) is vastly smaller than the search space of all possible values of all relevant tags.


> Tag search is a search. Can be broken. Can be optimized in a way that causes it to lie. I look at the results, and I'm not sure if they're complete. Maybe the file I'm looking for is really not there, or maybe the search gave up too early. Or the tag was slightly misformatted?

Google Docs, while not exactly a tag system, wants you to search instead of use a hierarchy, and so is my go-to example: A good 90+% of the time, I can't find something I know is there and have to ask a co-worker for a link.

The missing consideration when it comes to tags is simple discoverability. You have to already know enough about what you're looking for in order to find it. A hierarchical system lets you do systematic browsing.


You make excellent points, although I think I the issues you raise exist in our current filesystems as well. Provided the FS is indexed properly, opening a tag should show all files associated with that tag immediately. Just like opening (or listing) a directory does.

And in the case of search, it absolutely sucks on today's filesystems. Don't get me wrong, find and grep are incredible tools. I simply mean that it's not like searching for files beyond the hierarchy is known for the pleasurable UX. The only way I know a grep of a whole drive or deep directory is done is because I get my blinking cursor back.

At the very least with a proper tagging system, we would be inherently familiar with the indexes available to us.


> And in the case of search, it absolutely sucks on today's filesystems.

It does. You mention grep, I'd even mention find - half of the time I'm wondering whether it has searched everything I wanted, or I misspelled the command. Or file search in Windows (Vista+) - I just don't trust it; I'm pretty sure it missed some data in the past for one reason or another.

Now with traditional file systems, I at least have the file tree. With tag-based systems, I'd only have search - so it better be trustworthy, both in reality and UX-wise. It needs to project the feeling of correctness and completeness or results.

> The only way I know a grep of a whole drive or deep directory is done is because I get my blinking cursor back.

The only way I know a find of a whole drive does what it's supposed to be doing is because it emits a stream of "find: `/some/path': Permission denied" messages.

> At the very least with a proper tagging system, we would be inherently familiar with the indexes available to us.

Fair enough.


> I'd only have search - so it better be trustworthy

Absolutely agreed. It should be as reliable and as immediate as what we have now.

> I at least have the file tree

I don't know how this would work in practice, but I'm imagining something where, UX-wise, a tag-based FS could act very much like what we're already used to. Google was very much on this track in their early versions of "labels" in gmail and google drive (shame they've slowly moved away from it)

Just last night I used some desktop app I found to tag a few thousand scanned documents so I could do my taxes this morning (researching my options is how I ended up finding this article). Once they were all tagged, I was able to traverse in a very familiar way.

At "root", there's too much noise, but as soon as I pick a tag, say "2017" - now I have whittled down my available tags. And then I pick "receipts". Smaller list of files and a smaller list of tags. And then "restaurants". And then "business".

That seems quite a bit like a hierarchy to me. The subset of tags that are related to the first one I chose act just like sub-directories. The UI could work exactly like what we already know and love. As we know it now, I would have ended up at ./2017/receipts/restaurants/business.

Of course with directories, that's the only way I could organize my files. But if we're working with tags, I would get the exact same results going to:

/business/receipts/2017/restaurants/

/receipts/2017/business/restaurants/

You get the idea. But, I could also potentially do something like:

/receipts/2017/client_1+client_2+client_5

or

/2017/receipts/business+!client_3

Now, still within the realm of a directory structure - even using terminal commands we're all familiar with and a bit of extra sugar - I have access to more features. I can't merge directories in a tree. Not that easily, anyway. But in this case I can `cd` into a directory of exactly what I want in a familiar way without trying to remember if what I'm looking for is in ~/Dropbox/receipts/2017 or ~/Documents/business/client_1/receipts.

It's in both. "Dropbox" and "Documents" are no longer necessary. Nor is ~/.


I like the content of your answer: filtering by tags to narrow down the search results, only showing tags that belong to the current set of results, the benefits of order-insensitive path parts, and the ease of taking unions of tag results.

The path examples that you created are Boolean queries with different symbols: slash means AND (low precedence), plus means OR (medium precedence), and exclamation means NOT (high precedence). Your last example could be rendered as "2017 AND receipts AND (business OR NOT client_3)" and mean the same thing.

In any case, the illustration you made is indeed the sort of user interaction that I want to design into a future prototype.


>if I visit a folder and see some files, I know what I see is all of the files there (+/- hidden file settings); if something is missing, it's not there, period.

Funny you should say that. Only a few weeks ago a colleague of mine was perplexed by a file which showed up in a 'save as' box, but not in Windows Explorer. It was an ordinary log file, same as a bunch of others in that folder, no reason for it to be different. Apparently he later discovered the file was visible if navigated to from C:, but not through the desktop shortcut he'd made to that folder. We could only conclude it was a Windows bug. Whatever the cause, it wasted a good deal of our time hunting for that file...


I think the parent wasn't concerned about scrolling through large numbers of files, so much as the performance issues associated with querying them.


I understand your concerns, and they are indeed valid. First off, I doubt that managing millions of files in a traditional hierarchical file system is fun either. You'd likely run into problems with making unique names, sharding folders, and categorizing files that logically belong in multiple places. I also have some worries that existing file systems (say NTFS or XFS) will behave or perform well with millions of files. I believe that implementing tags is a starting point for the problem of managing millions of files in a sensible way.

Speaking of thousand of pictures, what I really want is to dump all my photos into one folder. Right now, tools are ill-equipped to deal with large folders, so I am forced to manually create a new folder for every thousand or so items.


> I also have some worries that existing file systems (say NTFS or XFS) will behave or perform well with millions of files.

I believe the mantra for XFS is "if you have large or lots, use XFS". XFS has a lot of optimisations for metadata operations which should mean it's better than most filesystems for lots-of-files and large-files cases (Dave Chinner has given several talks about the performance characteristics of XFS with "large or lots" cases).


A few things I can see off the top of my head that would need to be solved for tag filesystems to be able to take off:

1. Inertia. Most software assumes hierarchical filesystems, and assumes it can control some portion of that hierarchy. This includes things ranging from search paths for various things ($PATH for binaries, library paths, etc), temporary files, preferences files for applications, assumptions made about hierarchical filesystems in archive formats like ZIP and TAR, etc.

2. Permissions. With a hierarchical filesystem, you can apply permissions on higher levels of the hierarchy to control access to lower levels, and you can have various forms of permission inheritance to control permissions on new files. Need a design for how to do that on tag filesystems.

3. Mounting. Filesystems come and go; some are on your OS drive, some are on removable media, some are network filesystems. Hierarchical filesystems means that each one has a single root, and it's easy to tell where the boundaries are.

4. Tagging taxonomy. What kinds of tags do you use? What happens if you mount a filesystem in which someone else used a different tagging taxonomy than you used? Who controls different parts of the tag space? What happens if you import an archive of material which uses a different tagging scheme than you use?

5. Projects. How do you group files of different types with different tags into discrete projects? How do you bundle related files together? How often would you want to see files by arbitrary tag lumped together, rather than looking in particular projects that have a pre-defined structure?

6. UI. How do you browse tags? How do you refine down? In many cases, rather than a general purpose tags based interface, you actually want media-specific browsers, like ones specialized for music which let you browse by artist, album, playlist, etc, or photo galleries that can show you previews of the photos, or video browsers which can show projects, bins, and sequences (for video editing), or IDEs which can either show you file hierarchy or allow you to browse by class, function, etc.

7. And finally, why is a tag-based filesystem necessary for this? What is wrong with the current approach, in which there are special purpose applications which can index, tag, and display certain types of media in certain ways? For instance, you can use your text editor or IDE to navigate among files within development projects, iTunes or Play Music or whatever to browse your music, Lightroom or Darktable or iPhoto or Google Photos to manage your pictures, iMovie or Final Cut or Premiere or Avid for browsing and managing your video, and so on. They all frequently have some way of tagging files, but also have specialized UIs for browsing the specific types of files they are defined for without having to do explicit tagging, and the actual files are just stored on a normal hierarchical filesystem.

There are a couple of good thoughts in the original post, but a lot is handwaved away, such as mutability of files, which is an incredibly important use case, for a huge amount of what files are used for today. Lots of people have brought up the idea of making tag based or database based filesystems, such as the failed WinFS effort (https://en.wikipedia.org/wiki/WinFS), but it's actually a pretty big problem to solve.


I think many of these issues can be addressed with mechanisms proposed by the author.

Mainly, the more complex tags which can themselves refer to other tags.

1. This is probably the trickiest one. You may be able to do some sort of translation between a hierarchical system and the tag system using tags themselves. You could have a series of tags that refer to each other, such that the hierarchical ___location is essentially encoded in the tags themselves.

2. Again, maybe just special tags?

3. Yeah, again, tags. Just tag the thing with the media it's on.

4. Aside from the basic UI side of things which should help, there is the idea of shared tagging systems. I don't recall if that came from the author or another commenter on HN. And you can basically ask the same question about hierarchical systems. It's not exactly a solved problem there either.

5. Again, the complex tags. Just make a tag for the project.

6. Obviously UI is a big question. I'm not sure how it relates so much to media-specific browsers though. They basically present a different view of a section of a filesystem. You have to do some work to let them do this, or else use a system like iTunes and buy all of your media through them.

7. Although I feel this is well addressed by the author, one thing I think you aren't considering is that each of these applications requires their own setup in order to provide that view. You often can't just take the directory from one of these programs and use a different program to view it and have it all work properly. If you only have one program for each media type and never want to use anything else that works, sort of. Many years ago I directed iTunes to redo the file layout for my music collection and rendered it effectively useless for direct browsing. I never really recovered from that due to the time involved to sort it out.

And mutability isn't totally handwaved away, again with the complex tag system you could tag mutated works with a reference back to the original. This doesn't cover the case where you don't wish to retain the original, but then you could just do a simple find/replace with the old and new hashes in the simplest case.


I think you're missing some of the subtleties of solving these problems using "just more tags."

In a hierarchical system, a lot of these organizational issues are local. If I have one directory that consists of a project organized one way, and another directory that consists of a different project organized a different way, those different organizations don't really interact with each other in any way.

If you are using tags for everything, in order to avoid weird mishmashes of different ways of using tags, you would need to either have a completely standardized tagging system that everything used consistently, or you'd have to always include various contextual information in your queries or in your browsing in order for the queries to make sense. For instance: [mount: my-hd][project: my-project][type: jpeg]

I think you overstate the problem with different applications as well. For a large amount of the metadata that is relevant for these applications, there is a standard tagging system. ID3 for music, EXIF for images, XMP for various image and video formats. It's true that there is some metadata that these applications store in proprietary databases, but that's mostly an issue of it being difficult to come to a consensus on standards that meet everyone's needs, and it's easier to just write some proprietary metadata somewhere. With tagging systems, if there wasn't agreement on the schema of tags, you'd still have the same issue.

I don't think it's a bad idea to consider alternatives that are more general and more flexible than what we're doing now, but I do think that it's pretty easy to handwave about how nice a tag based system would be, but a lot harder to solve all of the little problems that are going to come up and turn it into a real, coherent, working whole, and then getting enough critical mass so that it is used outside of a small niche with a handful of applications.


I'm sure you're right that there are a lot of overlooked subtleties. That said I'm not sure some of those problems you mentioned would exist, or at least I'm not sure they would be any worse with a tag system than a hierarchical one.

For example, how is that example query any worse than the current situation? Right now you'd navigate to the project directory (requires specifying more than your example already) and then use some search method depending on OS/WM/etc. And then you still end up with a big list of jpegs to look through. This is sort of a worst-case example for both systems, and still I think the tag system comes out ahead here - by a little - just because it would give you the ability to spread the project across multiple drives without requiring you to do two searches if you don't know which drive the desired image is on. You can improve the situation for either system by manually specifying more information. Put better tags on the images or put them in more specific directories or title them.

As for specific applications, it's not the metadata encoded into the files that I'm talking about. It's as simple as the directory structure itself that is used to store all of this. I can't have one application organize everything and then trivially point another application at the directory and have it work.

With a tag-based system this starts to change. I don't need to tell a new music player where my music is, and then go through whatever process is needed to let it properly work with the current directory organization. At worst I tell it which tags to include or perhaps exclude. From there many options exist. Maybe it pulls in metadata from the files themselves. Maybe I provide an external file in whatever format. Maybe I tell it which tags to associate with which fields. You could do a lot of things here.

I also won't end up telling the application to reorganize things as I did many years ago with iTunes, which promptly made it nearly impossible to wade through my music manually. I had it sort everything into directories based on the artist with subdirectories for albums. It sounded great, until I remembered just how much music I had off OCRemix, where an album is a large collaboration between many people. All of those albums were ripped apart. Ironically, I also had some standardization issues with things like artist names which caused more trouble. Once I stopped using iTunes I basically abandoned that collection because of the work required to fix it.

Yeah, standardization is going to be sort of a problem, but I don't think it's quite as big of a deal as you think. For one, the OS is going to ship with a bunch of standard tags just for itself to work. There will also just be a lot of really standard stuff people are interested in that can be shipped with them. You also have file extentions, for both specific extentions and also generally what kind of information they contain. And finally there is just good old translations. The hierarchical system basically utilizes all these methods and suffers from the same problem - namely you can put directories wherever you want and name them whatever you want. Same problem, different manifestation.

I think the biggest benefit would come from a system that can present itself either hierarchically or tag-based. They both have merits. I've already presented some ideas on how you could store the hierarchical structure in the tags. I'm not so sure how you store the tags in a hierarchical system directly. You could probably fake it with a separate datastore easily enough though.

Finally, when did this discussion of general design goals turn into one of a real-world implementation, much less widespread adoption? I'm not sure how this is relevant.


Once upon a time there was Google Desktop, and it was incredibly useful. I wouldn't trust Google on my machine anymore, but an open source replace ment would be great. https://en.m.wikipedia.org/wiki/Google_Desktop


OSX has Spotlight. It searches pretty much all the same stuff Google Desktop is listed as searching. If you want something open source instead then check out Quicksilver [https://qsapp.com].

Dunno about Windows or Your Favorite Unix. alternativeto.net may be of help: https://alternativeto.net/software/google-desktop/


Thanks. My subtle point was why use tags when you could use search? Search engines index the whole internet - millions of hierarchical file systems of all kinds. Why reinvent file systems?

I don't use Mac. Windows 10 "Cortana" is no match for where Google Desktop was 10 years ago. I rarely use it unless I have to. Google searched the contents. Cortana just searches filenames and tries to route searches to the web. Linux...I don't see anything - Ubuntu has Cortana-like feature, but it's not Google Desktop. It's very odd to go backwards technology-wise.


Google Desktop was indeed a great solution partially solving the tag problem in Windows. A good tagging system trumps search as you can make the results much better determinate. The ability to create your own namespace and then organize every file into that namespace ensures encapsulation. In a search result, you almost always have to filter out irrelevant items.

Good working search solves the problem pragmatically though :)


I am far too lazy to tag every file and I would despise such a system if forced into it. Some files I want to keep realizing I may never need them again. Tagging is a time waste. It's also difficult to anticipate future use and what tags are helpful.

Now a tagging system that could be built over time from search results could be very useful. Apply tags as you go in batches in other words to aid future searches.

If I were designing a new OS, I'd force each piece of software to auto-tag and index the file contents in meta tags and feed it to a global OS search function.


You assume that you need to tag manually. You assume that you are forced to do so. Currently, it's perfectly possible to put all files in a single folder. Your OS will be happy about that (except maybe some technical limitation of max filecount of a folder).


I agree, Cortana is terrible at searching. I use Everything Search [1] coupled with Wox Launcher [2], it's great and just oh so fast!

[1] https://www.voidtools.com/ [2] https://github.com/Wox-launcher/Wox


All of the things you pointed out are legitimate concerns. My ideas are at an early stage (esp. without an implementation or usage experience), so they are necessarily incomplete. To respond to your points:

1. Inertia on hierarchies. Totally correct. I want to tackle immutable media collections first (photo, audio, video, documents), because developer tools (scripts, source code, path configurations) have a more intimate coupling with the hierarchial model.

2. Permissions. I have no clear answers. I suppose there could be a meta mechanism like, "for every file tagged as WordDocument, make it readable to $ANOTHER_USER". However, I believe the default of share-nothing is about the level that we get in most web-based applications. And I think the file-level permissions (like Unix, NTFS, etc.) are too fine-grained and confusing.

3. Mounting. I don't want to repeat myself, so first read https://www.nayuki.io/page/designing-better-file-organizatio... . However, it's true that you might want to vary the query ___domain depending on what you're doing.

4. Tag taxonomy. My proposal isn't any worse than the ad hoc taxonomies that people make in hierarchical file systems. At least when I proposed public "tag cores" (search in the article), there is one possibly viable way to unify everyone's vocabulary (opt-in, of course).

5. Projects. I suggest tagging every relevant file with its project tag. I'm not sure what you are pre-supposing with "pre-defined structure", because you could tag tags with the project tag. Also, I like the idea of tagging over hierarchies in this case because many of my project files also belong to other projects (e.g. monthly account statements vs. tax documents) or to a general stream (e.g. photos).

6. User interface. This is one of my big obstacles at the moment. With a hierarchy, there is really only one reasonable way to present the files, and one reasonable way to browse them. With tags, the options are numerous. There might even be metadata to control how tags are ranked, hidden, etc. At the very least, a UI would need to present many facets of metadata, such as file type, topic, timestamp, author, etc.

7. Necessity. You hinted at the problem in your statement. Application software have demonstrated to us that media libraries (e.g. songs in iTunes, photos in Adobe Lightroom) are extremely convenient and valuable. But they are all proprietary data silos. The metadata you generate is only accessible within one media program. Moving to a different program means giving up your investment. Also, media libraries are brittle, and when you move/rename the data files in the file system, the media library often behaves poorly (broken links, slow rescans, etc.). I believe that keeping metadata in an application-neutral format and letting the file system handle queries is a better alternative than proprietary media libraries. I wrote about this in a section: https://www.nayuki.io/page/designing-better-file-organizatio...


I'm not sure about that. I'm fine with hierarchical file system and links for my projects (desktop or explorer side bar). I tried few times to make use of macOS tags, but I didn't find them useful. I guess, for people who work with actual documents, it might be different. For me all those systems trying to be smart end up showing me billions of .class files from my build directories and wasting cycles trying to index them.


Would you have time to say a bit more about why WinFS failed from your point of view?


Oh man, that would take a book. If I had to severely boil it down: You can have a beautiful unifying idea that seems to make total sense in the abstract, but then bog down completely when you have to make real implementation decisions in the context of a massive existing platform with a several-thousand-person development team. At some point a system's assumptions become so established that there are certain aspects that are literally impossible to change. Good intentions, even with the full support of the highest levels of management, can't resolve all the tangles.



Wow, I forgot Hal practically did wrote a book. :)

I was on the Windows Shell side, sitting in a lot of schema meetings where products couldn't figure out how to combine their incompatible complex existing schemas into one new WinFS schema that would be understandable to anyone, and in a lot of other meetings where I was sort of a negotiation translator, decoding user interface concepts for the SQL team and SQL concepts for the Shell team, though their priorities were wildly different.

The common theme throughout was that it was pretty clear none of the product teams saw any reason they should go through unknown years of effort and compatibility hell, delaying the feature roadmap for their product, just to fulfill Bill's desire for this abstract notion of Integrated Storage. So these meetings went on interminably to make Bill and David and Bob and Jim happy, while none of the participants outside the WinFS team itself saw any point to the exercise.

It was not a pleasant experience, but I learned a lot about how a large organization works! (A reason I'm no longer in one.)


If you were one of the people behind the ideas in Soup, I salute you.


Thanks! That was pretty much just me, but it was inspired by reading a whole bunch of other peoples' books and papers on persistent object systems. That was a big topic of research at the time. I don't really know why the persistent object idea fell out of sight so completely.


Have you considered a file system organized as a timeline that _also_ supports tagging?

I find one of the key concepts that's not a first-class concept is _when_ the file was modified. Rather than a file-and-folder physical analogy for the file system UI, I think a timeline-oriented UI could present some advantages for the way that humans actually think and work. Tags would be a helpful orthogonal organization scheme, but I don't think they work as a primary UI for navigation.

This is great work, though! I love the compilation of various other works, the references, and the way you've dug into the details!


I have thought about file tagging for over a decade, before setting out to write the article. But a timeline-oriented file organization only came to my awareness near the end of writing the first draft.

A year has passed since I wrote the article, and the idea of timeline presentation has grown on me a lot. Especially because I use numerous data systems daily that are already time-oriented: Every chat program, email, Twitter, Facebook personal profile timeline, the "recent documents" view in major popular applications like Microsoft Office or Adobe Reader.

You should find this blurb in my article helpful:

> The Lifestreams Software Architecture http://www.cs.yale.edu/homes/freeman/dissertation/etf.pdf (185 pages)

> Comprehensively designs and tests a system for workflow and archival, based on chronological presentation plus keyword and attribute filtering.

Thanks for your compliments on the thoroughness of my exposition. I did a lot of research and thinking as preparation for writing. I wanted to see how other people viewed the problem of file organization and what kind of solutions they proposed. I wanted to find weaknesses in my arguments and to avoid repeating unnecessary work, and of course I wanted to move toward the best solution.


If all you want is knowledge about when a piece of data was created and/or modified, then tags do that just fine.


Not sure if it generalizes for the entire filesystem - not all files are modified due to explicit user action. Software keeps logging all the time. Many applications update their config files when closing, or during runtime. Saving a document in a program may trigger saving another 3 files elsewhere. All in all, it seems like a recipe for seeing the least relevant data first.


You've highlighted one of the primary challenges with such a filesytem. Determining what files have been modified as a _meaningful user action by the user_ (determined by the user thinking it was meaningful) and _what changed_ would be a very hard, but incredibly valuable problem for the user.

I think fundamentally the success and adoption of a timeline-oriented filesystem leans on new user experiences that have yet to be designed.


This might be a good idea if clocks were reliable.


That used to be more of a problem in the past. Nowadays, it's very common for a computer to have its clock synchronized via either NTP or a cell phone network.


Sure, it's common, but I work with devices all the time that have no access to the internet. Cameras, robots, and scientific instruments. Many, like the raspberry pi, can't even keep time when they are turned off. I also go on expeditions where I have no internet access for weeks. A time - based filesystem would be useless in all of these cases.


>But fundamentally, there is a mismatch between the narrowness of hierarchies and the rich structure of human knowledge, and the proposed system will not presuppose the features of HFSes.

This hits the nail on the head !

All the fileSystems I had to work with are fine as engineering tools. By that I mean using them as an engineer works just fine, their own implementation is off topic.

As a user though.

What the hell !

I don't want to go to c:/users/me/documents/talks/stockholm2018/draft3

I just want to open my document !

I really hope that someday we expose a document based filesystem to the user.

The underlying implementation does not matter, we can always add a layer on top of the hierarchical file system.

I just want to be able to display :

-all the games installed on my system .

-all the pictures

-all of my text documents .

-all of my pictures of Paris

etc


How do you make sure things are tagged properly though? A file must reside within a folder, even if it's a default ___location, which forces a user to think about the folder where the file is stored. With tags, a user could very easily forget one tag on a file, and now any filtering on that tag is never going to be aware of the new file existing.

What if you find another picture somewhere from your <Paris> trip, but you forget to add the <Travel> tag? Or you have a tag for "cool architecture pics" or something and you miss tagging one of your Paris pics with this when you upload?

There just seems like so much friction in properly keeping tags organized, despite how much extremely better the "read" UI is for someone browsing or searching file collections.


I imagine the process as being similar to how you would currently add a file.

Now:

Your friend messages you "Here's another pic from Paris xxx". You click on the picture, "Save as". Your file system comes up, you navigate to benjammer/travel/paris/pics. You can see the rest of the pictures from Paris in the directory. You hit save.

With tags:

Your friend messages you "Eiffel Tower :-)". You click on the picture, "Save as". A list of your tags comes up for you to select some. On the other side of the screen are the files that match the selected tags, so that you can see what company your file is going to end up in. They're shown in the order of how many files they tag (that also have the tags you've already selected). The "benjammer" tag is selected by default, as is "picture" and "png" (because your application knows it's been given a png). The "Paris" tag isn't at the top of the list so you type it in and select it. Now "Travel" is at the top of your list of tags, so you add it, along with "June 2017" and "Europe".


Manual tags require a lot of curation and upkeep, but some "tags" are really just restatements of attributes or facts about a file, like search filters, e.g. ("Pictures downloaded from the web on 2018-04-05", "Files created during installation of World of Warcraft", "Files opened in the last two weeks").

In fact, a tag-based document filesystem is largely useless without powerful search, where tag keys and values can be searched at will.


I suppose I should mention that while MacOS search is far from perfect, Apple put quite a lot of working into this kind of autotagging.

Here's one a screenshot showing a tiny portion of the tags available in search https://imgur.com/a/AYm5V


Maybe not use the same file system for everything?


I would probably use a hybrid of the two. Use the tag system for things I've tagged, and fall back to a regular filesystem for untagged files. That's basically how I have Steam setup. Most of my games are tagged, usually with multiple tags, and all untagged games just go under Games.

Once I got everything organized (via automation, not manually -- not necessarily relevant, though), keeping up on tagging new games became easy and quick.

I imagine there would be automation tools written for a tagging filesystem, that just "knows" a lot of common software, etc., and can get you started.


That's a group effort.

For picture they could read metadata for something as simple as Paris.

The tech is there to search for 'architecture' pics, Google Photos demonstrates it.

Of course there is the issue of who can see this data and that's not a small one. The recognition itself could run locally, but training data would still need to be mutualized.

And most apps would need to adapt and provide more metadata to your documents


My problem, at least with Windows, is that it's an unholy mishmash of the two ideas.

I really just want something like a Unix filesystem, with a single root. Just with things coming off that root that actually make some goddamn sense and aren't relics from the PDP days.

Yet on Windows I have different drives that act as independent file trees, and then I have special folders that are handled somewhat differently, applications that save things in hidden folders that non-power users don't know are there, three or four different cloud storage providers that all need their own special folders, applications that can and cannot be installed in custom locations, weird special-case access control restrictions on some folders, some of the time. And all manner of other abominations.

It is far too much work to try to setup document storage in a sane way, and commit to actually doing it, and so my experience is that people increasingly don't even bother to try, and just flood their Downloads and Documents folders.


It doesn't help that Windows /never/ (still hasn't) developed an actual standard library abstraction of what the filesystem is. In the Unix world your standard library either handles all supported filesystems for you, or more often, just provides a working abstraction of that via the kernel's own VFS abstraction of the supported filesystems.

Literally, in the UNIX/POSIX world there is exactly one way to open a file for a normal application, irrespective of the storage backing for that file and what filesystem might contain it.

(IIRC, it's been a while) On Windows the closest a programmer can get to this is opening up a file via calls specific to NTFS (since it's now the predominant filesystem), FAT (most flash drives), eXFAT (other flash drives), ISO9660(etc)/UDF for optical media, or something else for SMB/CIFS mounts.

For many years I've thought MS should just make a VFS that presents everything as (virtually) NTFS, and use that as the standard library abstraction.

Which of course has very little relation to actually tagging/organizing/search/recall files in a more human way. The biggest hurdle there is probably getting humans to tag/file things correctly and verbosely enough. MOST humans simply can't do a good job at this. For at least 60%, maybe as high as 80 or 90% of the population, it is /literally/ beyond their technical capability to do a good job (at least without a lot of help).


What do you mean? fopen exists on windows.


It seems you are correct. I /do/ remember something a /long/ time ago having this issue, but even some quick searching reveals that MSDN documents fopen existing back in 2008, and the notes supporting UNCs. I wonder how the underlying support is handled.


CreateFile can open anything on any mounted filesystem, as well as things like raw drives and named pipes.


> The underlying implementation does not matter, we can always add a layer on top of the hierarchical file system.

I don't think so. hierarchical (trees) are too limited, and my intuition tell me that this lead to a impedance mismatch.

This is the reason the relational model take over, is far more powerfull.

To have something comparable, you need full graphs.

I think is not coincidence that tag-based filesystem have never take off, the tree-based file system can never be good for it, you need or a graph database or a relational database (or both?). And this probably is only viable on SSD/RAM-alike disks.

Finally tags are not enough. You need a way to do groups and alias and groups of groups, plus a way to full-text-search and seriously good metadata alongside the file...


I eventually had to use a series of command-line tools to shepherd my badly-organized photo collection into something like a usable state, and it basically involved finding all the JPEGs on my system, de-duplicating them, and dropping them into folders based on their date-taken metadata.

It was a huge pain, and entirely an artifact of the tyranny of the folder-based filesystem.


for i in *.jpg; do date=$(exif $i --machine-readable --tag=0x9003 | cut -d ' ' -f 1 | tr : -) mkdir -p $date mv $i $date done

But don't many photo-browsing programs allow browsing by date, or even GPS-tagged ___location? e.g. on KDE's DigiKam I choose "timeline" view (or map view).


I like the smiley before mkdir


Oh, damn. The formatting was lost, I had line breaks in there!


Indent with 2 spaces to format code: https://news.ycombinator.com/formatdoc


On OSX these issues have been solved for quite a few years.

1. Press Apple-Space.

2. Type "draft3".

3. Select first item.

And for finding "all my pictures" just create a Smart Folder based on a series of rules e.g. "File Type = JPG".


There is TMSU for Linux among others. That's the only semantic file system I've tried, and around 10 years ago, but it was easy to use then.


Why can't you do the tagging with series of symlinks in one directory instead? It is essentially an index.


Symlinks are brittle when files are renamed and when files move from device to device. Also, symlinks don't solve any of the other problems I'm interested in, such as: Not being forced to make unique file names, being able to find files by hash, being able to find duplicate file content, being able to have tags about tags, and more.


Didn't Windows try to do this starting with 7? It has "libraries". It was awful and I just wanted the real file system back.


Libraries would be a-OK if they were the only abstractions available. You can group your stuff into a number of libraries and call it a day.

Unfortunately, Microsoft changed their mind and kinda-deprecated-but-not-completely-eliminated libraries in Windows 10. They also added a whole bunch of similar features that don't fully replace the functionality that libraries used to offer. Now there are at least four different places where the same content might be shown: Quick access, folders under This PC, folders in the actual drives (also under This PC), and Libraries (if you choose to display them). It's an unholy mess that needs registry hacks to clean up.


Every time I read an article about attempts at non-hierarchical filesystems, I try to figure out how I'd take the huge piles of stuff I generate when I'm drawing (and publishing) a graphic novel and reorganize it under tags. It's never pretty.

Like, okay, sure, I tag everything with the name of the project, that's a no-brainer. But if I just do that then I get the hundreds of files I generate (one per page) mixed up with everything else - web-res renderings of each page, model sheets (and their source files), promotional material, stuff sent to publishers to try and convince them to deal with that part of the process, and the huge mass of files I generate for each book I print (which can be more than one for a single multi-year project): source files tweaked for print, print-res renderings thereof, files for the kickstarter for each book... So I tag all of these attributes too, and imagining putting all these tags on a file as I save it sure is a lot of fun, even if I imagine some sort of save requestor that keeps a list of all my previously-used tags, including ways to filter those - I don't care about any of the tags attached to my music collection or my collection of cartoon porn or my programming projects when I'm working on my comics projects, for instance, so I'd want to quickly narrow it down to just tags found in my art projects, and...

Ultimately it just starts to look like a hierarchical structure in my mind, except for the fact that I'm interacting with it by some kind of tag-filtering file browser on top of a huge filesystem that mixes everything together in a non-human-browseable structure.


I can strain my imagination and optimistically say that with a suitable UI, the tag-based file browser would be no worse than the hierarchical one — using tags to categorize your files in a not-too-coarse, not-too-fine way, and maybe even having more flexibility in how you organize and browse your files. Do you ever put information in the file name that could be a tag, or make up meaningless file names when there are few enough files related to a particular page, for example? Or put information in names that could be attached as some sort of comment or notes metadata instead? But I think you bring up some important issues when it comes to organizing your stuff.

Would it be better if, instead of having kitchen drawers and cabinets, we had a tag-based system, because, you know, the structure of human knowledge and all that? Why should I be forced to put a utensil in zero or one drawers? Actually, the cabinets and drawers system is nice because you have a sense of “place” — you can think intuitively about where an item is, where an item goes; even if where an item goes is a somewhat arbitrary choice, at least you know exactly what decisions have to be made (which drawer and which compartment in the drawer organizer, for example) to put it away. You can also do a traversal through the cabinets and drawers to see what objects are being stored and how they are being organized, and any time you open a cabinet, you are focusing on a different set of objects. Imagine if you had 10 cabinets and 10 objects in each cabinet, but you actually only have 20 objects total. Every time you open a different cabinet, you see a different 10 of those 20 items. Confusing.

I wonder if it would help to have required tags, and exclusive tags. Files with tag X must have tags of type A, B, and C, and may have other nonessential tags. X would be something like, “is a project file for some project,” and A could be the type of project name tags, and so on.


Tags are great BUT

It's pretty important to realize that a files position is merely it's default tag (and you can tag it further with many different types of systems like extended attributes, as I think both Gnome and KDE have used at times).

Without that default tag you have a mess.

It's also important to note that Tags have a very high maintenance cost of their own.

Duplicate, inconsistently applied and redundant tags are a aggressive cancer in any of these systems.

No you can't just ignore them as they make it more and more difficult to accomplish even basic viewing /scanning over files for the system and the user.

Many many users have trouble even doing basic maintenance on their file locations (that default tag) that makes a tag based system even more prone to failure.


I spent last year ruminating on this (Independently. However I find it interesting the article was published around the same time that I was voicing my ideas to a friend on this!) and toying with a few prototypes. This year I committed myself to hacking on a proper implementation (named 'libkoios' and 'koios') of it, using the Extended Filesystem Attributes. What I found interesting is that while there was a lot of prior work systems existing for tagging, none of them use the extended attributes system, which to me feels like a waste. However there are problems with ext(2,3,4)'s implementation of file tags that make it difficult to store a lot of data without compression (I'm storing one bit per tag, which allows fast masking and comparison operations per file), so I guess that is understandable.

I believe that for image-based systems there is 8ch's /hydrus/ (probably the only good thing to come out of the chan-networks). One upshot of there being existing network sharing systems for tags is that it should be possible to scrape them when autotagging things (Nobody. NOBODY, wants to manually tag hundreds of photo memes, which is the main forseeable problem with file tagging).


I never personally used it, but I've heard the BeFS was designed to have significant non-hierarchical use cases:

https://en.wikipedia.org/wiki/Be_File_System:

> [BeFS] includes support for extended file attributes (metadata), with indexing and querying characteristics to provide functionality similar to that of a relational database.

IIRC, this was pretty hyped at the time, but they had to back away from it. I don't know if it was because if the concept was too unfamiliar to people familiar with the hierarchical paradigm or if it didn't work as well in practice as it was imagined.

There's also a book about it written by its designer and now freely available: http://www.nobius.org/dbg/practical-file-system-design.pdf


BeFS extended file attributes are a good example to point out. I watched this excellent talk which shows the power of live queries: https://systemswe.love/archive/minneapolis-2017/ivan-richwal...

Unfortunately, file attributes is not a feature I want to see. They don't solve the problems with naming files or deduplicating files. Metadata is attached to a file, so when the file is gone the metadata is gone. I instead proposed that metadata can be freestanding, and can exist even if the main file is missing. Relevant section to read: https://www.nayuki.io/page/designing-better-file-organizatio...

I did read the entire book "Practical File System Design with the Be File System", but didn't find it helpful for what I was working on. I can skip the low-level bits because I'll probably build on top of a NoSQL database or something.


Tagging is the first step, but how do you know if you don't have overlapping or duplicate tags, say country and folk music? If you need something that fits both categories, you eventually start designing taxonomies and eventually ontologies, there's just no end to it. I think tagging is a sensible, lightweight approach, but it has limitations...


> country and folk music

Folk music has nothing to do with country.

Country was invented in the 1800s by a businessman who aimed it at primarily white Americans. Folk music has a rich history dating before the 1200s and before much of written language. Many songs of Irish and Welsh descent date before record, still being played today.

Nevertheless, to answer your question:

1) metatags, parent tags, and the like provide ways to structure tag relationships to encompass and describe what you speak of.

2) There are many networks out there that share user's tags based on file hashes. A system could scrape these networks for existing tags and autotag many items without the user's interaction (beyond initiating the autotagger). Users could also get their hands dirty and ask for only tags relating to parent categories, or something like that.


"Folk music" possibly rides on the definition of "folk", which brings in a lot of diversity if interpreted broadly.

The word "folk" also denotes recognizeable format in the context of commercial broadcasting and streaming of canned music. It basically refers to a locus roughly centered around someone crooning while strumming chords on an acoustic guitar.


> The word "folk" also denotes recognizeable format in the context of commercial broadcasting and streaming of canned music.

That's a very american-centric view you have there.


That isn't what you might call my "view"; I'm just remarking on how it seems that a word happens to be used in a certain culture and context.


We can also blame Nashville for taking the 'Western' out of 'Country & Western'.


> > country and folk music

> Folk music has nothing to do with country.

I think the point was that they're both "music"; congrats, now you have a hierarchy again!


> Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.

and any tagging scheme sufficiently good enough to describe the real world will implicitly create a hierarchical ontology


But it will create multiple hierarchies, as opposed to the single hierarchy to which you are constrained by a conventional filesystem.


Do you see that as a positive or a negative?


Generally a positive, there are drawbacks related to maintenance, but that's a problem for any system.

Say you're tagging hundreds of academic papers, so you have top-level tags math, physics, computer-science, graphics, machine-learning. Each of these has many subfields, and there is significant overlap between many of those. Perhaps more importantly, there is unpredictable future overlap, so it is generally impossible to define mutually exclusive categories (i.e. conventional directories [not counting links]) that are future-proof. I like to be able to define each hierarchy separately, then freely apply tags from any hierarchy to any item.

That's just one group of related hierarchies, and other categories of items (mutually exclusive categories DO exist, with care they can be suitable for very high-level organization) will have similar groups, which may or may not overlap with the top-level tags listed above.

I would love to see a GUI file browser designed with tags as a first-class concept. You might just call them "dynamic folders" - in any folder you can view a list of dynamic folders, generated by the tags in the current folder. This dynamic folder list could be adjusted from 1) only the top levels of the local tag hierarchy, down to N) all unique tags that are present in the current folder.


Not GP, but I'd definitely call that a positive! That's one of my issues with single-inheritance OOP too... There are many hierarchies you might want depending on the aspects you're interested in.

"A glass of juice is certainly a drink, but it is also a source of nutrition -- but not every drink is. There's not much nutrition in a glass of water. Likewise, there are many sources of nutrition which are not drinks. [...] A cup of tea is technically a source of energy (it is hot, it contains thermal energy), but so is a battery. Do they have a common base class?"[1]

(I remember a better example using soft drinks, but I couldn't find it with a quick google)

[1](https://stackoverflow.com/a/1079003/5534735)


I found this to be a real blocker every time I tried to tag something.

However, it’s a complete non-issue if the tags are automatically generated and updated.

I first noticed this in the iOS Photos app, which automatically generates collections such as “Weekend in Paris” and “Pictures of Bob”. The accuracy is good enough to be useful and can be improved automatically in future releases without requiring user interaction.


That's what I was thinking, having used RDF a bit. I liked that the article touched on some more intricate details, things like the timestamp when a fact was added, or disputed facts (well, tags ...). But obviously it gets complicated quite quickly.


It is similar to CSS. Either using clumsy naming convention like BEM, or use preprocessor which provide a way to define hierarchy of class. Both would work.


I find it hard to remember if Google Docs originally had tags instead of folders. Or if I just imagined it. I can’t find it through googling.

I would rather use tags than folders, but can’t find good support in an operating system.

Google used to be the closest since you could use the search bar as a command line and search queries as tags. There are no folders. But they changed now that try to guess what you’re looking for rather than what you type.

OSX has tags, but their search is slow and inaccurate.

The closest is I’ve been trying to use Gmail as an organizer with inbox infinity rather than inbox zero. Nothing organized other than tags. Using search to find anything.


Google Drive used to be based around tags, not hierarchies. It was wonderful. Then as it matured and catered to more and more 'normal' people it introduced the concept of folders. The folders were initially tags 'really' - the same file could exist in multiple folders at the same time. But they made that harder and harder, and now I think it's hierarchical folders through and through.

I miss the old days.


Even today, Google Drive isn't strictly hierarchical, in that a file can still be in multiple folders / locations. They have, however, made it awfully hard to discover.

If you have an existing item or items selected, you can "add" them to a different folder by hitting Shift+Z; I don't think this exists in any menu. It is in the list of keyboard shortcuts (hit "?").

These feel sort of closer to some form of link than a tag, though.


> as it matured and catered to more and more 'normal' people it introduced the concept of folders

I think it has more to do with capacity growth. Relying in tag+search would be insane if most queries returned hundreds of files,


Yeah. Imagine a Google search that returned more than 100 results. Madness!


I found this to be a great collection of insights!

My criticisms:

Organizing your files and digital “stuff” has very little to do with the “rich structure of human knowledge,” to me, any more than organizing your kitchen or garage is an exercise in philosophy. The goal should be as usable a system as possible, full stop. Now, the actual content of the article is extremely practically oriented, so I have no beef with that. I just think people get carried away with the idea that storing a file is “representing knowledge,” and it takes them in weird directions like trying to create elaborate universal ontologies. The question is, is it easier or harder to find your files, and save your files?

Whenever a phrase like “representing knowledge” or “augmenting intelligence” comes up, it’s like everyone gets a boner, and then moves on to something unrelated, like (hopefully) usability.

Mutability: Everything changes. The only way to have immutable facts is to have timestamps. Image hosting sites, message boards, etc are misleading examples of file storage because they are really means of publishing. When you publish something, and people link to it, there’s a case for thinking of immutability as the default, though even then, most things that can be published can be retracted or edited. This comment can be edited after I publish it. I think true immutability as a default, for files as opposed to time-stamped facts, only makes sense in a very narrow ___domain.


I've used a self-hosted Booru for all of my image sorting. It took me roughly 2 months to upload and tag 53,000 images and another week of cleaning up rarely used or redundant tags. Since I used it outside of the Artist/Series/Character hierarchies my Artist/Series/Character hierarchies refer to Topic/Subject/Details.

For example, visualizations of various algorithms would be filed under:

Topic: Computer Science / Subjects: Algorithms, Visualizations / Data: {Algorithm Name}

By personal restriction - something may only be filed under one topic, no more than three subjects, and can have as much data as is relevant. It gets stored under what I believe to be the primary subject.

The #1 problem is I add files to my filesystem without uploading and tagging them to the Booru. Also, since the only open-source Booru software I could find is quite dated/buggy, I'm often fighting the Booru for how I use it. Now that I think about it, this might be a good problem for me to solve myself.


ALL available meta data should be exposed and forced into tags, categories and folders.

Move everything file-like into the file system. Make emails into folders with tagged files. Link Torrents to their files and folders. Treat zips like folders. etc

Bit more on date range sliders, colors, files as tags and 3d models here:

https://steemit.com/filesystem/@gaby-de-wilde/how-a-file-sys...


An approach I find interesting is the [Perkeep](https://perkeep.org/) or Google Drive model for post hierarchy.

Storing all files as objects and then indexing..

An interesting indexer for images would be one that groups objects by faces recognized or exif data(camera model, GPS ___location, lens, date, etc) Google Drive does this.

Perkeep can deal with tags, span devices, deal with permissions. Check out HN user @bradfitz


My main issue with Perkeep is that a lot of the automated tags are very limited, and adding more is not really easy-to-do. Though the last time I used it, it was known as Camlistore. So maybe things have changed (there were some pretty bad UI issues back then as well).

I do like the idea that nothing is deleted and everything is stored using "permanodes" and signatures of objects defining mutations of a "permanode". The downside is that everything is so incredibly dependent on the indexer, and my experience is that if the indexer has a bug you are in a lot of trouble. Also sometimes you don't want to keep everything you made 10 years ago around -- especially if it's burning storage space that costs you money.


Every time I use a tagging-based system, I become more convinced that tags are what I want for almost all things, not just files.


How do you find an untagged file? Are wrongly tagged files undiscoverable?


This is more of a UI/UX question, but I thought about it on and off for months and have a partial answer. Look at how Gmail works - you have All Mail, but you also have Inbox, Sent, and your own custom labels (tags). Every message can always be found in All Mail, and you can restrict your search by date range. Similarly on an image board like Danbooru, even if you don't tag an image, it will appear on the chronological stream of every image ever uploaded to the system. So, I'm hoping the design will end up something like these two examples. You should be able to list every file, preferably in chronological order; you should also be able to exclude files that have at least one tag; and the software might have a special nag section showing you all the files that you left uncategorized.


Does it help to think of the path as a tag then?


I've wanted something along these lines for a long time as well. I have trouble drawing hard lines and distinctions (this is pervasive; things like having a "favorite" anything, or the desire to debate what genres a song or movie fall into, are rather alien to me). This makes picking "one" place for something difficult. Because these are fine/fuzzy distinctions for me, it's also tricky to reason my way back to where I would have put something.

The biggest directory in my document hierarchy is "flotsam".

I think part of the problem is that organizational tactics/schema/heuristics aren't global. We need an array of safe, high-quality tools with good system support/interfaces, and the knowledge to reason about how and when to use which. Patterns.

A stack is probably a fine way to think about organizing mail or clothes. It's probably less useful for deciding where furniture or paintings should go. A filesystem that made sense for organizing source code is probably not the best tool for organizing a movie collection or a lifetime of personal documents. Genre apparently seems like a great way to organize most of the world's movie, book, and music store/sections, but I (unless I can get someone to check the store's inventory system) never know whether what I'm looking for is out of stock or just hiding in the taxonomical hinterlands.

Search can help. Tags can help. Hierarchy can help. Metadata can help.


I don't see why one couldn't have a tag system whose scope is defined by its ___location in a hierarchical structure.


apparently someone thought that the notion of a tag system with a semantic scope bound to a file tree hierarchy is such a ridiculous--nay, offensive!--idea that they had to downvote me for it, but couldn't bother to address it on the merits. The problem I see with tag based structures is that tags are global in scope, and so, tags have to mean one thing and one thing only; on the other hand, having one set of tags for a photo directory subtree, and another for my code repository makes a hell of a lot of sense to me.


I agree, though I think this kind of hierarchy should be shallow and high level, representing only typical search scope boundaries.


Just a braindump of what I don't like about tags instead of directories, no need to repay the advantages (I agree with some of them) :

- lack of identity. Somewhat watered down in presence of hardlinks/symlinks, but still much closer to identity than a tag cloud.

- does a file without a tag exist? It is conceptually very clear how removal of the path identity causes the actual bytes to go back into the free storage pool (wiggle a bit for hardlinks, but still pretty clear), it would be quite weird however to have files stick around based solely on secondary tags like "blue".

- lack of a consistent threshold for tagging: tags are binary, but relevance is not. If some files are tagged close to a full text index while others are tagged following a more minimalistic approach, the combined soup will not be very useful.

- too powerful for convenience: file creation in a hierarchical filesystem usually happens with a somewhat meaningful default. PWD, an app-wide default or an app-specific last used folder. The default sets some of the information you might put into tags, and this information is easily corrected if it was wrong, with a single operation that might be as easy as dragging to a different folder. "Is the default folder the right one for this file?" is easily discerned and corrected, a default tag cloud however would require a full mental scan to check for applicability to the new file. Every attempt of making those defaults more clever would just force even more scrutiny onto the user.


https://web.archive.org/web/20070927003401/http://www.namesy...

Hans Reiser had some good ideas.

And by the way there is a good analogy with www - originally we had just the addresses (somehow hierarchical), then we had hierarchical catalogue of Yahoo, and then quickly it became too much for that and we now rely on search.


There's a difference in searching the Internet and searching your own data. On the Internet, there's much more of everything than you'd ever need, and close to none of it is something you've created, or even seen before. So you use search to get some reasonably relevant results. The search doesn't have to be - and isn't - complete nor correct.

On the filesystem, I'd like to know the data browser isn't hiding files from me by reporting only "top 100 relevant results", or not indexing half of it because $reasons, or not showing them because of faulty query. Being able to iterate through all files on your disk in a tree-like fashion seems like a feature.


I 100% agree that a tag based file system is better than a hierarchical or folder based system. The problem is that people seem to fall into one of two camps - too confused by how to make tags work, or too enamored with organizing things into folders.


A way to see folders & tags is to treat folders as tags but with the property that items may only be of one folder. The same happens in biology where scientists try to classify animals into one folder; a cat is put into the mammals folder (which has multiple hierarchies of subfolders). Instead, this classification system may be much more effective if it is classified as tags instead, where items may belong to multiple folders (or categories). This solves problems like the platypus, which belongs to multiple "biological folders".

It seems that humans have a hard time reasoning with the tag concept. I'm not sure where this comes from; is it decades of working with the folder/subfolder idiom in Windows, where most people are grown up with? Is it the resemblance of the physical world where we also put documents into one folder and one folder only? Or is it our intuition to simplify matters, and therefore seemingly make things simpler to uniquely have items belong to one container? I don't know; most likely, it's all the reasons above plus a few that I didn't mention.



I'm going to really enjoy piecing apart the incredible detail put into this article.

https://tmsu.org/ This is nice.


I wouldn't really call a hierarchical FS a DAG (directed acyclic graph) because of the flaws with links you've already called out. It's not a true DAG.

Are there graph-structure filesystems?

Tagging has always seemed confusing to me because it seems like a degenerate case of a graph if your tags can't contain other tags. A graph that is only two levels deep doesn't have the flexibility of a real DAG. I'm having trouble visualizing the true correspondence between tagging and a graph structure, but I think they're pretty much the same thing if you can tag tags. Does that sound right?

Finding easy fast ways to navigate graphs in various UXs (shell, file explorers, etc) is an interesting challenge. Deletion is tricky.


I think this is fundamentally NOT how humans remembers things - I think we are masters in "geospatial" memory compared to abstract unrelated concepts, and "geospatial" memory is probably organised in hierarchies.

Worse, it's next to impossible to efficiently explore a large tag-cloud compared to a hierarchical structure, which means it's much harder to learn about things organised in a tag-cloud compared to hierarchical tree, or a graph that is mostly tree-like.

As an example; it totally breaks the xkcd techsupport cheat sheet - you end up in the "click one at random" branch basically all the time. https://xkcd.com/627/

Obviously, tagging things is good too, but file systems and computers should emphasise the tree (a better one than the Windows file strucutre though)- rather than inventing a confusing cloud/fog of unrelated things.


The pain with tags is the overhead of generating them and semantic drift. Likely the best solution is simply search (some smart semantic search) with ad hoc tags to help.



I believe many developers and designers have been annoyed by hierarchies in filesystems.

But I wanted to comment on this: > there is a mismatch between the narrowness of hierarchies and the rich structure of human knowledge

Absolutely true, this is exactly what I think annoys us most than anything, it shows us how limited hierarchies are. But at the same time, I think it's very relevant to keep in mind that our knowledge and the mental relationships we can find between ideas are very hard to make explicit and complete, like you would ideally want in a tag-based filesystem. I feel serious tagging, if manually defined, it's quite expensive if we want it to be really useful (surely, we can also consider complementary automatic tagging, like AI). Hierarchies instead, might not be very expressive, but they are very simple to use in "most" cases. So I would say we're still far from getting the best of both worlds. The problem to "solve" is information organization/structuring, and not even humans handle that ideally (we are more like, faulty, search engines with random inputs, prone to forget XD).

About the other ideas, I think they are all interesting, agree a lot with hashes usage and no-filenames, not so convinced about metadata, but haven't really thought enough about it. I don't think we can talk about the ideas fitting cohesively or not yet (but hey, I don't even think links in HFS are cohesive from any perspective), we would have to see more formal proposals for implementation and interface. This said, I hope we see more work along these lines in the future, it's a very worthy field to explore! Maybe start small, testing some of the ideas, we get a lot of design insight when we are working on the implementation.


My pet peeve with tagging systems in general, but especially community-based tagging is false negatives. If I search using a tag, there's no guarantee at all that it will display ALL the items that qualify for it.

I'll use my favorite porn site as an example, without going into any specifics and especially linking. I just skimmed the HN Guidelines and I don't think I'm breaking any.

Suppose I try the tag #bigtits. It is highly unlikely that I will get all the pictures with women who have especially large breasts. It's because no one will review all the images and verify if the tag #bigtits applies to them. That would be very time-consuming even for the most motivated individual who uses both hands for typing. So if I were into that particular fetish, I would need to try #bigtits, then #busty, then #nicerack, #slimandbusty, #ygwbt... because each tag has its proponents, and there's definitely overlap between them. You could - and I've seen non-porn sites doing that - use a program for automatic tagging, but then in my opinion you are defeating the purpose of tagging, which is grouping things by interesting categories. Machine-generated tags tend to be lifeless.

As I've said it is a pet peeve of mine, and I will likely start a project or two to implement my fixes for a web framework or a static blog generator. I mean that I should have confidence that a tag has been considered for all content in the collection. Program-assisted tags can help, such as keeping track of what tags existed at the point when a picture was added.

Then there are almost identical tags. #cat vs #cats, #tortoise vs #turtle, #color vs #colour.

Overall, in practice, I think tagging, as usually implemented, is the most overrated feature of the Web 2.0 era.


I purchased the Sony Digital Paper system (DPT-RP1) and it has possibly the most ill conceived file system design possible. All files are stored in a flat directory on the device (eg. one long list).

Users on the Sony community site are frequently looking for updates to the software. I'm curious which file organization solution, hierarchy or tags, would be easier to implement?


I'm a big believer in tags. I tried to make a tag-based tool that merely relies on directory names as tags, so ~/t/.tag1/.tag2/tag3/your-file-or-directory and moves your files around so that the tag directories are always organized by tag counts.

I had the idea there would be a series of tmv tcd tls commands that work with the tag directory structure.

https://github.com/foucist/tagmv

Warning - The regexp I'm using is likely broken, I suspect directories with .git/ or other dot-directories in them causes issues. It sometimes causes the .git/ innards to be moved out into the project directory or something like that. I never got around to fixing it.


OFTN OSWG started work on a tag-based filesystem called TPFS[1] back in 2011. The Haskell source code might be useful to anyone interested in developing platforms that use tag-based filesystems.

[1]: https://github.com/oftn-oswg/TPFS


As I recently inherited a huge, well-tagged music collection (tens of terabytes of files) I am very interested in this. Is there something like it, also supporting .cue files and also storing the original filenames and structure? A mediaplayer agnostic way to access this treasure trove would be the best.


I highly highly highly highly recommend beets [0].

[0] http://beets.io/


Wow! Thank you.


I'm not speaking from experience, but git-annex comes to mind: https://git-annex.branchable.com/tips/metadata_driven_views/


It sounds like you grabbed a copy of What.CD - or perhaps the entire Touhou lossless music collection torrent (2 TB). ;-)


Since 1 June 2012, I've been taking notes in unicode text files, which contain (occasional or adjacent) lines starting with 'nb ' and then a list of tags. I wrote a simple tool ("nb") in Inferno's shell (thanks to Robert J. Ennis for the port to Plan 9's rc), to (1) search for given keywords in per-directory index files pointed to by the global index, (2) index all of the nb lines in files in the current directory, and (3) if necessary, append, to a global index file, a reference to the index file in the current directory.

https://github.com/catenate/notabene

I've found that I'm comfortable with the eventual consistency this offers, in exchange for fast lookups when I want something (as opposed to indexing first, and/or indexing globally, and so waiting for indexing to get a result). This distributed-file approach also allows me to add tags to a variety of files: local files, or networked file-system files, or sshfs-mounted files, or Dropboxed files, or files under version control, or files with varying text formats; and find tags across all of them and across all the time I've been indexing.

It runs in linear time with respect to the number of tags I've entered, plus the time to read and process the global index, so obviously there are many ways I could improve the time performance (as an easy example, I could permute the index to list all the tags in alphabetical order, and next to each tag list the files that contain that tag).

I also wrote other tools, since the layout is so simple: for example, "nbdoc", to catenate the actual contents of the references returned by the primary tool (nb); and "so" (second-order), to return all the tags which appear in any nb line with the given tag(s).

I've also found that it's not easy for me to remember what tags I might have used in the past, or how I was thinking about something, so I try to use the conjuction of several tags to narrow down search results, rather than try to remember one specific tag (this seems to correspond to the observation that it can be difficult to remember exactly where in a hierarchy you put something).

The modular approach, of per-directory indexes referenced in a global file, also makes it easy for me to combine work-specific notes, with public notes, with private notes, all in the same global index file, at work; but only have the same public and private notes at home.


I wonder if Microsoft would be willing to take another shot at WinFS? It would have met most of the requirements. But the project bogged down and never shipped.

https://en.wikipedia.org/wiki/WinFS


git-annex might have some interesting thoughts about this: https://git-annex.branchable.com/tips/metadata_driven_views/


Have you seen "Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems" ? It is very relevant and if tags could be accurately machine generated we could derive organizational hierarchies from just tags.


> When people realize they need to classify a file in more than one way, they will start to use shortcuts/links to try to solve the problem. (Windows has shortcuts, Unix has soft/symbolic links, and Mac has aliases. This is a ubiquitous feature, but transferring shortcuts across existing platforms is very hard.) This sounds like a reasonable solution, but will face trouble in all but the simplest use cases.

They rule out this solution because it's not perfect, but surely if the idea had real merit this would be a serviceable test bed?

I think we have hierarchies because it's human nature to create hierarchies to make sense of the world, we try to force them into place where they don't make sense. We see it in biology, we see it in organisations and we see it in code, I'm sure most of us here have worked with examples of OO hierarchies that made no sense.


Hi. Have you tried managing a collection of shortcuts? How do you deal with recategorization, removing files, renaming files, moving files to different storage devices, etc.?

The article has a whole section illustrating why both hardlinks and softlinks won't work in the general case. https://www.nayuki.io/page/designing-better-file-organizatio...


> The article has a whole section illustrating why both hardlinks and softlinks won't work in the general case.

Yes, that's the part I quoted.

> Have you tried managing a collection of shortcuts?

No, tagging is the bottleneck for me, which is why I don't think it will ever be useful.

> How do you deal with recategorization, removing files, renaming files, moving files to different storage devices, etc.?

Aside from moving to different storage device, you can make a minimum viable product with a few shell scripts, one to tag a file, one to untag, one to search by tag(s), one to listen for for events (move, delete). For bonus points you could do some auto tagging from meta data.

A working implementation (even with limitations) would be a lot more convincing than all this theory.


I wrote a paper about the same topic while in Uni about 15 years ago, and also developed a proof of concept 'filesystem' with an file explorer that uses tags. Too bad it isn't the standard in any OS yet.


Would you care to share the name or a link to your paper?


I would need to dig it up from my old harddrive, if it still exists. To be honest not sure it still exists, as I said it was a long time ago.


I have thought about this a bit and I think if similarity hashes (probably LSH forest) were used, automatic tagging could occur with a preset of hashes.


for simple use cases, folder beats everything else.

when you have a large numbers in deep path then tags should be the way to go, you will need a database to manage it for portability across OSes etc.

with tags we need isolate how-to-store-the-files from how-to-organize-them-for-easy-access, tags can be used to build a virtual folder hierarchy for example.


just curious, why the down vote? i wish I can see who down voted, is there a way to check?


What would you do with the names?


Tag them.


make an enemies of ausjke list probably


nope, just ask why? you're what you think I guess, I have not made any lists in my whole life and surprised you immediately guessed that.


If they wanted to say why, they would have. You've got no need to go around hassling people for information. You could be down-voted a trillion times and it wouldn't matter for shit in the world, so get over it.


its a line from an old Steve Martin movie (Dead mean don't wear plaid) - thought it might be to obscure :-)


ReiserFS flashbacks.


You can pry HFS from my dead hands!

seriously, graphs are hard, and the possibility to lose data is serious.


just to clarify: not lose data in the hardware sense, but in the cognitive one.

In a tag-based fs I'm sure I will find a way to "lost" files.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: