This is good thinking, and mostly overlaps with what I've tried to do a few times, so I would love to see it finally happen somehow.
For example, the Newton storage system I worked on at Apple 1990-1996 was based on separating organization from storage so we could have multiple (tag and/or hierarchy) organization systems. That eventually became the "soup" system in the shipping Newton OS, where objects were retrieved by content rather than hierarchy.
More recently, I spent some painful years working on Microsoft's WinFS, which had a ton of overlap with the principles here, and demonstrates just how hard it is to go from some nice principles that all seem like the Right Thing to an actual successful adopted implementation of the principles.
I think most people would agree a tag filesystem or a similar concept is a great idea. I have wanted one myself for a long time. Yet, for some reason, it doesn't take off. Do you think it is just a problem of implementation?
Hierarchical file systems allow you to say for sure where a file isn't. Every tagging system I've ever played with has turned out to be a mess in actual use compared to the simplicity a hierarchy provides.
(I am just a user, not a developer of filesystems, so this may be a naive opinion.)
Tags are fun when you have a few thousand items to test your MVP with. It gets much less fun when you have millions of items with thousands of tags, all on a flat hierarchy.
On the other hand, when you're stuck with a flat hierarchy anyway (e.g. thousands of pictures, all named DCIMxxxx.jpg), tags can be more useful. But only if they're automatic.
I want the best of both worlds. I want to organize my stuff into folders and use tags to search for individual items. There's no need to be a purist on either side. "Designing better file organization around tags" is a good thing. "Designing better file organization around tags, not hierarchies" is not.
I absolutely agree with this, except to add that it seems, in theory, in the case of a 100% tag-based file-system, there would never (*very rarely) be a flat list where you have to scroll through millions of files. The UX of a single flat list with millions of files named DCIMxxxx.jpg is a limitation of the current format and doesn't make sense when so much information can be generated about our files upon creation.
In this hypothetical FS, all files would have dynamically generated tags for created date/time, modified date/time, access date/time, originating application, given filename, owner, group, permissions, format, geo-data if available, originating hostname, EXIF data should all be first-class tags as prominent as the others, and so on. I visualize it to work, by design, something like how google photos UX works, grouping everything by EXIF data automatically before you ever start organizing things on your own.
Just as well, with any files created manually, in the application "tagging" should be just as prominent a function as "naming" is right now.
Part of the adoption problem here would be trust. Hierarchical filesystems are something we're used to, and we can trust they're implemented correctly. That means, if I visit a folder and see some files, I know what I see is all of the files there (+/- hidden file settings); if something is missing, it's not there, period.
Tag search is a search. Can be broken. Can be optimized in a way that causes it to lie. I look at the results, and I'm not sure if they're complete. Maybe the file I'm looking for is really not there, or maybe the search gave up too early. Or the tag was slightly misformatted?
Maybe I'm too used to the old thing, but I like the notion that there's one canonical tree structure that makes all my data reachable. In case I've misplaced something, the search space of all paths through the filesystem tree (or a subtree of interest) is vastly smaller than the search space of all possible values of all relevant tags.
> Tag search is a search. Can be broken. Can be optimized in a way that causes it to lie. I look at the results, and I'm not sure if they're complete. Maybe the file I'm looking for is really not there, or maybe the search gave up too early. Or the tag was slightly misformatted?
Google Docs, while not exactly a tag system, wants you to search instead of use a hierarchy, and so is my go-to example: A good 90+% of the time, I can't find something I know is there and have to ask a co-worker for a link.
The missing consideration when it comes to tags is simple discoverability. You have to already know enough about what you're looking for in order to find it. A hierarchical system lets you do systematic browsing.
You make excellent points, although I think I the issues you raise exist in our current filesystems as well. Provided the FS is indexed properly, opening a tag should show all files associated with that tag immediately. Just like opening (or listing) a directory does.
And in the case of search, it absolutely sucks on today's filesystems. Don't get me wrong, find and grep are incredible tools. I simply mean that it's not like searching for files beyond the hierarchy is known for the pleasurable UX. The only way I know a grep of a whole drive or deep directory is done is because I get my blinking cursor back.
At the very least with a proper tagging system, we would be inherently familiar with the indexes available to us.
> And in the case of search, it absolutely sucks on today's filesystems.
It does. You mention grep, I'd even mention find - half of the time I'm wondering whether it has searched everything I wanted, or I misspelled the command. Or file search in Windows (Vista+) - I just don't trust it; I'm pretty sure it missed some data in the past for one reason or another.
Now with traditional file systems, I at least have the file tree. With tag-based systems, I'd only have search - so it better be trustworthy, both in reality and UX-wise. It needs to project the feeling of correctness and completeness or results.
> The only way I know a grep of a whole drive or deep directory is done is because I get my blinking cursor back.
The only way I know a find of a whole drive does what it's supposed to be doing is because it emits a stream of "find: `/some/path': Permission denied" messages.
> At the very least with a proper tagging system, we would be inherently familiar with the indexes available to us.
> I'd only have search - so it better be trustworthy
Absolutely agreed. It should be as reliable and as immediate as what we have now.
> I at least have the file tree
I don't know how this would work in practice, but I'm imagining something where, UX-wise, a tag-based FS could act very much like what we're already used to. Google was very much on this track in their early versions of "labels" in gmail and google drive (shame they've slowly moved away from it)
Just last night I used some desktop app I found to tag a few thousand scanned documents so I could do my taxes this morning (researching my options is how I ended up finding this article). Once they were all tagged, I was able to traverse in a very familiar way.
At "root", there's too much noise, but as soon as I pick a tag, say "2017" - now I have whittled down my available tags. And then I pick "receipts". Smaller list of files and a smaller list of tags. And then "restaurants". And then "business".
That seems quite a bit like a hierarchy to me. The subset of tags that are related to the first one I chose act just like sub-directories. The UI could work exactly like what we already know and love. As we know it now, I would have ended up at ./2017/receipts/restaurants/business.
Of course with directories, that's the only way I could organize my files. But if we're working with tags, I would get the exact same results going to:
/business/receipts/2017/restaurants/
/receipts/2017/business/restaurants/
You get the idea. But, I could also potentially do something like:
/receipts/2017/client_1+client_2+client_5
or
/2017/receipts/business+!client_3
Now, still within the realm of a directory structure - even using terminal commands we're all familiar with and a bit of extra sugar - I have access to more features. I can't merge directories in a tree. Not that easily, anyway. But in this case I can `cd` into a directory of exactly what I want in a familiar way without trying to remember if what I'm looking for is in ~/Dropbox/receipts/2017 or ~/Documents/business/client_1/receipts.
It's in both. "Dropbox" and "Documents" are no longer necessary. Nor is ~/.
I like the content of your answer: filtering by tags to narrow down the search results, only showing tags that belong to the current set of results, the benefits of order-insensitive path parts, and the ease of taking unions of tag results.
The path examples that you created are Boolean queries with different symbols: slash means AND (low precedence), plus means OR (medium precedence), and exclamation means NOT (high precedence). Your last example could be rendered as "2017 AND receipts AND (business OR NOT client_3)" and mean the same thing.
In any case, the illustration you made is indeed the sort of user interaction that I want to design into a future prototype.
>if I visit a folder and see some files, I know what I see is all of the files there (+/- hidden file settings); if something is missing, it's not there, period.
Funny you should say that. Only a few weeks ago a colleague of mine was perplexed by a file which showed up in a 'save as' box, but not in Windows Explorer. It was an ordinary log file, same as a bunch of others in that folder, no reason for it to be different. Apparently he later discovered the file was visible if navigated to from C:, but not through the desktop shortcut he'd made to that folder. We could only conclude it was a Windows bug. Whatever the cause, it wasted a good deal of our time hunting for that file...
I understand your concerns, and they are indeed valid. First off, I doubt that managing millions of files in a traditional hierarchical file system is fun either. You'd likely run into problems with making unique names, sharding folders, and categorizing files that logically belong in multiple places. I also have some worries that existing file systems (say NTFS or XFS) will behave or perform well with millions of files. I believe that implementing tags is a starting point for the problem of managing millions of files in a sensible way.
Speaking of thousand of pictures, what I really want is to dump all my photos into one folder. Right now, tools are ill-equipped to deal with large folders, so I am forced to manually create a new folder for every thousand or so items.
> I also have some worries that existing file systems (say NTFS or XFS) will behave or perform well with millions of files.
I believe the mantra for XFS is "if you have large or lots, use XFS". XFS has a lot of optimisations for metadata operations which should mean it's better than most filesystems for lots-of-files and large-files cases (Dave Chinner has given several talks about the performance characteristics of XFS with "large or lots" cases).
A few things I can see off the top of my head that would need to be solved for tag filesystems to be able to take off:
1. Inertia. Most software assumes hierarchical filesystems, and assumes it can control some portion of that hierarchy. This includes things ranging from search paths for various things ($PATH for binaries, library paths, etc), temporary files, preferences files for applications, assumptions made about hierarchical filesystems in archive formats like ZIP and TAR, etc.
2. Permissions. With a hierarchical filesystem, you can apply permissions on higher levels of the hierarchy to control access to lower levels, and you can have various forms of permission inheritance to control permissions on new files. Need a design for how to do that on tag filesystems.
3. Mounting. Filesystems come and go; some are on your OS drive, some are on removable media, some are network filesystems. Hierarchical filesystems means that each one has a single root, and it's easy to tell where the boundaries are.
4. Tagging taxonomy. What kinds of tags do you use? What happens if you mount a filesystem in which someone else used a different tagging taxonomy than you used? Who controls different parts of the tag space? What happens if you import an archive of material which uses a different tagging scheme than you use?
5. Projects. How do you group files of different types with different tags into discrete projects? How do you bundle related files together? How often would you want to see files by arbitrary tag lumped together, rather than looking in particular projects that have a pre-defined structure?
6. UI. How do you browse tags? How do you refine down? In many cases, rather than a general purpose tags based interface, you actually want media-specific browsers, like ones specialized for music which let you browse by artist, album, playlist, etc, or photo galleries that can show you previews of the photos, or video browsers which can show projects, bins, and sequences (for video editing), or IDEs which can either show you file hierarchy or allow you to browse by class, function, etc.
7. And finally, why is a tag-based filesystem necessary for this? What is wrong with the current approach, in which there are special purpose applications which can index, tag, and display certain types of media in certain ways? For instance, you can use your text editor or IDE to navigate among files within development projects, iTunes or Play Music or whatever to browse your music, Lightroom or Darktable or iPhoto or Google Photos to manage your pictures, iMovie or Final Cut or Premiere or Avid for browsing and managing your video, and so on. They all frequently have some way of tagging files, but also have specialized UIs for browsing the specific types of files they are defined for without having to do explicit tagging, and the actual files are just stored on a normal hierarchical filesystem.
There are a couple of good thoughts in the original post, but a lot is handwaved away, such as mutability of files, which is an incredibly important use case, for a huge amount of what files are used for today. Lots of people have brought up the idea of making tag based or database based filesystems, such as the failed WinFS effort (https://en.wikipedia.org/wiki/WinFS), but it's actually a pretty big problem to solve.
I think many of these issues can be addressed with mechanisms proposed by the author.
Mainly, the more complex tags which can themselves refer to other tags.
1. This is probably the trickiest one. You may be able to do some sort of translation between a hierarchical system and the tag system using tags themselves. You could have a series of tags that refer to each other, such that the hierarchical ___location is essentially encoded in the tags themselves.
2. Again, maybe just special tags?
3. Yeah, again, tags. Just tag the thing with the media it's on.
4. Aside from the basic UI side of things which should help, there is the idea of shared tagging systems. I don't recall if that came from the author or another commenter on HN. And you can basically ask the same question about hierarchical systems. It's not exactly a solved problem there either.
5. Again, the complex tags. Just make a tag for the project.
6. Obviously UI is a big question. I'm not sure how it relates so much to media-specific browsers though. They basically present a different view of a section of a filesystem. You have to do some work to let them do this, or else use a system like iTunes and buy all of your media through them.
7. Although I feel this is well addressed by the author, one thing I think you aren't considering is that each of these applications requires their own setup in order to provide that view. You often can't just take the directory from one of these programs and use a different program to view it and have it all work properly. If you only have one program for each media type and never want to use anything else that works, sort of. Many years ago I directed iTunes to redo the file layout for my music collection and rendered it effectively useless for direct browsing. I never really recovered from that due to the time involved to sort it out.
And mutability isn't totally handwaved away, again with the complex tag system you could tag mutated works with a reference back to the original. This doesn't cover the case where you don't wish to retain the original, but then you could just do a simple find/replace with the old and new hashes in the simplest case.
I think you're missing some of the subtleties of solving these problems using "just more tags."
In a hierarchical system, a lot of these organizational issues are local. If I have one directory that consists of a project organized one way, and another directory that consists of a different project organized a different way, those different organizations don't really interact with each other in any way.
If you are using tags for everything, in order to avoid weird mishmashes of different ways of using tags, you would need to either have a completely standardized tagging system that everything used consistently, or you'd have to always include various contextual information in your queries or in your browsing in order for the queries to make sense. For instance: [mount: my-hd][project: my-project][type: jpeg]
I think you overstate the problem with different applications as well. For a large amount of the metadata that is relevant for these applications, there is a standard tagging system. ID3 for music, EXIF for images, XMP for various image and video formats. It's true that there is some metadata that these applications store in proprietary databases, but that's mostly an issue of it being difficult to come to a consensus on standards that meet everyone's needs, and it's easier to just write some proprietary metadata somewhere. With tagging systems, if there wasn't agreement on the schema of tags, you'd still have the same issue.
I don't think it's a bad idea to consider alternatives that are more general and more flexible than what we're doing now, but I do think that it's pretty easy to handwave about how nice a tag based system would be, but a lot harder to solve all of the little problems that are going to come up and turn it into a real, coherent, working whole, and then getting enough critical mass so that it is used outside of a small niche with a handful of applications.
I'm sure you're right that there are a lot of overlooked subtleties. That said I'm not sure some of those problems you mentioned would exist, or at least I'm not sure they would be any worse with a tag system than a hierarchical one.
For example, how is that example query any worse than the current situation? Right now you'd navigate to the project directory (requires specifying more than your example already) and then use some search method depending on OS/WM/etc. And then you still end up with a big list of jpegs to look through. This is sort of a worst-case example for both systems, and still I think the tag system comes out ahead here - by a little - just because it would give you the ability to spread the project across multiple drives without requiring you to do two searches if you don't know which drive the desired image is on. You can improve the situation for either system by manually specifying more information. Put better tags on the images or put them in more specific directories or title them.
As for specific applications, it's not the metadata encoded into the files that I'm talking about. It's as simple as the directory structure itself that is used to store all of this. I can't have one application organize everything and then trivially point another application at the directory and have it work.
With a tag-based system this starts to change. I don't need to tell a new music player where my music is, and then go through whatever process is needed to let it properly work with the current directory organization. At worst I tell it which tags to include or perhaps exclude. From there many options exist. Maybe it pulls in metadata from the files themselves. Maybe I provide an external file in whatever format. Maybe I tell it which tags to associate with which fields. You could do a lot of things here.
I also won't end up telling the application to reorganize things as I did many years ago with iTunes, which promptly made it nearly impossible to wade through my music manually. I had it sort everything into directories based on the artist with subdirectories for albums. It sounded great, until I remembered just how much music I had off OCRemix, where an album is a large collaboration between many people. All of those albums were ripped apart. Ironically, I also had some standardization issues with things like artist names which caused more trouble. Once I stopped using iTunes I basically abandoned that collection because of the work required to fix it.
Yeah, standardization is going to be sort of a problem, but I don't think it's quite as big of a deal as you think. For one, the OS is going to ship with a bunch of standard tags just for itself to work. There will also just be a lot of really standard stuff people are interested in that can be shipped with them. You also have file extentions, for both specific extentions and also generally what kind of information they contain. And finally there is just good old translations. The hierarchical system basically utilizes all these methods and suffers from the same problem - namely you can put directories wherever you want and name them whatever you want. Same problem, different manifestation.
I think the biggest benefit would come from a system that can present itself either hierarchically or tag-based. They both have merits. I've already presented some ideas on how you could store the hierarchical structure in the tags. I'm not so sure how you store the tags in a hierarchical system directly. You could probably fake it with a separate datastore easily enough though.
Finally, when did this discussion of general design goals turn into one of a real-world implementation, much less widespread adoption? I'm not sure how this is relevant.
Once upon a time there was Google Desktop, and it was incredibly useful. I wouldn't trust Google on my machine anymore, but an open source replace ment would be great.
https://en.m.wikipedia.org/wiki/Google_Desktop
OSX has Spotlight. It searches pretty much all the same stuff Google Desktop is listed as searching. If you want something open source instead then check out Quicksilver [https://qsapp.com].
Thanks. My subtle point was why use tags when you could use search? Search engines index the whole internet - millions of hierarchical file systems of all kinds. Why reinvent file systems?
I don't use Mac. Windows 10 "Cortana" is no match for where Google Desktop was 10 years ago. I rarely use it unless I have to. Google searched the contents. Cortana just searches filenames and tries to route searches to the web. Linux...I don't see anything - Ubuntu has Cortana-like feature, but it's not Google Desktop. It's very odd to go backwards technology-wise.
Google Desktop was indeed a great solution partially solving the tag problem in Windows. A good tagging system trumps search as you can make the results much better determinate. The ability to create your own namespace and then organize every file into that namespace ensures encapsulation. In a search result, you almost always have to filter out irrelevant items.
Good working search solves the problem pragmatically though :)
I am far too lazy to tag every file and I would despise such a system if forced into it. Some files I want to keep realizing I may never need them again. Tagging is a time waste. It's also difficult to anticipate future use and what tags are helpful.
Now a tagging system that could be built over time from search results could be very useful. Apply tags as you go in batches in other words to aid future searches.
If I were designing a new OS, I'd force each piece of software to auto-tag and index the file contents in meta tags and feed it to a global OS search function.
You assume that you need to tag manually. You assume that you are forced to do so. Currently, it's perfectly possible to put all files in a single folder. Your OS will be happy about that (except maybe some technical limitation of max filecount of a folder).
All of the things you pointed out are legitimate concerns. My ideas are at an early stage (esp. without an implementation or usage experience), so they are necessarily incomplete. To respond to your points:
1. Inertia on hierarchies. Totally correct. I want to tackle immutable media collections first (photo, audio, video, documents), because developer tools (scripts, source code, path configurations) have a more intimate coupling with the hierarchial model.
2. Permissions. I have no clear answers. I suppose there could be a meta mechanism like, "for every file tagged as WordDocument, make it readable to $ANOTHER_USER". However, I believe the default of share-nothing is about the level that we get in most web-based applications. And I think the file-level permissions (like Unix, NTFS, etc.) are too fine-grained and confusing.
4. Tag taxonomy. My proposal isn't any worse than the ad hoc taxonomies that people make in hierarchical file systems. At least when I proposed public "tag cores" (search in the article), there is one possibly viable way to unify everyone's vocabulary (opt-in, of course).
5. Projects. I suggest tagging every relevant file with its project tag. I'm not sure what you are pre-supposing with "pre-defined structure", because you could tag tags with the project tag. Also, I like the idea of tagging over hierarchies in this case because many of my project files also belong to other projects (e.g. monthly account statements vs. tax documents) or to a general stream (e.g. photos).
6. User interface. This is one of my big obstacles at the moment. With a hierarchy, there is really only one reasonable way to present the files, and one reasonable way to browse them. With tags, the options are numerous. There might even be metadata to control how tags are ranked, hidden, etc. At the very least, a UI would need to present many facets of metadata, such as file type, topic, timestamp, author, etc.
7. Necessity. You hinted at the problem in your statement. Application software have demonstrated to us that media libraries (e.g. songs in iTunes, photos in Adobe Lightroom) are extremely convenient and valuable. But they are all proprietary data silos. The metadata you generate is only accessible within one media program. Moving to a different program means giving up your investment. Also, media libraries are brittle, and when you move/rename the data files in the file system, the media library often behaves poorly (broken links, slow rescans, etc.). I believe that keeping metadata in an application-neutral format and letting the file system handle queries is a better alternative than proprietary media libraries. I wrote about this in a section: https://www.nayuki.io/page/designing-better-file-organizatio...
I'm not sure about that. I'm fine with hierarchical file system and links for my projects (desktop or explorer side bar). I tried few times to make use of macOS tags, but I didn't find them useful. I guess, for people who work with actual documents, it might be different. For me all those systems trying to be smart end up showing me billions of .class files from my build directories and wasting cycles trying to index them.
Oh man, that would take a book. If I had to severely boil it down: You can have a beautiful unifying idea that seems to make total sense in the abstract, but then bog down completely when you have to make real implementation decisions in the context of a massive existing platform with a several-thousand-person development team. At some point a system's assumptions become so established that there are certain aspects that are literally impossible to change. Good intentions, even with the full support of the highest levels of management, can't resolve all the tangles.
Wow, I forgot Hal practically did wrote a book. :)
I was on the Windows Shell side, sitting in a lot of schema meetings where products couldn't figure out how to combine their incompatible complex existing schemas into one new WinFS schema that would be understandable to anyone, and in a lot of other meetings where I was sort of a negotiation translator, decoding user interface concepts for the SQL team and SQL concepts for the Shell team, though their priorities were wildly different.
The common theme throughout was that it was pretty clear none of the product teams saw any reason they should go through unknown years of effort and compatibility hell, delaying the feature roadmap for their product, just to fulfill Bill's desire for this abstract notion of Integrated Storage. So these meetings went on interminably to make Bill and David and Bob and Jim happy, while none of the participants outside the WinFS team itself saw any point to the exercise.
It was not a pleasant experience, but I learned a lot about how a large organization works! (A reason I'm no longer in one.)
Thanks! That was pretty much just me, but it was inspired by reading a whole bunch of other peoples' books and papers on persistent object systems. That was a big topic of research at the time. I don't really know why the persistent object idea fell out of sight so completely.
For example, the Newton storage system I worked on at Apple 1990-1996 was based on separating organization from storage so we could have multiple (tag and/or hierarchy) organization systems. That eventually became the "soup" system in the shipping Newton OS, where objects were retrieved by content rather than hierarchy.
More recently, I spent some painful years working on Microsoft's WinFS, which had a ton of overlap with the principles here, and demonstrates just how hard it is to go from some nice principles that all seem like the Right Thing to an actual successful adopted implementation of the principles.