The root cause of all this is a relatively obscure NTFS feature called alternate data streams.
Obscure indeed, I've never seen them used for anything other than hiding malicious content. Curious, I read about them on Wikipedia[1] and it turns out they were originally created to support resource forks in Services for Macintosh. Browsers also use them to flag files downloaded from the internet.
Hardly obscure, every modern OS has an equivalent feature, but only OSX and Windows unify it with the regular filesystem API.
Streams and resource forks are a play on a now-standard UNIX feature that almost nobody uses because it has a shitty non-file based API that also breaks most tools unless they are specifically aware of them: extended attributes. Resource forks and extended attributes are almost equivalent in every single way, except that extended attributes can only be read/written atomically (limiting their size to strings that will fit in RAM), whereas a fork or stream can be opened like a regular file. Stick that in your pipe and smoke it, UNIX sycophants, another case where Windows is more UNIX than UNIX ;)
The file-or-directory vagueness created by the hierarchy of resources buried within a file also more closely maps how the most popular path naming scheme on the planet (URLs) work: an URL can always represent both a file and a collection simultaneously, so I see this as closer to an ideal than the alternative where files can have no children at all. Sadly nobody actually uses these APIs like that, because all our tooling sucks so bad at coping with it. I sometimes wonder what the world would look like if directories on popular operating systems had simply been made 0 byte files
> feature that almost nobody uses because it has a shitty non-file based API that also breaks most tools unless they are specifically aware of them: extended attributes
Mind you, OS X makes extensive use of extended attributes in addition to resource forks (and it's largely deprecated resource forks in favor of app folders). Spend some time poking around Siracusa's reviews (since Tiger); he loves to go into detail about every new way Apple makes use of extended attributes.
Also, it's not fair to say that almost nobody uses them. Chrome makes use of extended attributes, as does KDE's metadata system and a few other things.
> (limiting their size to strings that will fit in RAM)
That's an understatement. The Linux kernel API limits the size of all extended attributes to 64KB, and the most popular filesystems limit them further to 4KB. That's not really comparable to a true fork.
ZFS is the exception: its extended attributes are implemented as forks, and the maximum size of an extended attribute is the same as that of a file. Unfortunately, those aren't accessible on ZOL because the kernel won't support it, so you can really only take advantage of it on Solaris/Illumos (and maybe FreeBSD?).
Does OSX use extended attributes to track things like the "color" attribute on a file (that shows up in Finder)? Or is this tracked via the .DS_Store hidden file?
I believe the .DS_Store files are just a fallback so if you are accessing files on a network share or on a file system that doesn't support extended attributes like FAT32, those features can still work. The native implementation on HFS+ uses the extended attributes.
.DS_Store is where finder saves the folder ui configuation (icon positions, list mode, etc) and these files are present even on native HFS+ (although "hidden" by default like any other unix file name starting with ".")
There were some reports online that the future APFS in the 10.12 betas didn't leave .DS_Store files around.
- Unix xattrs have a terrible API and awful command line tools: listxattr(2) returning \0-separated character arrays with lists of attributes that are next to impossible to decipher in C? - check! Hiding certain xattrs by default based only on their names? - check!
- xattrs have magical qualities based on their names, the kernel version, the kernel configuration, and the filesystem mount options (eg. "security.selinux", "trusted.*")
- Some xattrs are \0 terminated (and the APIs set and return the \0 making them very awkward to use from shell scripts), some don't, and some are indeterminate. They can also be binary blobs.
Also, add too many xattrs and you can no longer get a list of them:
As noted in xattr(7), the VFS imposes a limit of 64 kB on the size of
the extended attribute name list returned by listxattr(7). If the
total size of attribute names attached to a file exceeds this limit,
it is no longer possible to retrieve the list of attribute names.
It's used in lots of places. Internet Explorer uses it to save whether a file was downloaded via IE. They are just only useful on NTFS, and often not even then because file hating utilities like Dropbox don't store them. So if you upload a file with an ADS to Dropbox, then copy it back again, you'll have lost that data.
One thing I'm not sure about is whether it appears in the file size when using dir. And if you apply a file hashing algorithm to generate a hash and you only use the file attributes, base file name and $DEFAULT data stream then you can append to the file via another data stream. So hash utilities need to be AFS aware to be truly useful in Windows.
Unless you are calling data an "attribute" though, it's really a bit of a silly comparison. Literally it's a seperate namespace in which you store data. The standard tools and utilities provided by Windows generally only look at $DEFAULT. The article is correct, git is pretty much doing something very similar, only the data is stored in .git (or specified somewhere else) and you use a tool like git to get access to that data, but you can also dive into the directory directly with any other tool. In Windows you use streams.exe, and it's a. generalised, b. non-portable as it's an intrinsic part of NTFS, and c. denoted as part of the NTFS filename by the delimiter ":", which is a reserved character and documented as such.
> Internet Explorer uses it to save whether a file was downloaded via IE.
Wait, but why? Chromium (at least on Linux) uses extended attributes too, but to record the origin and referrer of downloaded files (which can be really useful, once you know about it).
Chrome does it too. As does outlook, Firefox, possibly a bunch of other things. I think you'll find that the stream is zone-identifier. It can contain a value of 1 to 4, where each corresponds to a list of Windows' security zones. (Restricted sites, internet sites, local Intranet, and trusted sites from 4 to 1 respectively. There's a fifth option, zone 0, which is "local computer", but it's unused.)
This is the source of the prompts in Windows that say "this file came from the Internet, are you sure you wish to run it?".
To be more specific, IE (and most browsers on Windows, actually) use alternate streams to record that the file originates from the network, in a certain standardized way. When such a file is an executable file, and the user attempts to launch it (via Explorer; I don't think this happens for command line), they will get a confirmation dialog from the OS telling them that it's unsafe.
Other applications can perform similar checks on file formats that they handle, if the payload can be dangerous when untrusted. E.g. Visual Studio will give you a warning if you're trying to open a project file with this bit set.
Solaris unfies it too. You can even use the runat command to open a shell where extended attributes are exposed as and can be manipulated as regular files.
It's used by all the browsers on Windows these days. They all create a 'Zone.Identifier' stream when a file is downloaded to mark is downloaded. It's content's is what triggers the "You downloaded this file! It's Evil!' warning in Windows.
To be fair, it's not used by a ton of things, since it requires NTFS, disappears when files are moved to different filesystems, and various things that read and write files destroy them if they're not careful, not to mention actually enumerating the streams is tricky, last I checked.
some history: this was introduced with XP SP2 as part of the windows security push. was a clever way to track the information without touching the binary data directly and supporting it in IE meant the majority of customers saw the benefit right away. and since most people (in windows) don't move files across file systems.
People in windows often move files across file systems: between the internal hard drive (generally NTFS) to external USB drives (often FAT32, or exFAT)
I've not seen that warning before. I just tested on a file I downloaded, which had the 'Zone.Identifier' stream. Using Explorer, I copied it to a FAT32 volume, then back to my NTFS drive. Sure enough, it lost the 'Zone.Identifier' stream, and there was no warning when I opened it.
This is on a fairly normal Windows 10 installation. YMMV on different versions, of course.
I saw this warning on files copied from Mac OS X (when I transferred them further to a FAT32 filesystem). Maybe it depends on which stream it is. (Windows 8 Pro.)
iTunes for Windows uses them to store how much of a streaming file it has already downloaded. I wrote it (but I won't take credit for most things in iTunes for Windows)
It's a nifty feature but I'll admit NTFS is really obscure at times.
Great place to store meta data about a file, never thought about that before. I guess if the download stream is interrupted it reads that to know where to pick up again if resumed?
Another obscure feature of NTFS is Transactional NTFS which I'd never heard of until recently.
Windows even includes mechanisms to perform transactions over different things like file system, registry, and even multiple machines.
Back when SVN was horribly slow and implemented transactions by actually touching thousands of small files in the .svn directories, I actually wanted to implement its file system layer on Windows with NTFS transactions, figuring that a native solution would probably be better. But by now they completely changed their working copy format so I don't think it's necessary anymore.
Unfortunately, transactional NTFS is being deprecated. MSDN says:
"Microsoft strongly recommends developers utilize alternative means to achieve your application’s needs. Many scenarios that TxF was developed for can be achieved through simpler and more readily available techniques. Furthermore, TxF may not be available in future versions of Microsoft Windows."
Which is a shame, because, conceptually speaking, a true transactional filesystem with snapshot semantics makes some things so much easier.
The original idea on the Macintosh was to have some place to put non-code assets - icons, images, etc - that came with an application. So MacOS files had a "data fork" and a "resource fork". The "resource fork" was a tree structure managed by the Resource Manager.
The problem was that the original Macintosh had limited memory and only a floppy disk, and the implementation of writing to the resource fork wasn't very good. Many programs wrote to their own resource fork for preferences and such. The tree structure wasn't updated fully until the program was closed, because writing to the floppy was so slow. If the program exited abnormally, the resource fork's links were broken. This gave the resource fork approach a bad reputation.
Since Windows programs had to run on DOS, which didn't have resource forks, Windows never used this much. Windows put non-code assets in the executable as read-only objects.
NT, which was supposed to do everything (originally it had POSIX and OS/2 compatibility, and ran on MIPS, Alpha, and x86) added generalized support for resource forks, just in case. But since most applications were written for Windows 3.1/95/ME, they didn't use those facilities.
> The tree structure wasn't updated fully until the program was closed, because writing to the floppy was so slow
Not to mention in many cases on the original Macs, you probably didn't even have the program floppy in the drive when you were working, because with only 400K on a disk you had to swap to the disk with your document on it.
I recall Inside Macintosh had a big disclaimer at the top that warned "The Resource Manager IS NOT A DATABASE". It was originally just meant to handle localizable resources, but since it was already there it was handy for developers (including Apple themselves) to use to load any kind of structured data. And who didn't love going messing around in system and application files with ResEdit?
>It was originally just meant to handle localizable resources
Not quite. An application's executable code was also stored in the resource fork, as CODE resources (one or several, so parts of the code could be loaded and unloaded as needed; initially there was also a size limit of 64k per CODE resource).
When Apple switched to PPC, the PPC code was stored in the data fork and the 68k code in CODE resources.
I may have unconsciously filled in some blanks in my memory that weren't actually there - the story mentions Andy Hertzfeld used the Resource Manager to manage the swapping in and out of code segments and I think I read it as a hack to use the Resource Manager in a way it wasn't intended, but it may very well have been intended that way to begin with.
In the early 90's I worked at a company that made server software that allowed Mac AppleTalk (AFP) clients to connect to a PC network. Eventually IBM had us write a custom version for OS/2 called LAN Server for Macintosh. We were really excited about using the streams/resource forks feature but had to give up eventually. We used a separate database to store what's in the resource forks instead.
SQLServer uses it from version 2005 til 2012 to create databases snapshots in order to run DBCC CHECKDB (consistency check).
So for actually a critical feature of MSSQL. I suppose this was the reason why ReFS was not supported for SQL data disks.
It seems they are not used anymore since sql 2014.
We had a system that generated millions of images and needed to be sure that from one version to the next the images produced by a given request were the same, and also have some diagnostic data in case of problematic images. The images could be either JPG or PNG and we needed a unified way to associate arbitrary metadata with them.
We had a special mode that would store an equivalent of the request in an alternate data stream of the image. When a problem was detected we would open the alternate data stream and test the request manually.
they should market this as a feature! alternate streams for people who think it is "an obscure feature" I mean that many people using alternate streams would be interesting for anyone forensicating systems for malware or as protection from...
This is used in specific sectors like data loss prevention. For example, you can tag files based on the security sensitiveness and if the file is copied it retains the tags.
I've worked on Windows-only software that used resource forks. It stored mail messages, one per file, with the message metadata in a resource fork so we didn't have to modify the file containing the actual mail when the metadata changed.
There were once plans to store the individual streams which make up Microsoft Office files (OLE2) as alternate data streams, which would have been... interesting.
> Browsers also use them to flag files downloaded from the internet.
Is that where that annoying shit comes from? Good to know. When firefox kills off DownThemAll I will then use a FAT partition to store downloaded files (and see if I can force the temporary files to go there too).
Unless it has changed in newer Windows versions, you can simply disable that warning in the Internet Settings, no need to keep files in an outdated filesystem.
Except that millions of developers routinely make use of file permissions; as evidenced by this discussion, many - perhaps even a majority - haven't heard of alternate data streams.
I really like NTFS as a file system... seems to offer a lot more than many other file systems, and pretty interestingly so for as old as it is now. That said, hopefully broader adoption can happen when the patents expire (ugh, in 7 years). Maybe the "new" MS could be convinced to create a royalty-free spec release/promise.
Would love for NTFS to become default for external storage, I already use it, but getting it on macOS and Linux isn't always as straight forward as it could be. NTFS-3G ftw.
Obscure indeed, I've never seen them used for anything other than hiding malicious content. Curious, I read about them on Wikipedia[1] and it turns out they were originally created to support resource forks in Services for Macintosh. Browsers also use them to flag files downloaded from the internet.
[1] https://en.wikipedia.org/wiki/NTFS#Alternate_data_streams_.2...