Before people panic and again try to claim that HTTPS does not help here, note that the leak here is not in HTTPS itself per-se: it's in DASH and VBR encodings. Segment sizes can be predictable and are unique for each video. Higher variation in bitrate leaks more unique fingerprint information, and Netflix happens to support high variation in bitrates. HTTPS still does guarantee integrity and confidentiality.
Stepping back a bit, although this paper is definitely valuable, it isn't that startling, because we already know that encrypted communications are vulnerable to passive attacks when the contents are predictable. It's a good reminder that "vanilla" encryption isn't necessarily the best way to protect privacy when the attacker can simply guess what we're transmitting because the search space is so small; in this case, it's easy to compare the length of what is being transmitted against a corpus -- and bam. There's only ~42k entries...
Entropy entropy entropy. It is your friend. Just so happens that VBR and DASH weren't designed to increase entropy when transmitting segments.
Re: Entropy: Note that just adding random padding to packets doesn't actually protect you from this kind of analysis. You'd want a constant bit-rate ("CBR") encoding instead. Even with CBR, the exact length of the video might give away the contents too.
From a bandwidth perspective, such CBR encodings are either wasteful or low quality for high motion scenes—or both. So it makes sense that Netflix has chosen a VBR system, but does have this privacy caveat.
I just want to point out that Signal, the encrypted messaging/voice chat app is using CBR for that reason: it should be possible to reconstruct the words spoken using the metadata that is leaked from a VBR.
Of course, that means losing some bandwidth, so it is not the most bandwidth-efficient app.
> Even with CBR, the exact length of the video might give away the contents too.
To avoid that issue you just artificially make multiple movies have the same length, meaning do some padding to round up to the next e.g. 10 minutes (so if the movie is 1:48:23 it becomes 1:50:00), to do so the movie keeps buffering in the background (some random audiovisual noise).
Maybe netflix should have a "Fully safe" mode where it uses CBR instead of VBR so the user knows the trade off (slower/heavier buffering)
It wouldn't work, that sounds like adding sleep calls to prevent timing attacks, which doesn't works...
Different bit rates on the random audiovisual noise (white noise would be uncompress-able, thus weight more, pure black would be too compress-able, etc).
So I'm somewhat ignorant in how a lot of TLS works, but wouldn't this have been solved if all packets under TLS were forced to be the same size? Which, is my understanding, isn't part of the standard but wouldn't that essentially prevent these types of snooping?
The fingerprint is derived from the aggregate size of all packets that comprise a segment of video (what the researchers call an ADU),
Some possible counter-measures are discussed in the conclusion:
"To that end, we believe that Netflix could defend against passive traffic analysis by ensuring that the byte-range portion of the
HTTP GETs sent by the browser do not perfectly align with
individual video segment boundaries. For instance, the browser
could average the size of several consecutive segments and send
HTTP GETs for this average size. As an alternative approach, the
browser could randomly combine consecutive segments and send
HTTP GETs for the combined video data. Designing obfuscation
techniques for VBR DASH streams that do not degrade video
quality remains a potential area for future research."
That might be true, but the number of transmitted packets per unit of time betray the bitrate of the video being watched. When an attacker is monitoring a transmission coming from Netflix CDN, the nature traffic is a given.
The scraping and automated viewing in question pretty clearly violate Netflix's terms of use. As junior officers in the U.S. Army, the authors are more vulnerable than most to trivial but "correct" accusations of illegal activity, so I wonder if they were at all concerned about the government's sweeping interpretation of the Computer Fraud and Abuse Act.
> In order to generate these fingerprints, we first mapped every available video on Netflix. We took advantage of Netflix’s search feature to do this mapping by conducting iterative search queries to enumerate all of Netflix’s videos. This enumeration was done by visiting https://www.netflix.com/search/<value> where <value> was ‘a’, then ‘b’, etc. and then parsing the returned HTML into a list of videos with matching URLs.
This is not the same as but still in the same class of "unauthorized" use that Weev was charged with carrying out on AT&T endpoints. No privacy concern here, and in theory you are authorized to view this Netflix content but not to "use any robot, spider, scraper or other automated means to access the Netflix service; decompile, reverse engineer or disassemble any software or other products or processes accessible through the Netflix service; insert any code or product or manipulate the content of the Netflix service in any way; or use any data mining, data gathering or extraction method." Though Weev's conviction was vacated on appeal, that was only based on a venue problem so the prosecution's legal theory about violating terms of use still seems to be in play.
Not concern trolling here, I do this sort of scraping all the time and there's no reason to believe the authors are at any risk. It's just an interesting juxtaposition that illustrates how overly broad the DOJ's interpretation of CFAA is, and how selectively it can be pursued. As the EFF notes, one of the major impacts is that is puts security researchers in a legal gray area (https://www.eff.org/issues/cfaa).
Very interesting that they can get a video fingerprint without even downloading the video. So they can fingerprint 44k in 4 days (7 seconds each) instead of downloading each video which would be very demanding. I wonder if Netflix had any monitoring that noticed them initiating a stream of every single video. I wonder if they used multiple Netflix accounts.
They mention they used Silverlight. I wonder if this also works for videos when viewed with HTML5, and if the same fingerprints can be used.
I'm curious why you're very curious why I'm curious.
I would think Netflix would be protective of their content and would likely have monitoring to detect mass downloading. The adversarial nature of one person trying to do something and other people trying to detect and stop them is interesting to me. I find JSOR's account of their monitoring, detection, and attempted blocking of Aaron Swartz's downloading of academic papers (not just metadata like this post), and the cat and mouse game that followed to be very interesting. https://docs.jstor.org/summary.html
And the perspective from the other side: the authors of this paper, whether they were concerned with being detected by Netflix and possibly blocked or even banned from Netflix for life, and maybe took action to avoid that such as using multiple accounts or VPNs.
How is Netflix going to ban someone for life? Get a new email address, a new IP, a new credit card, and you're a new person. You may have lost your ratings, but maybe a fresh start is nice from time to time.
Possible but not that simple since a new CC is also tied to your identity and billing address.
Paypal does do checks to ensure that blocked accounts are not easily resurrected.
Netflix has less incentive to perform such a costly operation but it's more than possible, this is what every credit and background check agency can do.
Sure if you want to get a completely new identity, credit history and address you can probably fool most of these but you are going to be violating a few laws in the process and it would be probably be cheaper to purchase the entire Netflix library on DVD/BR at that point.
> Possible but not that simple since a new CC is also tied to your identity and billing address.
Billing address sort of, but address verification is usually only on the numbers, not the names of the street. Very few credit card systems pass the name on to the bank when requesting authorization. If you're only using streaming, it doesn't really matter if the street address isn't correct.
Credit and background checks usually request a lot more information than netflix does; nobody would give netflix their social security number, or recent addresses.
It was MIT's network, not a government network. The network was not "secure," campus guests can use it though leaving a computer connected 24/7 in a wiring closet (that could have been locked but wasn't) clearly exceeds the campus policy. The JSTOR link makes it pretty clear that his activity, including that which involved using the hidden computer, came before federal involvement. Given the language of the Computer Fraud and Abuse Act, I think the outcome could have been the same for an MIT student or faculty member, it would've mattered to the federal attorney. JSTOR's and MIT's hands were tied in the sense that they were not party to the aggressive prosecution of Swartz, the federal attorney had a lot of discretion as to how to handle the case and chose to be very aggressive.
The result of this is generalizable. Looking at your encrypted HTTPS traffic, people can still tell what you are browsing and downloading especially when they have a good idea of what you could browse or download.
For the rest, I am not sure how many people should be afraid to let people know what they are watching on Netflix.
Yep, people can infer a lot. I did a demo of this a couple of years ago for my employer at the time by creating a tool which, in a slightly contrived scenario, is able to figure out what one is looking at on Google Maps over SSL.
Stepping back a bit, although this paper is definitely valuable, it isn't that startling, because we already know that encrypted communications are vulnerable to passive attacks when the contents are predictable. It's a good reminder that "vanilla" encryption isn't necessarily the best way to protect privacy when the attacker can simply guess what we're transmitting because the search space is so small; in this case, it's easy to compare the length of what is being transmitted against a corpus -- and bam. There's only ~42k entries...
Entropy entropy entropy. It is your friend. Just so happens that VBR and DASH weren't designed to increase entropy when transmitting segments.