We’re Back
So, I am finally back at it after several weeks away from writing. My absence has partly been due to work and family obligations. But mainly it is due to the major announcement Google made on July 22 about the fact it is deferring the deprecation of third-party advertising cookies from Chrome and instead implementing a consumer-choice mechanic.
Despite this change, Google said in multiple forums which I attended that the work on the Privacy Sandbox would continue unabated, but given such a major change I wanted to wait and let the dust settle before I jumped back into the fray. For the moment, things look to be stable without any further major shifts in the offing. So I will pick up where I left off.
Continuing the discussion in the prior post on headers brings us to a discussion of browser fingerprinting and some new browser header elements designed to reduce the ability of companies to fingerprint a user agent. These elements come under the heading of Client Hints Infrastructure or a subset known as User Agent Client Hints.
In order to talk about Client Hints, I first need to introduce the concept of fingerprinting - what it is and how it works. Then we’ll discuss guidance from the W3C on a framework to reduce the ability to fingerprint. This also involves providing an introduction to some basic concepts of differential privacy. At that point, we’ll then discuss the Client Hints mechanism and how it attempts to accomplish the goals laid out in W3C’s framework. I will discuss the first three items in this post. In the next post I will then explore how the technologists have worked to reduce the ability to fingerprint using multiple methods, including Client Hints Infrastructure and User Agent Client Hints.
What Is Fingerprinting
Fingerprinting is a set of techniques for identifying a user agent from characteristics of the browser or the device on which it runs. Some of these techniques are deterministic - for example by reading the user agent header - but many are derived using statistical learning. I am particularly familiar with fingerprinting as I built algorithms to do this work in 2012 in my first role in ad tech. At that time fingerprinting was fairly new. Peter Eckersley of the Electronic Frontier Foundation had published one of the earliest papers on a variant known as browser fingerprinting, “How Unique is Your Web Browser”, in 2010. In that paper, Eckersley found that five characteristics of browsers - browser plugins, system fonts, User-Agent string (UA), HTTP Accept-headers and screen dimension - allowed his team to identify a browser uniquely ~ 84% of the time. Note that this didn’t even take IP Address into account.
At the same time, Eckersley built a web-based tool called Panopticlick to test browser uniqueness. That tool still exists today at www.coveryourtracks.eff.org. A separate tool, called AmIUnique is also available. To give you a sense of how powerful browser fingerprinting is today, I put my Chrome browser (in which I am currently writing this) through AmIUnique as the report from AmIUnique is a bit easier to comprehend. Even though I have multiple layers of protection from online tracking, AmIUnique could uniquely identify my browser (a partial printout is shown Figure 1. The full analysis is shown as an appendix at the end of this article). In fact, it could use my browser protection elements, such as my do not track settings or my Ghostery plugin, as part of the fingerprint.,
Figure 1 - Partial Printout for My Browser from AmIUnique.org
Since Eckersley published his research, there has been a large body of further work that identifies and tests browser/device features to determine the most impactful. One especially robust study, which tracked 2,315 participants on a weekly basis for 3 years, examined over 300 browser and device features. However, most fingerprinting techniques rely on somewhere between 10 - 20 features. These are shown in the top half of the table in Figure 2.
Figure 2 - Main Categories of Browser and Device Features Used for Browser Fingerprinting
Mobile devices have other features that can be fingerprinted. These include the compass, accelerometer readouts, gyroscope readouts, and barometer readouts. I won’t cover these in any detail here as right now they are tertiary signals. Only 1-2 companies actually use these features in any way to fingerprint mobile devices. But I mention them here for completeness and to call out the fact that mobile fingerprinting uses slightly slightly different methods to accomplish device (vs. browser) fingerprinting.
Some of these features are easily available in the contents of web requests. An example is the user agent header. Using just these features for creating a fingerprint is called passive fingerprinting. However, most fingerprinting is active, which means it depends on JavaScript or other code running in the local user agent to observe additional characteristics.
There is a third form of fingerprinting - called cookie-like fingerprinting. Cookie-like fingerprinting involves techniques that circumvent the end user’s attempts to clear cookies.
Evercookie, invented by Samy Kamkar in 2010, is an example of this. Evercookie is a JavaScript application programming interface (API) that identifies and reproduces intentionally deleted cookies on the clients' browser storage. Evercookie effectively hides duplicate copies of cookies and critical identifying information in storage locations on the browser - such as IndexedDB or in web history storage - so that when a user agent logs back in that information can be queried and retrieved, even if cookies have been deleted.
Why the W3C Cares About Fingerprinting
While I am focused on explaining Chrome’s approach to privacy in the Privacy Sandbox, browser fingerprinting is a broad issue that all browser manufacturer’s care about. The Worldwide Web Consortium (known by its shorthand name - the “W3C”), has published a document entitled “Mitigating Browser Fingerprinting in Web Specifications” that provides guidance to the various working groups developing web specifications. The point of the guidance is to ensure that each working group considers the fingerprinting “surface” its specification creates and works to minimize it.
The W3C leadership has been concerned about fingerprinting for quite some time. But it has become especially concerned about fingerprinting as cookies or other obvious forms of cross-site tracking are deprecated. This is because statistical methods of fingerprinting will become the de facto workaround as other methods are restricted. It doesn’t pay to close the front door when the back door is wide-open. So browser manufacturers, including Google, are enhancing the privacy features of their browsers to reduce the ability to fingerprint even as they are removing obvious cross-site tracking mechanisms like cookies.
Which brings us to the new Client Hints and User Agent Client Hints APIs as one technology to reduce the ability to fingerprint a browser. As part of this discussion, we are going to have to delve into the topic of entropy, which comes from information theory developed by Claud Shannon in 1948. This will serve as an introduction to a very mathematical topic that will become exceedingly critical later in our discussions about privacy budgets and the Attribution Reporting API. But for now a high-level summary will suffice.
What are Client Hints and User Client Hints APIs?
Client Hints Infrastructure is a specification that identifies a series of browser and device features and allows access to the information about them to be controlled by the user agent in a privacy-preserving manner. It uses several techniques to accomplish this:
- It allows each browser manufacturer to establish a “baseline” set of user agent features that can be easily available for any website to request for the purposes of serving content.
- It also identifies a set of “critical” features that a website can request in order to serve a web page correctly. These features are not easily available because they provide a large amount of information value - known as entropy - that can be used to fingerprint a user agent. Examples of this are the exact operating system version on the device and the physical device model.
- It provides for the ability of the browser manufacturer to give some control of these settings to the end user in a consumer-friendly fashion.
- It establishes a structured mechanic for content negotiation of these elements between the user agent and a web server.
- It allows for information sharing only between the user agent and the primary web server (the top-level domain). Third-parties whose content is on a web page cannot gain access to this information without express permission from the primary website.
- All accesses related to features subject to control by client hints must be deleted whenever the user deletes their cookies or the session ends.
There are several types of client hints, each of which are handled differently:
- UA client hints contain information about the user agent which might once have been found expected in the user-agent header. There is a separate specification for these features appropriately named the User Agent Client Hints Specification. The User Agent Client Hints Specification extends Client Hints to provide a way of exposing browser and platform information via User-Agent response and request headers, and a JavaScript API.
- Device client hints contain dynamic information about the configuration of the device on which the browser is running.
- Network client hints contain dynamic information about the browser's network connection.
- User Preference Media Features client hints contain information about the user agent's preferences as represented in CSS media features.
As we will discuss later, these hints are requested using a new header called an Accept-CH
header, and each data element that is communicated in a request/response interaction is identified by a Sec-CH-UA
(I assume the abbreviation is short for “secure client hints user agent”).
To get to that point, we need to go step-by-step through three topics. First we will take a walk down memory lane and review the history of browsers and the information they share. Next we will begin the discussion of entropy. Then we will go through and show some of the simple things that browser manufacturers did even before the Client Hints Specification to limit fingerprinting from the user agent header.
The History of the User Agent Header
User agent strings date back to the beginning of the Worldwide Web. Mosaic was the first truly widely-adopted browser. It was released in 1993 and it had a very simple user-agent string: NCSA_Mosaic/1.0, which consisted of the product name and its version number.
The original purpose of the user-agent string was to allow for analytics and debugging of issues within the browser implementation. At that time, the W3C recommended it be included in all http requests. Thus, openly including the user-agent header became the normal practice.
But as the web evolved, so did the user agent string. Browsers, the devices they ran on, and the operating systems they supported multiplied. Many major and minor versions of all three platforms (browser, device, OS) were in use at the same time. The combinations became extensive, and it became difficult for web developers to to have their code run correctly on the various combinations. Thus the user agent header added more information so that the web server would know what combination it was serving to and adapt the code to ensure a web page rendered properly on that combination of platforms. Before Client Hints and User Agent Client Hints, a user agent header looked something like this (I will show what this looks like after User Agent Client Hints in the next post):
Mozilla/5.0 (Linux; Android <span style={{color: '#016F01' }}>13; Pixel 7</span>) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.0.0 Mobile Safari/537.36
As you can see contains a lot of information passed in the clear and available to any website automatically with a call to the user agent. Mind you, it looks like a lot of gobbledygook, and how it became this way is a story in-and-of itself (for those interested, a very humorous take on the evolution can be found in Aaron Anderson's post "History of the browser user-agent string"). But the key point is that the user agent header evolved to create a better user experience. No one was really thinking about the privacy implications at the time. So no one thought twice about openly sharing that information.
But then came commercialization and advertising, which has consistently followed every new medium since the mid-1800s like bees to nectar. The unique part of this new medium was that its effectiveness could be measured in detail. Slowly but surely, advertisers and publishers got more sophisticated in their ability to know who they were advertising to in order to maximize their now understandable metrics like conversion rate and return on ad spend. They discovered that the very public information in the user-agent header, when combined with other signals, allowed them to easily identify a specific viewer.
These techniques, of which fingerprinting was only one, created significant privacy concerns among regulators and consumers. Consumers especially did not like that they kept seeing the same ads over-and-over on every site they visited, which occurred before good frequency capping tools existed. They felt stalked and surveilled, which ultimately resulted in privacy regulations like GDPR.
Equally important, the values of platform owners in the industry, especially Apple and Mozilla (which evolved from Mosaic) began to change. After all, their executives were consumers as well who experienced tracking. Plus they had to worry about regulators imposing increasingly restrictive regulations and penalties for failure to follow those regulations. Like any new behavior, at first this change was due to these mandates, but ultimately they became a reflex, and now almost a religion. And where one browser developer went, others followed due to the standards-based approach to web technologies that occurs through the W3C.
The annual W3C meetings are a place where the key technical owners of browsers (and in the case of Apple and Google, operating systems and devices) come together and share ideas. These are some of the brightest and most opinionated minds on the planet, and the discussions between them can be wide-ranging, brilliantly insightful, and intense. It was in these meetings and very specific working groups, that the privacy-first mantra first emerged and then became the undisputed correct approach. Apple started it with the creation of their ID for Advertising (IDFA), which was the first control with a mandatory opt-out default. Ultimately, their viewpoint came to be accepted across the board. Since then, a huge amount of work has been done across multiple working groups to ensure that the consumer has a privacy-first experience of the web. Much of the technology I discuss in theprivacysandbox.com emerged from this work.
And while cookie deprecation and visible user controls for opting out of cookies in Safari and Firefox were some of the earliest (and easiest) results of this work, masking information in the user-agent header wasn’t far behind because its very public sharing of identifying information was an obvious privacy vulnerability.
For this, the industry turned to something called information theory and its notion of entropy to solve the problem.
Introduction to Information Theory
Information Theory was a completely new field of endeavor created almost whole cloth by Claude Shannon while working at Bell Labs in 1948. Shannon had the insight that you could measure the amount of information in any communication. Today we take the concept of “signal-to-noise ratio” - an indication of the quantity of information in a transmission - for granted. But in 1948 the concept that you could measure information was unheard of.
The intuition behind quantifying information is that unlikely events, which are “surprising”, contain more information than high probability events which are not surprising. Rare events are more uncertain and thus require more information to represent them than common events. Alternately - and what is important for us in the privacy domain - is that rare events provide more information than common events.
Let’s take an example that impacts the user-agent header and which was actually implemented. This is the current breakdown of Windows OS versions in the market (Figure 3):
Figure 3 - Market Share of Windows Versions as of September, 2024 (Source: WIkipedia)
So if there are 250 million Windows PCs that access the Internet today and if the user agent says that I am dealing with a Windows 11 device, I know that I am dealing with a 1 in 84 million chance of identifying an individual user agent. Not great for targeting an ad. But if I see a machine running Windows XP, that gives me a 1 in 850,000 chance of identifying an individual user agent. That is a less likely event, and as such has much higher information content.
But now let’s look at the percentage of Windows 11 minor releases (figure 4):
Figure 4 - Windows 11 Versions (as a percentage of all Windows 11 Machines)
If I see a Windows machine with a 24H2 minor release, then my ability to identify an individual user agent is 1 in 1 million. That is much better than just knowing the major version, and contains more information, but still less than the “surprise” I get finding that there is still a Windows XP machine out there.
I will not go into the mathematical logic here, but it is important to understand for purposes of this discussion that the level of information decreases in a non-linear fashion (Figure 5).
Figure 5 - The Probability vs.Information Curve
What the chart shows is that the level of surprise drops more rapidly than linear as you move from low probability to high probability events. This means low probability events provide incrementally more information for “unit of increase in likelihood” than a linear curve. Or put another way, removing a low likelihood predictor from a predicting equation means that you can remove a lot more information. As you will see, this last statement is why we care so much about low probability vs. high probability events as we attempt to limit information loss via Client Hints Infrastructure.
Here is the second important point and it gets to a definition of what is called entropy. Note that the chart represents the tradeoff between probability and information for a single variable. But the user-agent contains seven critical pieces of information (variables) that allow for identification of a specific user agent. We need to know how much total information is contained in this complete set of features. We can then identify which features are high information versus low information and alter those with high information since this will have the most impact on identifiability of a specific user agent.
This is where the concept of entropy comes into play. Let’s say we have a specific user agent, X, we want to identify. The way we do that is to look at all the available elements in the user agent and the information they contain and determine the probability that that combination of element values (e.g. Android, Version 14.5, Chrome, Release 127.1.1.5, mobile device, manufacturer= Google, model = Pixel, model version 7) is an exact match to device X. In other words, if we put this into an equation where f(x) is the predictive mathematical function and p(y) represents “the probability that y has a certain value" then we can write this general equation as follows:
f(x) = p(OS) + p(OS version) + p(Browser) + p(Browser version) + p(device type) + p(manufacturer)+ p(model) + p(model version)
The units of f(x) are bits of information. Each p(x) contributes so many bits of information to the total. Also note that f(x) is a probability distribution. Its values will vary depending on the actual combinations of the values from the probability distributions of each p(x). The more bits of information, the higher the likelihood that we can say f(x) = X, that is we can identify the specific user agent.
The number of bits in f(x) is known as the Shannon entropy.
The Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution. It gives a lower bound on the number of bits needed on average to encode an outcome drawn from a distribution P.
The intuition for entropy is that it is the average number of bits required to represent or transmit an event (r.g. Identify a specific user agent) drawn from the probability distribution f(x) for the random variable X.
If a combination of p(x)’s yields an f(x) that has 30 bits of information for an accurate prediction of a single user agent’s identity, then our job as privacy experts is to alter or remove those p(x)’s so that as little information as possible is provided to make that identity. The fewer bits allowed in the actual calculation relative to the 30 bits, the lower our granularity in our ability to make a 1-to-1 match. For example if 30 bits = 1 device, 20 bits might only resolve to 10,000 devices, and 10 bits might only resolve to 2,350,000 devices (recall the curve is non-linear).
We will stop here for today as I just poured a huge number of bits of information (ok, I’m not above a bad pun) into your brain. We’ll pick up next on how privacy experts have gone about using these concepts to ensure the privacy of user agents.
Appendix: Full AmIUnique Printout for My Browser
Shown below is the full printout of the AmIUnique analysis shown partially in Figure 1. This should give you a good sense of just how much information is available to fingerprint your device.