Web Storage After the Privacy Sandbox

June 19, 2024
Chapter 2: Browser Elements

This will be our last post on browser-side storage, thankfully.  Thankfully because we can now move on to the core reason I began writing this blog in the first place - understanding the details of the Topics API, Protected Audiences API, and the Attribution Reporting API, along with their companion APIs like the Private Aggregation API. But before we get there, we have to cover three topics:

  • Topics API Model (and Audience) Storage
  • Interest Group Storage
  • The Shared Storage API

The first two sections will be relatively brief as there isn’t that much to say.  So, most of this post will focus on the Shared Storage API.

Interest Group Storage

As we have discussed before (here and here), interest groups are the audiences that are part of the Protected Audiences API specification.  They are categorized as behavioral audiences to distinguish them from the Topics API audiences which are similar to, but not exactly the same as, contextual audiences.  However, they can be more than behavioral.  Interest Groups that can be uploaded to a specific browser or mobile device by a publisher using the Protected Audiences API, for example, can be of any type: demographic, psychographic, or taste-based, as well as behavioral.  

Interest groups are loaded into any individual user agent using the joinAdInterestGroup endpoint.  They are stored in a SQLite file called InterestGroups that can be found on your hard drive (if you are using Windows, the file can be found  in C:\Users\arthu\AppData\Local\Google\Chrome\User Data\Default).  It is possible to use a SQLite editor - as discussed here - to see the history of interest group activity on a given endpoint.  Interest groups in a user agent are also displayed in Chrome developer tools (Figure 1):

Figure 1 - Example of How Interest Groups Display in Chrome Developer Tools

Topics API Model and Audience Storage

We haven’t talked much at all about Topics API yet -  that actually begins in a few more posts.  But at a high level: Topics API collects contextual information on how a specific user browses the Internet.  It models that behavior locally in the browser on a weekly basis.  The model takes as its inputs the content from the sites viewed by the user and categorizes that user agent into three audiences (out of approximately 600 in the audience taxonomy taken from the IAB).  The models and the three audiences are both stored in the user agent.

There isn’t much to say about the storage used by the Topics API models and the audiences they create because for the most part anything to do with Topics API is happening ‘behind the scenes’ in the user agent and the mechanics are opaque to both developers and end-user.  The end result of the algorithms, however  - the actual audiences the browser is modeled into - are transparent to both the developer and the end-user.  In fact, the end user can actually see what Topics API audiences they are part of.  The end-user can also opt-out of being in Topics API audiences through a number of mechanisms already existing in Chrome.  An example of one such mechanism is clearing all browsing history, which immediately prevents the user from being modeled into a group.  

Here is an example of a call that a developer can make to the Topics API to retrieve the current audiences into which the user agent is categorized:

// document.browsingTopics() returns an array of up to three topic objects in random order.
const topics = await document.browsingTopics();

There is a more interesting tool available to developers that can be found by typing the following into the chrome address bar: 

chrome://topics-internals

This provides a testing/debugging tool for developers that use Topics API.  In the Classifier tab, you can type in the websites a group of viewers might look at.  When you hit the “Classify” button, the browser displays the topics associated with the host that are stored in Chrome (Figure 2).

Figure 2 – Topics for Websites That Are Stored in the Browser by the Topics API

For how these topics have been associated with these websites, see the Topics API post.

Developers can also see Topics audiences in the developer console under the same Interest Groups tab as used for Protected Audiences interest groups.  I am not clear on whether there is a clear indication of which audiences are from which API.  Nor am I clear on why the Google Chrome folks decided this was the best way to handle things.  Most likely it was a first approximation for MVP with more enhancements to follow as market feedback comes in.

Unless you are a browser developer, that’s about as much as about Topics API model and audience storage as you need know or worry about.

Shared Storage

We have talked a great deal about how the Privacy Sandbox uses dual-key partitioning to isolate data to prevent cross-site reidentification of a user’s profile and behaviors.  The dual keys are:

  • the site from which the content originates (the origin or context origin)
  • the site on which the content is displayed (the top-level domain of the web page in which the context is displayed, also called in the specification the top frame site or the top-level traversable).  

While this is great for privacy, it also creates problems for a variety of use cases that are essential for advertising.  Let’s go through an example – implementing A/B testing of creatives -  to help us understand the issues that partitioned storage creates. This example is taken from the Shared Storage API Explainer in the APIs core Github repository, but I am going to take it more slowly and use pictures to help explain what is going on.

A/B Testing Under Partitioned Storage

Let’s start in a case where we use dual-key partitioned storage.  To be clear from prior posts, you can think of a single storage partition as being a storage bucket into which critical data, like first-party cookies or information about which ads were served to the browser, is stored.  The storage bucket concept from the Storage API is an overarching mechanic which provides improved isolation for critical data.  So even though cookies are stored in a SQLite file called Cookies, the implementation in Chrome for how they are stored in that file is subject to the improved isolation techniques implicit in the Storage API.

For any given user, I want them to see only one of two creatives, A or B, no matter what site they are on when they see the ad.  In a world with partitioned storage, I cannot do that consistently since my activities on different sites can’t be cross-referenced.

Figure 3 shows the step-by-step as to why this won’t work with partitioned storage

Figure 3- Attempting A/B Testing with Partitioned Storage

Brand X has two different creatives, Creative A and Creative B, that it will have publishers display on any given site. It wants to do it in a way that 50% of viewers who see a Brand X ad always see either Creative A or Creative B.  

Person A comes to a publisher site Publisher1 using their browser – in this case a third-party publisher like CNN or Raw Story. 

Even with the Privacy Sandbox, Publisher1 can place a first-party cookie.  As a result, Publisher1 can identify User A’s user agent(browser or mobile device) , can consistently serve them Creative A every time they visit their site and record that information in partitioned storage.   This is true even if User A has opted out of anything but “essential cookies” (and note that there are different kinds of first-party cookies to which this opt-out does apply).  This latter case is a bit “gray” and no doubt the privacy compliance folks may argue with me about this.  But for purposes of this example, I am going to take a looser interpretation and say that showing the same ad to the same user agent on the same site using nothing but a first-party cookie isn’t a privacy violation.  

With that latter assumption, this case is obvious and easy to implement.

The problem comes when Brand X now wants to find User A’s browser on Publisher2’s site.  There is no third-party cookie to depend on, so Publisher2 puts its own first-party cookie in User A’s browser.  It can decide to consistently show either Creative A or Creative B to User A’s browser and store that data in its (Publisher2’s) partitioned storage in the browser.  

Now there are two problems.  First, Publisher1’s first-party cookie has no tie-in to Publisher2’s first-party cookie, so there is no way to guarantee that User A is shown the same Brand X creative on both sites.

However, let’s say that just randomly User A does get served Creative A on both sites.  Statistically this will happen 50% of the time and if we could connect the data from the two sites, we might still have enough data to make statistically valid reports about the performance of the two creatives for decision-making purposes.  The problem is that in a partitioned storage world, when it comes time to do reporting, we can’t make that connection because the partitions prevent us from differentially  combining data on User A.  What we would need to do is look in both partitions, see where Creative A was served on both Publisher1 and Publisher2, and in those cases allow the data from both partitions to be aggregated in a reporting script runner for reporting with either the Attribution Reporting API or the Private Aggregation API.  But we can’t do that.  In the Privacy Sandbox, we can’t look inside the reporting script runner and see individual transactions.  All we can do is aggregate ALL the data on impressions served on both sites, which means we cannot eliminate the impressions where User A was shown Creative B.    

As a result, you cannot do A/B testing in a world without cookies but with partitioned storage.

A/B Testing with Shared Storage

Figure 4 shows the same use case when Shared Storage is available.  We will only talk about the general concepts here.  The next section will discuss the actual mechanics for how this works.  The items highlighted in blue are what is different in the process between the two cases.

Figure 4 - Attempting A/B Testing with Shared Storage

In this case, when User A goes to Publisher1’s site, Publisher1 checks to see if User A has visited the site before when Brand X ads have been showing.  If not, Publisher1 puts a “seed” in its storage area in a special shared storage worklet that indicates that User A was served Creative A on Publisher1 in an experiment identified as Experiment1.  It knows to do this because there is a script that runs on Publisher1’s site when the ad request occurs indicating that the seed is from  Brand X and saving it to Publisher1’s shared storage. The experiment number – Experiment1 - was provided by Brand X at the time the A/B test was designed  The seed is tied to Experiment1, which in turn is associated with the URL where Creative A can be found.

When User A shows up at Publisher2, Publisher 2 also has Brand X’s script and the experiment number Experiment1.  The script on Publisher2’s site makes a request to Brand X’s shared storage, via a worklet that tightly controls what data can be accessed and shared, using the Experiment1 ID as a match key.  When the match key for Experiment1 is found, the seed is read and an opaque URL is provided by the browser that will deliver Creative A to User A’s browser. The entry reporting delivery of the creative is then stored in Publisher B’s shared storage.

When it comes time to report, the data from Publisher1 and Publisher2 are aggregated and are consistent in that both have shown Creative A to User A.  Thus, any measurements for A/B testing will accurately reflect, as much as can be done with Privacy Sandbox aggregate reporting (which will be discussed later), the real performance of each unique creative.

Other Use Cases That Require Shared Storage

What are the most critical use cases where shared storage is considered necessary?  They include:

  • Cross-Site Reach Measurement
  • Frequency and Recency Capping
  • K+ Frequency Measurement
  • Reporting Embedder Context

The Mechanics of Shared Storage

Now that we’ve explained why shared storage is essential for certain use cases, let’s explore how shared storage works.  Let me note that up until now we have been focusing on browser elements more generally and have been setting up the tools/concepts you need to delve into the internals of the Privacy Sandbox.  Moving forward from this post, we will be delving into technical discussions about the operations of the Privacy Sandbox itself.  We won’t go to the code level except occasionally where it can exemplify some “higher level” conceptual point.  We’ve done this before and hopefully you didn’t feel you needed to be a software developer in any way to understand the point I was making.

What is a Shared Storage Worklet

A shared storage worklet is a worklet with extra security restrictions on it to allow it to handle data shared between many sources in a privacy-preserving manner.  These restrictions include:

  • Shared storage worklets have limits on the APIs it can access relative to standard worklets.
  • Shared storage worklets cannot directly access the DOM, cookies, or other web page data.
  • Standard worklets can process data in its original format.  Shared storage worklets can only process obfuscated data.  The mechanic of that data obfuscation is internal to Chrome and is not available to the general public. .
  • Standard worklets can communicate with the main webpage and other scripts using standard JavaScript mechanisms.  Shared storage worklets, on the other hand, have limited external communication channels. They interact with webpages (like fenced frames) through predefined "output gates" that control what information can be shared based on specific purposes.

These differences are summarized in Table 1.

Table 1 – Differences Between Standard Worklets and Shared Storage Worklets

Feature Standard Worklet Implementation Shared Storage Worklet Implementation
Purpose General-purpose tasks within the browser Processing and managing data in Shared Storage
Data Access Broad access to browser APIs, DOM, cookies, and other storage mechanisms Restricted access to specific Shared Storage APIs
Data Processing Processes data in its original format Processes data in a privacy-preserving format (encryption or similar)
Security Environment Runs in a secure, isolated environment Runs in a secure, isolated environment
Communication Communicates with the main webpage and other scripts using standard JavaScript mechanisms Limited communication through predefined "output gates" to webpages (like fenced frames)
Permissions
May require specific permissions depending on accessed functionalities Likely requires additional permissions for Shared Storage access and processing
Focus
Performance optimization, handling complex tasks, interacting with various APIs Secure and privacy-preserving data processing within Shared Storage
Example Use Case Offloading complex calculations from the main thread, updating UI elements asynchronously Processing auction signals for ad selection in a privacy-preserving manner
Made with HTML Tables

How is Data Retrieved from a Shared Storage Worklet

Data from a shared storage worklet can only be accessed (read) via output gates.  An output gate is a specially-restricted environment by which data  can be read.  Basically think of them as a limited set of allowed use cases versus the kinds of data output allowed in a standard worklet.  Today there are two output gates defined in the specification:

  1. Fenced Frame Output Gate.  In this case, any output from the shared storage worklet must be in the form of a fenced frame. This requirement will not be enforced until at least 2026.  In the meantime, output can occur to an iFrame.  
  2. Private Aggregation Report Output Gate. This output gate specifically allows data to be read that is formatted according to the private aggregation API standards.

The following quote from the Shared Storage API specification describes these two output gates in a bit more detail

In particular, an embedder (authors note: an embedder is an origin that has written data to a fenced frame) can select a URL from a short list of URLs based on data in their shared storage and then display the result in a fenced frame. The embedder will not be able to know which URL was chosen except through specific mechanisms that will be better-mitigated in the longer term…
…An embedder is also able to send aggregatable reports through the Private Aggregation Service, which adds noise in order to achieve differential privacy, uses a time delay to send reports, imposes limits on the number of reports sent, and processes the reports into aggregate data so that individual privacy is protected.

How Do Shared Storage Worklets Relate to Fenced Frames?

As noted above, fenced frames are a specific output format that can be used by shared storage worklet.  However, if you review the specification, it isn’t 100% clear that data will come into shared storage only from fenced frames.  Fenced frames appear all over the code examples in the specification.  For example (and, once again, don't worry abotu what the code means, just note the use of fenced frames):

function generateSeed() { ... }
await window.sharedStorage.worklet.addModule('experiment.js');

// Only write a cross-site seed to a.example's storage if there isn't one yet.
window.sharedStorage.set('seed', generateSeed(), { ignoreIfPresent: true });

let fencedFrameConfig = await window.sharedStorage.selectURL(
	'select-url-for-experiment',
    [
		{url: "blob:https://a.example/123...", reportingMetadata: {"click": "https://report.example/1..."}},
        {url: "blob:https://b.example/abc...", reportingMetadata: {"click": "https://report.example/a..."}},
        {url: "blob:https://c.example/789..."}
	],
	{ data: { name: 'experimentA' } });
        
// Assumes that the fenced frame 'my-fenced-frame' has already been attached.document.getElementById('my-fenced-frame').config = fencedFrameConfig;

However, nothing in the specification states outright that shared storage worklets must only take in data from fenced frames.

Although there is no stated requirement, it is pretty clear why data for shared storage worklets must originate from, or be sent to, a fenced frame or some other privacy-preserving source like a private attribution report.  Since the whole point of the Privacy Sandbox is to preserve privacy, it doesn’t do any good to use privacy-preserving storage for data that could be collected in a non-privacy preserving manner.  Moreover, many of the use cases for shared storage are advertising driven, which means they center around ads delivered to a page.  Once the Privacy Sandbox is fully implemented, all ads delivered to a site will be served in fenced frames.  It thus makes sense that fenced frames are the assumed data source or one of two data receivers for shared storage.

Does that mean that fenced frames are the only privacy-preserving source that shared storage can use?  That definitely is not clear, but it is certainly possible that shared storage worklets might be allowed to access specific, privacy-preserving data points from the main webpage through controlled APIs. However, directly accessing the entire webpage or user data is almost certainly not allowed.

How Is Data Stored in a Shared Storage Worklet

Data that moves into a shared storage worklet is obfuscated on entry.  How that is done is a mystery that only the developers of shared storage worklets know.  It probably involves privacy enhancing technologies (PETS) like homomorphic encryption or secure Multi-Party Computation (MPC). 

Each shared storage worklet is associated with a database.  Each browsing context has its own shared storage database , which provides methods to store, retrieve, delete, clear, and purge expired data.  The data in the database is in the form of entries.  Each entry has a unique key to identify it and associated data.  In the prior example for A/B testing, the unique key would be a number and the data structure would include items like the time/date, the advertiser name (Brand X), the experiment number (Experiment1) and the creative shown (Creative A or B), and an entry expiration date/time .  

Navigation Entropy Budget

The Google Privacy Sandbox has the notion of a privacy budget.  This concept is not unique to Chrome. Privacy budgets are a form of differential privacy and are one of many new concepts from the world of Privacy Enhancing Technologies (PETs) 

The basic notion of a privacy budget has to do with the information required to reconstruct a unique user profile.  Every report generated from a browser releases a small quality of information known as entropy.   At some point the cumulative entropy from all these reports could surpass the threshold needed to do reidentification.  As a result, when cumulative entropy reaches a certain level, browsers are prevented from certain actions. 

We will discuss privacy budgets in excruciating detail later (because they are really cool and have serious implications for introducing bias into reporting).  But for now it is enough to note that data leaving a shared storage worklet generates some amount of entropy.  According to the specification, the most leakage that can occur when a specific URL is chosen from within a shared storage worklet (for example, when calling a specific creative) is log 2(8) or 3 bits.  This is because at most 8 obfuscated URLs can be stored to represent any one non-obfuscated URL and then be used when a call is made to deliver a URL out of the shared storage worklet via a fenced frame output gate. 

It is possible if enough data exits shared storage in a specific browser, that browser may not be able to continue exporting data needed for specific use cases like A/B testing of ads.  The Shared Storage API enforces a privacy budget per calling site per budget lifetime or epoch.  The specification does not require a specific lifetime for which entropy collects before being reset to zero, but the explainer in the Github repository proposes a one-day lifetime in the Output Gates and Privacy section. When the shared storage worklet hits its budget, the specification states that the browser can only export the first entry in the list of eight URLs.   

We’ll stop there for today. That should be more than enough information about the new forms of storage in the browser related to the Privacy Sandbox to carry forward into the core APIs.  Just in the rereading, this is pretty dense material