ANDREOU Athanasios

Person has left EURECOM

Name : ANDREOU Athanasios

Thesis

Bringing transparency to personalized services through statistical inference

Personalized services are online services that use information about their users to offer to

each user a service that is more adapted to her. With the proliferation of personal data over

the Internet, personalized services have become omnipresent in our daily life, including for

instance all services offering recommendations. Although this data-based personalization

has increased the utility of services for users and for service providers, it has also raised

privacy concerns that became increasingly serious in recent years. One example of

personalized service for which this issue is particularly stringent is targeted advertising.

Advertisement is the main source of revenue for many free web services such as Facebook

and Google. The ad ecosystem is complex and can be composed of many actors; here we

abstract away this complexity and we refer to the whole chain of organizations that are

responsible for sending an ad (e.g., companies that want to advertise, data brokers,

advertising platforms) as the ad engine. The prominent advertisement model today is payper-

click, which has led to an increasing amount of targeted advertising to increase the

likelihood that a user clicks on an ad. Targeted advertising has increased advertisement

revenues significantly. However, targeted advertising has been also raising more and more

concerns from users who often feel that it constitutes an invasion of their private sphere. In

particular, users often wonder "what data do advertisers have about me?" or "why am I being

shown this ad?". In a nutshell, users' concerns are mainly kindled by the lack of transparency

of current targeted advertising systems.

The main objective of this thesis is to increase the transparency of targeted

advertising by providing users with tools and methods to understand why they are

targeted with a particular ad, to infer what information the ad engines possibly have

about them, and ultimately to control it. Concretely, we propose to build a browser plugin

that collects the ads shown to a user and provides her with analytics about these ads and

tools to control them. The browser plugin can either give information for a particular ad such

as "you are being shown this ad because the ad engine likely thinks that you are a student"

or give analytics on a longer term such as "given the ads you have been shown in the last 3

months, the ad engine likely thinks that you earn less than $50k per year".

One of the main challenges to build such a tool is to infer the information that the ad engine

knows about a user from the ads received. To explain our approach we abstract the system

into three components: the information the ad engine collects about a user either online from

tracking, or offline from data brokers (inputs), the ad engine that processes the inputs to put

users in certain marketing categories (the black box), and the ads sent to the user (outputs).

In this thesis, we propose to observe only the outputs and to infer the categories the user

was put in by the ad engine, regardless of whether this was due to a particular input or not. In

order to do that, we will simply collect the ads users receive, then group together all the

users that received the same ad, and look at the most common demographics and interests

of users in the group. We detail in Section B.1.b. the methods that we propose to develop to

do this statistical inference task. The main novelty of our technique is that it relies only on the

output, i.e., the ads observed by users and not on any input data the users may have

Thesis

description

Athanasios Andreou (EURECOM), directeur: Patrick Loiseau (EURECOM)

__________________________________________________________________________________

explicitly given. This makes our approach much more realistic. Then, we propose ways to

control the information services have about a user by noise addition rather than by trying to

directly block leakage of information, which is also a much more realistic process.

2. History and related work

Previous works made a number of contributions either by discovering problems [2], or by

proposing methods to bring more transparency to the ad ecosystem [1, 3, 4, 2]. We focus on

the studies that are the closest to our proposal and refer the reader to [5] for an overview.

Two studies [1, 4] proposed techniques to detect whether an ad is contextual, re-targeted or

behavioral. While this is an important first step for transparency, the studies did not take the

next step to detect why the ads are being targeted. Towards this direction, two studies

proposed techniques to see how the activities of a user influence the ads she receives [3, 2].

At a high level, these approaches monitor the input of users (e.g., the emails users receive

and send, the videos users see on youtube, the sites users visit) and they propose methods

to estimate the likelihood that a given ad was shown due to a given input. Thus, these

studies look at the inputs and outputs of the ad engine and infer which inputs triggered which

outputs. On the contrary, our goal is to look at the outputs of the ad engine and infer what the

ad engine knows about the user regardless of whether this was due to a particular input. This

has numerous advantages: it requires less invasive monitoring, it has a much lower

overhead and it captures ads that are not triggered by any particular input.

B. Contenu Scientifique

1. Approach, detailed content and expected results

a. System architecture: the three main components

Browser plugin: The browser plugin has two functionalities: collect ads and present ads

analytics to users. First, the browser plugin parses the web pages a user is browsing and

collects all the ads the user receives, and sends the ads to the storage server. We plan to

build this functionality based on an existing open-source low overhead ad blocker plugin

(e.g., https://adblockplus.org/) that has already been largely tested. Second, the browser

plugin provides analytics to the user about the ads she receives, for a particular ad or over a

longer period. Finally, the plugin will include a webpage where users can optionally provide

personal information such as demographics, and include popup functionalities for active

labeling (see Method 1 below).

Data storage server: All the ads parsed by the browser plugin are sent and stored in an

SQL database. The server will be placed behind a firewall to secure the data from unwanted

access. The users will be tracked on our server by a user ID that will be randomly generated

for each plugin installation and we will not collect and store any identifying data about the

user (except the demographic and interest data the user willingly provides us).

Data analysis server: The server will run all data analysis scripts that infer why a given ad is

being shown to a particular user at a particular time. The scripts will infer, for each ad, what

are the likely marketing categories to which the ad is targeted by analyzing the ad itself and

all the users that received the ad. The output results will be stored in an SQL database.

b. Inferring why users are targeted

The main methodological challenge of this thesis is to infer the reasons why a given ad was

shown to a given user. We propose three methods to solve this challenge. Method 1

Thesis

description

Athanasios Andreou (EURECOM), directeur: Patrick Loiseau (EURECOM)

__________________________________________________________________________________

corresponds to what we already mentioned and we envision to use it in the long run once the

tool has sufficiently many users. However, to bootstrap the tool and not let the project rely

entirely on getting a large deployment, we propose Methods 2 and 3 that can provide

analytics starting from day one. On a high-level, all three methods rely on defining a

probabilistic identity (i.e., users are associated with distributions over demographics rather

than a specific one) for each user and inferring these probabilities from the ads received.

Method 1: Ask users. We can simply ask a subsample of users that installed the tool to

provide us with their demographics and interests and use this as training data to infer why

other users are being targeted with a particular ad. To collect the data we will include in the

browser plugin an opt-in option that will trigger an initial questionnaire that will ask users

about their general demographics and interests. In addition, to improve efficiency even with

few labeled samples, we plan to use active learning. Active learning is a set of techniques

that combine machine learning algorithms with real-time input from users to optimize the

accuracy of the overall process by selecting the examples to label that will be the most

informative for the learning algorithm. Concretely, for users and categories optimally selected

using active learning techniques, we will trigger quick questionnaires where we just ask users

to confirm whether they have a particular interest or demographics.

Once we have labeled examples, the process of inferring why a given ad is targeted consists

essentially in grouping users that received the particular ad and analyzing the demographics

and interests of the labeled examples in this group. The confidence we have in the prediction

will depend on the number of labeled examples we have in the group and how homogenous

the examples are with respect to a particular category. Developing this method will require

advanced researches in statistical methods to measure similarity between ads, optimally

group users to maximize the information inferred and evaluating the estimation confidence.

Method 2: Analyze the ads. A different technique to infer why an ad is being targeted is to

simply analyze the ad. We can do so with multiple sources. We can use sites such as Alexa

or Web of Trust to infer the categories of the ad's landing page. We can additionally use

natural language processing tools (such as Mashape, CoreNLP, AlchemyAPI, OpenCalais,

Semantria,TAGME) to analyze the content of the landing page and infer entities, context,

topics or sentiment related to the ad. (To avoid spending the advertiser's budget, we will not

click on the ads; we will copy the URLs, remove any user identifier and paste them in another

browser.) Finally, when available, we can use information provided by Quantcast.

Method 3: Infer from controlled experiments. Finally, the last method is to build controlled

experiments to do the mapping between ads and interests/demographics. We will create

different browsing profiles that reflect particular demographics and interest using the

techniques in [1, 2], and monitor what ads are shown to these browsing profiles. To compute

the probability that a given ad was targeted due to a certain demographic or interest we will

build on the technique proposed by [3] extended to our setting.

Strategy to evaluate the accuracy of our inferences. To evaluate the accuracy of our

results, we plan to collect data from the new 'Why am I seeing this" functionality on

Facebook. This functionality will provide ground truth data to evaluate our tool and methods.

Yet, our tool goes much beyond for several reasons: (i) Facebook does not always give all

the reasons why an ad is targeted; (ii) companies can come with a list of contacts (emails,

cookies or phone numbers) and ask Facebook to send ads to the users in their list, in this

case Facebook simply says that "you were in the list" whereas our tool might be able to infer

why the user is in the list; and (iii) we analyze ads on all websites and not just Facebook.

Methods to control the information known about a user. Lastly, we will investigate

methods for users to act on the information that is known about them. Since controlling the

information gathered by services is almost impossible, we propose to instead add noisy

Thesis

description

Athanasios Andreou (EURECOM), directeur: Patrick Loiseau (EURECOM)

__________________________________________________________________________________

information to obfuscate the real information. We will investigate methods that add noise in

order to achieve a given wanted probabilistic identity for the user. Our tool to infer this

probabilistic identity will make it possible to verify the effectiveness of our method.

c. Deployment strategy and risks

Incentives for users to install the tool. As evidenced by the success of other similar

projects such as Ghostery (with > 3.5 million adopters), many users are interested in

transparency. Still, to minimize this risk further, we will take the following actions. To increase

the tool's utility, we will package it with an ad blocker (just Adblock Plus has more

than 50 million adopters on Chrome alone). To incentivize users to provide their

demographics and interests we will investigate different incentive techniques based on

lotteries and gift certificates proposed in our prior work [6,7].

Privacy risks. To use our tool, users will need to donate the ads they see when browsing

the Internet. Even if such data does not include any PII, some users might feel that ads could

reveal information that is personal and the data collection might therefore entail privacy

concerns. Users installing the plugin will be provided guarantees about the treatment of their

data. In particular: no information will be collected beyond their ads (unless they voluntary

consent to providing demographics), all information will be stored and communicated

securely and the data will be used solely for the purpose of providing ads analytics. We

believe that these guarantees will be sufficient for users to confidently adopt our plugin.

2. Qualifications involved and collaborations

The main qualifications needed for this thesis are network measurement, statistical inference

and incentives design, which exactly correspond to the director's expertise. The student has

also excellent qualification on these aspects and a excellent potential for the topic. In

addition, the thesis will be performed in collaboration with Prof. Krishna Gummadi and

Dr. Oana Goga from the Max-Planck institute for Software Systems. They have

expertise in systems building and in online social systems that will be useful for the thesis

and this collaboration with a top EU institution will strengthen the student's education.