Engineering Seminar Topics :: Seminar Paper: Data Leakage Detection

Abstract

We study the following problem: A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the data are leaked and found in an unauthorized place (e.g., on the web or somebody’s laptop). The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. We propose data allocation strategies (across the agents) that improve the probability of identifying leakages. These methods do not rely on alterations of the released data (e.g., watermarks). In some cases, we can also inject “realistic but fake” data records to further improve our chances of detecting leakage and identifying the guilty party.

INTRODUCTION

N the course of doing business, sometimes sensitive data must be handed over to supposedly trusted third parties. For example, a hospital may give patient records to researchers who will devise new treatments. Similarly, a company may have partnerships with other companies that require sharing customer data. Another enterprise may outsource its data processing, so data must be given to various other companies. We call the owner of the data the distributor and the supposedly trusted third parties the agents. Our goal is to detect when the distributor’s sensitive data have been leaked by agents, and if possible to identify

the agent that leaked the data.

We consider applications where the original sensitive data cannot be perturbed. Perturbation is a very useful technique where the data are modified and made “less sensitive” before being handed to agents. For example, one can add random noise to certain attributes, or one can replace exact values by ranges . However, in some cases, it is important not to alter the original distributor’s data. For example, if an outsourcer is doing our payroll, he must have the exact salary and customer bank account numbers. If medical researchers will be treating patients (as opposed to simply computing statistics), they may need accurate data for the patients.

Traditionally, leakage detection is handled by water-marking, e.g., a unique code is embedded in each distributed copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified. Watermarks can be very useful in some cases, but again, involve some modification of the original data. Furthermore, watermarks can sometimes be destroyed if the data recipient is malicious.

In this paper, we study unobtrusive techniques for

detecting leakage of a set of objects or records. Specifically,

we study the following scenario: After giving a set of objects to agents, the distributor discovers some of those same objects in an unauthorized place. (For example, the data may be found on a website, or may be obtained through a legal discovery process.) At this point, the distributor can assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. Using an analogy with cookies stolen from a cookie jar, if we catch Freddie with a single cookie, he can argue that a friend gave him the cookie. But if we catch Freddie with five cookies, it will be much harder for him to argue that his hands were not in the cookie jar. If the distributor sees “enough evidence” that an agent leaked data, he may stop doing business with him, or may initiate legal proceedings.

In this paper, we develop a model for assessing the “guilt” of agents. We also present algorithms for distribut- ing objects to agents, in a way that improves our chances of identifying a leaker. Finally, we also consider the option of adding “fake” objects to the distributed set. Such objects do not correspond to real entities but appear realistic to the agents. In a sense, the fake objects act as a type of watermark for the entire set, without modifying any individual members. If it turns out that an agent was given one or more fake objects that were leaked, then the distributor can be more confident that agent was guilty.

We start in Section 2 by introducing our problem setup

and the notation we use. In Sections 4 and 5, we present a model for calculating “guilt” probabilities in cases of data leakage. Then, in Sections 6 and 7, we present strategies for data allocation to agents. Finally, in Section 8, we evaluate the strategies in different data leakage scenarios, and check whether they indeed help us to identify a leaker.

PROBLEM SETUP AND NOTATION

Entities and Agents

A distributor owns a set T ¼ ft1 ; ... ; tm g of valuable data objects. The distributor wants to share some of the objects with a set of agents U1 ; U2 ; ... ; Un , but does not wish the objects be leaked to other third parties. The objects in T could be of any type and size, e.g., they could be tuples in a relation, or relations in a database.

Although we do not discuss it here, our model can be easily extended to requests for a sample of objects that satisfy a condition (e.g., an agent wants any 100 California customer records). Also note that we do not concern ourselves with the randomness of a sample. (We assume that if a random sample is required, there are enough T records so that the to-be-presented object selection schemes can pick random records from T .)

Guilty Agents

Suppose that after giving objects to agents, the distributor discovers that a set S T has leaked. This means that some third party, called the target, has been caught in possession of S. For example, this target may be displaying S on its website, or perhaps as part of a legal discovery process, the target turned over S to the distributor.

Since the agents U1 ; ... ; Un have some of the data, it is

reasonable to suspect them leaking the data. However, the

agents can argue that they are innocent, and that the S data were obtained by the target through other means. For example, say that one of the objects in S represents a customer X. Perhaps X is also a customer of some other company, and that company provided the data to the target. Or perhaps X can be reconstructed from various publicly available sources on the web.

Our goal is to estimate the likelihood that the leaked data

came from the agents as opposed to other sources. Intuitively, the more data in S, the harder it is for the agents to argue they did not leak anything. Similarly, the “rarer” the objects, the harder it is to argue that the target obtained them through other means. Not only do we want to estimate the likelihood the agents leaked data, but we would also like to find out if one of them, in particular, was more likely to be the leaker. For instance, if one of the S objects was only given to agent U1 , while the other objects were given to all agents, we may suspect U1 more. The model we present next captures this intuition.

We say an agent Ui is guilty and if it contributes one or

more objects to the target. We denote the event that agent Ui is guilty by Gi and the event that agent Ui is guilty for a given leaked set S by Gi jS. Our next step is to estimate P rfGi jSg, i.e., the probability that agent Ui is guilty given evidence S.

RELATED WORK

The guilt detection approach we present is related to the data provenance problem : tracing the lineage of S objects implies essentially the detection of the guilty agents. Tutorial provides a good overview on the research conducted in this field. Suggested solutions are domain specific, such as lineage tracing for data ware- houses , and assume some prior knowledge on the way a data view is created out of data sources. Our problem formulation with objects and sets is more general and simplifies lineage tracing, since we do not consider any data transformation from Ri sets to S.

As far as the data allocation strategies are concerned, our

work is mostly relevant to watermarking that is used as a means of establishing original ownership of distributed objects. Watermarks were initially used in images , video , and audio data whose digital representation includes considerable redundancy. Recently, and other works have also studied marks insertion to relational data. Our approach and watermarking are similar in the sense of providing agents with some kind of receiver identifying information. However, by its very nature, a watermark modifies the item being watermarked. If the object to be watermarked cannot be modified, then a watermark cannot be inserted. In such cases, methods that attach watermarks to the distributed data are not applicable.

Finally, there are also lots of other works on mechan-

isms that allow only authorized users to access sensitive data through access control policies . Such ap- proaches prevent in some sense data leakage by sharing information only with trusted parties. However, these policies are restrictive and may make it impossible to satisfy agents’ requests.

AGENT GUILT MODEL

To compute this P rfGi jSg, we need an estimate for the probability that values in S can be “guessed” by the target. For instance, say that some of the objects in S are e-mails of individuals. We can conduct an experiment and ask a person with approximately the expertise and resources of the target to find the e-mail of, say, 100 individuals. If this person can find, say, 90 e-mails, then we can reasonably guess that the probability of finding one e-mail is 0.9. On the other hand, if the objects in question are bank account numbers, the person may only discover, say, 20, leading to an estimate of 0.2. We call this estimate pt , the probability that object t can be guessed by the target.

Probability pt is analogous to the probabilities used in designing fault-tolerant systems. That is, to estimate how likely it is that a system will be operational throughout a given period, we need the probabilities that individual components will or will not fail. A component failure in our case is the event that the target guesses an object of S. The component failure is used to compute the overall system reliability, while we use the probability of guessing to identify agents that have leaked information. The compo- nent failure probabilities are estimated based on experi- ments, just as we propose to estimate the pt s. Similarly, the component probabilities are usually conservative estimates,

rather than exact numbers. For example, say that we use a component failure probability that is higher than the actual probability, and we design our system to provide a desired high level of reliability. Then we will know that the actual system will have at least that level of reliability, but possibly higher. In the same way, if we use pt s that are higher than the true values, we will know that the agents will be guilty with at least the computed probabilities.

To simplify the formulas that we present in the rest of the paper, we assume that all T objects have the same pt , which we call p. Our equations can be easily generalized to diverse pt s though they become cumbersome to display.

Next, we make two assumptions regarding the relation-

ship among the various leakage events. The first assump- tion simply states that an agent’s decision to leak an object is not related to other objects. In , we study a scenario where the actions for different objects are related, and we study how our results are impacted by the different independence assumptions.

Note that if Assumption 2 did not hold, our analysis would be more complex because we would need to consider joint events, e.g., the target guesses t1 , and at the same time, one or two agents leak the value. In our simplified analysis, we say that an agent is not guilty when the object can be guessed, regardless of whether the agent leaked the value. Since we are “not counting” instances when an agent leaks information, the simplified analysis yields conservative values (smaller probabilities).

In the general case (with our assumptions), to find the

Assumption 1. For all t; t0 2 S such that t ¼ t0 , the provenance

of t is independent of the provenance of t0 .

probability that an agent Ui

is guilty given a set S, first, we

The term “provenance” in this assumption statement refers to the source of a value t that appears in the leaked set. The source can be any of the agents who have t in their sets or the target itself (guessing).

To simplify our formulas, the following assumption

states that joint events have a negligible probability. As we argue in the example below, this assumption gives us more conservative estimates for the guilt of agents, which is

compute the probability that he leaks a single object t to S.

To compute this, we define the set of agents Vt ¼ fUi jt 2 Ri g that have t in their data sets. Then, using

DATA ALLOCATION PROBLEM

The main focus of this paper is the data allocation problem: how can the distributor “intelligently” give data to agents in order to improve the chances of detecting a guilty agent? As illustrated in Fig. 2, there are four instances of this problem we address, depending on the type of data requests made by agents and whether “fake objects” are allowed.

The two types of requests we handle were defined in

Section 2: sample and explicit. Fake objects are objects generated by the distributor that are not in set T . The objects are designed to look like real objects, and are distributed to agents together with T objects, in order to increase the chances of detecting agents that leak data. We discuss fake objects in more detail in Section 6.1.

As shown in Fig. 2, we represent our four problem instances with the names EF , EF , SF , and SF , where E stands for explicit requests, S for sample requests, F for the use of fake objects, and F for the case where fake objects are not allowed.

Note that, for simplicity, we are assuming that in the E

problem instances, all agents make explicit requests, while in the S instances, all agents make sample requests. Our results can be extended to handle mixed cases, with some explicit and some sample requests. We provide here a small example to illustrate how mixed requests can be handled, but then do not elaborate further. Assume that we have two agents with requests R1 ¼ EXPLICITðT ; cond1 Þ and

R2 ¼ SAMPLEðT 0 ; 1Þ, w h e r e T 0 ¼ EXPLICITðT ; cond2 Þ.

Further, say that cond1 is “state ¼ CA” (objects have a state

field). If agent U2 has the same condition cond2 ¼ cond1 , we can create an equivalent problem with sample data requests

on set T 0 . That is, our problem will be how to distribute the CA objects to two agents, with R1 ¼ SAMPLEðT 0 ; jT 0 jÞ and R2 ¼ SAMPLEðT 0 ; 1Þ. If instead U2 uses condition “state ¼ NY,” we can solve two different problems for sets T 0 and T T 0 . In each problem, we will have only one agent. Finally, if the conditions partially overlap, R1 \ T 0 ¼ ;, but R1 ¼ T 0 , we can solve three different problems for sets R1 T 0 , R1 \ T 0 , and T 0 R1 .

Fake Objects

The distributor may be able to add fake objects to the distributed data in order to improve his effectiveness in detecting guilty agents. However, fake objects may impact the correctness of what agents do, so they may not always be allowable.

The idea of perturbing data to detect leakage is not new,

e.g., . However, in most cases, individual objects are perturbed, e.g., by adding random noise to sensitive salaries, or adding a watermark to an image. In our case, we are perturbing the set of distributor objects by adding fake elements. In some applications, fake objects may cause fewer problems that perturbing real objects. For example, say that the distributed data objects are medical records and the agents are hospitals. In this case, even small modifica- tions to the records of actual patients may be undesirable. However, the addition of some fake medical records may be acceptable, since no patient matches these records, and hence, no one will ever be treated based on fake records.

Our use of fake objects is inspired by the use of “trace”

records in mailing lists. In this case, company A sells to company B a mailing list to be used once (e.g., to send advertisements). Company A adds trace records that contain addresses owned by company A. Thus, each time company B uses the purchased mailing list, A receives copies of the mailing. These records are a type of fake objects that help identify improper use of data.

The distributor creates and adds fake objects to the data

that he distributes to agents. We let Fi Ri be the subset of fake objects that agent Ui receives. As discussed below, fake objects must be created carefully so that agents cannot distinguish them from real objects.

In many cases, the distributor may be limited in how many fake objects he can create. For example, objects may contain e-mail addresses, and each fake e-mail address may require the creation of an actual inbox (otherwise, the agent may discover that the object is fake). The inboxes can actually be monitored by the distributor: if e-mail is received from someone other than the agent who was given the address, it is evident that the address was leaked. Since creating and monitoring e-mail accounts consumes resources, the distributor may have a limit of fake objects. If there is a limit, we denote it by B fake objects.

Similarly, the distributor may want to limit the number

of fake objects received by each agent so as to not arouse suspicions and to not adversely impact the agents’ activ- ities. Thus, we say that the distributor can send up to bi fake objects to agent Ui .

Creation. The creation of fake but real-looking objects is a

nontrivial problem whose thorough investigation is beyond the scope of this paper. Here, we model the creation of a fake object for agent Ui as a black box function CREATEFAKEOB- JECT ðRi ; Fi ; condi Þ that takes as input the set of all objects Ri , the subset of fake objects Fi that Ui has received so far, and condi , and returns a new fake object. This function needs condi to produce a valid object that satisfies Ui ’s condition. Set Ri is needed as input so that the created fake object is not only valid but also indistinguishable from other real objects. For example, the creation function of a fake payroll record that includes an employee rank and a salary attribute may take into account the distribution of employee ranks, the distribution of salaries, as well as the correlation between the two attributes. Ensuring that key statistics do not change by the introduction of fake objects is important if the agents will be using such statistics in their work. Finally, function

the Ri and Fi tables, respectively, and the intersection of the conditions condi s.

Although we do not deal with the implementation of CREATEFAKEOBJECT(), we note that there are two main design options. The function can either produce a fake object on demand every time it is called or it can return an appropriate object from a pool of objects created in advance.

Optimization Problem

The distributor’s data allocation to agents has one constraint and one objective. The distributor’s constraint is to satisfy agents’ requests, by providing them with the number of objects they request or with all available objects that satisfy their conditions. His objective is to be able to detect an agent who leaks any portion of his data.

We consider the constraint as strict. The distributor may

not deny serving an agent request as in [13] and may not provide agents with different perturbed versions of the same objects as in [1]. We consider fake object distribution as the only possible constraint relaxation.

Our detection objective is ideal and intractable. Detection

would be assured only if the distributor gave no data object to any agent (Mungamuru and Garcia-Molina discuss that to attain “perfect” privacy and security, we have to sacrifice utility). We use instead the following objective: maximize the chances of detecting a guilty agent that leaks all his data objects.

We now introduce some notation to state formally the

distributor’s objective. Recall that P rfGj jS ¼ Ri g or simply P rfGj jRi g is the probability that agent Uj is guilty if the distributor discovers a leaked table S that contains all Ri objects. We define the difference functions ði; jÞ as

ði; jÞ ¼ P rfGi jRi g P rfGj jRi g i; j ¼ 1; ... ; n: ð6Þ

Note that differences have nonnegative values: given that set Ri contains all the leaked objects, agent Ui is at least as likely to be guilty as any other agent. Difference ði; jÞ is positive for any agent Uj , whose set Rj does not contain all data of S. It is zero if Ri Rj . In this case, the distributor will consider both agents Ui and Uj equally guilty since they have both received all the leaked objects. The larger a ði; jÞ value is, the easier it is to identify Ui as the leaking agent. Thus, we want to distribute data so that values are large.

Problem Definition. Let the distributor have data requests from n agents. The distributor wants to give tables R1 ; ... ; Rn to agents U1 ; ... ; Un , respectively, so that

. he satisfies agents’ requests, and

. he maximizes the guilt probability differences ði; jÞ

for all i; j ¼ 1; ... ;n and i ¼ j.

Assuming that the Ri sets satisfy the agents’ requests, we can express the problem as a multicriterion optimization problem:

ð ð Þ Þ ¼ ð Þ

maximize ... ; i; j ; .. . i j: 7

over R1 ;...;Rn

CREATEFAKEOBJECT() has to be aware of the fake objects Fi ð Þ

added so far, again to ensure proper statistics.

The distributor can also use function CREATEFAKEOB-

If the optimization problem has an optimal solution, it

means that there exists an allocation D ¼ fR ; ... ; R g such

1 n

JECT() when it wants to send the same fake object to a set of

agents. In this case, the function arguments are the union of

that any other feasible allocation D ¼ fR1 ; ... ; Rn g yields

ði; jÞ ði; jÞ for all i; j. This means that allocation D

allows the distributor to discern any guilty agent with higher confidence than any other allocation, since it maximizes the probability P rfGi jRi g with respect to any other probability P rfGi jRj g with j ¼ i.

Even if there is no optimal allocation D , a multicriterion

problem h as Paret o optimal a llocations. If po ¼ D

fRpo

The max-objective yields the solution that guarantees that the distributor will detect the guilty agent with a certain confidence in the worst case. Such guarantee may adversely impact the average performance of the distribution.

ALLOCATION STRATEGIES

1 ; ... ; Rpo g is a Pareto optimal allocation, it means that

there is no other allocation that yields

ði; jÞ po ði; jÞ for

In this section, we describe allocation strategies that solve

all

i; j. In other words, if an allocation yields ði; jÞ

exactly or approximately the scalar versions of (8) for the

po ði; jÞ for some i; j, then there is some i0 ; j0 such that

ði0 ; j0 Þ po ði0 ; j0 Þ. The choice among all the Pareto optimal allocations implicitly selects the agent(s) we want to identify.

Objective Approximation

We can approximate the objective of (7) with (8) that does not depend on agents’ guilt probabilities, and therefore, on p:

different instances presented in Fig. 2. We resort to

approximate solutions in cases where it is inefficient to solve accurately the optimization problem.

In Section 7.1, we deal with problems with explicit data requests, and in Section 7.2, with problems with sample data requests.

The proofs of theorems that are stated in the following sections are available in [14].

maximize

ðover R1 ;...;Rn Þ

... ; jRi \ Rj j ; ...

jRi j

i ¼ j: ð8Þ

Explicit Data Requests

In problems of class EF , the distributor is not allowed to

jRi j

This approximation is valid if minimizing the relative overlap jRi \Rj j maximizes ði; jÞ. The intuitive argument for this approximation is that the fewer leaked objects set Rj contains, the less guilty agent Uj will appear compared to Ui (since S ¼ Ri ). The example of Section 5.2 supports our

approximation. In Fig. 1, we see that if S ¼ R1 , the difference P rfG1 jSg P rfG2 jSg decreases as the relative

overlap jR2 \Sj increases. Theorem 1 shows that a solution to

add fake objects to the distributed data. So, the data allocation is fully defined by the agents’ data requests. Therefore, there is nothing to optimize.

In EF problems, objective values are initialized by agents’ data requests. Say, for example, that T ¼ ft1 ; t2 g and there are two agents with explicit data requests such that R1 ¼ ft1 ; t2 g and R2 ¼ ft1 g. The value of the sum- objective is in this case

jSj

(7) yields the solution to (8) if each T object is allocated to

2 2

X X jRi \ Rj j ¼ þ

¼ 1:5:

the same number of agents, regardless of who these agents are. The proof of the theorem is in [14].

Theorem 1. If a distribution D ¼ fR1 ; ... ; Rn g that satisfies agents’ requests minimizes jRi \Rj j and jVt j ¼ jVt j for all t; t0 2

i¼1 jRi j j¼1 2 1

j¼i

The distributor cannot remove or alter the R1 or R2 data to decrease the overlap R1 \ R2 . However, say that the

jRi j 0

T , then D maximizes ði; jÞ.

distributor can create one fake object (B ¼

1) and both

The approximate optimization problem has still multiple criteria and it can yield either an optimal or multiple Pareto optimal solutions. Pareto optimal solutions let us detect a guilty agent Ui with high confidence, at the expense of an inability to detect some other guilty agent or agents. Since the distributor has no a priori information for the agents’ intention to leak their data, he has no reason to bias the object allocation against a particular agent. Therefore, we can scalarize the problem objective by assigning the same weights to all vector objectives. We present two different scalar versions of our problem in (9a) and (9b). In the rest of the paper, we will refer to objective (9a) as the sum-objective and to objective (9b) as the max-objective:

agents can receive one fake object (b1 ¼ b2 ¼ 1). In this case,

the distributor can add one fake object to either R1 or R2 to increase the corresponding denominator of the summation term. Assume that the distributor creates a fake object f and he gives it to agent R1 . Agent U1 has now R1 ¼ ft1 ; t2 ;f g and F1 ¼ ff g and the value of the sum-objective decreases

to 1 1

3 þ 1 ¼ 1:33 < 1:5.

If the distributor is able to create more fake objects, he

could further improve the objective. We present in Algorithms 1 and 2 a strategy for randomly allocating fake objects. Algorithm 1 is a general “driver” that will be used by other strategies, while Algorithm 2 actually performs the random selection. We denote the combination of Algorithm 1 with 2 as e-random. We use e-random as our

maximize

ðover R1 ;...;Rn Þ

maximize

n n

X jR \ R j; ð9aÞ

i 1 jRi j j¼1

j¼i

max jRi \ Rj j : ð9bÞ

baseline in our comparisons with other algorithms for explicit data requests.

Algorithm 1. Allocation for Explicit Data Requests (EF )

Input: R1 ; ... ; Rn , cond1 ; ... ; condn , b1 ; ... ; bn , B

ðover R1 ;...;Rn Þ

i;j¼1;...;n j¼i

jRi j

Output: R1 ; ... ; Rn , F1 ; ... ; Fn

Both scalar optimization problems yield the optimal solution of the problem of (8), if such solution exists. If there is no global optimal solution, the sum-objective yields the Pareto optimal solution that allows the distributor to detect the guilty agent, on average (over all different agents), with higher confidence than any other distribution.

1: R ; . Agents that can receive fake objects

2: for i ¼ 1; ... ;n do

3: if bi > 0 then

4: R R [ fig

5: Fi ;

6: while B > 0 do

7: i SELECTAGENTðR; R1 ; ... ; Rn Þ

8: f CREATEFAKEOBJECTðRi ; Fi ; condi Þ

7.2 Sample Data Requests

With sample data requests, each agent Ui may receive any T

9: Ri Ri [ ff g

subset out of ðjT jÞ different ones. Hence, there are Qn

ðjT jÞ

mi i¼1 mi

10: Fi Fi [ ff g

11: bi bi 1

12: if bi ¼ 0then

13: R RnfRi g

14: B B 1

Algorithm 2. Agent Selection for e-random

1: function SELECTAGENT (R; R1 ; ... ; Rn )

2: i select at random an agent from R

3: return i

In lines 1-5, Algorithm 1 finds agents that are eligible to receiving fake objects in OðnÞ time. Then, in the main loop in lines 6-14, the algorithm creates one fake object in every iteration and allocates it to random agent. The main loop takes OðBÞ time. Hence, the running time of the algorithm

is Oðn þ BÞ.

different object allocations. In every allocation, the dis-

tributor can permute T objects and keep the same chances of guilty agent detection. The reason is that the guilt probability depends only on which agents have received the leaked objects and not on the identity of the leaked objects. Therefore, from the distributor’s perspective, there are

Qn jT j

i¼1 ðmi Þ

jT j!

different allocations. The distributor’s problem is to pick one out so that he optimizes his objective. In , we formulate the problem as a nonconvex QIP that is NP-hard to solve.

Note that the distributor can increase the number of

possible allocations by adding fake objects (and increasing

jT j) but the problem is essentially the same. So, in the rest of

i¼1

If B Pn

bi , the algorithm minimizes every term of the

this section, we will only deal with problems of class SF ,

objective summation by adding the maximum number bi of fake objects to every set Ri , yielding the optimal solution.

but our algorithms are applicable to SF problems as well.

i 1

Otherwise, if B < Pn

bi (as in our example where

Random

An object allocation that satisfies requests and ignores the

B ¼ 1 < b1 þ b2 ¼ 2), the algorithm just selects at random

the agents that are provided with fake objects. We return

distributor’s objective is to give each agent Ui

a randomly

back to our example and see how the objective would change if the distributor adds fake object f to R2 instead of R1 . In this case, the sum-objective would be 1 þ 1 ¼ 1 < 1:33.

selected subset of T of size mi . We denote this algorithm by

s-random and we use it as our baseline. We present s-random

in two parts: Algorithm 4 is a general allocation algorithm that is used by other algorithms in this section. In line 6 of

2 2

The reason why we got a greater improvement is that the

addition of a fake object to R2 has greater impact on the corresponding summation terms, since

1 1 1 1 1 1

< ¼ :

Algorithm 4, there is a call to function SELECTOBJECT() whose implementation differentiates algorithms that rely on Algorithm 4. Algorithm 5 shows function SELECTOBJECT() for s-random.

jR1 j jR1 jþ 1 ¼ 6

jR2 j

jR2 jþ 1 2

Algorithm 4. Allocation for Sample Data Requests (SF )

The left-hand side of the inequality corresponds to the objective improvement after the addition of a fake object to R1 and the right-hand side to R2 . We can use this observation to improve Algorithm 1 with the use of procedure SELECTAGENT() of Algorithm 3. We denote the combination of Algorithms 1 and 3 by e-optimal.

Algorithm 3. Agent Selection for e-optimal

1: function SELECTAGENT (R; R1 ; ... ; Rn )

Input: m1 ; ... ; mn , jT j . Assuming mi jT j

Output: R1 ; ... ; Rn

1: a 0jT j . a½k :number of agents who have received object tk

2: R1 ;; ... ; Rn ;

i 1

3: remaining Pn mi

4: while remaining > 0 do

5: for all i ¼ 1; ... ;n : jRi j < mi do

6: k SELECTOBJECTði; Ri Þ . May also use

2: i argmax

i0 :Ri0 2R

3: return i

jRi0 j

jRi0 jþ 1

jRi0 \ Rj j

additional parameters

7: Ri Ri [ ftk g

8: a½k a½k þ 1

9: remaining remaining 1

Algorithm 3 makes a greedy choice by selecting the agent that will yield the greatest improvement in the sum- objective. The cost of this greedy choice is Oðn2 Þ in every

Algorithm 5. Object Selection for s-random

1: function SELECTOBJECTði; Ri Þ

2: k select at random an element from set

iteration. The overall running time of e-optimal is

Oðn þ n2 BÞ ¼ Oðn2 BÞ. Theorem 2 shows that this greedy

fk0

j tk0

2 Ri g

approach finds an optimal distribution with respect to both

return

T j that shows the

optimization objectives defined in (9).

Theorem 2. Algorithm e-optimal yields an object allocation that minimizes both sum- and max-objective in problem instances of class EF .

In s-random, we introduce vector a 2 Nj

object sharing distribution. In particular, element a½k shows the number of agents who receive object tk .

Algorithm s-random allocates objects to agents in a

round-robin fashion. After the initialization of vectors d and

a in lines 1 and 2 of Algorithm 4, the main loop in lines 4-9 is executed while there are still data objects (remaining > 0) to be allocated to agents. In each iteration of this loop (lines 5-9), the algorithm uses function SELECTOBJECT() to find a random object to allocate to agent Ui . This loop iterates over all agents who have not received the number of data objects they have requested.

the sum of overlaps, i.e., Pnj 1 jRi \ Rj j. If requests are all

i; ¼

1 ¼ ¼ mn

of the same size (m j¼i ), then s-overlap minimizes the sum-objective.

To illustrate that s-overlap does not minimize the sum- objective, assume that set T has four objects and there are four agents requesting samples with sizes m1 ¼ m2 ¼ 2 and

i¼1

The running time of the algorithm is Oð Pn

mi Þ and

m3 ¼ m4

¼ 1. A possible data allocation from s-overlap is

depends on the running time of the object selection

function SELECTOBJECT(). In case of random selection, we can have ¼ Oð1Þ by keeping in memory a set fk0 j tk0 2 Ri g for each agent Ui .

Algorithm s-random may yield a poor data allocation. Say, for example, that the distributor set T has three objects

and there are three agents who request one object each. It is

R1 ¼ ft1 ; t2 g; R2 ¼ ft3 ; t4 g; R3 ¼ ft1 g; R4 ¼ ft3 g:

ð10Þ

Allocation (10) yields:

X 1 1 1 1

jR \ R j ¼ þ þ þ ¼ 3:

possible that s-random provides all three agents with the same object. Such an allocation maximizes both objectives

i 1 jRi j j¼1

j¼i

j 2 2 1 1

(9a) and (9b) instead of minimizing them.

CONCLUSIONS

In a perfect world, there would be no need to hand over sensitive data to agents that may unknowingly or mal- iciously leak it. And even if we had to hand over sensitive data, in a perfect world, we could watermark each object so that we could trace its origins with absolute certainty. However, in many cases, we must indeed work with agents that may not be 100 percent trusted, and we may not be certain if a leaked object came from an agent or from some other source, since certain data cannot admit watermarks.

In spite of these difficulties, we have shown that it is possible to assess the likelihood that an agent is responsible for a leak, based on the overlap of his data with the leaked data and the data of other agents, and based on the probability that objects can be “guessed” by other means. Our model is relatively simple, but we believe that it captures the essential trade-offs. The algorithms we have presented implement a variety of data distribution strate- gies that can improve the distributor’s chances of identify- ing a leaker. We have shown that distributing objects judiciously can make a significant difference in identifying guilty agents, especially in cases where there is large overlap in the data that agents must receive.

Our future work includes the investigation of agent guilt models that capture leakage scenarios that are not studied in this paper. For example, what is the appropriate model for cases where agents can collude and identify fake tuples? A preliminary discussion of such a model is available in [14]. Another open problem is the extension of our allocation strategies so that they can handle agent requests in an online fashion (the presented strategies assume that there is a fixed set of agents with requests known in advance).

Engineering Seminar Topics :: Seminar Paper

Data Leakage Detection - Engineering Seminar

No comments:

Post a Comment

Seminar Topics