Data Leakage Detection - Engineering Seminar

Data Leakage Detection
We study  the following problem:  A data  distributor has  given sensitive data  to a set  of supposedly trusted  agents (third parties). Some  of the data  are leaked  and found in an unauthorized place (e.g., on the web or somebody’s laptop). The distributor must assess the likelihood that the leaked  data came from one or more agents, as opposed to having been independently gathered by other means. We propose data  allocation strategies (across the agents) that improve the probability of identifying leakages. These methods do not rely on alterations of the released data  (e.g.,  watermarks). In some cases, we can also inject “realistic but fake” data  records to further improve our chances of detecting leakage and identifying the guilty party.

N  the  course  of doing  business, sometimes sensitive data must  be handed over to supposedly trusted third parties. For   example,  a  hospital  may   give   patient   records   to researchers who  will  devise   new  treatments.  Similarly,   a company may have partnerships with  other  companies that require  sharing  customer  data.   Another  enterprise  may outsource  its  data   processing, so  data   must   be  given  to various other  companies. We call the owner  of the data  the distributor  and   the  supposedly  trusted  third  parties  the agents. Our  goal is to detect when the distributor’s sensitive data  have  been  leaked by agents,  and  if possible to identify
the  agent  that  leaked  the data.
We  consider applications where  the  original  sensitive data   cannot   be  perturbed.  Perturbation is  a  very  useful  technique  where the  data   are  modified  and  made “less sensitive” before  being  handed to agents.  For example, one can  add   random noise  to  certain   attributes, or  one  can replace  exact values  by ranges . However, in some cases, it is important not to alter the original distributor’s data. For example, if an outsourcer is doing  our payroll, he must  have the  exact  salary   and  customer bank  account numbers. If medical researchers will be treating patients (as opposed to simply  computing statistics), they  may  need  accurate data for the patients.
Traditionally,  leakage   detection   is  handled  by  water-marking, e.g., a unique code is embedded in each distributed copy.  If that  copy  is  later  discovered in  the  hands of  an unauthorized party, the leaker can be identified. Watermarks can be very  useful  in some cases, but  again,  involve  some modification of the original data.  Furthermore, watermarks can sometimes be destroyed if the data recipient is malicious.
In   this   paper,   we   study  unobtrusive  techniques  for
detecting leakage  of a set of objects or records. Specifically,
we study the following scenario:  After giving a set of objects to  agents,   the  distributor  discovers some   of  those   same objects  in  an  unauthorized place.  (For  example, the  data may  be found on a website,  or may be obtained through a legal  discovery process.)  At this  point,  the  distributor can assess the likelihood that  the leaked  data  came from one or more  agents,   as  opposed  to  having been  independently gathered by  other  means. Using  an  analogy with  cookies stolen  from  a  cookie  jar, if we  catch  Freddie with  a single cookie, he can argue  that a friend  gave him the cookie. But if we catch  Freddie with  five cookies,  it will be much  harder for him to argue  that his hands were  not in the cookie jar. If the distributor sees “enough evidence” that an agent leaked  data,  he may  stop  doing  business with him, or may  initiate  legal proceedings.
In  this  paper, we  develop a  model   for  assessing  the “guilt” of agents.  We also present algorithms for distribut- ing objects to agents,  in a way that improves our chances  of identifying a leaker.  Finally,  we also consider the option of adding “fake”  objects to the distributed set. Such objects do not  correspond to  real  entities  but  appear realistic  to  the agents.   In   a  sense,   the   fake   objects   act  as   a  type   of watermark  for   the  entire   set,   without  modifying  any individual members. If it turns out that  an agent  was given one   or   more   fake   objects   that   were   leaked,   then   the distributor can be more  confident that  agent  was  guilty.

We start  in Section 2 by introducing our  problem setup
and  the notation we use. In Sections  4 and  5, we present a model for calculating “guilt” probabilities in  cases  of data leakage.  Then, in Sections 6 and  7, we present strategies for data  allocation to agents.  Finally, in Section 8, we evaluate the strategies in different data leakage  scenarios, and  check whether they  indeed help us to identify a leaker.

Entities and Agents
A  distributor  owns  a  set  T ¼ ft1 ; ... ; tm g of valuable  data objects.  The distributor wants to share  some  of the objects with  a set  of agents  U1 ; U2 ; ... ; Un , but  does  not  wish  the objects be leaked to other third parties. The objects in T could  be  of  any  type  and   size,  e.g.,  they  could   be  tuples in  a relation, or relations in a database.

Although we  do  not  discuss it here,  our  model  can  be easily  extended to  requests for  a  sample of  objects  that satisfy  a condition (e.g., an agent  wants any  100 California customer   records).   Also   note   that   we   do   not   concern  ourselves with  the  randomness of a  sample. (We  assume that  if a  random sample is  required, there  are  enough T records so that the to-be-presented object selection  schemes can pick random records from  T .)

Guilty  Agents
Suppose that  after  giving  objects  to agents,  the  distributor discovers that a set S     T has leaked.  This means that some third party,  called  the target, has been caught in possession of S. For  example, this  target  may  be displaying S on  its website, or perhaps as part  of a legal discovery process,  the target  turned over  S to the distributor.
Since  the  agents  U1 ; ... ; Un    have  some  of the  data,  it  is
reasonable to suspect them  leaking  the data.  However, the
agents  can argue  that they are innocent, and  that the S data were   obtained  by  the   target   through  other  means.  For example,  say  that   one  of  the  objects  in  S  represents  a customer X.  Perhaps X  is also  a  customer of some  other company, and that company provided the data to the target.  Or perhaps X  can  be  reconstructed from  various publicly available sources on the web.
Our goal is to estimate the likelihood that the leaked data
came   from   the   agents   as   opposed    to  other   sources. Intuitively, the  more   data   in  S,  the  harder  it  is  for  the agents  to argue  they  did  not  leak  anything. Similarly,  the “rarer” the  objects,  the  harder it is to argue  that  the  target obtained them  through other means. Not  only  do we want  to  estimate the  likelihood  the  agents  leaked  data,  but  we would also like to find out if one of them,  in particular, was more  likely  to  be  the  leaker.   For  instance, if  one  of  the S objects was only given to agent  U1 , while  the other  objects were  given  to  all  agents,   we  may  suspect U1    more.   The model  we present next captures this  intuition.
We say an agent  Ui  is guilty and  if it contributes one or
more objects to the target.  We denote the event that agent Ui is guilty  by  Gi   and  the  event  that  agent  Ui   is guilty  for  a given  leaked   set  S  by  Gi jS. Our  next  step  is  to  estimate P rfGi jSg, i.e., the  probability that  agent  Ui  is guilty  given evidence S.

The  guilt  detection approach we  present is related to  the data   provenance  problem :  tracing   the   lineage   of S  objects  implies   essentially  the  detection  of  the  guilty agents.   Tutorial   provides  a  good   overview  on   the research conducted in  this  field.  Suggested solutions are domain specific,  such   as  lineage   tracing  for  data   ware-  houses , and  assume some prior  knowledge on the way a data   view   is  created  out  of  data   sources.   Our   problem formulation  with   objects  and   sets  is  more   general  and simplifies lineage tracing, since we do not consider any data transformation from  Ri  sets to S.
As far as the data  allocation strategies are concerned, our
work  is mostly  relevant to watermarking that  is used  as a means  of  establishing  original  ownership  of  distributed objects. Watermarks were initially used in images , video , and audio data  whose  digital representation includes considerable redundancy.  Recently,  and other  works  have  also studied marks  insertion to relational data.  Our  approach  and  watermarking are  similar  in  the sense   of  providing   agents   with   some   kind   of  receiver identifying  information.  However,  by  its  very   nature,  a watermark  modifies the  item  being   watermarked.  If  the object  to  be  watermarked  cannot   be  modified,   then   a watermark cannot  be inserted. In such  cases, methods that attach watermarks to the distributed data are not applicable.
Finally,  there  are  also  lots  of other  works  on  mechan-
isms  that  allow  only  authorized users  to  access  sensitive data   through  access   control   policies .  Such   ap- proaches prevent in  some  sense  data  leakage  by  sharing information  only   with   trusted   parties.  However,  these policies   are   restrictive and   may   make   it  impossible  to satisfy  agents’  requests.

To  compute this  P rfGi jSg,  we  need  an  estimate for  the probability that  values  in S can be “guessed” by the target.  For instance, say that some of the objects in S are e-mails  of individuals.  We  can  conduct  an  experiment  and   ask  a person with  approximately the  expertise and  resources of the target  to find  the e-mail of, say, 100 individuals. If this person can  find,  say,  90  e-mails,  then  we  can  reasonably guess  that  the  probability of finding one  e-mail  is 0.9. On the  other hand, if the  objects in question are  bank  account numbers, the person may  only discover, say, 20, leading to an estimate of 0.2. We call this  estimate pt , the probability that  object t can be guessed by the target.
Probability pt    is  analogous to  the  probabilities used  in designing fault-tolerant systems. That  is, to  estimate  how likely  it is that  a system will  be operational  throughout a given   period,  we  need   the  probabilities  that   individual components will or will not fail. A component failure  in our case is the event  that  the target guesses an object of S. The component failure  is used  to  compute the  overall  system reliability,  while   we  use  the  probability  of  guessing  to identify agents  that have  leaked  information. The  compo- nent  failure  probabilities are  estimated based   on  experi-  ments, just as we propose to estimate the pt s. Similarly,  the component probabilities are usually conservative estimates,
rather than  exact numbers. For example, say that  we use a component failure  probability that  is higher than the actual  probability, and  we design our system to provide a desired high  level of reliability. Then  we will know  that  the actual  system will have at least that level of reliability, but possibly higher. In the same  way,  if we use pt s that  are higher than  the true  values,  we will know that  the agents  will be guilty with  at least the computed probabilities.
To simplify  the formulas that we present in the rest of the paper, we assume that all T objects have the same pt , which  we call p. Our equations can be easily generalized to diverse pt s though they  become  cumbersome to display.
Next,  we make  two  assumptions regarding the relation-
ship  among the  various leakage  events.  The  first  assump- tion simply  states that an agent’s decision  to leak an object is not  related to  other  objects.  In ,  we  study a  scenario  where the  actions  for different  objects  are  related, and  we study  how   our   results   are   impacted  by   the   different independence assumptions.
Note  that if Assumption 2 did  not hold,  our analysis would be more  complex because  we would need  to consider joint events,  e.g., the target  guesses t1 , and  at the same  time, one or two agents  leak the value.  In our simplified analysis, we say  that   an  agent   is  not  guilty  when the  object  can  be guessed, regardless of whether the agent  leaked  the  value. Since we are “not counting” instances when an agent  leaks information,   the   simplified  analysis  yields   conservative values (smaller  probabilities).
In  the  general case  (with  our  assumptions), to find  the
Assumption 1. For all t; t0  2 S such that t ¼ t0 , the provenance
of t is independent of the provenance of t0 .
probability that  an agent Ui
is guilty  given  a set S, first, we
The  term   “provenance” in  this  assumption  statement refers  to the  source  of a value  t that  appears in the  leaked  set. The source  can be any of the agents who  have  t in their sets or the target  itself (guessing).
To  simplify   our   formulas,  the   following   assumption
states  that  joint events  have  a negligible probability. As we argue  in the example below,  this assumption gives us more conservative  estimates  for  the  guilt   of  agents,  which   is
compute the probability that  he leaks a single  object t to S.
To compute this, we define the set of agents  Vt  ¼ fUi jt 2 Ri g that have t in their data  sets. Then, using 

The main  focus of this paper is the data  allocation problem: how can the distributor “intelligently” give data to agents  in order  to improve the chances  of detecting a guilty  agent?  As illustrated in Fig. 2, there  are four instances of this problem we address, depending on the type  of data  requests made by agents  and  whether “fake objects”  are allowed.
The  two  types  of requests we  handle were  defined  in
Section   2:  sample  and   explicit.   Fake  objects  are  objects generated by the distributor that are not in set T . The objects are designed to look like real objects, and  are distributed to agents   together with   T  objects,  in  order  to  increase   the chances  of detecting agents  that  leak data.  We discuss fake objects in more  detail  in Section 6.1.
As  shown  in  Fig.  2,  we  represent  our   four  problem instances with  the  names EF , EF , SF , and  SF , where E stands for explicit requests, S for sample requests, F for the use of fake objects, and  F for the case where fake objects are not allowed.
Note  that,  for simplicity, we are assuming that  in the E
problem instances, all agents  make  explicit  requests, while in  the  S  instances, all agents  make  sample  requests. Our results can be extended to handle mixed cases,  with  some explicit and  some sample requests. We provide here a small example to illustrate how  mixed  requests can  be handled, but  then  do  not  elaborate  further. Assume that  we  have two  agents   with  requests  R1   ¼ EXPLICITðT ; cond1 Þ   and
R2   ¼ SAMPLEðT 0 ; 1Þ,    w h e r e   T 0     ¼ EXPLICITðT ; cond2 Þ.
Further, say that  cond1   is “state  ¼ CA” (objects have  a state
field). If agent  U2  has the same  condition cond2   ¼ cond1 , we can create an equivalent problem with sample data requests
on set T 0 . That is, our problem will be how  to distribute the CA  objects   to  two   agents,   with   R1   ¼ SAMPLEðT 0 ; jT 0 jÞ  and   R2   ¼ SAMPLEðT 0 ; 1Þ.   If  instead  U2      uses   condition  “state  ¼ NY,” we can solve two different problems for sets T 0           and   T     T 0 .  In   each   problem,   we   will   have   only one   agent.   Finally,   if  the   conditions  partially  overlap,  R1   \ T 0     ¼ ;,   but   R1   ¼ T 0 ,  we   can   solve   three   different problems for sets R1    T 0 , R1   \ T 0 , and  T 0         R1 .

Fake Objects
The  distributor may   be  able  to  add   fake  objects  to  the distributed data  in  order   to  improve his  effectiveness in detecting guilty  agents.  However, fake objects may  impact  the correctness of what  agents  do, so they  may  not always be allowable.
The idea of perturbing data  to detect  leakage  is not new,
e.g., .  However, in  most   cases,  individual  objects  are perturbed,  e.g.,  by  adding  random  noise   to   sensitive  salaries, or  adding a watermark to an  image.  In our  case, we  are  perturbing the  set of distributor  objects  by  adding fake elements. In some  applications, fake objects may cause fewer problems that  perturbing real  objects.  For  example, say that the distributed data objects are medical records and the  agents  are  hospitals. In this  case, even small  modifica- tions  to the  records of actual  patients may  be undesirable. However, the addition of some fake medical records may be acceptable,  since  no  patient  matches  these   records,  and hence,  no one will ever  be treated based  on fake records.
Our  use of fake objects is inspired by the use of “trace”
records in  mailing lists.  In  this  case,  company A  sells  to company B a  mailing list  to  be  used   once  (e.g.,  to  send  advertisements).  Company  A  adds   trace   records  that contain  addresses owned by company A.  Thus,  each  time company  B  uses   the  purchased  mailing  list,  A  receives copies  of  the  mailing.   These  records are  a  type   of  fake objects that  help  identify improper use of data.
The distributor creates  and  adds fake objects to the data
that he distributes to agents.  We let Fi    Ri  be the subset of fake objects that agent  Ui  receives.  As discussed below,  fake objects  must   be  created  carefully so  that  agents   cannot  distinguish them  from  real objects.
In  many   cases,  the  distributor may  be  limited in  how many fake objects he can create.  For example, objects may contain  e-mail addresses, and  each fake e-mail address may require the creation of an actual inbox (otherwise, the agent  may  discover  that   the  object  is  fake).  The  inboxes   can actually  be   monitored   by  the   distributor:  if  e-mail   is received  from   someone  other   than   the   agent   who   was given the address, it is evident that the address was leaked. Since  creating and  monitoring e-mail  accounts  consumes resources, the distributor may have a limit of fake objects. If there  is a limit,  we denote it by B fake objects.
Similarly,  the  distributor may  want  to limit  the number
of fake  objects  received by each  agent  so as to not  arouse suspicions and  to  not  adversely impact  the  agents’  activ- ities. Thus, we say that the distributor can send  up to bi  fake objects to agent  Ui .
Creation. The creation of fake but real-looking objects is a
nontrivial problem whose  thorough investigation is beyond the scope of this paper. Here, we model  the creation of a fake object for agent  Ui  as a black box function CREATEFAKEOB- JECT ðRi ; Fi ; condi Þ that takes as input the set of all objects Ri , the subset  of fake objects Fi  that  Ui  has received so far, and condi , and  returns a  new  fake  object.  This  function needs  condi   to produce a valid  object that  satisfies  Ui ’s condition. Set Ri  is needed as input so that the created fake object is not only valid but also indistinguishable from other real objects. For example, the creation function of a fake payroll record that  includes an employee rank  and  a salary attribute may take  into  account the  distribution  of  employee ranks,  the distribution of salaries, as well as the correlation between the two attributes. Ensuring that key statistics do not change  by the introduction of fake objects is important if the agents will be  using   such   statistics  in  their   work.   Finally,   function
the Ri  and  Fi  tables, respectively, and  the intersection of the conditions  condi s.
Although we  do  not  deal  with  the  implementation  of CREATEFAKEOBJECT(),  we  note  that  there   are  two  main design  options. The  function  can  either   produce  a  fake object on demand every  time  it is called  or it can return an appropriate object from a pool of objects created in advance.

Optimization Problem
The distributor’s data allocation to agents  has one constraint and  one  objective.  The  distributor’s  constraint  is to  satisfy agents’ requests, by  providing  them  with  the  number of objects they  request or with all available objects that  satisfy their conditions. His objective is to be able to detect  an agent who  leaks  any portion of his data.
We consider the constraint as strict. The distributor may
not  deny  serving an  agent  request as in [13] and  may  not provide agents   with   different perturbed  versions  of  the same  objects as in [1]. We consider fake object distribution as the only  possible  constraint relaxation.
Our detection objective is ideal and intractable. Detection
would be assured only if the distributor gave no data object to any  agent  (Mungamuru and  Garcia-Molina discuss that  to  attain   “perfect” privacy and  security, we  have  to sacrifice  utility).   We  use  instead  the  following objective: maximize the chances  of detecting a guilty  agent  that  leaks all his data  objects.
We  now  introduce some  notation to  state  formally  the
distributor’s objective.  Recall that  P rfGj jS ¼ Ri g or simply  P rfGj jRi g is the  probability that  agent  Uj   is  guilty  if the distributor  discovers  a  leaked   table   S   that  contains  all Ri  objects.  We define  the difference functions    ði; jÞ as

 ði; jÞ ¼ P rfGi jRi g P rfGj jRi g    i; j ¼ 1; ... ; n: ð6Þ

Note that differences    have nonnegative values:  given that set Ri   contains all the leaked  objects, agent  Ui  is at least as likely  to be guilty  as any  other  agent.  Difference   ði; jÞ  is positive for any agent  Uj , whose  set Rj  does  not contain  all data  of S. It is zero  if Ri    Rj . In this  case, the  distributor will consider both agents  Ui  and  Uj  equally guilty  since they have both received all the leaked  objects. The larger  a    ði; jÞ value  is, the  easier  it is to identify Ui   as the  leaking  agent. Thus, we want  to distribute data  so that      values  are large.
Problem Definition. Let the distributor have data requests from n agents. The distributor  wants to give tables R1 ; ... ; Rn     to agents U1 ; ... ; Un , respectively, so that

. he satisfies agents’ requests, and
. he maximizes the guilt probability differences   ði; jÞ
for all i; j ¼ 1; ... ;n and i ¼ j.
Assuming that  the  Ri   sets  satisfy  the  agents’  requests,  we can  express   the  problem as  a  multicriterion  optimization problem:

ð ð Þ Þ ¼ ð  Þ
maximize   ... ;     i; j ; .. . i j: 7
over  R1  ;...;Rn
CREATEFAKEOBJECT() has to be aware of the fake objects Fi ð Þ
added so far, again  to ensure proper statistics.
The  distributor can  also  use  function  CREATEFAKEOB-
If the  optimization problem has  an  optimal solution,  it
means that there exists an allocation D     ¼ fR  ; ... ; R   g such
1 n
JECT() when it wants to send  the same fake object to a set of
agents.  In this case, the function arguments are the union of
that  any  other  feasible  allocation D ¼ fR1 ; ... ; Rn g  yields
 ði; jÞ     ði; jÞ  for  all  i; j.  This  means that  allocation  D 
allows   the  distributor  to  discern  any  guilty   agent  with higher   confidence  than   any   other   allocation,   since   it maximizes the  probability P rfGi jRi g with  respect to  any other probability P rfGi jRj g with  j ¼ i.
Even if there  is no optimal allocation D   , a multicriterion

problem    h as   Paret o   optimal  a llocations.  If po  ¼ D
The  max-objective yields  the  solution that  guarantees  that the  distributor will  detect  the  guilty  agent  with  a  certain  confidence in the worst  case. Such guarantee may adversely impact the average performance of the distribution.

1   ; ... ; Rpo g is a Pareto  optimal allocation, it means that
there  is no other  allocation that yields
 ði; jÞ    po ði; jÞ for
In this  section,  we  describe allocation strategies that  solve
i; j.  In  other   words,  if  an  allocation  yields      ði; jÞ   
exactly  or  approximately the  scalar  versions of (8) for  the
 po ði; jÞ  for  some  i; j,  then  there   is  some  i0 ; j0   such  that
 ði0 ; j0 Þ    po ði0 ; j0 Þ. The choice among all the Pareto optimal allocations implicitly selects the agent(s) we want to identify.

Objective Approximation
We can approximate the objective of (7) with (8) that does not depend on agents’  guilt  probabilities, and  therefore, on p:

different   instances  presented  in   Fig.  2.  We   resort   to
approximate  solutions in  cases  where it  is  inefficient  to solve accurately the optimization problem.
In Section 7.1, we deal  with  problems with  explicit data requests, and  in  Section  7.2, with  problems with  sample data  requests.
The  proofs  of theorems that  are  stated in the  following sections  are available in [14].
ðover R1 ;...;Rn  Þ
... ; jRi  \ Rj j ; ... 
jRi j
i ¼ j: ð8Þ

Explicit Data  Requests
In problems of class  EF , the  distributor is not  allowed  to
jRi j
This  approximation is  valid  if minimizing the  relative  overlap jRi \Rj j   maximizes    ði; jÞ. The intuitive argument for this  approximation is that  the  fewer  leaked objects  set  Rj contains, the less guilty agent Uj  will appear compared to Ui (since  S ¼ Ri ).  The  example  of  Section  5.2  supports our
approximation.  In  Fig.  1,  we   see   that   if  S ¼ R1 ,   the difference P rfG1 jSg  P rfG2 jSg  decreases as  the  relative
overlap jR2  \Sj   increases. Theorem 1 shows  that a solution to
add   fake   objects   to  the   distributed  data.   So,  the   data allocation  is  fully  defined by  the  agents’   data  requests. Therefore, there  is nothing to optimize.
In   EF   problems,  objective   values   are   initialized   by agents’   data   requests. Say,  for  example, that  T ¼ ft1 ; t2 g and  there  are  two  agents  with  explicit  data  requests such that   R1   ¼ ft1 ; t2 g  and   R2   ¼ ft1 g.  The  value   of  the  sum- objective  is in this  case
(7) yields  the  solution to (8) if each  T object is allocated to
2 2
X X jRi  \ Rj j ¼    þ

¼ 1:5:
the same  number of agents,  regardless of who  these agents  are. The proof  of the theorem is in [14].
Theorem 1. If a distribution  D ¼ fR1 ; ... ; Rn g  that  satisfies agents’ requests minimizes jRi \Rj j    and jVt j ¼ jVt j for all t; t0  2
i¼1 jRi j j¼1 2 1

The distributor cannot  remove or alter  the R1   or R2   data to decrease   the   overlap   R1   \ R2 .  However,   say   that   the
jRi j 0
T , then D maximizes    ði; jÞ.
distributor  can  create   one  fake  object  (B ¼
1)  and   both
The approximate optimization problem has still multiple criteria  and  it can yield either  an optimal or multiple Pareto  optimal solutions. Pareto  optimal  solutions let us  detect  a guilty  agent  Ui   with  high confidence, at the  expense of an inability  to detect  some other  guilty  agent  or agents.  Since the  distributor has  no  a priori  information for  the  agents’ intention to  leak their  data,  he  has  no  reason  to  bias  the object  allocation against a particular agent.  Therefore, we can  scalarize the  problem objective  by assigning the  same weights to  all  vector  objectives.  We  present two  different scalar versions of our problem in (9a) and  (9b). In the rest of the paper, we will refer to objective  (9a) as the sum-objective and  to objective  (9b) as the max-objective:
agents  can receive  one fake object (b1  ¼ b2  ¼ 1). In this case,
the distributor can add  one fake object to either  R1   or R2  to increase  the corresponding denominator of the summation term. Assume that the distributor creates a fake object f and he  gives  it to agent  R1 . Agent  U1    has  now  R1   ¼ ft1 ; t2 ;f g and  F1   ¼ ff g and  the value  of the sum-objective decreases
to 1 1
3 þ 1  ¼ 1:33 < 1:5.
If the  distributor is able  to create  more  fake  objects,  he
could   further    improve  the   objective.   We   present   in Algorithms 1 and  2 a strategy for randomly allocating fake objects. Algorithm 1 is a general “driver” that  will be used  by  other  strategies, while  Algorithm 2  actually  performs the   random  selection.   We   denote   the   combination  of Algorithm 1 with  2 as e-random. We use  e-random as our

ðover R1 ;...;Rn  Þ

n n
X jR  \ R j; ð9aÞ
i   1 jRi j j¼1
max jRi  \ Rj j : ð9bÞ
baseline   in  our   comparisons  with   other   algorithms   for explicit  data  requests.

Algorithm 1. Allocation for Explicit  Data  Requests (EF )
Input: R1 ; ... ; Rn , cond1 ; ... ; condn , b1 ; ... ; bn , B
ðover R1 ;...;Rn  Þ
i;j¼1;...;n j¼i
jRi j
Output: R1 ; ... ; Rn , F1 ; ... ; Fn
Both   scalar   optimization  problems  yield   the  optimal solution of  the  problem of  (8), if  such  solution  exists.  If there  is no global optimal solution, the sum-objective yields the  Pareto  optimal solution that  allows  the  distributor to detect   the   guilty   agent,   on   average  (over   all   different agents),  with  higher confidence than  any other  distribution.
1:  R ; . Agents  that  can receive  fake objects
2:  for i ¼ 1; ... ;n do
3: if bi  > 0 then
4: R R [ fig
5: Fi ;
6:  while B > 0 do
7: i SELECTAGENTðR; R1 ; ... ; Rn Þ
8: f CREATEFAKEOBJECTðRi ; Fi ; condi Þ
7.2 Sample Data  Requests
With sample data requests, each agent Ui  may receive any T
9: Ri Ri  [ ff g
subset  out  of ðjT jÞ different ones.  Hence,  there  are Qn
ðjT jÞ
mi i¼1  mi
10: Fi Fi  [ ff g
11: bi bi    1
12: if bi  ¼ 0then
13: R RnfRi g
14: B B     1

Algorithm 2. Agent  Selection  for e-random
1:  function SELECTAGENT  (R; R1 ; ... ; Rn )
2: i select at random an agent  from  R
3:  return i
In lines 1-5, Algorithm 1 finds  agents  that  are eligible to receiving fake objects in OðnÞ time.  Then,  in the main loop in lines 6-14, the algorithm creates  one fake object in every iteration and  allocates  it to random agent.  The  main  loop takes OðBÞ time.  Hence,  the running time  of the algorithm
is Oðn þ BÞ.
different  object  allocations.  In  every   allocation, the  dis-
tributor can permute T objects and  keep  the  same chances  of  guilty   agent   detection.  The   reason   is   that   the   guilt probability depends only on which  agents have received the leaked  objects and  not on the identity of the leaked  objects. Therefore, from  the distributor’s perspective, there  are
Qn jT j
i¼1 ðmi Þ
jT j!

different allocations. The  distributor’s problem is  to  pick one  out   so  that   he  optimizes  his  objective.   In ,  we formulate the problem as a nonconvex QIP that is NP-hard to solve.
Note   that  the  distributor can  increase   the  number  of
possible  allocations by adding fake objects (and increasing
jT j) but the problem is essentially the same. So, in the rest of
If B     Pn
bi , the algorithm minimizes every term of the
this  section,  we  will  only  deal  with  problems of class  SF ,
objective  summation by adding the maximum number bi  of fake  objects  to every  set  Ri , yielding the  optimal  solution.
but  our  algorithms are applicable to SF  problems as well.
i   1
Otherwise,  if  B < Pn
bi    (as   in   our   example   where

An  object allocation that  satisfies  requests and  ignores  the
B ¼ 1 < b1  þ b2  ¼ 2), the  algorithm just  selects  at  random
the  agents  that  are  provided with  fake  objects.  We  return
distributor’s objective  is to give  each  agent  Ui
a randomly
back  to  our   example  and   see  how   the  objective  would change  if the distributor adds fake object f to R2  instead of R1 . In this case, the sum-objective would be 1 þ 1  ¼ 1 < 1:33.
selected  subset  of T of size mi . We denote this algorithm by
s-random and we use it as our baseline. We present s-random
in two  parts:  Algorithm 4 is a general allocation algorithm that  is used  by other  algorithms in this section. In line 6 of
2 2
The reason  why we got a greater improvement is that the
addition of a fake  object  to  R2    has  greater impact  on  the corresponding summation terms,  since

1 1 1 1 1 1
< ¼   :
Algorithm 4,  there   is  a  call  to  function  SELECTOBJECT() whose  implementation differentiates algorithms that rely on Algorithm 4. Algorithm 5 shows function SELECTOBJECT() for s-random.
jR1 j     jR1 jþ 1 ¼ 6
jR2 j
jR2 jþ 1 2
Algorithm 4. Allocation for Sample  Data  Requests (SF )
The  left-hand side  of the  inequality corresponds to  the objective  improvement after the addition of a fake object to R1      and   the   right-hand  side   to  R2 .  We   can   use   this observation  to   improve  Algorithm  1   with   the   use   of procedure SELECTAGENT() of Algorithm 3. We denote the combination of Algorithms 1 and  3 by e-optimal.

Algorithm 3. Agent  Selection  for e-optimal
1:  function SELECTAGENT  (R; R1 ; ... ; Rn )
Input: m1 ; ... ; mn , jT j . Assuming mi    jT j
Output: R1 ; ... ; Rn
1:  a 0jT j . a½k :number of agents  who have received object tk
2:  R1 ;; ... ; Rn ;
i   1
3:  remaining Pn     mi
4:  while remaining > 0 do
5: for all i ¼ 1; ... ;n : jRi j < mi  do
6: k SELECTOBJECTði; Ri Þ . May also use

2: i argmax
i0 :Ri0 2R
3: return i

jRi0 j

jRi0 jþ 1
jRi0   \ Rj j
additional parameters
7: Ri Ri  [ ftk g
8: a½k a½k þ 1
9: remaining remaining    1
Algorithm 3 makes a greedy choice by selecting  the agent  that   will   yield   the   greatest   improvement  in   the   sum- objective.  The  cost  of this  greedy choice  is  Oðn2 Þ  in every

Algorithm 5. Object Selection  for s-random
1:  function SELECTOBJECTði; Ri Þ
2: k select at random an element from  set
iteration.  The   overall   running  time   of   e-optimal   is
Oðn þ n2 BÞ ¼ Oðn2 BÞ.  Theorem 2 shows   that  this  greedy
j tk0
2 Ri g

approach finds  an optimal distribution with  respect to both

T j that shows  the
optimization objectives  defined in (9).
Theorem 2. Algorithm e-optimal  yields an object allocation that minimizes both sum- and max-objective in problem instances of class EF .
In s-random, we introduce vector a 2 Nj
object sharing distribution.  In particular, element a½k   shows the number of agents  who  receive  object tk .
Algorithm  s-random  allocates   objects   to  agents   in   a
round-robin fashion.  After the initialization of vectors  d and
a in lines 1 and  2 of Algorithm 4, the main  loop in lines 4-9 is executed while there are still data  objects (remaining > 0) to  be  allocated  to  agents.   In  each  iteration of  this  loop (lines  5-9), the  algorithm uses function SELECTOBJECT() to find  a  random object  to  allocate   to  agent   Ui .  This  loop iterates over all agents who have not received the number of data  objects they have  requested.
the sum of overlaps, i.e., Pnj   1  jRi  \ Rj j. If requests  are  all
i; ¼
1   ¼       ¼ mn
of the same size (m j¼i    ), then s-overlap minimizes the sum-objective.

To illustrate that  s-overlap does  not  minimize the  sum- objective,  assume that  set T has  four  objects  and there  are four agents  requesting samples with  sizes m1   ¼ m2   ¼ 2 and
The  running time  of the  algorithm is Oð  Pn
mi Þ  and
m3   ¼ m4
¼ 1. A possible  data  allocation from  s-overlap is
depends  on  the   running  time       of  the   object  selection
function SELECTOBJECT(). In case of random selection,  we can have    ¼ Oð1Þ by keeping in memory a set fk0  j tk0   2 Ri g for each agent  Ui .
Algorithm s-random may  yield  a  poor  data  allocation. Say, for example, that the distributor set T  has three  objects
and  there  are three  agents  who request one object each. It is

R1   ¼ ft1 ; t2 g; R2   ¼ ft3 ; t4 g; R3   ¼ ft1 g; R4   ¼ ft3 g:

Allocation (10) yields:

X 1 1 1 1
jR  \ R j ¼    þ  þ  þ  ¼ 3:

possible   that  s-random provides all  three  agents  with  the same  object.  Such  an  allocation maximizes both  objectives
i   1 jRi j j¼1
j 2 2 1 1
(9a) and  (9b) instead of minimizing them.

In  a perfect  world, there  would be no  need  to hand  over sensitive  data   to  agents   that   may   unknowingly  or  mal- iciously  leak it. And  even  if we had  to hand over  sensitive data,  in a perfect  world, we could watermark each object so that   we  could   trace   its  origins  with   absolute  certainty. However, in many cases, we must  indeed work  with  agents  that  may  not  be 100 percent trusted, and  we  may  not  be certain  if a leaked  object came from  an agent  or from  some other source,  since certain  data  cannot  admit watermarks.
In  spite  of these  difficulties, we  have  shown that  it  is possible  to assess the likelihood that an agent  is responsible for a leak, based  on the overlap of his data with  the leaked  data   and   the   data   of  other   agents,   and   based   on   the probability that  objects  can  be  “guessed” by  other  means. Our   model   is  relatively  simple,   but   we  believe   that   it captures the  essential  trade-offs. The  algorithms we  have presented  implement a variety of data  distribution strate- gies that can improve the distributor’s chances  of identify- ing   a  leaker.   We  have   shown  that   distributing   objects judiciously can make  a significant difference in identifying guilty  agents,   especially  in   cases   where   there   is  large overlap in the data  that  agents  must receive.
Our   future  work   includes  the   investigation  of  agent  guilt  models that  capture leakage   scenarios that  are  not studied in this paper. For example, what  is the appropriate model  for cases where agents  can collude and  identify fake tuples?  A   preliminary  discussion   of  such   a   model    is available in [14]. Another open problem is the extension of our   allocation  strategies  so  that   they   can  handle  agent  requests  in  an   online   fashion   (the   presented  strategies assume that  there  is  a  fixed  set  of  agents   with  requests known in  advance).

No comments:

Post a Comment

leave your opinion