Privacy-Prerving Multi-Class Support青色 英文
Vector Machine for Outsourcing the Data
Classification in Cloud
Yogachandran Rahulamathavan,Member,IEEE,Raphael C.-W.Phan,Suresh Veluru, Kanapathippillai Cumanan,Member,IEEE,and Muttukrishnan Rajarajan,Senior Member,IEEE Abstract—Emerging cloud computing infrastructure replaces traditional outsourcing techniques and providesflexible rvices to clients at different locations via Internet.This leads to the requirement for data classification to be performed by potentially untrusted rvers in the cloud.Within this context,classifier built by the rver can be utilized by clients in order to classify their own data samples over the cloud.In this paper,we study a privacy-prerving(PP)data classification technique where the rver is unable to learn any knowledge about clients’input data samples while the rver side classifier is also kept cret from the clients during the classification process.More specifically,to the best of our knowledge,we propo thefirst known client-rver data classification protocol using support vector machine.The propod protocol performs PP classification for both two-class and multi-class problems.The protocol exploits properties of Pailler homomorphic encryption and cure t
wo-party computation.At the core of our protocol lies an efficient, novel protocol for curely obtaining the sign of Pailler encrypted numbers.
Index Terms—Privacy,data classification,cloud computing,support vector machine,homomorphic encryption
Ç
1I NTRODUCTION
I N machine learning and artificial intelligence,data classi-fication is a problem of identifying the category of unknown data sample by a classifier which is built using a training t of known data samples.Building a good classi-fier requires a large number of valid training samples; hence,it is not possible for individuals or small organiza-tions to build their own classifier.Only viable solution to this problem is to outsource the data classification to third-party.Outsourcing the data classification mitigates the requirement of not only a large number of valid training data samples but also high computational and storage resources to ,individuals or small organizations).
Recent advances in cloud computing replaces the tradi-tional outsourcing techniques and provides v
arious rvices to the clients over the Internet in aflexible ,on-demand,pay-per u)[1].This leads to a new paradigm of rvice where a rver in a cloud could offer data classifica-tion to clients.In particular,the rver can automatically process and classify the clients’data samples remotely bad on privately owned training data samples.
However,releasing the data samples owned by the cli-ents to the cloud rais privacy concerns since the data procesd in the cloud are often outsourced to untrusted-third-party-rvers(rvers)[2],[3].Furthermore,the rver may not wish to disclo any details or parameters of its training data t even if it offers classification rvice to the client.Hence,in this paper we propo a method which prerves the privacy of both the client data sam-ples and the rver’s training data while maximizing the benefits of the emerging cloud computing technology.The importance of cloud computing in the data classification can be categorized as follows:
ambiguously
1.the cloud is responsible for maintaining and updat-
meow
ing training data t for classification
2.the cloud provides data classification as a rvice to
any clients via Internet while prerving the privacy
of clients’data
3.the cloud helps to offload substantial amount of
computation of clients
A widely ud classification tool in many classification scenarios is support vector machine(SVM)due to its strong mathematical foundations and high reliability in many practical applications.The SVM classification involves two phas:training pha and testing pha.In the training pha,the rver trains a SVM using labeled , training data samples belong to different class)in order to obtain the classification parameters.In the testing pha, any unlabeled data sample provided by a client can be clas-sified and labeled to the matched class by a rver using the classification parameters.Depending on the number of class that the data samples are to be from,the SVM can be divided into two problems:two-class problem and multi-class problem.
Y.Rahulamathavan,S.Veluru and M.Rajarajan are with the School of
Engineering and Mathematical Science,City University London,North-
ampton Square,London,United Kingdom.E-mail:{Yogachandran.
Rahulamathavan.1,Suresh.Veluru.1,R.Muttukrishnan}@city.ac.uk.
R.C.-W Phan is with the Faculty of Engineering,Multimedia University,
63100Cyberjaya,Malaysia.E-mail:raphael@
K.Cumanan is with the School of Electrical and Electronic Engineering,
Newcastle University,Newcastle upon tyne,NE17RU,United Kingdom.
E-mail:kanapathippillai.cumanan@ncl.ac.uk.
Manuscript received24Dec.2012;revid9Sept.2013;accepted22Nov.
2013.Date of publication11Dec.2013;date of current version17Sept.2014.
For information on obtaining reprints of this article,plea nd e-mail to:
reprints@ieee,and reference the Digital Object Identifier below.
Digital Object Identifier no.10.1109/TDSC.2013.51
1545-5971ß2013IEEE.Personal u is permitted,but republication/redistribution requires IEEE permission.
See www.ieee/publications_standards/publications/rights/index.html for more information.
In addition to the availability of a highly reliable SVM classification tool and easy access to cloud computing rvices,it is necessary to include curity and privacy measures to ensure the privacy of the data provided by the client and the classification parameters made avail-able by the rver are prerved.Hence,it is crucial to develop a privacy-prerving(PP)protocol for SVM whereby a client could ek a rver to run the SVM classifier to classify an unlabeled data sample;while maintaining the following privacy guarantees for both the client and rver:
hide the client’s input data sample and the classifica-tion result from rver,
hide the rver side classification parameters from
the client.
In practical scenarios,the client might be a general practitioner(GP)who runs a private medical clinic, whereas classification rvice could be provided by a national hospital in order to detect a dia fr
om patients’symptoms.This rvice could be provided by a hospital via Internet to any client.
In this paper,we prent to the best of our knowledge, thefirst known client-rver PP SVM protocol for both two-class and multi-class problems.More specifically,a client nds the input data sample in an encrypted format to the rver.Then the rver exploits the homomorphic encryp-tion properties to perform the operations directly on the encrypted data sample.If there are any operations that can-not be handled by the homomorphic properties,then there will be a limited amount of interaction between the client and rver bad on two-party cure computation protocol. We assume that both the client and the rver will execute the protocol correctly in order to maintain their reputation, hence they will behave in a mi-honest ,they are honest but curious,so privacy is a real issue.
The remainder of this paper is organized as follows: In Section2,we describe conventional ,the steps involved in training the SVM and classification in the plain-domain(PD).In Section3,wefirst briefly describe one of the building ,homomorphic encryption,and show how SVM classification can be extended to work in the encrypted-domain(ED)for the two-class problem tting.In particular,the core idea of finding the sign of an encrypted number is described in Section3.In Section4,we extend the protocol for two-class problem to the multi-class problem.We an
alyze the performance of the ED techniques in Section 5.We review related works in Section6.Conclusions are dis-cusd in Section7.
Notation.We u boldface lower ca letters to denote vectors;ð:Þ0 denotes the transpo operator;k:k2the Euclidean norm;b:e the nearest integer approximation;½½m the encryption of mes-sage m;and signðmÞdenotes sign of the number m.The mod-ular reduction operator is denoted by mod.
2S UPPORT V ECTOR M ACHINE
SVM has been widely ud in machine learning for data classification[10],[11],[20].It has high generalization ability which provides high reliability in real-world applications such as image processing[16],computer vision[12],text mining[14],natural language processing[17],bioinformat-ics[18]and many more.SVM has been invented to classify a two-class problem,however,it has been extended later on to a multi-class problem.In a two-class problem,the goal of SVM is to parate both class by a function,which is obtained from the training data samples.The multi-class problem can be solved by decomposing the multi-class problem into multiple two-class subproblems[19].We for-mulate the classification functions of both two-class and multi-class problems in the following ct
ions.The classi-fication functions are crucial to derive the PP SVM propod in Sections3and4.
2.1Two-Class Problem
We start with a training t of points~x i2R n,i¼1;...;N, where each point~x i belongs to one of the two class denoted by a label y i2fÀ1;þ1g,i¼1;...;N.Using the training data samples we can train a SVM to classify an unlabeled test sample.Initially,the training data need to be normalized to keep the numeric values of training sam-ples on the same scale.This will prevent the samples with a large original scale from biasing the solution.Let us denote the normalized training data samples as x i2R n, i¼1;...;N,where,
x i¼
~x iÀx
s2
;8i;(1)
where x and s denote the mean and standard deviation of the training data samples.Depending on t
he parability of the training data,the two-class problem is further divided into two:linear classification problem and non-linear classi-fication problem.
2.1.1Linear Classification Problem
The goal of linear classification problem is to obtain two parallel hyperplanes as shown in Fig.1,w0xþb¼À1 and w0xþb¼þ1,where w and b are the classification parameters obtained during the training process.Both hyperplanes parate the training data of the two class such that the distance between tho hyperplanes is maximized.
After the training stage,we can classify an unlabeled test sample,~t2R n.Before the classification,the test sample
塔斯马尼亚大学
is Fig.1.Training data samples for two different class are denoted byþandÀsigns.
required to be normalized similar to (1)as
t ¼
~t Àx
s :(2)
Now the normalized test sample,t ,can be substituted into the following classification function
f ðt Þ¼sign ðw 0t þb Þ¼sign X
s 2S
a s y s x 0s t þ
b !;(3)
where f ðt Þ2fÀ1;þ1g ,a i ,i ¼1;...;N are Lagrangian varia-bles and x s ,s ¼1;...;j S j are support vectors.If f ðt Þ¼þ1
then the test sample ~t belongs to þve class el it belongs to Àve class.Plea note that a decision function d ðt Þcan be extracted from (3)as
d ðt Þ¼w 0t þb ¼
X
s 2S
a s y s x 0s t þb;(4)where w 0x þ
英语专业考研科目b ¼0denotes the decision-hyperplane which lies between the two hyperplanes (i.e.,w 0x þb ¼À1and w 0x þb ¼þ1).
2.1.2Non-Linear Classification Problem
In the previous ction,we have discusd the classifica-tion problem where the training data samples were line-arly parable.However,it has been proven in the literature that the similar approach can be ud for non-linear classification problems using kernel tricks [21],[22].Hence,the non-linear classification algorithm is formally similar to linear classification algorithms except that the
dot product in (3)(i.e.,x 0s t )is replaced by a various non-linear kernel functions.The kernel functions map the data samples into a higher dimensional feature space;hence the non-linear classification problem transformed into linear classification problem (e Fig.2).In this work,we only consider a polynomial kernel.Hence,the dot product between x s and t in (3)can be replaced as
x 0s t )K ðx s ;t Þp ¼ðx 0s t þ1Þp ;
(5)
where p denotes the degree of the polynomial.Hence,the clas-sification function in (3)can be modified as
f ðt Þ¼sign
X
s 2S
a s y s ðx 0s t þ1Þp þ
b !|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
decision function d ðt Þ:(6) 2.2Multi-Class Problem
In order to solve a multi-class problem using SVM,it has to be decoupled into multiple two-class subproblems.There are two classical approaches to decouple the multi-class problem:one-versus-all (1VA)and one-versus-one (1V1)[13],[15].
2.2.1One-versus-All Approach
For a given N c number of class,we form N c number of subproblems.Hence,we train N c number of SVMs.For j th subproblem,we train a SVM using j th class training data samples as þve class and all the remaining samples as Àve class.In the testing pha,for a given normalized test sample,t ,the matching class,M c ,can be obtained by modifying (6)as
M c ¼argmax
j ¼1;...;N c
X
s 2S j
a s y s ðx 0s t þ1Þp þ
b j !;(7)where subscript j denotes the variables associated with the j th
subproblem.Hence,a subproblem with the largest margin value for the decision function (i.e.,P s 2S j a s y s ðx 0s t þ1Þp þb j )will be chon.
2.2.2One-versus-One Approach
For this approach,we train N c ðN c À1Þ
2
number of subpro-blems,where SVM for each subproblem trained using data only from two class.If we consider training sam-ples from class i (i.e.,þve class)and j (i.e.,Àve class)for a given subproblem,then the classification function in (6)can be modified as
f i;j ðt Þ¼sign
X
s 2S i;j
a s y s ðx 0s t þ1Þp þ
b i;j !;(8)where subscript i;j denotes the variables associated with the ði;j Þth subproblem.The matching class for the given test sample t can be obtained using majority voting approach as
M c ¼argmax i ¼1;...;N c
X
N j ¼1;j ¼i
f i;j ðt Þ!
;(9)
where f j;i ðt Þ¼Àf i;j ðt Þ,8i;j .
3C LASSIFICATION IN THE E NCRYPTED -D OMAIN
We consider a client-rver model where the rver main-tains a data t with training samples and obtains the
classification parameters required for the classification functions (i.e.,(6),(7)and (9)).In this ction,we show how to prerve the privacy of the test sample,t ,and the classification result from the rver and the SVM parame-ters from the client when the rver executes the classifi-cation function to classify a test sample.Depending on the nature of the problem rver executes one of the clas-sification functions from (6),(7)and (9).First,let us explain the required building blocks in the next
ction.
Fig.2.Non-linear classification problem converted into linear classifica-tion problem after the kernel mapping.
RAHULAMATHAVAN ET AL.:PRIVACY-PRESERVING MULTI-CLASS SUPPORT VECTOR MACHINE FOR OUTSOURCING THE DATA CLASSIFICATION IN (469)
3.1Homomorphic Encryption
For concreteness and without loss of generality,our descriptions are bad on the Paillier cryptosystem [9]although any other homomorphic encryption schemes could be ud.The Paillier cryptosystem is a public-key encryption scheme.It’s provable mantic curity is bad on the decisional composite residuosity problem.Hence,it is mathematically intractable to decide whether an integer z is an n -residue modulo n 2for some composite
n ,i.e.,whether there exists some y 2Z Ãn 2
such that z ¼y n mod n 2
夏天英文单词.Let n ¼pq where p and q are two large prime numbers.A message m 2Z n can be encrypted using the Paillier cryptosystem as ½½m ¼g m r n mod n 2where g 2Z Ãn 2
and r 2Z Ãn .The Paillier cryptosystem is said to be an additively homomorphic cryptosystem becau for some given encryptions,½½m 1 and ½½m 2 ,the encryption ½½m 1þm 2 of the sum
m 1þm 2in the PD and the encryption ½½m 1:a of the product of m 1with a constant a in the PD can respectively be computed efficiently in the ED as
½½m 1þm 2 ¼½½m 1 ½½m 2 ;½½m 1:a ¼½½m 1 a :
(10)
In the tting considered by this paper,the client distrib-utes its public-key to the rver while keeping its private-key cret.The rver performs encryptions under this pub-lic-key and exploits the homomorphic properties of the Pail-lier cryptosystem to perform the required linear operations
in the ED.However,only the client can decrypt any encrypted messages using its corresponding private-key.
3.2Encrypting Negative Integer
To reprent negative integers,we exploit cyclic property in modulo arithmetic.We reprent À1by n À1since n À1 À1in mod n .When the message space is m 2Z n ,we reprent Paillier encryption of Àm using homomorphic property as follows:½½Àm =½½À1Âm ¼½½m À1¼½½m n À1.In the SVM classification problem considered in the work,Pail-lier curity parameter n >>classification vari
ables ,hence,errors due to overflow can be avoided.
expectation
3.3Two-Class Problem in the Encrypted-Domain First we prent our technique for SVM classification in the ED that solves the two-class problem.In particular,we
show how rver can execute (6)when the test sample is in the ED.Let us assume that each class is reprented by an associated string class i ,i 2fÀ;þg .Fig.3depicts the over-view of the propod PP SVM:client supplies a test sample in an encrypted format to the rver.The rver has the nor-malization parameters (i.e.,mean and standard deviation)and the SVM parameters (i.e.,support vectors,Lagrangian variables and bias)in the PD.Hence,the rver executes the four steps shown in Fig.3,in the ED,in order to classify the encrypted test sample.
Initially,the client encrypts each element of the test sample ~t individually using the public-key and nds ½½~t to the rver.In general,the training samples,test sample and the variables involved in SVM classification are con-tinuous data.Since the Paillier cryptosystem only sup-ports integers,all the variables involved in the decision function (6)need to be quantized to the nearest integer value before encryption.However,if we quantize tho variables without any preprocessing then this will deteri-orate the classification accuracy.To address this issue,we scale the decision function in (6)using a positive number g 2p þ1before quantization.Hence,(6)can be modified into (11)as follows:
f ðt Þ¼sign
g 2p þ1
X
s 2S
a s y s ðx 0s t þ1Þp þ
b "#()
¼sign X s 2S g ða s y s Þ½ðg x s Þ0ðg t Þþg 21 p |fflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}polynomial kernel K p
s
þg 2p þ1
b zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{decision function d ðt Þ8><>:9>=
>;:(11)Note that (6)and (11)provide the same results as long as g >0.It is obvious that if the scaling factor g is suffi-ciently large (i.e.,g >106)then the solution obtained before and after the quantization of variables will be nearly equal since
X
s 2S g ða s y s Þ½ðg x s Þ0ðg t Þþg 21 p þg 2p þ1b %X s 2S
b ga s y s e b g x s e 0b g t e þb g 21e ÀÁp þb g 2p þ1b e :(12)高考录取分数线2013
This result is crucial in order to obtain higher accuracy
when executing (11)in the ED.Executing (11)in the ED involves four different steps (e Fig.3).In the following ctions,we propo methods to compute tho four steps in the ED when the test sample,~t ,is in encrypted format.
3.3.1Step 1–Normalization and Scaling
As a first step,the test sample ~t ¼½~t
1;...;~t n 0has to be nor-malized as in (2).Let us define a mean vector x ¼½x 1;...;x n 0and a normalized test sample as t ¼½t 1;...;t n 0.Hence,the scaled and normalized test sample g t can be written as
g t ¼g ~t Àx
s ¼g s ~t Àg x s
;)g t i ¼g s ~t i Àg x i
s ;
8i:ð13ÞHowever,the operations have to be performed in the ED
by a rver who receives only the encrypted test
sample,
Fig.3.The block diagram of propod PP SVM.
470IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,VOL.11,NO.5,SEPTEMBER/OCTOBER 2014
½½~t ¼½½~t
1 ;...;½½~t n ÂÃ0,from the client.Since the rver knows the vector x ,and scalars g and s in the PD,he can
easily compute the values Àg x i s 2¼ðÀ1Þ:g x i
s 2
,8i and encrypt each of its components by exploiting homomorphic proper-ties as ½½ðÀ1Þ:g x i 2 ¼½½g x i 2 ðÀ1Þ,8i .Similarly,encryption of g s 2~t i
can be obtained as ½½g s 2~t i ¼½½~t i g 2,8i .Hence,the scaled and normalized value of the test sample in (13)can be obtained in the ED as follows:
½½g t i ¼g s 2~t i Àg x i s 2 ! !¼g s 2~t i i h :Àg x i s 2
! !;8i;¼½½~t i g 2:g x i s 2 ! !ðÀ1Þ;8i:
ð14ÞNote that every computation in (14)is performed by the rver without interacting the client.Now the rver obtains normalized and scaled test sample in ED as ½½g t ¼½½g t 1 ;...;½½g t n ½ 0.
3.3.2Step 2–Computing the Polynomial Kernel
The encrypted test sample ½½g t is ud to compute the
polynomial kernel K p
s
¼½ðg x s Þ0ðg t Þþg 21 p ,8s in (11)in the ED.Let us define the support vector x s ¼½x s;1;...;x s;n 0,8s .Consider a ca when p ¼1,for which we have,
K s ¼½ðg x s Þ0ðg t Þþg 21 ;
8s;
¼½ðg x s;1Þ:ðg t 1ÞþÁÁÁþðg x s;n Þ:ðg t n Þþg 21 ;8s:
ð15Þ
Thus,the rver can compute (15)in the ED as
½½K s ¼½½ðg x s;1Þ:ðg t 1ÞþÁÁÁþðg x s;n Þ:ðg t n Þþg 21 ;
¼½½ðg x s;1Þ:ðg t 1Þ :ÁÁÁ:½½ðg x s;n Þ:ðg t n Þ :½½g 21 ;¼½½g t 1 g x s;1:...:½½g t n g x s;n :½½g 21 ;
ð16Þ
8s .This can be computed as follows:since the rver knows the scalars,x s;i ,8s;i ,in the PD,he can perform the required multiplications by exploiting (10).For example,in order to multiply the first component,the rver has to compute ½½g t 1 g x s;1.To obtain the sum of all the products he has to multiply the encryptions of each component with other components.For the ca p ¼1,the rver computes the kernel value without any interaction with the client.If
p >,K p
s
)then,the kernel could be computed via a cure two-party computation technique between the rver and client.If the rver nds ½½K s to the client then the client can decrypt it using her private-key,then rai the
decrypted value by degree p and nd the encrypted ½½K p s
back to the rver.However,this will leak the rver side SVM parameters to the client.Instead,the rver blinds the encrypted ½½K s with some uniformly random element
r s from the PD as ½½~K
s ¼½½K s :r s ¼½½K s r s .Note that the rver generates a fresh random element r s for each ½½K s .
Then rver nds the blinded ½½~K
s to the client.The client decrypts ½½~K s using her private-key,computes ~K p s
and nds ½½~K p s back to the rver.Now the rver can remove
the random mask and computes ½½K p
s
as ½½K p s ¼½½~K p s
r Àp
s ;(17)
which provides the required results due to the following homo-morphic relation:
½½~K p s r Àp
s ¼½½~K p s :r Àp s ¼½½K p s :r p s :r Àp s ¼½½K p s
:Note that due to the blinding,the client does not learn the val-ues of K s even though she had access to ~K
s in the PD (privacy analysis is given in Section 3.2.5).
3.3.3Step 3–Calculating the Decision Function
The rver can now execute the third step in Fig.,compute the value of the decision function in (11)in the ED as follows:
½½d ðt Þ ¼
X
s 2S
g ða s y s Þ½ðg x s Þ0ðg t Þþg 21 p þg 2p þ1b "#"#;¼X
s 2S
雷锋精神演讲稿g ða s y s ÞK p s þg 2p þ1b "#"#
;
¼
g 2p þ1b :Y
s 2S
½½K p
s
"#"#g ða s y s Þ
:
(18)
This can be done as follows.Since the rver knows the values
翻译哥of g and ða s y s Þin the PD,he can perform the required multi-plications using (10).As an example,in order to compute
g ða s y s ÞK p s
in the ED the rver computes ½½K p s g ða s y s Þfor a given s .To obtain the sum of all the products together with g 2p þ1b ,he multiplies the encryptions with each other.
The only part missing now is the computation of sign value of d ðt Þand inform the matched
,class þor class Àto the client.Note that,this classification result can-not be revealed to the rver.In the next ction,we describe the required cure protocol to obtain the classifica-tion result.
3.3.4
Step 4–Obtaining the Sign of the Value of the
Decision Function
This ction studies how to obtain a sign of Paillier encrypted number.According to (11),f ðt Þ¼sign ðd ðt ÞÞ.However,the rver now has the value of d ðt Þin the ED as in (18).Let us assume that j d ðt Þj <10l ,l 2Z in the PD.Note that since the training and test data samples are nor-malized,the value of l can be determined using the range of P s 2S a s y s ðx 0s t þ1Þp þb and the scaling factor g
2p þ1
in (11).Now the rver computes a new variable in the ED as
½½z ¼½½10l þd ðt Þ ¼½½10l :½½d ðt Þ :
(19)
Since j d ðt Þj <10l ,the most significant digit of z in PD is either ,if d ðt Þ>0)or ,if d ðt Þ<0).Let us denote the
most significant digit of z as ~z
2f 1;0g .If ~z ¼1then test sam-ple belongs to class þand if ~z
¼0then test sample belongs to class À.Hence,the matched class M c ,can be obtained as
M c ¼~z
:ðclass þÀclass ÀÞþclass À:(20)
The most significant digit ~z
could be computed using the fol-lowing linear operation:
~z
¼10Àl :z Àðz mod 10l ÞÂÃ
;(21)RAHULAMATHAVAN ET AL.:PRIVACY-PRESERVING MULTI-CLASS SUPPORT VECTOR MACHINE FOR OUTSOURCING THE DATA CLASSIFICATION IN (471)