You are on page 1of 5

IM Session Identification by Outlier Detection in

Cross-correlation Functions

Saad Saleh∗ , Muhammad U. Ilyas∗ , Khawar Khurshid∗ , Alex X. Liu‡ and Hayder Radha§
∗ Dept. of Electrical Engineering, School of Electrical Engineering and Computer Science,
National University of Sciences and Technology, H-12, Islamabad – 44000, Pakistan
‡ Dept. of Computer Science and Engg, College of Engineering, Michigan State University, East Lansing, MI – 48824, USA
§ Dept. of Electrical and Computer Engg, College of Engineering, Michigan State University, East Lansing, MI – 48824, USA

Email: {saad.saleh, usman.ilyas, khawar.khurshid}@seecs.edu.pk∗ ,


alexliu@cse.msu.edu‡ , radha@egr.msu.edu§

Abstract—The identification of encrypted Instant Messaging approach, has a TP rate of 63% for a signal-to-noise ratio
(IM) channels between users is made difficult by the presence of (SNR) of −6.11dB [4]. Major factors for performance
variable and high levels of uncorrelated background traffic. In deterioration include delay, jitter, buffering, reordering,
this paper, we propose a novel Cross-correlation Outlier Detector duplicate messages and server messages. In the area of
(CCOD) to identify communicating end-users in a large group of de-anonymization of mix-networks several studies focused
users. Our technique uses traffic flow traces between individual
on the computation of mutual information between ingress
users and IM service provider’s data center. We evaluate the
CCOD on a data set of Yahoo! IM traffic traces with an average and egress traffic flows. High time-complexity and reliance
SNR of −6.11dB (data set includes ground truth). Results show on data that needs to be collected from multiple points
that our technique provides 88% true positives (TP) rate, 3% inside the network are major limitations. In social-network
false positives (FP) rate and 96% ROC area. Performance of de-anonymization, data sparsity and membership information
the previous correlation-based schemes on the same data set was is used to de-anonymize networks. Here, the requirement of
limited to 63% TP rate, 4% FP rate and 85% ROC area. detailed user information becomes a major limiting factor.
Keywords- Link de-anonymization; instant messaging; se- Several de-anonymization attempts have been made over Tor
curity; privacy; network but the major emphasis was the identification of
traffic using various fingerprints. Use of correlation attempts
has been limited in breaching attempts.
I. I NTRODUCTION
A. Background and Motivation In our previous works, we showed that the correlation
of wavelet decomposed time series of users’ traffic traces
Instant Messaging (IM) services are projected to reach can successfully breach IM session privacy [4]. In a recent
1.4 billion users worldwide by 2016 [1]. IM services provide study, we showed that the cause-effect relationship between
mostly free, ubiquitous access, mobility and privacy. However, packets appearing in two communicating (“talking”) users’
user privacy has been under attack for unlawful reasons in traffic traces can be leveraged to de-anonymize user sessions
the past few years, e.g. the theft of data of 35 million users [5]. Time complexity was a major limiting factor for these
in Korea in 2013 [2]. Government agencies, including the approaches.
National Security Agency (NSA), have also breached the
privacy of millions of users [3]. The aim of our research is
to assess the vulnerability of IM sessions to de-anonymization C. Proposed Approach
attacks (identifying who is communicating with whom) using
only transport layer session traces. Such link de-anonymization In this paper, we propose a novel Cross-correlation Outlier
is challenging for the following reasons: (1) IM messages Detector (CCOD) to de-anonymize users’ IM sessions using
are now often times encrypted (only IP and TCP headers are only undirected transport layer traffic traces collected at the
visible which become infeasible to log on a large scale), (2) data center (or gateway), in the form of flow logs. Our idea
IM data center establishes separate TCP connections between leverages the limited delay between the appearance of a packet
the source and destination users (at any time, no packet in a sending user’s traffic trace and its appearance in the
contains the source and destination IPs of both end users). receiving user’s traffic trace. We expect talking users to have
The complexity of de-anonymization increases further in the a high cross-correlation statistic, while for non-talking users
following practical scenarios, (1) Simultaneous multiple mes- the appearance of packets in the same time slot is expected
sage sessions by a user, (2) Thousands of users communicating to be coincidental. To de-anonymize a user we compute the
through IM data center at any time, (3) Duplicate packets due cross-correlation function of the time series of her traffic trace
to retransmissions and (4) Out-of-order packet delivery. derived from her flow log with the respective time series of all
other users. Therefore, the time-complexity of our approach is
Θ(N ), where N is the number of IM users. Next, we estimate
B. Limitations of Prior Art
the distribution of the cross-correlation function at all non-zero
Several prior works have focused on link de- time-shifts. Finally, we apply a binary classifier to the value of
anonymization. Time Series Correlation (TSC), the baseline the cross-correlation function at zero time shift and determine
whether that value is drawn from the preceding distribution or III. P ROPOSED D E - ANONYMIZATION A PPROACH
not.
We propose a cross-correlation outlier detector to map an
D. Experimental Results and Findings IM user to the user(s) she is communicating with. Let X(t)
and Y (t) be the respective time series of packets sent to and
We evaluated the performance of CCOD on a real world received from the IM data center by two users X and Y ,
Yahoo! IM based data set collected from the greater New York respectively. Let RXY (t − k) denote the cross-correlation of
area over a duration of 24 hours. Classification rates were flow log X(t) and k time unit (in our case, seconds) delayed
estimated for Naı̈ve Bayes, SVM and C 4.5 classifiers. Results flow log Y (t − k), respectively. The step-by-step procedure of
showed that CCOD provides up to 88% TP rate and 3% FP our approach is as follows:
rate with 95% precision and 88% recall. Results showed that
low SNR has the worst effect on classifier performance. We 1) Compute RXY (0) of the flow logs X(t) and Y (t).
compared the results of TSC and wavelet based decomposition 2) Compute RXY (t−k)  of the flow logs X(t) and Y (t−
called COLD with CCOD [4][5]. Results showed that correla- k), where k ∈ Z {[−Z, +Z]\0} , i.e. varies from
tion outlier outperformed the techniques of TSC and COLD. −Z to +Z, except 0.
3) Estimate the standard deviation σXY of all 2 × Z
E. Key Contributions values RXY (t − k) from the previous step.
Our major contributions are as follows: 4) Compute the deviation of RXY (0) from the mean
μXY of the distribution of RXY (t − k) of all other
1) We propose CCOD, a novel link de-anonymization values, i.e. RXY (0) − μXY .
approach for IM user session (see Section III). 5) Normalize the deviation of RXY (0)
2) We validated the approach on a real world Yahoo! (R(0) − μXY ) by the standard deviation σXY ,
messenger data set (see Section IV). i.e. (RXY (0) − μXY ) /σXY .
3) We compared the performance gains of our CCOD
approach with previous state-of-the-art schemes (see Our breaching strategy is to explore the “high correla-
Tab. II). tion” between undirected flow logs of communicating users,
where the threshold for what constitutes high correlation is
Paper Organization: determined by the value of σXY for each pair of users
The rest of the paper is organized as follows: Section- X and Y . Without any time shift, talking users have high
II presents the IM network scenario and de-anonymization cross-correlation. However, the server introduces a processing
challenges. Section-III presents our proposed approach. Re- delay for packets which decreases the correlation statistics.
sults of simulations over Yahoo! data set have been discussed Although noise decreases the correlation statistics, any increase
in Section-IV. Section-V presents the related work. Finally, in time-shift decreases correlation sharply. To increase the
Section-VI concludes the paper. performance gains, we developed our own measure which
computes the deviation of central value (non-shifted correlation
II. P ROBLEM D ESCRIPTION statistics) from the mean of all other values. Developed metric
is expressed in terms of standard deviation of all other values.
When two users u1 and u2 communicate with each other
via a relayed IM service, two TCP connections are established;
One session is established between user u1 and the IM data IV. E XPERIMENTAL R ESULTS AND F INDINGS
center and another one between user u2 and the IM data center. Performance of the correlation outlier approach was tested
All flows in source user appear in destination users time series on a data set collected from Yahoo! IM service [8]. The
with some delay and possible reordering. Fig. 1 shows the entire data was collected from the greater New York area over
setup used for data collection of IM traffic traces. Port filtering a duration of 24 hours. We were provided a flow log file
is used to isolate IM traffic from other traffic to/from the data (containing time and anonymized user IDs) and the ground
center. Users are identified by their IP addresses. Flow logs truth user communication file for testing the performance of
can be collected at two possible locations, (1) At the gateway the proposed CCOD approach. Analysis of flow log data set
to the IM data center, (2) At the gateway or proxy server of revealed the large noise present in form of about 200 control
a subnet. Logging at the IM data center can de-anonymize all and service messages resulting in a mean SNR of −6.11dB and
sessions established through the data center. On the contrary, a standard deviation of 5.81dB. Fig. 2a and Fig. 2b show the
traffic logged at the gateway of a subnet can only be used to correlation statistics with time shifts −20 ≤ k ≤ +20. Unlike
de-anonymize conversations between users located inside the non-talking pairs, all talking pairs bear a large correlation
subnet1 . Traffic logging tools record a number of packet header statistics without any time shift. Any increase in time shift
fields using packet headers including source and destination decreases the correlation statistics. Correlation does not drop
IP addresses, source and destination port numbers, protocol to zero at most time shifts because of the high levels of
versions, time stamps and packet sizes [7]. Owing to resource background traffic / low SNR of the data set. A few pairs
constraints, the flow logs collected as part of the data set we show large correlation statistics despite time shifts due to the
collected at the data center contained only timestamps and user variability of the data set. Thus using a number of time shifts
IDs / IP addresses [8]. We aim to answer the question, “Is it large enough to estimate σXY can remove this anomaly in
possible to link two communicating IM users from among a classifier design. Fig. 3a and Fig. 3b present the probability
large set of users using only user-server flow logs?” density function of the computed (RXY (0) − μXY ) /σXY for
1 In a recent case of a bomb hoax at Harvard University, the suspect was all talking and non-talking pairs. Results show that talking
identified based upon the local subnet traffic at Harvard University [6]. pairs have a larger spread than non-talking pairs. The data set
X

3DFNHW
6L]H
,QWHUQHW
7LPH *DWHZD\
,0 'DWD
&HQWHU

X X

3DFNHW
3DFNHW

6L]H
6L]H
7LPH 7LPH

Fig. 1: Collecting traffic around gateway and IM Data Center.

used for the results in this section is of 10 minute duration server messages, delays, jitters, no network visibility and large
containing 1962 talking and 3000 non-talking user pairs. number of users.
Tab. I presents the classifier results. WEKA was used to
design classifiers with 10-fold cross validation [9]. Classifiers B. Social Network De-anonymization
showing promising results include Naı̈ve Bayes, SVM and Using people’s friendship graphs to de-anonymize net-
C 4.5 decision tree. Results show that best performance is works has been a major focus of various studies. In [12],
obtained for C 4.5 classifier providing 88% TP rate, 3% FP authors calculate the distances in d-dimensional data sets of
rate, 95% precision, 88% recall and 96% ROC. users to measure the user separation. In [13], authors develop
Tab. II presents the performance comparison of previous an identification algorithm to de-anonymize the users. In [14],
TSC and COLD approaches with CCOD. CCOD provides [15], group membership information has been used to trace the
88% TP rate while TSC and COLD show only 63% and 55% users. All these studies have been ineffective for IM networks
TP rates, respectively. Similarly, CCOD shows a 3% FP rate because these approaches require connectivity and topological
while TSC and COLD show 3% and 4% FP rates, respectively. information of users.
Similar is the case for other statistics. TSC and COLD showed
poor de-anonymization statistics in comparison with CCOD. C. Tor Network Deanonymization
A number of studies have focused over the deanonymiza-
V. R ELATED W ORK
tion of the Tor network which uses a distributed network archi-
A. Mix Network De-anonymization tecture. Location of hidden services is identified by using route
selection, timing signatures and delays [16][17][18]. Several
Several studies have focused on the de-anonymization of researches focus over the identification of Tor traffic from other
mix networks. In [10], Troncoso and Danezis use a Markov network traffic using traffic fingerprints, packet sizes and proxy
Chain Monte Carlo engine to obtain the probabilities for link nodes [19][20][21]. Large number of breaching attempts have
de-anonymization. Authors limited the network size to only 10 been made using unpopular ports, circuit clogging and man-in-
nodes and collected data from multiple points. In [11], authors the-middle strategies [22][23][24]. “Correlation” based attacks
estimate the mutual information of all the information entering have been used to identify Tor traffic by using delays, times,
and leaving the mixers. Their algorithm is exhaustive because stream sizes [25][26][27]. However, no previous study has used
of large computation time. Time series correlation has been the deviation in correlations from the mean value without any
the most logical de-anonymization technique but it provides time shift, like our scheme, to deanonymize the Tor network.
only 63% TP rate for IM networks [4] [5]. IM network
differs from mix networks due to large number of control and
D. IM Network De-anonymization
In our study [4], we used the wavelet based decompo-
TABLE I: Performance of CCOD on Yahoo! data set. sition approach to de-anonymize IM users. Correlation of
coefficients was performed over the decomposed user time
Classifier TP Rate FP Rate Precision Recall ROC
series at multiple frequency scales. Our experiments showed
Naı̈ve Bayes 0.61 0.04 0.92 0.61 0.91
SVM 0.89 0.04 0.94 0.89 0.93 hit rates of 90% to 98% when the candidate set sizes varied
C 4.5 0.88 0.03 0.95 0.88 0.96 from 10 to 20. However, COLD showed poor performance
for a candidate set size of 1. In our recent study [5], we
used the causality based detection approaches for link de-
TABLE II: Performance comparison of CCOD with previous anonymization. Our study showed a TP rate of 99% for the best
de-anonymization approaches [4][5]. performing KCI causality based approach. Large computation
Technique TP FP Precision Recall ROC time and complexity were the major disadvantages for some
TSC 0.63 0.04 0.76 0.63 0.85 causality based approaches. In this paper, we focus over a
COLD 0.55 0.03 0.78 0.55 0.84 novel correlation outlier approach with better performance and
CCOD 0.88 0.03 0.95 0.88 0.96 computation time.
500 200

400
150

Correlation R(t)
Correlation R(t)

300
100
200

50
100

0 0
−20 −15 −10 −5 0 5 10 15 20 −20 −15 −10 −5 0 5 10 15 20
Time Shift (k) Time Shift (k)
(a) Talking pairs. (b) Non-talking pairs.

Fig. 2: Correlation with time shift for (a) talking pairs, and (b) non-talking pairs.

0.12 0.6

0.1 0.5

0.08 0.4
PDF

PDF
0.06 0.3

0.04 0.2

0.02 0.1

0 0
−10 0 10 20 30 40 50 60 70 80 90 100 −10 0 10 20 30 40 50 60 70 80 90 100
[R(0)−μ] / σ [R(0) − μ] / σ
(a) Talking pairs. (b) Non-talking pairs.

Fig. 3: CCOD statistics for (a) talking pairs, and (b) non-talking pairs.

VI. C ONCLUSION R EFERENCES


[1] CMO Council. Engage at Every Stage: Using Mobile Relationship
Our paper presents a novel Cross-correlation Outlier Detec- Marketing (MRM) to Put More Interaction in the Hands of the Customer
tor (CCOD) approach to breach user session in encrypted IM Report, January 2012.
networks using only flow log data. Our approach suggests that [2] D. Sancho. Large Data Breach in South Korea, Data of 35M Users
talking pairs bear large correlation statistics without any time Stolen, July 2013.
shift. Experiments over a real world Yahoo! messenger data [3] E. Chabrow. NSA E-Spying: Bad Governance, October 2013.
set showed a maximum of 88% TP rate, 3% FP rate, 95% [4] M. U. Ilyas, M. Z. Shafiq, A. X. Liu, and H. Radha. Who are you
accuracy, 88% precision and 96% ROC. Study suggests that talking to? Breaching privacy in encrypted IM networks. In Intl. Conf.
CCOD provides better performance than majority of previous on Network Protocols, 2013.
approaches with much less computational complexity. In fu- [5] S. Saleh, M. Raja, M. Shahnawaz, M. U. Ilyas, K. Khurshid, M. Z.
ture, we aim to deanonymize the user links in the Tor network Shafiq, A. X. Liu, H. Radha, and S. S. Karande. Breaching IM session
by real world experiments using our proposed approaches. privacy using causality. In Global Telecommunications Conference
(GLOBECOM), 2014.
[6] B. Crowley and S. Almasy. Harvard student charged in bomb hoax out
on 100,000 dollars bail, December 2013.
[7] B. Claise. Cisco systems NetFlow services export version 9. Wikipedia,
the free encyclopedia, October 2004.
ACKNOWLEDGMENT
[8] Yahoo! Webscope dataset ydata-ymessenger-client-server-protocol-
events-v1-0. [http://research.yahoo.com/Academic Relations].
This research is part of the project “Detecting covert links [9] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.
in instant messaging networks using flow level log data” sup- Witten. The WEKA data mining software: an update. SIGKDD
ported by National ICT R&D Fund, Pakistan. We are thankful Explorations, 11(1):10–18, 2009.
to the people at Yahoo! Webscope program for providing us [10] C. Troncoso and G. Danezis. The Bayesian traffic analysis of mix
the real world data and assisting us in the estimation of users networks. In ACM Computer and Communications Security, 2009.
vulnerability in IM applications. [11] Y. Zhu, X. Fu, R. Bettati, and W. Zhao. Anonymity Analysis of
Mix Networks against Flow-Correlation Attacks. In IEEE Global Ubiquitous Computing (EUC), 2011 IFIP 9th International Conference
Communications Conference (GLOBECOM), 2005. on, pages 72–78. IEEE, 2011.
[12] A. Narayanan and V. Shmatikov. Robust de-anonymization of large [21] Sambuddho Chakravarty, Angelos Stavrou, and Angelos D Keromytis.
sparse datasets. In IEEE Symposium on Security and Privacy, 2008. Identifying proxy nodes in a tor anonymization circuit. In Signal
[13] A. Narayanan and V. Shmatikov. De-anonymizing social networks. In Image Technology and Internet Based Systems, 2008. SITIS’08. IEEE
IEEE Symposium on Security and Privacy, 2009. International Conference on, pages 633–639. IEEE, 2008.
[14] G. Wondracek, T. Holz, E. Kirda, and C. Kruegel. A practical attack [22] Muhammad Aliyu Sulaiman and Sami Zhioua. Attacking tor through
to de-anonymize social network users. In IEEE Symposium on Security unpopular ports. In Distributed Computing Systems Workshops (ICD-
and Privacy, 2010. CSW), 2013 IEEE 33rd International Conference on, pages 33–38.
IEEE, 2013.
[15] E. Zheleva and L. Getoor. To Join or not to Join: The Illusion of Privacy
in Social Networks with Mixed Public and Private User Profiles. In [23] Eric Chan-Tin, Jiyoung Shin, and Jiangmin Yu. Revisiting circuit
World Wide Web (WWW) Conference, 2009. clogging attacks on tor. In Availability, Reliability and Security (ARES),
2013 Eighth International Conference on, pages 131–140. IEEE, 2013.
[16] Lasse Overlier and Paul Syverson. Locating hidden servers. In Security
and Privacy, 2006 IEEE Symposium on, pages 15–pp. IEEE, 2006. [24] Xiaogang Wang, Junzhou Luo, Ming Yang, and Zhen Ling. A
novel flow multiplication attack against tor. In Computer Supported
[17] Juan A Elices, Fernando Perez-Gonzalez, and Carmela Troncoso. Fin-
Cooperative Work in Design, 2009. CSCWD 2009. 13th International
gerprinting tor’s hidden service log files using a timing channel. In
Conference on, pages 686–691. IEEE, 2009.
Information Forensics and Security (WIFS), 2011 IEEE International
Workshop on, pages 1–6. IEEE, 2011. [25] Steven J Murdoch and George Danezis. Low-cost traffic analysis of
tor. In Security and Privacy, 2005 IEEE Symposium on, pages 183–
[18] Karsten Loesing, Werner Sandmann, Christian Wilms, and Guido Wirtz.
195. IEEE, 2005.
Performance measurements and statistics of tor hidden services. In Ap-
plications and the Internet, 2008. SAINT 2008. International Symposium [26] Lu Zhang, Junzhou Luo, Ming Yang, and Gaofeng He. Application-
on, pages 1–7. IEEE, 2008. level attack against tor’s hidden service. In Pervasive Computing and
Applications (ICPCA), 2011 6th International Conference on, pages
[19] Xuefeng Bai, Yong Zhang, and Xiamu Niu. Traffic identification of tor
509–516. IEEE, 2011.
and web-mix. In Intelligent Systems Design and Applications, 2008.
ISDA’08. Eighth International Conference on, volume 1, pages 548– [27] Ming Song, Gang Xiong, Zhenzhen Li, Junrui Peng, and Li Guo. A de-
551. IEEE, 2008. anonymize attack method based on traffic analysis. In Communications
and Networking in China (CHINACOM), 2013 8th International ICST
[20] John Barker, Peter Hannay, and Patryk Szewczyk. Using traffic analysis
Conference on, pages 455–460. IEEE, 2013.
to identify the second generation onion router. In Embedded and

You might also like