- Yizhou Sun;Jiawei Han;
Information networks that can be extracted from many domains are widely studied recently. Different functions for mining these networks are proposed and developed, such as ranking, community detection, and link prediction. Most existing network studies are on homogeneous networks, where nodes and links are assumed from one single type. In reality, however, heterogeneous information networks can better model the real-world systems, which are typically semi-structured and typed, following a network schema. In order to mine these heterogeneous information networks directly, we propose to explore the meta structure of the information network, i.e., the network schema. The concepts of meta-paths are proposed to systematically capture numerous semantic relationships across multiple types of objects, which are defined as a path over the graph of network schema. Meta-paths can provide guidance for search and mining of the network and help analyze and understand the semantic meaning of the objects and relations in the network. Under this framework, similarity search and other mining tasks such as relationship prediction and clustering can be addressed by systematic exploration of the network meta structure. Moreover, with user's guidance or feedback, we can select the best meta-path or their weighted combination for a specific mining task.
2013年04期 v.18 329-338页 [查看摘要][在线阅读][下载 551K] [下载次数:219 ] |[网刊下载次数:0 ] |[引用频次:47 ] |[阅读次数:23 ] - Po Hu;Minlie Huang;Xiaoyan Zhu;
Patents are critically important for a company to protect its core business concepts and proprietary technologies. Effective patent mining in massive patent databases not only provides business enterprises with valuable insights to develop strategies for research and development, intellectual property management, and product marketing, but also helps patent offices to improve efficiency and optimize their patent examination processes. This paper describes the patent mining problem of automatically discovering core patents (i.e., novel and influential patents in a domain). In addition, the value of core patent mining is illustrated by revealing the potential competitive relationships among companies in their core patents. The work addresses the unique patent vocabulary usage which is not considered in traditional word-based statistical methods with a topic-based temporal mining approach that quantifies a patent's novelty and influence through topic activeness variations. Tests of this method on real-world patent portfolios show the effectiveness of this approach over state-of-the-art methods.
2013年04期 v.18 339-352页 [查看摘要][在线阅读][下载 519K] [下载次数:72 ] |[网刊下载次数:0 ] |[引用频次:9 ] |[阅读次数:25 ] - Fulan Qian;Yanping Zhang;Yuan Zhang;Zhen Duan;
Collaborative Filtering (CF) is a commonly used technique in recommendation systems. It can promote items of interest to a target user from a large selection of available items. It is divided into two broad classes: memory-based algorithms and model-based algorithms. The latter requires some time to build a model but recommends online items quickly, while the former is time-consuming but does not require pre-building time. Considering the shortcomings of the two types of algorithms, we propose a novel Community-based User domain Collaborative Recommendation Algorithm (CUCRA). The idea comes from the fact that recommendations are usually made by users with similar preferences. The first step is to build a user-user social network based on users' preference data. The second step is to find communities with similar user preferences using a community detective algorithm. Finally, items are recommended to users by applying collaborative filtering on communities. Because we recommend items to users in communities instead of to an entire social network, the method has perfect online performance. Applying this method to a collaborative tagging system, experimental results show that the recommendation accuracy of CUCRA is relatively good, and the online time-complexity reduces to O.n/.
2013年04期 v.18 353-359页 [查看摘要][在线阅读][下载 387K] [下载次数:77 ] |[网刊下载次数:0 ] |[引用频次:11 ] |[阅读次数:64 ] - Shu Zhao;Chen Rui;Yanping Zhang;
Mining from ambiguous data is very important in data mining. This paper discusses one of the tasks for mining from ambiguous data known as multi-instance problem. In multi-instance problem, each pattern is a labeled bag that consists of a number of unlabeled instances. A bag is negative if all instances in it are negative. A bag is positive if it has at least one positive instance. Because the instances in the positive bag are not labeled, each positive bag is an ambiguous. The mining aim is to classify unseen bags. The main idea of existing multi-instance algorithms is to find true positive instances in positive bags and convert the multi-instance problem to the supervised problem, and get the labels of test bags according to predict the labels of unknown instances. In this paper, we aim at mining the multi-instance data from another point of view, i.e., excluding the false positive instances in positive bags and predicting the label of an entire unknown bag. We propose an algorithm called Multi-Instance Covering kNN (MICkNN) for mining from multi-instance data. Briefly, constructive covering algorithm is utilized to restructure the structure of the original multi-instance data at first. Then, the kNN algorithm is applied to discriminate the false positive instances. In the test stage, we label the tested bag directly according to the similarity between the unseen bag and sphere neighbors obtained from last two steps. Experimental results demonstrate the proposed algorithm is competitive with most of the state-of-the-art multi-instance methods both in classification accuracy and running time.
2013年04期 v.18 360-368页 [查看摘要][在线阅读][下载 369K] [下载次数:62 ] |[网刊下载次数:0 ] |[引用频次:14 ] |[阅读次数:59 ] - Jin Zhou;Liang Hu;Feng Wang;Huimin Lu;Kuo Zhao;
The Internet of Things (IoT) implies a worldwide network of interconnected objects uniquely addressable, via standard communication protocols. The prevalence of IoT is bound to generate large amounts of multisource, heterogeneous, dynamic, and sparse data. However, IoT offers inconsequential practical benefits without the ability to integrate, fuse, and glean useful information from such massive amounts of data. Accordingly, preparing us for the imminent invasion of things, a tool called data fusion can be used to manipulate and manage such data in order to improve process efficiency and provide advanced intelligence. In order to determine an acceptable quality of intelligence, diverse and voluminous data have to be combined and fused. Therefore, it is imperative to improve the computational efficiency for fusing and mining multidimensional data. In this paper, we propose an efficient multidimensional fusion algorithm for IoT data based on partitioning. The basic concept involves the partitioning of dimensions (attributes), i.e., a big data set with higher dimensions can be transformed into certain number of relatively smaller data subsets that can be easily processed. Then, based on the partitioning of dimensions, the discernible matrixes of all data subsets in rough set theory are computed to obtain their core attribute sets. Furthermore, a global core attribute set can be determined. Finally, the attribute reduction and rule extraction methods are used to obtain the fusion results. By means of proving a few theorems and simulation, the correctness and effectiveness of this algorithm is illustrated.
2013年04期 v.18 369-378页 [查看摘要][在线阅读][下载 1656K] [下载次数:104 ] |[网刊下载次数:0 ] |[引用频次:17 ] |[阅读次数:65 ] - Feng Tan;Li Li;Zheyu Zhang;Yunlong Guo;
With the development of the social media and Internet, discovering latent information from massive information is becoming particularly relevant to improving user experience. Research efforts based on preferences and relationships between users have attracted more and more attention. Predictive problems, such as inferring friend relationship and co-author relationship between users have been explored. However, many such methods are based on analyzing either node features or the network structures separately, few have tried to tackle both of them at the same time. In this paper, in order to discover latent co-interests' relationship, we not only consider users' attributes but network information as well. In addition, we propose an Interest-based Factor Graph Model (I-FGM) to incorporate these factors. Experiments on two data sets (bookmarking and music network) demonstrate that this predictive method can achieve better results than the other three methods (ANN, NB, and SVM).
2013年04期 v.18 379-386页 [查看摘要][在线阅读][下载 1525K] [下载次数:27 ] |[网刊下载次数:0 ] |[引用频次:2 ] |[阅读次数:42 ] - Le Yug;Bin Wug;Bai Wang;
Recently, complex networks have attracted considerable research attention. Community detection is an important problem in the field of complex networks and is useful in a variety of applications such as information propagation, link prediction, recommendation, and marketing. In this study, we focus on discovering overlapping community structures by using link partitions. We propose a Latent Dirichlet Allocation (LDA)-Based Link Partition (LBLP) method, which can find communities with an adjustable range of overlapping. This method employs the LDA model to detect link partitions, which can calculate the community belonging factor for each link. On the basis of this factor, link partitions with bridge links can be found efficiently. We validate the effectiveness of the proposed solution by using both real-world and synthesized networks. The experimental results demonstrate that the approach can find a meaningful and relevant link community structure.
2013年04期 v.18 387-397页 [查看摘要][在线阅读][下载 2765K] [下载次数:79 ] |[网刊下载次数:0 ] |[引用频次:14 ] |[阅读次数:57 ] - Yanhua Yu;Meina Song;Yu Fu;Junde Song;
Traffic prediction plays an integral role in telecommunication network planning and network optimization. In this paper, we investigate the traffic forecasting for data services in 3G mobile networks. Although the Box-Jenkins model has been proven to be appropriate for voice traffic (since the arrival of calls follows a Poisson distribution), it has been demonstrated that the Internet traffic exhibits statistical self-similarity and has to be modeled using the Fractional AutoRegressive Integrated Moving Average (FARIMA) process. However, a few studies have concluded that the FARIMA process may fail in modeling the Internet traffic. To this end, we conducted experiments on the modeling of benchmark Internet traffic and found that the FARIMA process fails because of the significant multifractal characteristic inherent in the traffic series. Thereafter, we investigate the traffic series of data services in a 3G mobile network from a province in China. Rich multifractal spectra are found in this series. Based on this observation, an integrated method combining the AutoRegressive Moving Average (ARMA) and FARIMA processes is applied. The obtained experimental results verify the effectiveness of the integrated prediction method.
2013年04期 v.18 398-405页 [查看摘要][在线阅读][下载 1903K] [下载次数:57 ] |[网刊下载次数:0 ] |[引用频次:15 ] |[阅读次数:31 ] - Zhen Chen;Linyun Ruan;Junwei Cao;Yifan Yu;Xin Jiang;
The archiving of Internet traffic is an essential function for retrospective network event analysis and forensic computer communication. The state-of-the-art approach for network monitoring and analysis involves storage and analysis of network flow statistic. However, this approach loses much valuable information within the Internet traffic. With the advancement of commodity hardware, in particular the volume of storage devices and the speed of interconnect technologies used in network adapter cards and multi-core processors, it is now possible to capture 10 Gbps and beyond real-time network traffic using a commodity computer, such as n2disk. Also with the advancement of distributed file system (such as Hadoop, ZFS, etc.) and open cloud computing platform (such as OpenStack, CloudStack, and Eucalyptus, etc.), it is practical to store such large volume of traffic data and fully in-depth analyse the inside communication within an acceptable latency. In this paper, based on well- known TimeMachine, we present TIFAflow, the design and implementation of a novel system for archiving and querying network flows. Firstly, we enhance the traffic archiving system named TImemachine+FAstbit (TIFA) with flow granularity, i.e., supply the system with flow table and flow module. Secondly, based on real network traces, we conduct performance comparison experiments of TIFAflow with other implementations such as common database solution, TimeMachine and TIFA system. Finally, based on comparison results, we demonstrate that TIFAflow has a higher performance improvement in storing and querying performance than TimeMachine and TIFA, both in time and space metrics.
2013年04期 v.18 406-417页 [查看摘要][在线阅读][下载 1105K] [下载次数:107 ] |[网刊下载次数:0 ] |[引用频次:7 ] |[阅读次数:89 ] - Jianlin Xu;Yifan Yu;Zhen Chen;Bin Cao;Wenyu Dong;Yu Guo;Junwei Cao;
With the explosive increase in mobile apps, more and more threats migrate from traditional PC client to mobile device. Compared with traditional Win+Intel alliance in PC, Android+ARM alliance dominates in Mobile Internet, the apps replace the PC client software as the major target of malicious usage. In this paper, to improve the security status of current mobile apps, we propose a methodology to evaluate mobile apps based on cloud computing platform and data mining. We also present a prototype system named MobSafe to identify the mobile app's virulence or benignancy. Compared with traditional method, such as permission pattern based method, MobSafe combines the dynamic and static analysis methods to comprehensively evaluate an Android app. In the implementation, we adopt Android Security Evaluation Framework (ASEF) and Static Android Analysis Framework (SAAF), the two representative dynamic and static analysis methods, to evaluate the Android apps and estimate the total time needed to evaluate all the apps stored in one mobile app market. Based on the real trace from a commercial mobile app market called AppChina, we can collect the statistics of the number of active Android apps, the average number apps installed in one Android device, and the expanding ratio of mobile apps. As mobile app market serves as the main line of defence against mobile malwares, our evaluation results show that it is practical to use cloud computing platform and data mining to verify all stored apps routinely to filter out malware apps from mobile app markets. As the future work, MobSafe can extensively use machine learning to conduct automotive forensic analysis of mobile apps based on the generated multifaceted data in this stage.
2013年04期 v.18 418-427页 [查看摘要][在线阅读][下载 2183K] [下载次数:290 ] |[网刊下载次数:0 ] |[引用频次:17 ] |[阅读次数:121 ] -
<正>The publication of Tsinghua Science and Technology was started in 1996. Since then, it has been an international academic journal sponsored by Tsinghua University and published bimonthly. This journal aims at presenting the state-of-art scientific achievements in computer science and other IT fields. One paper on Cloud Computing published in Vol. 18, Issue. 1, 2013, has been ranked the top of IEEE download list continuously for five months:
2013年04期 v.18 428页 [查看摘要][在线阅读][下载 52K] [下载次数:44 ] |[网刊下载次数:0 ] |[引用频次:1 ] |[阅读次数:29 ] <正>Tsinghua Science and Technology (Tsinghua Sci Technol), an academic journal sponsored by Tsinghua University, is published bimonthly. This journal aims at presenting the up-to-date scientific achievements with high creativity and great significance in computer and electronic engineering. Contributions all over the world are welcome. Tsinghua Sci Technol is indexed by IEEE Xplore, Engineering index (Ei, USA), INSPEC, SA, Cambridge Abstract and other abstracting indexes.
2013年04期 v.18 429页 [查看摘要][在线阅读][下载 384K] [下载次数:9 ] |[网刊下载次数:0 ] |[引用频次:0 ] |[阅读次数:22 ] 下载本期数据