PrePrint: Flexible Frameworks for Actionable Knowledge Discovery

Most data mining algorithms and tools stop at the mining and delivery of patterns satisfying expected technical interestingness. There are often many patterns mined but business people either are not interested in them or do not know what follow-up actions to take to support their business decisions. This issue has seriously affected the widespread employment of advanced data mining techniques in greatly promoting enterprise operational quality and productivity. In this paper, we present a formal view of actionable knowledge discovery (AKD) from the system and decision-making perspectives. AKD is a closed optimization problem-solving process from problem definition, framework/model design to actionable pattern discovery, and is designed to deliver operable business rules that can be seamlessly associated or integrated with business processes and systems. To support such processes, we correspondingly propose, formalize and illustrate four types of generic AKD frameworks: Post-Analysis-based AKD, Unified Interestingness-based AKD, Combined Mining-based AKD and Multi-Source Combined Mining-based AKD (MSCM-AKD). A real-life case study of MSCM-based AKD is demonstrated to extract debt prevention patterns from social security data. Substantial experiments show that the proposed frameworks are sufficiently general, flexible and practical to tackle many complex problems and applications by extracting actionable deliverables for instant decision-making.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=6ee8e04c11e51b57c7ef7a3429663cafp=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=6ee8e04c11e51b57c7ef7a3429663cafp=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: Conic Programming for Multi-Task Learning

Abstract.When we have several related tasks, solving them simultaneously has been shown to be more effective than solving them individually. This approach is called multi-task learning (MTL). In this paper, we propose a novel MTL algorithm. Our method controls the relatedness among the tasks locally, so all pairs of related tasks are guaranteed to have similar solutions. We apply the above idea to support vector machines and show that the optimization problem can be cast as a second-order cone program, which is convex and can be solved efficiently. The usefulness of our approach is demonstrated in ordinal regression, link prediction and collaborative filtering, each of which can be formulated as a structured multi-task problem.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=4ff0566b435aa6ebc4bc04281c74e3a0p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=4ff0566b435aa6ebc4bc04281c74e3a0p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: Performance Comparison of the R*-tree and the Quadtree for kNN and Distance Join Queries

Multi-dimensional point indexing plays a critical role in a variety of data-centric applications, ranging from image retrieval, sequence matching, to moving object search. A common choice of indexing method for these applications is often the "ubiquitous" R*-tree. Choosing the right indexing method requires careful consideration of various factors such as query operations and index construction methods. In this work, we present an experimental study comparing the R*-tree and Quadtree using various criteria including the query operations and index construction methods. Although a variety of query operations can be performed using these index structures, previous work has largely focused only on the range search operation. We go beyond this previous work and compare the performance of these index structures using k-nearest neighbor (kNN) and distance join queries. In addition, we also consider the impact of index construction methods in evaluating these index structures. Our study sheds light on how the choice of the underlying index structure affects the performance of different query operations, and shows that the method used for constructing the index and the dynamic nature of the dataset has a dramatic impact on the performance of these index structures.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=acbbcbf094dc6fd3e55e3a1487889879p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=acbbcbf094dc6fd3e55e3a1487889879p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: Dynamic Dissimilarity Measure for Support-Based Clustering

Clustering methods utilizing support estimates of a data distribution have recently attracted much attention because of their ability to generate cluster boundaries of arbitrary shape and to deal with outliers efficiently. In this paper, we propose a novel dissimilarity measure based on a dynamical system associated with support estimating functions. Theoretical foundations of the proposed measure are developed and applied to construct a clustering method that can effectively partition the whole data space. Simulation results demonstrate that clustering based on the proposed dissimilarity measure is robust to the choice of kernel parameters and able to control the number of clusters efficiently.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=1a0a5a535a9d0860c0beb91e1ee0e689p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=1a0a5a535a9d0860c0beb91e1ee0e689p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: Closeness: A New Privacy Measure for Data Publishing

The $k$-anonymity privacy requirement for publishing microdata requires that each equivalence class (i.e., a set of records that are indistinguishable from each other with respect to certain "identifying" attributes) contains at least $k$ records. Recently, several authors have recognized that $k$-anonymity cannot prevent attribute disclosure. The notion of $\ell$-diversity has been proposed to address this; $\ell$-diversity requires that each equivalence class has at least $\ell$ well-represented values for each sensitive attribute. In this article, we show that $\ell$-diversity has a number of limitations. In particular, it is neither necessary nor sufficient to prevent attribute disclosure. Motivated by these limitations, we propose a new notion of privacy called "closeness". We first present the base model $t$-closeness, which requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a threshold $t$). We then propose a more flexible privacy model called $(n,t)$-closeness that offers higher utility. We describe our desiderata for designing a distance measure between two probability distributions and present two distance measures. We discuss the rationale for using closeness as a privacy measure and illustrate its advantages through examples and experiments.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=8e230014bbf88f47c291a8d7ec323663p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=8e230014bbf88f47c291a8d7ec323663p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: Credibility: How Agents Can Handle Unfair Third-party Testimonies in Computational Trust Models

Usually, agents within multi-agent systems represent different stakeholders that have their own distinct and sometimes conflicting interests and objectives. They would behave in such a way so as to achieve their own objectives, even at the cost of others. Therefore, there are risks in interacting with other agents. A number of computational trust models have been proposed to manage such risk. However, the performance of most computational trust models that rely on third-party recommendations as part of the mechanism to derive trust is easily deteriorated by the presence of unfair testimonies. There have been several attempts to combat the influence of unfair testimonies. Nevertheless, they are either not readily applicable since they require additional information which is not available in realistic settings, or ad-hoc as they are tightly coupled with specific trust models. Against this background, a general credibility model is proposed in this paper. Empirical studies have shown that the proposed credibility model is more effective than related work in mitigating the adverse influence of unfair testimonies.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=3098167591390a821551668d78453710p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=3098167591390a821551668d78453710p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: Superseding Nearest Neighbor Search on Uncertain Spatial Databases

This paper proposes a new problem, called {\em superseding nearest neighbor search}, on uncertain spatial databases, where each object is described by a multidimensional probability density function. Given a query point $q$, an object is a {\em nearest neighbor} (NN) {\em candidate} if it has a non-zero probability to be the NN of $q$. Given two NN candidates $o_1$ and $o_2$, $o_1$ {\em supersedes} $o_2$ if $o_1$ is more likely to be closer to $q$. An object is a {\em superseding nearest neighbor} (SNN) of $q$, if it supersedes all the other NN-candidates. Sometimes no object is able to supersede every other NN candidate. In this case, we return the {\em SNN-core} — the {\em minimum} set of NN-candidates {\em each of which} supersedes {\em all} the NN-candidates outside the SNN-core. Intuitively, the SNN-core contains the best objects, because any object outside the SNN-core is worse than {\em all} the objects in the SNN-core. We show that the SNN-core can be efficiently computed by utilizing a conventional multidimensional index, as confirmed by extensive experiments.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=7519abe8828595d1951e0b2870301c36p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=7519abe8828595d1951e0b2870301c36p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: An UpDown Directed Acyclic Graph Approach for Sequential Pattern Mining

Traditional pattern-growth based approaches for sequential pattern mining derive length-(k+1) patterns based on the projected databases of length-k patterns recursively. At each level of recursion, they grow the length of detected patterns by 1 uni-directionally along the suffix of detected patterns, which needs k levels of recursion to find a length-k pattern. In this paper a novel data structure, UpDown Directed Acyclic Graph (UDDAG), is invented for efficient sequential pattern mining. UDDAG allows bidirectional pattern growth along both ends of detected patterns. Thus a length-k pattern can be detected in ⌊log₂k+1⌋ levels of recursion at best, which results in fewer levels of recursion and faster pattern growth. When minSup is large such that the average pattern length is close to 1, UDDAG and PrefixSpan have similar performance because the problem degrades into frequent item counting problem. However, UDDAG scales up much better. It often outperforms PrefixSpan by almost one order of magnitude in scalability tests. UDDAG is also considerably faster than Spade and LapinSpam. Except for extreme cases, UDDAG uses comparable memory to that of PrefixSpan and less memory than Spade and LapinSpam. Additionally, the special feature of UDDAG enables its extension toward applications involving searching in large spaces.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=7793771ca6e376f66ae920b72fc34d24p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=7793771ca6e376f66ae920b72fc34d24p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: An Information Theoretic Foundation for the Measurement of Discrimination Information

Hitherto, it has not been easy to interpret the meaning of the amount of discrimination information conveyed in a term rationally and explicitly within practical application contexts; it has not been simple to introduce the concept of the extent of semantic relatedness between two terms meaningfully and successfully into scientific discussions. This study is part of an attempt to do this. We attempt to answer two important questions: (1) What is the discrimination information conveyed by a term and how to measure it? (2) What is the relatedness between two terms and how to estimate it? We focus on the first question, and present an in-depth investigation into the discrimination measures based on several information measures. The relatedness measures are then naturally defined according to the individual discrimination measures. Some key points are made for clarifying potential problems arising from using the relatedness measures and solutions are suggested. Two example applications in the contexts of text mining and information retrieval are provided. The aim of this study, of which this paper forms part, is to establish a unified theoretical framework, with MDI (measurement of discrimination information) at the core, for achieving effective MSR (measurement of semantic relatedness).br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=7235034ff0557c79c7eef4ecf41d489ep=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=7235034ff0557c79c7eef4ecf41d489ep=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: Incremental and General Evaluation of Reverse Nearest Neighbors

This paper presents a novel algorithm for Incremental and General Evaluation of continuous Reverse Nearest neighbor queries (IGERN, for short). The IGERN algorithm is general in that it is applicable for both continuous monochromatic and bichromatic reverse nearest neighbor queries. This problem is faced in a number of applications such as enhanced 911 services and in army strategic planning. A main challenge in these problems is to maintain the most up to date query answers as the dataset frequently changes over time. Previous algorithms for monochromatic continuous reverse nearest neighbor queries rely mainly on monitoring at the worst case of six pie regions, whereas IGERN takes a radical approach by monitoring only a single region around the query object. The IGERN algorithm clearly outperforms the state-of-the-art algorithms in monochromatic queries. We also propose a new optimization for the monochromatic IGERN. Furthermore, a filter and refine approach for IGERN (FR-IGERN) is proposed for the continuous evaluation of bichromatic reverse nearest neighbor queries which is an optimized version of our previous approach. The computational complexity of IGERN and FR-IGERN is presented, the correctness of IGERN and FR-IGERN are proved, and experimental analysis using synthetic and real datasets is shown.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=a49113ce37f927fb6dc7ab8ef8fbc242p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=a49113ce37f927fb6dc7ab8ef8fbc242p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: Bregman Divergence Based Regularization for Transfer Subspace Learning

The regularization principals lead approximation schemes to deal with various learning problems, e.g., the regularization of the norm in a reproducing kernel Hilbert space for the ill-posed problem. In this paper, we present a family of subspace learning algorithms based on a new form of regularization, which transfers the knowledge gained in training samples to testing samples. In particular, the new regularization minimizes the Bregman divergence between the distribution of training samples and that of testing samples in the selected subspace, so it boosts the performance when training and testing samples are not independent and identically distributed. To test the effectiveness of the proposed regularization, we introduce it to popular subspace learning algorithms, e.g., Principal Components Analysis (PCA) for cross-domain face modelling; and Fisher’s linear discriminant analysis (FLDA), locality preserving projections (LPP), marginal Fisher’s analysis (MFA), and Discriminative Locality Alignment (DLA) for cross-domain face recognition. Finally, we present experimental evidence on FERET, UMIST, and YALE face image datasets, suggesting that the proposed Bregman divergence based regularization is effective to deal with cross-domain learning problems.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=ac2153e1de1c3deecbe6340995d69f52p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=ac2153e1de1c3deecbe6340995d69f52p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: δ-Presence Without Complete World Knowledge

Advances in information technology, and its use in research, are increasing both the need for anonymized data and the risks of poor anonymization. Previous work presented a new privacy metric, δ-presence, that clearly links the quality of anonymization to the risk posed by inadequate anonymization. It was shown that existing anonymization techniques are inappropriate for situations where δ-presence is a good metric (specifically, where knowing an individual is in the database poses a privacy risk). This article addresses a practical problem with previous work, extending to situations where the data anonymizer is not assumed to have complete world knowledge. The algorithms are evaluated in the context of a real-world scenario, demonstrating practical applicability of the approach.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=930a63eaee90d0bd4fb4d817681d8078p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=930a63eaee90d0bd4fb4d817681d8078p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: Feature Selection Using f-Information Measures in Fuzzy Approximation Spaces

The selection of nonredundant and relevant features of real valued data sets is a highly challenging problem. A novel feature selection method is presented here based on fuzzy-rough sets by maximizing the relevance and minimizing the redundancy of the selected features. By introducing the fuzzy equivalence partition matrix, a novel representation of Shannon's entropy for fuzzy approximation spaces is proposed to measure the relevance and redundancy of features suitable for real valued data sets. The fuzzy equivalence partition matrix also offers an efficient way to calculate many more information measures, termed as f-information measures. Several f-information measures are shown to be effective for selecting nonredundant and relevant features of real valued data sets. This paper compares the performance of different f-information measures for feature selection in fuzzy approximation spaces. Some quantitative indices are introduced based on fuzzy-rough sets for evaluating the performance of proposed feature selection method. The effectiveness of the proposed method, along with a comparison with other methods, is demonstrated on a set of real life data sets.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=792cd502391b716aa5590f1d208f061ap=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=792cd502391b716aa5590f1d208f061ap=1//a
Categories: IEEE Members Only

PrePrint: A Non-Supervised Learning Framework of Human Behavior Patterns Based on Sequential Actions

In designing autonomous service systems such as assistive robots for the aged and the disabled, discovery and prediction of human actions are important and often crucial. Patterns of human behavior, however, involve ambiguity, uncertainty, complexity, and inconsistency caused by physical, logical, and emotional factors, and thus their modeling and recognition are known to be difficult. In this paper, a non-supervised learning framework of human behavior patterns is suggested in consideration of human behavioral characteristics. Our approach consists of two steps. In the first step, a meaningful structure of data is discovered by using Agglomerative Iterative Bayesian Fuzzy Clustering (AIBFC) with a newlyproposed cluster validity index. In the second step, the sequence of actions is learned on the basis of the structure discovered in the first step and by utilizing the proposed Fuzzy-state Q-learning (FSQL) process. These two learning steps are incorporated in an amalgamated framework, AIBFC-FSQL, which is capable of learning human behavior patterns in a non-supervised manner and predicting subsequent human actions. Through a number of simulations with typical benchmark datasets, we show that the proposed learning method outperforms several well-known methods. We further conduct experiments with two challenging real-world databases to demonstrate its usefulness from a practical perspective.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=964d058b2ac9b13dc273dae9d2b719d5p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=964d058b2ac9b13dc273dae9d2b719d5p=1//a
Categories: IEEE Members Only

PrePrint: Probabilistic Topic Models for Learning Terminological Ontologies

Probabilistic topic models were originally developed and utilised for document modeling and topic extraction in Information Retrieval. In this paper we describe a new approach for automatic learning of terminological ontologies from text corpus based on such models. In our approach, topic models are used as efficient dimension reduction techniques, which are able to capture semantic relationships between word-topic and topic-document interpreted in terms of probability distributions. We propose two algorithms for learning terminological ontologies using the principle of topic relationship and exploiting information theory with the probabilistic topic models learned. Experiments with different model parameters were conducted and learned ontology statements were evaluated by the domain experts. We have also compared the results of our method with two existing concept hierarchy learning methods on the same dataset. The study shows that our method outperforms other methods in terms of recall and precision measures. The precision level of the learned ontology is sufficient for it to be deployed for the purpose of browsing, navigation, and information search and retrieval in digital libraries.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=52f6f181dacf6854f94609cb668db94ep=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=52f6f181dacf6854f94609cb668db94ep=1//a
Categories: IEEE Members Only

PrePrint: The Tiled Bitmap Forensic Analysis Algorithm

Tampering of a database can be detected through the use of cryptographically-strong hash functions. Subsequently-applied forensic analysis algorithms can help determine when, what, and perhaps ultimately who and why. This paper presents a novel forensic analysis algorithm, the Tiled Bitmap Algorithm, which is more efficient than prior forensic analysis algorithms. It introduces the notion of a candidate set (all possible locations of detected tampering(s)) and provides a complete characterization of the candidate set and its cardinality. An optimal algorithm for computing the candidate set is also presented. Finally, the implementation of the Tiled Bitmap Algorithm is discussed, along with a comparison to other forensic algorithms in terms of space/time complexity and cost. An example of candidate set generation and proofs of the theorems and lemmata and of algorithm correctness can be found in the appendix.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=6b34ac951830b363d58e59e8d5d0b20bp=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=6b34ac951830b363d58e59e8d5d0b20bp=1//a
Categories: IEEE Members Only

PrePrint: k-Anonymity in the Presence of External Databases

The concept of k-anonymity has received considerable attention due to the need of several organizations to release microdata without revealing the identity of individuals. Although all previous k-anonymity techniques assume the existence of a public database (PD) that can be used to breach privacy, none utilizes PD during the anonymization process. Specifically, existing generalization algorithms create anonymous tables using only the microdata table (MT) to be published, independently of the external knowledge available. This omission leads to high information loss. Motivated by this observation we first introduce the concept of k-join-anonymity (KJA), which permits more effective generalization by exploiting the records of PD to reduce the information loss. Then, we propose two methodologies for adapting k-anonymity algorithms to their KJA counterparts. The first generalizes the combination of MT and PD, under the constraint that each group should contain at least one tuple of MT (otherwise, the group is useless and discarded). The second anonymizes MT, and then refines the resulting groups using PD. Finally, we evaluate the effectiveness of our contributions with an extensive experimental evaluation using real and synthetic datasets.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=272512237768c6ef3547d6e0d3978842p=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=272512237768c6ef3547d6e0d3978842p=1//a
Categories: IEEE Members Only

PrePrint: A Distance Measure Approach to Exploring the Rough Set Boundary Region for Attribute Reduction

Feature Selection (FS) or Attribute Reduction techniques are employed for dimensionality reduction and aim to select a subset of the original features of a dataset which are rich in the most useful information. The benefits of employing FS techniques include improved data visualisation and transparency, a reduction in training and utilisation times and potentially, improved prediction performance. Many approaches based on rough set theory up to now, have employed the dependency function, which is based on lower approximations as an evaluation step in the FS process. However, by examining only that information which is considered to be certain and ignoring the boundary region, or region of uncertainty, much useful information is lost.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=a1e7588ac133955eb979b838bb38c40bp=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=a1e7588ac133955eb979b838bb38c40bp=1//a
Categories: IEEE Members Only

PrePrint: Building a Rule-Based Classifier --- a Fuzzy-Rough Set Approach

The fuzzy-rough set (FRS) methodology, as a useful tool to handle discernibility and fuzziness, has been widely studied. Some researchers studied on the rough approximation of fuzzy sets while some others focused on studying one application of FRS: attribute reduction (i.e. feature selection). However, constructing classifier by using FRS, as another application of FRS, has been less studied. In this paper, we build a rule-based classifier by using one generalized FRS model after proposing a new concept named ‘consistence degree’ which is used as the critical value to keep the discerniblity information invariant in the processing of rule induction. First, we generalized the existing FRS to a robust model with respect to misclassification and perturbation by incorporating one controlled threshold into knowledge representation of FRS. Second, we propose a concept named ‘consistence degree’ and by the strict mathematical reasoning we show this concept is reasonable as a critical value to reduce redundant attribute-values in database. By employing this concept, we then design a discernibility vector to develop the algorithms of rule induction. The induced rule set can function as a classifier. Finally, the experimental results show that the proposed rule-based classifier is feasible and effective on noisy data.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=569774a472a0f7d87cbebc30062c91b5p=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=569774a472a0f7d87cbebc30062c91b5p=1//a
Categories: IEEE Members Only

PrePrint: Incremental Maintenance of 2-hop Labeling of Large Graphs

Recent interests on XML, Semantic Web, and Web ontology, among other topics, have sparked a renewed interest on graph-structured databases. A fundamental query on graphs is the reachability test of nodes. Recently, 2-hop labeling has been proposed to index large collections of XML and/or graphs for efficient reachability tests. However, there has been few work on updates of 2-hop labeling. This is compounded by the fact that data may change over time. In response to these, this paper studies the incremental maintenance of 2-hop labeling. We identify the main reason for the inefficiency of updates of existing 2-hop labels. We propose three updatable 2-hop labelings, hybrids of 2-hop labeling, and their incremental maintenance algorithms. The proposed 2-hop labeling is derived from graph connectivity, as opposed to SET COVER which is used by all previous work. Our experimental evaluation illustrates the space efficiency and update performance of various kinds of 2-hop labeling. Our results show that the incremental maintenance algorithm can be two orders of magnitude faster than previous methods and the size of our 2-hop labeling can be comparable to the existing 2-hop labeling. We conclude that there is a natural way to spare some index size for update performance in 2-hop labeling.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=372ec4bf17ed692fbcb0331f670fea30p=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=372ec4bf17ed692fbcb0331f670fea30p=1//a
Categories: IEEE Members Only
Syndicate content