PrePrint: A Sparse Learning Machine for High-Dimensional Data with Application to Microarray Gene Analysis

Extracting features from high-dimensional data is a critically important task for pattern recognition and machine learning applications. High-dimensional data typically have much more variables than observations, and contain significant noise, missing components, or outliers. Features extracted from high-dimensional data need to be discriminative, sparse, and can capture essential characteristics of the data. In this paper, we present a way to constructing multivariate features and then classify the data into proper classes. The resulting small subset of features is nearly the best in the sense of Greenshtein's persistence, however, the estimated feature weights may be biased. We take a systematic step to correct the biases. We use conjugate gradient based primal-dual interior-point techniques for large-scale problems. We apply our procedure to microarray gene analysis. The effectiveness of our method is confirmed by experimental results.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=c6b0d2ceafd37073b8639dc3ff6573abp=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=c6b0d2ceafd37073b8639dc3ff6573abp=1//a img src=http://www.pheedo.com/feeds/tracker.php?i=c6b0d2ceafd37073b8639dc3ff6573ab style=display: none; border=0 height=1 width=1 alt=/
Categories: IEEE Members Only

PrePrint: Data Mining on DNA Sequences of Hepatitis B Virus

In this study, a data mining framework which includes molecular evolution analysis, clustering, feature selection, classifier learning and classification, is introduced. Our research group has collected HBV DNA sequences, either genotype B or C, from over 200 patients specifically for this project. In the molecular evolution analysis and clustering, three subgroups have been identified in genotype C and a clustering method has been developed to separate the subgroups. In the feature selection process, potential markers are selected based on Information Gain for further classifier learning. Then meaningful rules are learnt by our algorithm called the Rule Learning which is based on Evolutionary Algorithm. Also, a new classification method by Nonlinear Integral has been developed. Good performance of this method comes from the use of the fuzzy measure and the relevant nonlinear integral. The nonadditivity of the fuzzy measure reflects the importance of the feature attributes as well as their interactions. These two classifiers give explicit information on the importance of the individual mutated sites and their interactions towards the classification (potential causes to liver cancer in our case). A thorough comparison study of these two methods with existing methods is detailed.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=6788101bd48a123c676799880154416cp=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=6788101bd48a123c676799880154416cp=1//a img src=http://www.pheedo.com/feeds/tracker.php?i=6788101bd48a123c676799880154416c style=display: none; border=0 height=1 width=1 alt=/
Categories: IEEE Members Only

PrePrint: A Metric on the Space of Reduced Phylogenetic Networks

Phylogenetic networks are leaf-labeled, rooted, acyclic, directed graphs, that model reticulate evolutionary histories. Several measures for quantifying the topological dissimilarity between two phylogenetic networks have been devised for various classes of phylogenetic networks. A biologically-motivated class of phylogenetic networks, namely reduced phylogenetic networks, was recently introduced. None of the existing measures is a metric on the space of reduced phylogenetic networks. In this paper, we provide a polynomiallycomputablebr clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=3922845ce88b844a3b3074ca31802866p=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=3922845ce88b844a3b3074ca31802866p=1//a img src=http://www.pheedo.com/feeds/tracker.php?i=3922845ce88b844a3b3074ca31802866 style=display: none; border=0 height=1 width=1 alt=/
Categories: IEEE Members Only

PrePrint: Information-Theoretic Model of Evolution over Protein Communication Channel

In this paper, we propose a communication model of evolution and investigate its information-theoretic bounds. The process of evolution is modeled as the retransmission of information over a protein communication channel, where the transmitted message is the organism’s proteome encoded in the DNA. We compute the capacity and the rate-distortion functions of the protein communication system for the three domains of life: Archaea, Bacteria and Eukaryotes. The tradeoff between the transmission rate and the distortion in noisy protein communication channels is analyzed. As expected, comparison between the optimal transmission rate and the channel capacity indicates that the biological fidelity does not reach the Shannon optimal distortion. However, the relationship between the channel capacity and rate distortion achieved for different biological domains provides tremendous insight into the dynamics of the evolutionary processes of the three domains of life. We rely on these results to provide a model of genome sequence evolution based on the two major evolutionary driving forces: mutations and unequal crossovers.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=7ae7a12d8f6921eee5c5211cdf3c406ap=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=7ae7a12d8f6921eee5c5211cdf3c406ap=1//a img src=http://www.pheedo.com/feeds/tracker.php?i=7ae7a12d8f6921eee5c5211cdf3c406a style=display: none; border=0 height=1 width=1 alt=/
Categories: IEEE Members Only

PrePrint: Bayesian Models and Algorithms for Protein beta-Sheet Prediction

Prediction of the three-dimensional structure greatly benefits from the information related to secondary structure, solvent accessibility, and non-local contacts that stabilize a protein's structure. Prediction of such components is vital to our understanding of the structure and function of a protein. In this paper, we address the problem of beta-sheet prediction. We introduce a Bayesian approach for proteins with six or less beta-strands, in which we model the conformational features in a probabilistic framework. To select the optimum architecture, we analyze the space of possible conformations by efficient heuristics. Furthermore, we employ an algorithm that finds the optimum pairwise alignment between beta-strands using dynamic programming. Allowing any number of gaps in an alignment enables us to model beta-bulges more effectively. Though our main focus is proteins with six or less beta-strands, we are also able to perform predictions for proteins with more than six beta-strands by combining the predictions of BetaPro with the gapped alignment algorithm. We evaluated the accuracy of our method and BetaPro. We performed a 10-fold cross validation experiment on the BetaSheet916 set and we obtained significant improvements in the prediction accuracy.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=905dda8a2d6878dbad2f75616d70a971p=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=905dda8a2d6878dbad2f75616d70a971p=1//a img src=http://www.pheedo.com/feeds/tracker.php?i=905dda8a2d6878dbad2f75616d70a971 style=display: none; border=0 height=1 width=1 alt=/
Categories: IEEE Members Only

PrePrint: RDCurve: A Nonparametric Method to Evaluate the Stability of Ranking Procedures

Great concerns have been raised about the reproducibility of gene signatures based on high-throughput techniques such as microarray. Studies analyzing similar samples often report poorly overlapping results, and the p-value usually lacks biological context. We propose a non-parametric Re-Discovery-Curve (RDCurve) method, to estimate the frequency of rediscovery of gene signature identified. Given a ranking procedure and a dataset with replicated measurements, the RDCurve bootstraps the dataset and repeatedly applies the ranking procedure, selects a subset of k important genes, and estimates the probability of rediscovery of the selected subset of genes. We also propose a permutation scheme to estimate the confidence band under the Null hypothesis for the significance of the RDCurve. The method is non-parametric and model independent. With the RDCurve we can assess the signal-noise ratio of the data, compare the performance of ranking procedures in term of their expected rediscovery rates, and choose the number of genes to be reported.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=de5c9f4165385c776946785b1716438dp=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=de5c9f4165385c776946785b1716438dp=1//a img src=http://www.pheedo.com/feeds/tracker.php?i=de5c9f4165385c776946785b1716438d style=display: none; border=0 height=1 width=1 alt=/
Categories: IEEE Members Only

PrePrint: Semi-Markov Models for Brownian Dynamics Simulation Algorithms in Biological Ion Channels

Constructing accurate computational models that explain how ions permeate through a biological ion channel is an important problem in biophysics and drug design. Brownian dynamics simulations are large scale interacting particle computer simulations for modeling ion channel permeation but can be computationally prohibitive. In this paper we show the somewhat surprising result that a small dimensional semi-Markov model can generate events (such as conduction events, dwell times at binding sites in the protein) that are statistically indistinguishable from Brownian dynamics computer simulation. This approach enables the use of extrapolation techniques to predict channel conduction when performing the actual Brownian dynamics simulation is computationally intractable. Numerical studies on the simulation of gramicidin A ion channels are presented.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=654d1d08b1f1737ccc40f4dcfaaf9c68p=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=654d1d08b1f1737ccc40f4dcfaaf9c68p=1//a img src=http://www.pheedo.com/feeds/tracker.php?i=654d1d08b1f1737ccc40f4dcfaaf9c68 style=display: none; border=0 height=1 width=1 alt=/
Categories: IEEE Members Only

PrePrint: Multidimensional Profiling of Cell Surface Proteins and Nuclear Markers

Cell membrane proteins play an important role in tissue architecture and cell-cell communication. We hypothesize that segmentation and multidimensional characterization of the distribution of cell membrane proteins, on a cell-by-cell basis, enable improved classification of treatment groups and identify important characteristics that can otherwise be hidden. We have developed a series of computational steps to (i) delineate cell membrane protein signals and associate them with specific nuclei; (ii) compute a coupled representation of the multiplexed DNA content with membrane proteins; (iii) rank computed features associated with such a multidimensional representation; (iv) visualize selected features for comparative evaluation; and (v) discriminate between treatment groups in an optimal fashion. The novelty of our method is in the segmentation of the membrane signal and the multidimensional representation of phenotypes on a cell-by-cell basis. To test the utility of the new method, the proposed computational steps were applied to images of cells that have been irradiated with different radiation qualities in the presence and absence of other small molecules. These samples are labeled for their DNA content and E-cadherin membrane proteins. We demonstrate that multidimensional representations of cell-by-cell phenotypes improve predictive and visualization capabilities among different treatment groups, and identify hidden variables.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=a589ccc8b62e91997b9c6ca96c656c1cp=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=a589ccc8b62e91997b9c6ca96c656c1cp=1//a img src=http://www.pheedo.com/feeds/tracker.php?i=a589ccc8b62e91997b9c6ca96c656c1c style=display: none; border=0 height=1 width=1 alt=/
Categories: IEEE Members Only

PrePrint: Quartets MaxCut: A Divide and Conquer Quartets Algorithm

Supertree methods are used to construct a large tree over a large set of taxa, from a set of small trees over overlapping subsets of the complete taxa set. Since accurate reconstruction methods are currently limited to a maximum of few dozens of taxa, the use of a supertree method in order to construct the tree of life is inevitable. Perhaps the simplest version of this that is widely applicable, yet quite challenging, is quartet based reconstruction. This problem lies at the root of many tree reconstruction methods and theoretical as well as experimental results have been reported. Nevertheless, dealing with false, conflicting quartets remains problematic. In this paper, we describe an algorithm for constructing a tree from a set of input quartet trees even with a significant fraction of errors. Our algorithm is based on a divide and conquer algorithm where our divide step uses a semi- definite formulation of max cut.We remark that this builds on previous work of ours [28] for piecing together trees from rooted triples. The recursion for quartets, however, is more complicated in that even with completely consistent quartets the problem is NP-hard.This complexity leads to several issues and some solutions of possible independent interest.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=8c477b8920dc5c73c15af1adbd26afaap=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=8c477b8920dc5c73c15af1adbd26afaap=1//a img src=http://www.pheedo.com/feeds/tracker.php?i=8c477b8920dc5c73c15af1adbd26afaa style=display: none; border=0 height=1 width=1 alt=/
Categories: IEEE Members Only

PrePrint: On the Complexity of uSPR Distance

We show that subtree prune and regraft (uSPR) distance on unrooted trees is fixed parameter tractable with respect to the distance. We also make progress on a conjecture of Steel on the preservation of uSPR distance under chain reduction, improving on lower bounds of Hickey.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=709b83de1ca185bfbc37e2a03d7b2fbap=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=709b83de1ca185bfbc37e2a03d7b2fbap=1//a img src=http://www.pheedo.com/feeds/tracker.php?i=709b83de1ca185bfbc37e2a03d7b2fba style=display: none; border=0 height=1 width=1 alt=/
Categories: IEEE Members Only

PrePrint: Using Gaussian Process with Test Rejection to Detect T-Cell Epitopes in Pathogen Genomes

A major challenge in the development of peptide-based vaccines is finding the right immunogenic element, with efficient and long-lasting immunisation effects, from large potential targets encoded by pathogen genomes. Computer models are convenient tools for scanning pathogen genomes to pre-select candidate immunogenic peptides for experimental validation. Current methods predict many false positives resulting from a low prevalence of true positives. We develop a test reject method based on the prediction uncertainty estimates determined by Gaussian process regression. This method filters false positives amongst predicted epitopes from a pathogen genome. The performance of stand-alone Gaussian process regression is compared to other state-of-the-art methods using cross-validation on eleven benchmark data sets. The results show that the Gaussian process method has the same accuracy as the top performing algorithms. The combination of Gaussian process regression with the proposed test reject method is used to detect true epitopes from the Vaccinia virus genome. The test rejection increases the prediction accuracy by reducing the number of false positives without sacrificing the method's sensitivity. We show that the Gaussian process in combination with test rejection is an effective method for prediction of T-cell epitopes in large and diverse pathogen genomes, where false positives are of concern.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=b64bf79ef063112e6afbfaeedf8a0a42p=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=b64bf79ef063112e6afbfaeedf8a0a42p=1//a img src=http://www.pheedo.com/feeds/tracker.php?i=b64bf79ef063112e6afbfaeedf8a0a42 style=display: none; border=0 height=1 width=1 alt=/
Categories: IEEE Members Only

PrePrint: CollHaps: A Heuristic Approach to Haplotype Inference by Parsimony

Haplotype data play a relevant role in several genetic studies, e.g. mapping of complex disease genes, drug design and evolutionary studies on populations. However, the experimental determination of haplotypes is expensive and time-consuming. This motivates the increasing interest in techniques for inferring haplotype data from genotypes, which can instead be obtained quickly and economically. Several such techniques are based on the maximum parsimony principle, which has been justified by both experimental results and theoretical arguments. However, the problem of haplotype inference by parsimony was shown to be NP-hard, thus limiting the applicability of exact parsimony-based techniques to relatively small datasets. In this paper we introduce collapse rule, a generalization of the well-known Clark's rule, and describe a new heuristic algorithm for haplotype inference (implemented in a program called CollHaps), based on parsimony and the iterative application of collapse rules. The performance of CollHaps is tested on several datasets. The experiments show that CollHaps enables the user to process large datasets obtaining very "parsimonious" solutions in short processing times. They also show a correlation, especially for large datasets, between parsimony and correct reconstruction, supporting the validity of the parsimony principle to produce accurate solutions.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=c5b6011b594aa6bd1b4917fef55cf28fp=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=c5b6011b594aa6bd1b4917fef55cf28fp=1//a img src=http://www.pheedo.com/feeds/tracker.php?i=c5b6011b594aa6bd1b4917fef55cf28f style=display: none; border=0 height=1 width=1 alt=/
Categories: IEEE Members Only

PrePrint: Human-Readable Rule Generator for Integrating Amino Acid Sequence Information and Stability of Mutant Proteins

Most of the bioinformatics tools developed for predicting mutant protein stability appear as a black box and the relationship between amino acid sequence/structure and stability is hidden to the users. We have addressed this problem and developed a human-readable rule generator for integrating the knowledge of amino acid sequence and experimental stability change upon single mutation. Using information about the original residue, substituted residue and three neighboring residues, classification rules have been generated to discriminate the stabilizing and destabilizing mutants and to explore the basis for experimental data. These rules are human readable and hence the method enhances the synergy between expert knowledge and computational system. Furthermore, the performance of the rules has been assessed on a non-redundant dataset of 1859 mutants and we obtained an accuracy of 80% using cross-validation. The results showed that the method could be effectively used as (i) a tool for knowledge discovery as well as (ii) for predicting mutant protein stability. We have developed a web for classification rule generator and it is freely available at http://bioinformatics.myweb.hinet.net/irobot.htm.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://www.pheedo.com/click.phdo?s=d0e631dd8301fa76c1057519a61de70ep=1img alt= style=border: 0; border=0 src=http://www.pheedo.com/img.phdo?s=d0e631dd8301fa76c1057519a61de70ep=1//a img src=http://www.pheedo.com/feeds/tracker.php?i=d0e631dd8301fa76c1057519a61de70e style=display: none; border=0 height=1 width=1 alt=/
Categories: IEEE Members Only

PrePrint: A Theoretical and Empirical Study of Search-Based Testing: Local, Global and Hybrid Search

Search based optimization techniques have been applied to structural software test data generation since 1992, with a recent upsurge in interest and activity within this area. However, despite the large number of recent studies of applicability of different search based optimization approaches, there has been very little theoretical analysis of the types of testing problem for which these techniques are well-suited. There are also few empirical studies that present results for larger programs. This paper presents a theoretical exploration of the most widely studied approach, the global search technique embodied by Genetic Algorithms. It also presents results from a large empirical study that compare the behaviour of both global and local search based optimization on real world programs. The results of this study reveal that there exist cases of test data generation problems that suit each algorithm, thereby suggesting that a hybrid global-local search (a Memetic Algorithm) may be appropriate. The paper presents a Memetic Algorithm along with further empirical results studying its performance.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=caecf08447182347edaf8b0a7e982a75p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=caecf08447182347edaf8b0a7e982a75p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/

PrePrint: Learning a Metric for Code Readability

In this paper, we explore the concept of code readability and investigate its relation to software quality. With data collected from 120 human annotators, we derive associations between a simple set of local code features and human notions of readability. Using those features, we construct an automated readability measure and show that it can be 80% effective, and better than a human on average, at predicting readability judgments. Furthermore, we show that this metric correlates strongly with three measures of software quality: code changes, automated defect reports, and defect log messages. We measure these correlations on over 2.2 million lines of code, as well as longitudinally, over many releases of select projects. Finally, we discuss the implications of this study on programming language design and engineering practice. For example, our data suggests that comments, in of themselves, are less important than simple blank lines to local judgments of readability.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=c48fdfd22f9f56b62b37c1520fe4cc19p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=c48fdfd22f9f56b62b37c1520fe4cc19p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/

PrePrint: How Developers' Experience and Ability Influence Web Application Comprehension Tasks Supported by UML Stereotypes: A Series of Four Experiments

In recent years, several design notations have been proposed to model domain-specific applications or reference architectures. In particular, James Conallen has proposed the UML Web Application Extension (WAE): a UML extension to model Web applications. The aim of our empirical investigation is to test whether the usage of the Conallen notation supports comprehension and maintenance activities with significant benefits, and whether such benefits depend on developers' ability and experience. This paper reports and discusses the results of a series of four experiments performed in different locations and with subjects possessing different experience#x2014;namely undergraduate students, graduate students, and research associates#x2014;and different ability levels. The experiments aim at comparing performances of subjects in comprehension tasks where they have the source code complemented either by standard UML diagrams or by diagrams stereotyped using the Conallen notation. Results indicate that, although in general it is not possible to observe any significant benefit associated with the usage of stereotyped diagrams, the availability of stereotypes reduces the gap between subjects with low skill or experience and highly skilled or experienced subjects. Organizations employing developers with low experience can achieve a significant performance improvement by adopting stereotyped UML diagrams for Web applications.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=ad593437b7c8392fad3d9ee4bb86a215p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=ad593437b7c8392fad3d9ee4bb86a215p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/

PrePrint: Generating Event Sequence-Based Test Cases Using GUI Run-Time State Feedback

This paper presents a fully automatic model-driven technique to generate test cases for Graphical user interface-based applications (GUIs). The technique uses feedback from the execution of a "seed test suite," which is generated automatically using an existing structural event-interaction graph model of the GUI. During its execution, the run-time effect of each GUI event on all other events pinpoints event-semantic interaction (ESI) relationships, which are used to automatically generate new test cases. Two studies on eight applications demonstrate that the feedback-based technique (1) is able to significantly improve existing techniques and help identify serious problems in the software and (2) the ESI relationships captured via GUI state yield test suites that most often detect more faults than their code-, event-, and event-interaction-coverage equivalent counterparts.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=91e88fd2959ceee8b9778f73379249e1p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=91e88fd2959ceee8b9778f73379249e1p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/

PrePrint: Better Debugging via Output Tracing and Callstack-Sensitive Slicing

Debugging often involves (1) finding the point of failure (the first statement that produces bad output), and (2) finding and fixing the actual bug. Print statements and debugger breakpoints can help with step 1. Slicing the program back from values used at the point of failure can help with step 2. However, neither approach is ideal: debuggers and print statements can be clumsy and time-consuming, and backward slices can be almost as large as the original program. This paper addresses both problems. We present icallstack-sensitive slicing/i, which reduces slice sizes by leveraging the series of calls active when a program fails. We also show how slice intersections may further reduce slice sizes. We then describe a set of tools that identify points of failure for programs that produce bad output. Finally, we apply our point-of-failure tools to a suite of buggy programs and evaluate callstack-sensitive slicing and slice intersection as applied to debugging. Callstack-sensitive slicing is very effective: on average, a callstack-sensitive slice is about 0.31 times the size of the corresponding full slice, down to just 0.06 times in the best case. Slice intersection is less impressive on average, but may sometimes prove useful in practice.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=214e56fa863351eeb77adf38ae1331fdp=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=214e56fa863351eeb77adf38ae1331fdp=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/

PrePrint: Grammar Recovery from Parse Trees and Metrics-Guided Grammar Refactoring

Many software development tools that assist with tasks such as testing and maintenance are specific to a particular development language and require a parser for that language. Because a grammar is required to develop a parser, construction of these software development tools is dependent upon the availability of a grammar for the development language. However, a grammar is not always available for a language, and in these cases, acquiring a grammar is the most difficult, costly, and time-consuming phase of tool construction. In this paper we describe a methodology for grammar recovery from a hard-coded parser. Our methodology comprises manual instrumentation of the parser, a technique for automatic grammar recovery from parse trees, and a semi-automatic metrics-guided approach to refactoring an iterative grammar to obtain a recursive grammar. We present the results of a case study in which we recover and refactor a grammar from version 4.0.0 of the GNU C++ parser and then refactor the recovered grammar using our metrics-guided approach. Additionally, we present an evaluation of the recovered and refactored by comparing it to the ISOC++98 grammar.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=198c959454fe4b91a1d765651ba6b2e6p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=198c959454fe4b91a1d765651ba6b2e6p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/

PrePrint: Engineering A Sound Assertion Semantics for the Verifying Compiler

The Verifying Compiler (VC) project is a core component of the Dependable Systems Evolution Grand Challenge. The VC offers the promise of automatically proving that a program or component is correct, where correctness is defined by program assertions. While several VC prototypes exist, all adopt a semantics for assertions that is unsound. This paper presents a consolidation of VC requirements analysis activities that, in particular, brought us to ask targeted VC customers what kind of semantics they wanted. Taking into account both practitioners#8217; needs and current technological factors, we offer recovery of soundness through an adjusted definition of assertion validity that matches user expectations and can be implemented practically using current prover technology. For decades there have been debates concerning the most appropriate semantics for program assertions. Our contribution here is unique in that we have applied fundamental software engineering techniques by asking primary stakeholders what they want and based on this, proposed a means of efficiently realizing the semantics stakeholders want using standard tools and techniques. We describe how support for the new semantics has been added to ESC/Java2, one of the most fully developed VC prototypes. Case studies demonstrate the effectiveness of the new semantics at uncovering previously indiscernible specification errors.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=6f04eb1ce481ed206c0f1934939431b3p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=6f04eb1ce481ed206c0f1934939431b3p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Syndicate content