Computer Science and Software Engineering Technical Reports

http://hdl.handle.net/2374.MIA/186

View Collection Statistics

Now showing 1 - 20 of 73

A Detailed Stylometric Investigation of the İnce Memed Tetralogy*
Patton, Jon; Can, Fazli
We analyze four İnce Memed novels of Yaşar Kemal using six style markers: “most frequent words,” “syllable counts,” “word type -or part of speech- information,” “sentence length in terms of words,” “word length in text,” and “word length in vocabulary.” For analysis we divide each novel into five thousand word text blocks and count the frequencies of each style marker in these blocks. The principal component analysis results show clear separation between the first two and the last two volumes; the blocks of the first two novels are also distinguishable from each other. The blocks of the last two volumes are intermixed. This parallels the fact that the author planned the last two volumes as three separate novels, but later condensed them into two. The style markers showing the best separation are “most frequent words” and “sentence length”. We use stepwise discriminant analysis to determine the best discriminators of each style marker and then use them in cross validation. The related results concur with the principal component analysis results. For example, the cross validation results obtained by “most frequent words” and “sentence length,” respectively, provide 87% and 81% correct classification of the text blocks to their corresponding volumes. Further investigation based on multiple analysis of variance (MANOVA) reveals how the attributes of each style marker group distinguish among the volumes.
Effectiveness Assessment of the Cover Coefficient Based Clustering Methodology
Can, Fazli; Ozkarahan, Esen
An algorithm for document clustering is introduced. The basic concept of the algorithm, Cover Coefficient (CC) concept, provides means of estimating the number of clusters within a document database and relates indexing and clustering analytically. The CC concept is used also to identify the cluster seeds, and to form clusters with these seeds. It is shown that the complexity of the clustering process is very low. The retrieval experiments show that the IR effectiveness of the algorithm is compatible with a very demanding complete linkage clustering method which is known to have good performance. The experiments also show that the algorithm 15.1 to 63.5 (with an average of 47.5) percent better than four other clustering algorithms in cluster based information retrieval. The experiments have validated the indexing-clustering relationships and the complexity of the algorithm, and have shown improvements in retrieval effectiveness. In the experiments two document databases are used: TODS and INSPECT, the later is a common database with 12684documents.
Towards the Realization of a DSML for Machine Learning: A Baseball Analytics Use Case
Koseler, Kaan; Stephan, Matthew
Using machine learning (ML) for big data is challenging, requiring specialized knowledge of the domain, learning algorithms, and software engineering. To demonstrate the viability of model-driven engineering in the ML domain we consider an ML use case of baseball analytics by extending and applying an existing, but untested, ML domain specific modeling language (DSML). Additionally, we aim to make ML software development more accessible and formalized, and help facilitate future research in this area. This paper describes our plan, initial work, and anticipated contributions in extending, testing, and validating this DSML, and implementing a code generation scheme that is targeted at a binary classification baseball problem. Keywords: Model driven engineering * Domain specific modeling languages * Machine Learning * Analytics * Baseball
A Survey of Baseball Machine Learning: A Technical Report
Koseler, Kaan; Stephan, Matthew
Statistical analysis of baseball has long been popular, albeit only in limited capacity until relatively recently. The recent proliferation of computers has added tremendous power and opportunity to this field. Even an amateur baseball fan can perform types of analyses that were unimaginable decades ago. In particular, analysts can easily apply machine learning algorithms to large baseball data sets to derive meaningful and novel insights into player and team performance. These algorithms fall mostly under three problem class umbrellas: Regression, Binary Classification, and multiclass classification. Professional teams have made extensive use of these algorithms, funding analytics departments within their own organizations and creating a multi-million dollar thriving industry. In the interest of stimulating new research and for the purpose of serving as a go-to resource for academic and industrial analysts, we have performed a systematic literature review of machine learning algorithms and approaches that have been applied to baseball analytics. We also provide our in- sights on possible future applications. We categorize all the approaches we encountered during our survey, and summarize our findings in two tables. We find two algorithms dominated the literature, 1) Support Vector Machines for classification problems and 2) Bayesian Inference for both classification and Regression problems. These algorithms are often implemented manually, but can also be easily utilized by employing existing software, such as WEKA or the Scikit-learn Python library. We speculate that the current popularity of neural networks in general machine learning literature will soon carry over into baseball analytics, although we found relatively fewer existing articles utilizing this approach when compiling this report.
Application of Object-oriented Programming in Simulation: A Simulation of Case Study Using Microsoft Visual C++
(1995-08-01) Chen, Hong
In this thesis, a prototype simulation environment is introduced. Simulation has always been important for systems analysis. The original idea of this thesis stems from the fact that more flexible simulation programming tools are required by the modern analysis. Six simulation classes are implemented in the thesis to support simple simulation cases and Microsoft Visual C++ window classes are used to build a user friendly interface. A simulation output class is also implemented to conduct simple simulation output analysis. Based on this work, more classes and features can be added to make the simulation environment more powerful, so that the simulation environment can support different simulation situations.
System Programming - The Human and the Machine
(1991-12-01) Kline, Michael
The purpose of this paper is to document my experiences in planning, generating, and modifying the IBM VM/SP operating system (systems programming), survey literature on systems programming, and to draw conclusions as to what makes a successful systems programming experience. I will explore the skills necessary for the systems programmer to perform the tasks, as well as discuss aspects of the system itself (hardware, software, and documentation) that affect the success of any systems programming effort. This work is intended to serve as a case study of a VM/SP systems programmer working on WISPcompatible hardware. Judgments as to how these skills and conclusions may apply to other platforms are left to the reader.
HypIR: Hypertext Based Information Retrieval
(1992-08-01) Lee, Yuan; Can, Fazli
Information Retrieval (IR), which is also known as text or document retrieval, is the process of locating and retrieving docri)nents that are relevant to the user queries. In hypertext environments, docuinent databases are organized as a network of nodes which are interconnected by various types of links. This study introduces a hypertext-based text retrieval system, HypIR. In HypIR, the sentantic relationships ainong docuinents are obtained using a clustering algorithm. A new approach providing the advantages of system maps and history list is introduced to prevent the user fiotn being lost in the IR hivperspace. The paper presents the underlying concepts and iinplementation details. HypIR is based on the object-oriented paradigm and its execution platforin is HyperCard.
Visual Programming: Concepts and Implementations
(1994-08-01) Howard, Elizabeth
The computing environment has changed dramatically since the advent of the computer. Enhanced computer graphics and sheer processing power have ushered in a new age of computing. User interfaces have advanced from simple line entry to powerful graphical interfaces. With these advances, computer languages are no longer forced to be sequentially and textually-based. A new programming paradigm has evolved to harness the power of today's computing environment - visual programming. Visual programming provides the user with visible models which reflect physical objects. By connecting these visible models to each other, an executable program is created. By removing the inherent abstractions of textual languages, visual programming could lead computing into a new era.
An Empirical Investigation of Four Strategies for Serializing Schedules in Transaction Processing
(1992-12-01) Johnson, Terri
A database management system (DBMS) is a very large program that allows users to create and maintain databases. A DBMS has many capabilities. This study will focus on the capability known as transaction management, the capability to provide correct, concurrent access to the database by many users at the same time. If a DBMS did not provide transaction management, livelocks, deadlocks, and non-serializable schedules could occur. A livelock can occur when a transaction is waiting on a locked data item, and another transaction appears. After the data item is unlocked, the second transaction locks the data item, which causes the first transaction to continue waiting. Conceivably, the first transaction could wait indefinitely to lock the data item. This situation is called livelock. Deadlock is a situation in which each member of a set of two or more transactions is waiting to lock an item currently locked by some other transaction in the set. None of the transactions can proceed, so they all wait indefinitely. A schedule is serial if for every pair of transactions, all of the operations of one transaction execute before any of the operations of the other transaction. A schedule is serializable if its effect on the database is the same as some serial execution of the same set of transactions. A schedule is nonserializable if its effect on the database is not equivalent to that of any serial schedule which processes the same transactions. The scheduler is a component of the DBMS, and it is responsible for resolving any livelocks, deadlocks, or non-serializable schedules that occur. This study looks specifically at non-serializable schedules. There are many methods by which the scheduler can serialize non-serializable schedules. This study proposes and examines four strategies to detect and resolve non-serializable schedules. Computer simulation is used to examine the four strategies. These strategies reduce a nonserializable schedule to a serializable or a serial schedule, thus eliminating the possibility of incorrectly updating data items within a database. It is shown experimentally that, of the four strategies, the one that delays the transaction which has executed the least number of steps until non-serializability is detected is the best.
Node Re-Usability in Structured Hypertext Systems
(1992-08-01) Abdalla, Omer; Can, Fazli
When the size of a graph-based hyperdocument exceeds a certain limit, the graph structure gets complicated and causes navigation and document management problems. A simple solution for this problem is the structuring ofthe hyperdocument into several smaller units. In this approach each unit contains nodes that share common properties and their link structures. Srnuller. more manageable networks (called webs) which have their own, less complex graph structures are the result. In this paper we propose a model for hypertext systems which allows hyperclocument structuring using webs. The model demonstrates node re-usability which becomes essential as it is verylikely that the smaller units created will share nodes. The implementation details of a hypertext authoring system, HypAS, based on the proposel model is also provided.
Analysis of Signature Generation Schemes for Multiterm Queries In Partitioned Signature File Environments
(1993-05-01) Aktug, Deniz; Can, Fazli
Our analysis explores the performance of three superimposed signature generation schemes as they are applied to a dynamic sigrtature file organization based on linear hashing: Linear Hashing with Superinzposed Signatures (LHSS). First scheme (SM) allows all terms set the same number of bits whereas the second and third methods (MMS and MMM) emphasize the terms with hlgh discriminatory power. In addition, M Mco nsiders the probaOiZity distribution of the number of query terms. The main contribution of the study is the combination of signature generation and signature file organization concepts together with the relaxation of the single term query and uniform frequency assumptions. The derivation of the performance evaluation formulas are provided as well as the analysis of various experimental settings. Results indicate that MMM outperforms the others as terms become more distinctive in their discriminatory power. MMM accomplishes the highest savings in retrieval eficiency for the high query weight case. We also discuss the applicability of the derivations to other partitioned signature organizations providing a detailed analysis of Fixed Prefix Partitioning (FPP) as an example. Finally, an appro.ximate perfortnance evaluation formula that works for both FPP and LHSS is modijied to account for the multiterm case.
Interface-based Programming Assignments and Automatic Grading of Java Programs
(2006-01-01) Helmick, Michael
AutoGrader is a framework developed at Miami University for the automatic grading of student programming assignments written in the Java programming language. AutoGrader leverages the abstract concept of interfaces, brought out by the Java interface language construct, in both the assignment and grading of programming assignments. The use of interfaces reinforces the role of procedural abstraction in ob ject-oriented programming and allows for a common API to all student code. This common API then enables automatic grading of program functionality. AutoGrader provides a simple instructor API and enables the automatic testing of student code through the Java language features of interfaces and reﬂection1 . AutoGrader also supports static code analysis using PMD [4] to detect possible bugs, dead code, suboptimal, and overcomplicated code. While AutoGrader is written in and only handles Java programs, this style of automated grading is adaptable to any language that supports (or can mimic) named interfaces and/or abstract functions and that also supports runtime reﬂection.
Compressed Bit-sliced Signature Files An Index Structure for Large Lexicons
(1999-04-01) Can, Fazli; Carterette, Ben
We use the signature file method to search for partially specified terms in large lexicons. To optimize efficiency, we use the concepts of the partially evaluated bit-sliced signature file method and memory resident data structures. Our system employs signature partitioning, compression, and term blocking. We derive equations to obtain system design parameters, and measure indexing efficiency in terms of time and space. The resulting approach provides good response time and is storage-efficient. In the experiments we use four different lexicons, and show that the signature file approach outperforms the inverted file approach in certain efficiency aspects. KEYWORDS: Lexicon search, n-grams, signature files.
A Methodology for the Implementation and Maintenance Of a Data Warehouse
(1995-12-01) Jarrett, Wayne
A methodology for the implementation and maintenance of a data warehouse is described. The data warehouse forms the basis for a marketing decision support system for use by a large surgical equipment manufacturer and requires integration of multiple sources of data external to the company. A prototype system is developed based upon the methodology. Each phase of the data warehouse implementation is discussed, including the data conversion process from raw data files to the prototype and production environments. Emphasis is placed upon the selection of suitable software tools for each process. The research approach employed is action research, in which the researcher participates with a client organization that exhibits the problems of interest to the investigator. The client organization is a large surgical equipment manufacturer, Ethicon Endo-Surgery.
Investigation of Programming Languages for an Automated Manufacturing System
(1995-07-01) Ma, Mark
This paper is an investigation of alternative programming languages for use in manufacturing control applications. After reviewing several types of languages, two alternative languages for programming the flexible manufacturing cell in Miami University's Manufacturing Engineering Department are investigated. One language, called Cell Programming Language (CPL), is an object-like high level language developed at Miami University. The other is Relay Ladder Logic (RLL) which is the predominant language used in industry to program programmable logic controllers. An RLL program that is equivalent to an existing CPL program was developed for this purpose.
"PanaeaMud An Online, Object-oriented Multiple User Interactive Geologic Database Tool"
(1993-12-01) Boring, Erich
This paper provides an overview of the design, development, and use of the PangaeaMud Database System. Section I gives an introduction to pertinent concepts and discusses previous work in the area. Section I1 is devoted to the non-technical aspects of the system. A brief user's view of the system is provided, along with discussion of the internal environment utilities. Section I11 illustrates the workings of the system from the programmer's viewpoint and contains information on the main entity relationships within the database and their implementation.
Concepts and Effectiveness of the Cover Coefficient Based Clustering Methodology for Text Databases
(1987-12-01) Can, Fazli; Ozkarahan, Esen
An algorithm for document clustering is introduced. The base concept of the algorithm, Cover Coefficient (CC) concept, provides means of estimating the number of clusters within a document database. The CC concept is used also to identify the cluster seeds, to form clusters with the seeds, and to calculate Term Discrimination and Document Significance values (TDV, DSV). TDVs and DSVs are used to optimize document descriptions. The CC concept also relates indexing and clustering analytically. Experimental results indicate that the clustering performance in terms of the percentage of useful information accessed (precision) is forty percent higher, with accompanying reduction in search space, than that of random assignment of documents to clusters. The experiments have validated the indexing-clustering relationships and shown improvements in retrieval precision when TDV and DSV optimizations are used.
Analysis of Signature Generation Schemes for Multiterm Queries In Linear Hashing with Superimposed Signatures
(1995-12-01) Can, Fazli; Ertugay, Osman
Signature files provide efficient retrieval of data by reflecting the essence of the data objects into bit patterns. Our analysis explores the performance of three superimposed signature generation schemes as they are applied to a dynamic signature file organization based on linear hashing: Linear Hashing with Superimposed Signatures (LHSS). The first scheme (SM) allows all terms set the same number of bits whereas the second and third schemes (MMS aid MMM) emphasize the terms with high discriminatory power. In addition, MMM considers the probability distribution of the number of query terms. The main contribution of the study is a detailed analysis of LHSS in multiterm query environments by incorporating the term discrimination values based on document and query frequencies. The approach of the study can also be extended to other signature file access methods based on partitioning. The derivation of the performance evaluation formulas, the simulation results based on these formulas for various experimental settings, and the implementation results based on INSPEC and NPL text databases are provided. Results indicate that MMM and MMS outperform SM in all cases in terms of access savings, especially when terms become more distinctive. MMM slightly outperforms MMS in high weight and low weight query cases. The performance gap among all three schemes decreases as the database size increases, and as the signature size increases the performances of MMM and MMS decrease and converge to that of the SM scheme when the hashing level is fixed.
Experiments on Tunable Indexing
(1988-04-01) Can, Fazli; Ozkarahan, Esen
The effectiveness and efficiency of an Information Retrieval (IR) system depends on the quality of its indexing system. Indexing con be used in inverted file systemsor in cluster-based retrieval. In this article, new concept called tunable indexing is introduced. With tunable indexing the number of clusters of a document clustering system can be varied to any desired value. Also covered are the computation of Term Discriminarion Value(TDV) with the cover coefficienr (CC) concepts and its use in tunable indexing. A set of experiments has slown the consistency between the CC based TDYs and the TDYs determined with the known methods. The main use of turnable indexing has been observed in determining the parameters of a clustering system.
Dynamic Signature File Partitioning Based on Term Characteristics
(1992-08-01) Aktug, Deniz; Can, Fazli
Signature files act as a filter on retrieval to discard a large number of non-qualifying data items. Linear hashing with superimposed signatures (LHSS) provides an effective retrieval filter to process queries in dynamic databases. This study is an analysis of the effects of reflecting the term query and occurrence characteristics to signatures in LHSS. This approach relaxes the unrealistic uniform frequency assumption and lets the terms with high discriminatory power set more bits in signatures. The simulation experiments based on the derived formulas show that incorporating the term characteristics in LHSS improves retrieval efficiency. The paper also discusses the further benefits of this approach to alleviate the potential imbalance between the levels of efficiency and relevancy.

Recent Submissions