Main page

Main page
Annual Report


Methods of knowledge-mining from databases

Knowledge-mining from databases (KMD) is understood as a process of non-trivial extraction of new, understandable and useful knowledge "hidden" in large data files. Our goal in this field will be the development of new methods and tools for knowledge-mining from databases, putting an emphasis on the automation of the whole KMD process and the use of prior knowledge (ontologies). Further, we shall deal with the development of tools for the automatic exploitation of knowledge obtained in the KMD process, as well as for further operations with such knowledge.

Methods of automatic analysis of the WWW and multimedial data

Within recent years, in the field of data analysis, attention has been paid to the sources combining different data types in narrow cohesion: free or structured text, pictures, eventually audio and video. Such sources are often available in the form of text documents, but also in the form of multimedial collections. For the analysis (searching, classification, extraction, clustering, etc.) of such data, we need cooperation between methods and tools which are often created independently of one another, and it is necessary to ensure their interoperability.

Inference in knowledge systems

Automatic processing (using) of knowledge has a longstanding tradition both in expert (knowledge) systems and in Case-Based Reasoning. As for the first type of systems, knowledge usually has the form of generalization (e.g. represented by generally valid rules) and the inference is usually based on the deduction rules of sententional calculus, eventually supplemented by the possibility of working with uncertainty. As for systems of Case-Based Reasoning, knowledge has the form of typical cases and the inference proceeds on the grounds of ontology. Thus, both approaches are to some extent complementary, and it would be desirable to have the possibility to combine them mutually. This is connected with the questions to be examined, namely of different representations of knowledge, inference and working with uncertainty.

Methods of ontological engineering

Ontologies as formal conceptual models of reality enabling automatic inference are understood as the principal tool for shared understanding of data meaning on so-called semantic web and in many intradepartmental and inter-firm applications. For the present, their creation is a matter of experts with profound knowledge of the formal procedures used. The goal of contemporary research in the field of ontological engineering is to give support to the creation of ontologies by user-friendly tools, which would (among others):

  • supply the user with debugged, self-initializing construction elements for ontology, both from the point of view of factual structure and formally-logical constructions
  • continuously test the accuracy of the ontology created and look for potentials causes of an error
  • offer during its creation interesting concepts and relations on the base of occurrence of terms in the corpus of relevant texts (so-called learning of ontologies)
  • enable the creation of a new ontology by the combining or mutual mapping of several existing ones.

Multidimensional statistical methods

From the point of view of statistical approaches, we make allowance for the use of statistical methods in the analysis of data characterizing collections of text documents. It is concerned with three principal areas of analysis - the reduction of dimension (the lowering of the number of terms in virtue of principal component analysis and related methods), clustering of documents and query assignment to the group of documents best complying with the query. A typical feature of text documents is their considerable length, which results in the fact that some statistical methods cannot be used (e.g. it is impossible to compute the matrix of similarity for all pairs of objects). Generally, we are concerned with the evaluation and comparison of different approaches serving for document searching and their implementations in different program systems from the point of view of their efficiency. On the one hand, we keep at our disposal a specialized means developed in the CR (the Amphora system), on the other hand classical statistical program systems. Further, specialized commercial products for data mining and text mining can be used. For example, in the new version of the GhostMiner system, the SVD method (Singular Value Decomposition) is implemented for the solution of problems of the reduction of dimension, to which attention is concentrated at present (research publications) only in connection with text document analysis.

In this context, a further subject of investigation is dimension reduction - it is concerned above all with principal component analysis, factor analysis and, if need be, multidimensional scaling. Likewise, methods for the clustering of documents can be used - cluster analysis or other methods for the measurement of structure proximity (measurements of similarity and dissimilarity).

Analysis and prediction of time-dependent data

A further objective of the project is the analysis and prediction of time-dependent data, or spatially or otherwise intricately structured types The models of data behavior in time (or in space) are essential for the recognition of their regularities, possibilities of prediction and decision-making support. The definition of the model relationship between components and the whole from the point of view of time (eventually space) is the basis of the development of methods of disaggregation, and retropolation in cases of time sequences (or allocation methods in cases of space sequences). Indeed, in connection with predictions, it is necessary to pay attention to the development of methods to be used for the detection of long-term regularities in time-dependent data manifesting itself as distinctive turning points, because these turning points qualitatively predetermine the development of future process structure. Methods for the detection of turning points will be based on tests of changes of distributions in the corresponding processes (in case of a sufficient number of observations), or on singularities of probit type (in case of a smaller number of observations in time). Applications mainly include values of macroeconomic time series in the CR (GDP and aggregates of national accounting, inflation, unemployment rate, export, import, exchange rates CZK/EURO, CZK/USD, specific series of financial indicators, etc.).

Attention will be given to the quantitative evaluation of economic processes in the transition period and their adaptation to the conditions and criteria of the European Union, with the objective to formulate alternative rules for the achievement of various rates of convergence to the EU standards, including the analysis of their impacts.

The time series analyzed will be examined from the point of view of the construction of stochastic models, with an emphasis on the investigation and description of dependences between these series. As for results, the construction of models enabling high-quality predictions of the future values of these macroeconomic series is expected. Further, these series will be compared with the same or similar ones in European Union states and in the European Union as a whole. Again, the goal will be the modeling of dependences among the series of the Czech economy and these "union" series. As the output, we expect publication of a series of papers devoted to the above-mentioned problems. All results achieved will be summarized in a conclusive publication and presented at an international conference held regularly by the Faculty.

Methods of knowledge acquisition and its analysis in the socioeconomic field

Within the project of the research proposal, we shall deal, among others, with highly actual statistical topic in conditions of economic globalization, and namely with the possibilities of obtaining knowledge (and its analysis) from a broad data basis, both at the level of the national economy and at the level of branches. Special attention will be paid to data analysis at the level of individual subjects (microdata) and to knowledge-mining from large administrative data resources.

We understand economic globalization as a new degree of international economic cooperation, when both horizontal and, above all, vertical integration of entrepreneurial activity occurs regardless of existing state boundaries. From the point of view of statistical data analysis, globalization has a cross-sectional character and affects a number of fields (individual branch statistics, statistics of labor market, prices, national accounts, social statistics, etc.).

The goal of the research is the examination of the possibility of obtaining knowledge for the evaluation of the extent of economic globalization at the level of the national economy, branches and individual subjects, and the impacts of economic globalization on competitive strength, economic growth, employment and other economical phenomena and indicators and, further, the examination of possibilities of efficient obtaining of statistical data in a globalizing world (clearly, globalization should be not only investigated through statistics, but also deeply influences statistics itself).

In connection with the use of information technologies, large data resources arise in administrative processes. It can be assumed that this development will become more intensive due to e-Government strategy in the near future. Whereas in the field of Internet and business applications within large companies (banks, telecommunication companies, etc.) considerable attention is given to the use of these large information resources, their use within public administration is still not of interest.

The systematic and efficient processing of an immense amount of information appears to be a key factor of success in modern information society, not only in the business sphere, but also in the public one. Therefore, we want to pay attention to the possibilities of the application of knowledge-mining methods and the use of analytic potential of administrative data sources for the purpose of qualitatively new statistical outputs.

Schematically, we can illustrate the linkage of individual partial directions with respect to principal fields of the research proposal by the following table. Nevertheless, it is necessary to bear in mind that individual partial directions blend together, as is mentioned in the preceding text. Therefore, one can understand this table as being for orientation.

  quantitative approaches semantic approaches
knowledge acquisition and data analysis
  • knowledge-mining from databases
  • multidimensional statistical methods
  • analysis and prediction of time dependent data
  • methods of knowledge acquisition and their analysis in socioeconomic area
automatic analysis of WWW and multimedia
knowledge representation, processing and use
  • inference in knowledge systems
ontological engineering

LISp - Laboratory for Intelligent Systems Prague
University of Economics Prague, Ekonomicka 957, 148 01 Praha 4 - Kunratice
The Czech Republic

Phone: +420 224 094 226, fax: +420 224 094 213

[ University of Economics Prague ] - [ Faculty of Informatics and Statistics ] Copyright © 2002-2014 Laboratory for Intelligent Systems Prague. This page last modified on: 11-02-10