!! Challenge homepage !!
to access the data and submit papers
(http://lisp.vse.cz/challenge/CURRENT )
Motivation
Knowledge discovery in real-world databases requires a
broad scope of techniques and forms of knowledge. Both the knowledge
and the applied methods should fit the discovery tasks and should
adapt to knowledge hidden in the data. The ECML/PKDD2005 Discovery Challenge
will encourage a collaborative research effort, a broad and unified view of
knowledge and methods of discovery, and emphasis on business problems and solutions
to those problems.
The idea of Discovery Challenge came from Jan Zytkow, who suggested to
organize such an event during PKDD'99 in Prague. In contrast to KDD Cups
held within KDD Conferences, the Discovery Challenge stresses the aspect
collaboration.
The Discovery Challenge constitutes a collection of data and problems as a common ground for
better comparisons and discussions of the applicability of KDD methods on a real-world problems with
respect to both KDD and application viewpoints. The main goals of the
Discovery Challenge are
- stimulate an open view of knowledge and discovery
- stimulate collaborative approach to KDD and research on unification of
both different forms of knowledge and discovery
- integrate into KDD an emphasis on business problems and solutions to
those problems
Time and place
The Discovery Challenge will be held as a workshop during the ECML/PKDD2005 Conference,
October 6-10, 2005, Porto, Portugal.
Only those registered for ECML/PKDD2005 can participate in the Discovery Challenge.
Programm Committee
Petr Berka (co-chair)
Bruno Cremilleux (co-chair)
Nicolas Bredeche
Martine Collard
Olivier Gandrillon
Tomas Kocka
Stefan Kramer
Claire Leschi
Jan Ramon
Jan Rauch
Olga Stepankova
Einoshin Suzuki
Shusaku Tsumoto
Hideto Yokoi
Data Sets
Three data sets will be available for the Discovery Challenge
- A data about chronic hepatitis:
This is a data set concerning administrative information as well as long
time-series data of laboratory examinations of 771 patients with hepatitis B and
C who took examinations in the period 1982-2001. The data are organized in 7 tables: basic information about the patients,
results of biopsy, information on interferon therapy, results of
out-hospital examinations, results of in-hospital examinations (largest
table with about 1.5 milion records), info about measurments in in-hospital
examinations, results on hematological analysis.
The data were prepared in cooperation with the Shimane Medical University,
School of Medicine and Chiba University Hospital, Japan.
The data from the same domain have been used
in the previous challenge. To see the results from ECML/PKDD 2004 Challenge, follow
this link.
-
Gene expression data:
This data set is based on the publicly available SAGE data produced
from human cells (see
www.ncbi.nlm.nih.gov/SAGE/index.cgi ). It
deals with the human transcriptome, that is an exhaustive list of
transcripts expressed at one given time point in one given biological
situation. Analyzing such data is relevant since this SAGE data source
has been largely under-exploited as of today. The only available on
line approach consists in comparing the existing libraries 2 by 2 to
extract differential information. One obvious reason for such a poor
exploitation lies in the structure of the data, including a high error
rate for low frequency tags (and especially tags appearing only once
in a library). The use of discretization operators provides a
solution to the problem of low frequency tags. Biologists are
convicted that some essential biological information may be derived
from the mass of the SAGE data.
These
data can be seen as expression matrices in which the expression level
of genes (the columns) are recorded in various biological situations
(the lines). Using the SAGE terminology, tags correspond to genes and
the biological situations are called libraries.
This data set
was prepared by
Dr. Olivier Gandrillon's team from
Centre de Genetique Moleculare et Cellulaire Universite Claude Bernard Lyon I, France.
The data from the same domain have been used
in the previous challenge. To see the results from ECML/PKDD 2004 Challenge, follow
this link.
- Web server log data:
These data comes from a Czech company running several internet shops.
The log data cover the traffic on the web server of about three weeks.
This represents about 3 mil. records (each record is a single page view). The stored data
allow to derive the products (the internet shop is oriented on electronics), type of page
(such as shopping cart or detail of product), and internet shop (this info has been anonymised).
A generated ID is contained in the records, so identifying sessions (click-streams of
single user) is easy.
These data sets are available to prospective participants for download
and analysis.
To get access to the data, you have to fill-in the
registration form .
The participants in the Challenge can analyze any of these data sets.
We are seeking papers presenting original uses or combinations of data
mining methods on the data sets, or new insights in the application
domain thanks to a data mining approach and, more generally, papers
which show an important collaborative effort between data miners and
experts of the domain. New data mining methods are also welcome.
Submission guidelines
Submitted papers should be in English and should be formatted according to
the Springer-Verlag Lecture Notes in Artificial Intelligence guidelines.
Authors' instructions and style files
can be downloaded from
http://www.springer.de/comp/lncs/authors.html (no copyright form is
requested, use the style for proceedings). The
maximum length of papers is 12 pages.
The paper must be
submitted electronicaly (as PostScript or PDF files) by e-mail to
Petr Berka
The deadline for submission is July 25, 2005. An acceptance notification
will follow by August 22. The deadline for camera-ready papers is September 5, 2005.