PKDD'99 Discovery Challenge

Guide to the Medical Data Set




The database was collected at Chiba University hospital. Each patient came to the outpatient clinic of the hospital on collagen diseases, as recommended by a home doctor or a general physician in the local hospital.

Collagen diseases are auto-immune diseases. Patients generate antibodies attacking their own bodies. For example, if a patient generates antibodies in lungs, he/she will chronically lose the respiratory function and finally lose life. The disease mechanisms are only partially known and their classification is still fuzzy. Some patients may generate many kinds of antibodies and their manifestations may include all the characteristics of collagen diseases.

In collagen diseases, thrombosis is one of the most important and severe complications, one of the major causes of death. Thrombosis is an increased coagulation of blood, that cloggs blood vessels. Usually it will last several hours and can repeat over time. Thrombosis can arise from different collagen diseases. It has been found that this complication is closely related to anti-cardiolipin antibodies. This was discovered by physicians, one of whom donated the datasets for discovery challenge.

Thrombosis must be treated as an emergency. It is important to detect and predict the possibilities of its occurence. However, such database analysis has not been made by any experts on immunology. Domain experts are very much interested in discovering regularities behind patients' observations.



  1. Search for patterns which detect and predict thrombosis.
  2. Search for temporal patterns specific/sensitive to thrombosis. (Examination date is very close to the date on thrombosis. If we can find specific/sensitive patterns before/after the thrombosis, they are very useful.)
  3. Search for features which classifies collagen diseases correctly.
  4. Search for temporal patterns specific/sensitive to each collagen disease.

Domain experts told us that if useful patterns are discovered then they are acceptable in major journals on rheumatology (collagen diseases.)


Evaluation Scheme

One of the domain experts, who is well known for rheumatology, will attend PKDD'99 conference and evaluate all the results. The results will be also evaluated in the clinical environment in the future.



Database consists of three tables. (TSUM_A.CSV, TSUM_B.CSV, TSUM_C.CSV). The patients in these three tables are connected by ID number.



Basic information about patients (input by doctors). This dataset includes all patients (about 1000 records).

IDidentification of the patient
Description datethe first date when a patient data was recordedYY.MM.DD
First datethe date when a patient came to the hospital YY.MM.DD
Admissionpatient was admitted to the hospital (+) or followed at the outpatient clinic (-)
Diagnosis disease namesmultivalued attribute



Special laboratory examinations (input by doctors) (measured by the Laboratory on Collagen Diseases). This dataset does not include all the patients, but includes the patients with these special tests.

IDidentification of the patient
Examination Datedate of the testYYYY/MM/DD
aCL IgGanti-Cardiolipin antibody (IgG) concentration
aCL IgManti-Cardiolipin antibody (IgM) concentration
ANAanti-nucleus antibody concentration
ANA Pattern pattern observed in the sheet of ANA examination
aCL IgAanti-Cardiolipin antibody (IgA) concentration
Diagnosisdisease namesmultivalued attribute
KCTmeassure of degree of coagulation
RVVTmeassure of degree of coagulation
LACmeassure of degree of coagulation
Symptomsother symptoms observed multivalued attribute
Thrombosisdegree of thrombosis 0: negative (no thrombosis)
1: positive (the most severe one)
2: positive (severe)
3: positive (mild)

Examination date is very close to the date on thrombosis. In negative examples, these tests are examined when thrombosis is suspected.



Laboratory Examinations stored in Hospital Information Systems (Stored from 1980 to March 1999) All the data include ordinary laboratory examinations and have temporal stamps. The tests are not necessarily connected to thrombosis.

itemmeaningnormal range
IDidentification of the patient
DateDate of the laboratory tests (YYMMDD)
GOTAST glutamic oxaloacetic transaminaseN < 60
GPTALT glutamic pylvic transaminaseN < 60
LDHlactate dehydrogenaseN < 500
ALPalkaliphophataseN < 300
TPtotal protein6.0 < N < 8.5
ALBalbumin3.5 < N < 5.5
UAuric acidN > 8.0 (Male)
N > 6.5 (Female)
UNurea nitrogenN < 30
CREcreatinineN < 1.5
T-BILtotal bilirubinN < 2.0
T-CHOtotal cholesterolN < 250
TGtriglycerideN < 200
CPKcreatinine phosphokinaseN < 250
GLUblood glucoseN < 180
WBCWhite blood cell3.5 < N < 9.0
RBCRed blood cell3.5 < N < 6.0
HGBHemoglobin10 < N < 17
HCTHematoclit29 < N < 52
PLTplatelet100 < N < 400
PTprothrombin timeN < 14
Notecomment for the test PT
APTTactivated partial prothrombin timeN < 45
FGfibrinogen150 < N < 450
AT3marker of DIC, one of the most important complications of collagen diseases70 < N < 130
A2PImarker of DIC70 < N < 130
U-PROproteinuria0 < N < 30
IGGIg G900 < N < 2000
IGAIg A80 < N < 500
IGMIg M40 < N < 400
CRPC-reactive proteinN= -, +-, or N < 1.0
RARhuematoid FactorN= -, +-
C3complement 3N > 35
C4complement 4N > 10
RNPanti-ribonuclear proteinN= -, +-
SManti-SMN= -, +-
SCl70anti-scl70N= -, +-
SSAanti-SSAN= -, +-
SSBanti-SSBN= -, +-
CENTROMEAanti-centromereN= -, +-
DNAanti-DNAN < 8
DNA-IIanti-DNAN < 8


This database was donated by dr. Katsuhiko Takabayashi and prepared by prof. Shusaku Tsumoto
For possible questions on the data and task description contact Petr Berka. All questions and answers will be published as appendixes to this document.


Asked Questions