1. Introduction
Since 1995, researchers have proposed to embed machine-learning techniques into a computer-aided system, such as medical diagnosis system [
1,
2,
3,
4]. Andres et al. [
5] proposed an ensemble of fuzzy system and evolutionary algorithm for breast cancer diagnosis, which can evaluate the confidence level to which the system responds and clarifies the working mechanism of how it derives its outputs. Huang et al. [
6] constructed a hybrid SVM-based strategy with feature selection to find the important risk factor for breast cancer. Generally speaking, without domain-specific expertise in medicine, researchers in data mining prefer models with high classification accuracy and low computational complexity. In contrast, common people (including patients and their relatives) hope that the models can have high-level interpretability simultaneously. Bayesian network classifiers (BNCs) are such models that can graphically describe the conditional dependence between attributes (or variables) and be considered to be one of the most promising graph models [
7,
8]. It can mine statistical knowledge from data and infer under conditions of uncertainty [
9,
10]. BNCs, from 0-dependence Naive Bayes (NB) [
11] to 1-dependence tree augmented Naive Bayes (TAN) [
12], then to arbitrary
k-dependence Bayesian classifier (KDB) [
13], can represent the knowledge with complex or simple network structure. KDB can theoretically represent conditional dependence relationships of arbitrary complexity. However, this approach is not effective for some specific cases. The model learned from training data may not definitely fit all testing instances. Otherwise, its bias and variance will always be 0, which is against the bias-variance dilemma [
14]. In the case of breast cancer, for different specific cases, the dependence relationships between attributes may be different. For BNCs, conditional mutual information (CMI) [
15],
, is commonly used to measure the conditional dependence relationship between attributes
and
given class variable
C:
can measure the conditional dependence between attributes between attributes
and
given class
C. Correspondingly,
can measure the conditional dependence between them when they take specific values. When
or
,
holds and the relationship between attribute values
and
can be considered to be conditional dependence. In contrast, when
or
,
holds and we argue that the relationship between attribute values
and
can be considered to be conditional independence. When
and
, the relationship between attribute values
and
just turns from conditional dependence to conditional independence. On dataset
WBC (breast cancer),
achieves the largest value of CMI (0.4733) among all attribute pairs. The distribution of
, which correspond to different attribute value pairs of
and
, are shown in
Figure 1. As shown in
Figure 1, the relationship between attributes
and
is dependent in general because the positive values of
, which represent conditional dependence, have a high proportion among all the values. In addition, some
values are especially large. In contrast, there also exist some negative values of
that represent conditional independence, i.e., the dependence relationship may be different rather than invariant when attributes take different values. However, general BNCs (like NB, TAN and KDB), which only build one model to fit training instances, cannot capture this difference and cannot represent the dependence relationships flexibly.
To meet the needs of experts in machine learning or in medicine, common people (including patients and their relatives) and the problem of breast cancer mentioned above, we propose a novel sorting strategy and extend KDB from a single restricted network to unrestricted ensemble networks, i.e., unrestricted
k-dependence Bayesian classifier (UKDB), in terms of Markov blanket analysis and target learning. Target learning [
16] is a framework that takes each unlabeled testing instance
as a target and builds a specific Bayesian model BNC
to complement BNC
learned from training data
.
To clarify the basic idea of UKDB, we introduce two concepts: “Domain knowledge”, which expresses a general knowledge framework learned from the training data, it focuses on describing interdependencies between attributes, such as attribute and . In addition, “Personalized knowledge”, which expresses a specific knowledge framework learned from the attribute values in the testing instance, such as attribute and . Take breast cancer as an example, there is a strong correlation between attributes “Clump Thickness” and “Uniformity of Cell Size” (corresponding CMI achieves the maximum value, i.e., 0.4733), which can be considered to be the domain knowledge. In contrast, for a testing instance with attribute values “Clump Thickness = 1” and “Uniformity of Cell Size = 3”, the dependence relationship between those attribute values is approximately independent (corresponding value of CMI is 0.0002), which can be regarded as the personalized knowledge. The personalized knowledge with clear expressivity (capacity to respectively describe dependence relationships between attributes in different situations.) and tight coupling (capacity to describe the most significant dependencies between attributes.) makes ever more urgent the quest for highly scalable learners.
UKDB contains two sub-models: UKDB and UKDB. UKDB is learned from training data , which can be thought of as a spectrum of dependencies and is a statistical form of domain knowledge. UKDB is a specific BNC to mine the personalized knowledge implicated in each single testing instance , i.e., the specific knowledge that describes the conditional dependency between the attribute values in each single testing instance . UKDB and UKDB apply the same strategy to build the network structure, but they apply different probability distributions and target different data spaces, thus they are complementary in nature, i.e., in contrast to restricted BNC, e.g., KDB, UKDB can discriminatively learn different unrestricted Bayesian network structures to represent different knowledge from training dataset and testing instance, respectively.
The Wisconsin breast cancer (WBC) database [
17] is usually used as a benchmark dataset [
1,
2,
3,
4] and is also selected in our main experiments for case study to demonstrate personalized Bayesian networks (BN) structures. The case study on the WBC database, as well as an extensive experimental comparison on additional 10 UCI datasets by involving some benchmark BNCs, show the advantages of the proposed approach.
2. Bayesian Network and Markov Blanket
All the symbols used in this paper are shown in
Table 1. We wish to build a Bayesian network classifier from labeled training dataset
such that the classifier can estimate the probability
and assign a discrete class label
to a testing instance
. BNs are powerful tools for knowledge representation and inference under conditions of uncertainty. A BN consists of two parts: the qualitative one in the form of a directed acyclic graph. Each node of the graph represents a variable in the training data and the directed edges between pairs of nodes represent dependence relationships between them; and the quantitative one based on local probability distributions for specifying the dependence relationships. Even though BNs can deal with continuous variables, we exclusively discuss BNs with discrete nodes in this paper. Directed edges represent statistical or causal dependencies among the variables. The directions are used to define the parent-children relationships. For example, given an edge
,
X is the parent node of
Y, and
Y is the children node.
A node is conditionally independent of every other node in the graph given its parents (
), its children (
), and the other parents of its children (
).
forms the Markov blanket of the node [
7], which contains all necessary information or knowledge to describe the relationships between that node and other nodes. BNCs are special type of BNs. By applying different learning strategies, BNCs encode the dependence relationships between predictive attributes
and class variable
C. Thus, the Markov blanket for variable
C can provide the necessary knowledge for classification.
Suppose that
X is divided into three parts, i.e.,
, the joint probability distribution
can be described in the form of chain rule,
The unrestricted BNC shown in
Figure 2, which corresponds to (
2), is a full Bayesian classifier (i.e., no independencies). The computational complexity in such an unrestricted model is an NP-hard problem.
NB is the simplest of the BNCs. Given the class variable
C, the predictive attributes are supposed to be conditionally independent of one another, i.e.,
Even though the supposition rarely holds, its classification performance is competitive to some benchmark algorithms, e.g., decision tree, due to the insensitivity to the changes in training data and approximate estimation of the conditional probabilities
[
10].
Figure 3 shows the structure of NB. In contrast to
Figure 2, there exists no edge between attribute nodes for NB and thus it can represent 0 conditional dependencies. It is obvious that the conditional independence assumption is too strict to be true in reality. When dealing with complex attribute dependencies, that will result in classification bias.
TAN relaxes the independence assumption and extends NB from 0-dependence tree to 1-dependence maximum weighted spanning tree [
12]. The joint probability for TAN turns to be
where
is the parent attribute of
. The constraint on the number of parents intensively requires that only the most significant, i.e.,
, conditional dependencies are allowed to be represented. By comparing CMI, the edge between
and
will be added to the network in turn to build a maximal spanning tree. Once the conditional independence assumption does not hold, TAN is supposed to achieve better classification performance than NB. An example of TAN is shown in
Figure 4.
KDB can represent arbitrary degree of dependence and control its bias/variance trade-off with a single parameter,
k. By comparing mutual information (MI)
[
15], attributes will be sorted in descending order and enter the network structure in turn.
To control the structure complexity, each attribute is required to have no more than k parent attributes. Thus, for any of the first attributes in the order, they will indiscriminately select all the attributes already in the model as its parents. For the other attributes, they will select k parent attributes which correspond to the highest values of where ranks before .
Suppose that the attribute order is
, the joint probability for KDB turns to be
where
are the
j parent attributes of
in the structure, where
. KDB can represent
conditional dependencies. When
, KDB represents the same number of conditional dependencies of TAN. As
k increases, KDB can represent increasingly conditional dependencies.
Figure 5 shows an example of KDB when
k = 2.
Since KDB can be extended to describe dependence relationships of arbitrary degree and thus demonstrates its flexibility, researchers proposed many important refinements to improve its performance [
18,
19,
20,
21]. Pernkopf and Bilmes [
22] proposed a greedy heuristic strategy to determine the attribute order by comparing
where
ranks higher than
in the order, i.e.,
. Taheri et al. [
23] proposed to build a dynamic structure without specifying
k a priori, and they proved that the resulting BNC is optimal.
3. The UKDB Algorithm
According to generative approach, the restricted BNCs, which take class variable
C as the common parent of all predictive attributes, define a unique joint probability distribution
in the form of chain rule of lower-order conditional probabilities,
The corresponding classification rule is
To maximize
, an ideal condition is that each factor
will be maximized. In other words,
should be strongly dependent on its parents, especially on class variable
C. Given limited number of training instances, the reliability of conditional probability estimation
will increase as the dependence relationships between
and its parent attributes increases. To achieve the trade-off between classification performance and structure complexity, only limited number of dependence relationships will be represented by BNs, e.g., KDB. In addition, the classification rule for KDB turns to be
where
is one subset of
and contains at most
k attributes. Obviously,
. No matter what the attribute order is, the full BNC represents the same joint distribution, i.e.,
. In contrast, from Equation (
8) we can see that for different attribute orders, the candidate parents for
may differ greatly. The joint distributions
represented by KDBs learned from different attribute orders may not surely be same. The key issue for structure learning of restricted BNC is how to describe the most significant conditional dependence relationships among predictive attributes, or more precisely, the relationships between
and its parent attribute
. However for KDB, the attributes are sorted in descending order of
, which only considers the dependence relationship between
and class variable
C while neglecting the conditional dependence relationships between
and its parents. If the first few attributes in the order are relatively independent of each other, the robustness of the network structure will be damaged from the beginning of structure learning. To address this issue, UKDB selects the parents of variable
C, or
, which are also the parents of the other attributes from the viewpoint of Markov blanket. In addition, there exist strong conditional dependence relationships between
and the other attributes. On the other hand,
k corresponds to the maximum allowable degree of attribute dependence, thus the number of attributes in
is
k.
Suppose that attribute set
contains
k attributes
and the order of attributes in
X is
, Formula (
7) can be rewritten in another form,
The relationships between
and its parents corresponding to Equations (
7) and (
10) are shown in
Table 2.
Since
is irrelevant to the classification, then
Thus, UKDB uses the following formula for classification,
where
is one subset of
and contains
k attributes. For any attribute
(
),
is the parent of the other attributes, then there should exist strong conditional dependencies, or tight coupling, between them. To this end, we sort the attributes by comparing the sum of CMI. To express this clearly in the following discussion, we sort the attributes by comparing the sum of CMI (SCMI) and
. The first
k attributes in the order with the largest
are selected as
. To control the structure complexity, UKDB also require that
should select at most
k parents from
as shown in
Table 2. The attribute sets
and
will be determined thereafter.
Figure 6 shows two examples of UKDB when
and
.
In the real world, when attributes take different values the same dependence relationships between them may lead to wrong diagnosis or therapy. Considering attributes
Sex and
Pregnant,
Sex = “Female” and
Pregnant = “Yes” are highly related, whereas
Sex = “female” and
Pregnant = “No” also hold for some instances. Obviously, treatment of breast cancer during pregnancy should be different to that during non-pregnancy. CMI can weigh the conditional dependency between
Sex and
Pregnant, but cannot discriminately weigh the dependencies when these two attributes take different values. Target learning takes each testing instance
as a target and tries to mine the dependence relationships between these attribute values [
16]. From Equations (
1) and (
5), we have the following equations:
where
The definitions of MI and CMI are measures of the average dependence between attributes implicated in the training data. In contrast to those, local mutual information (LMI)
and conditional local mutual information (CLMI)
can weigh the direct dependence and conditional dependence relationships between attribute values implicated in each instance [
16,
24]. Similarly, we sort the attribute values by comparing the sum of CLMI (SCLMI) and
.
For Bayesian inference, LMI refers to the event when and can be used to measure the expected value of mutual dependence between and C after observing that . CLMI can be used to weigh the conditional dependence between attribute values and while considering all possible values of variable C.
From Equations (
1) and (
5), to compute
or
, all possible values of attribute
need to be considered. If there exist missing or unknown value for attribute
and
in any instance, they will be replaced by some values and noise may be artificially introduced into the computation of
or
. These missing or unknown values are regarded as noisy because the conditional dependence relationships between them and other non-noisy attribute values may be incorrectly measured. If the noisy part only account for a small portion of the non-noisy part, the dependence relationships learned from training data may be still of high-confidence level and the network structure of UKDB
may be still robust. In contrast, from the definitions of LMI and CLMI (Equation (
14)) we can see that for specific instance
, to compute
or
only these attribute values in
need to be considered. The computation of
or
concerning noisy values will not be needed. Thus, neglecting these noisy conditional dependence relationships may make the network structure of UKDB
more robust.
We propose to use the Markov blanket and target learning to build an ensemble of two unrestricted BNCs, i.e., UKDB
and UKDB
. UKDB
and UKDB
learn from different parts data space and their learning procedures are almost the same, thus they are complementary in nature. In the training phase, by calculating MI and CMI, UKDB
describes the global conditional dependencies implicated in training data
. Correspondingly, in the classification phase, by calculating LMI and CLMI, UKDB
describes the local conditional dependencies implicated in unlabeled testing instance
. Breiman [
25] revealed that ensemble learning brings improvement in accuracy only to those “unstable” learning algorithms, in the sense that small variations in the training set would lead them to produce very different models. UKDB
and UKDB
are such algorithms. UKDB
tries to learn the certain domain knowledge implicated in training dataset, whereas the domain knowledge may not describe the conditional dependencies in testing instance
. It may cause overfitting on the training set and underfitting on the testing instance. In contrast, UKDB
can describe the conditional dependencies implicated in testing instance
, whereas the personalized knowledge is uncertain since the class label of
is unknown. It may cause underfitting on the training set and overfitting on the testing instance. Thus, an ensemble of UKDB
and UKDB
may be much more appropriate for making the final prediction.
The learning procedures of UKDB is described by Algorithm 1 as follows:
Algorithm 1: The UKDB algorithm |
|
Since the class label of testing instance
is unknown, we can get all possible class labels from training set
. Assume that the probability the testing instance
in class
c is
for each
, there will be
m “pseudo” instances. By adding these
m “pseudo” instances to training set
, we can estimate the joint or conditional probabilities between arbitrary attribute value pairs by using Equation (
14) to achieve the aim of learning conditional independence from a testing instance
.
The learning procedures of UKDB is shown in Algorithm 2, where “?” is represented the missing value in the dataset. To estimate the marginal and joint probabilities and , at training time UKDB needs one pass through the training data to collect the base statistics of co-occurrence counts. Calculating MI and CMI respectively need and time, where N is the number of training instances, m is the number of classes, n is the number of attributes and v is the number of values that discrete attributes may take on average. The procedure of parent assignment for each attribute needs . Thus, the time complexity for UKDB to build the actual network structure is . Since UKDB only needs to consider the attribute values in the testing instance, calculating LMI and CLMI respectively need and time. The procedure of parent assignment for each attribute in UKDB needs the same time, . Thus, the time complexity for UKDB is only . UKDB and UKDB use different variations of to classify each single instance and corresponding time complexities are the same, .
Algorithm 2: The UKDB algorithm |
|
UKDB
learned from training data
describes the general conditional dependencies, thus UKDB
corresponds to the domain knowledge that may be suitable for most cases. In contrast, UKDB
learned from testing instance
describes local conditional dependencies with uncertainty because all class labels are considered, thus UKDB
corresponds to the personalized knowledge that may be suitable for
only [
16].
When facing an expected case, it is difficult to judge which kind of knowledge should be considered in priority. Precision knowledge may provide some statistical information that the expert does not recognize and help him use the domain knowledge to confirm or rule out the decision. For different cases, the weights of UKDB
and UKDB
may differ greatly. In this paper, without any prior knowledge we simply use the uniformly weighted average instead of the nonuniformly weighted one. The final probability estimate for the ensemble of UKDB
and UKDB
is,
5. Conclusions
In this paper, we have proposed to extend KDB from restricted BNC to unrestricted one by applying Markov blanket. The final classifier, called UKDB, demonstrates better classification performance with high expressivity, enhanced robustness and tight coupling. For each testing instance , an appropriate local Bayesian classifier UKDB is built using the same learning strategy as that of UKDB learned from training data . Compared with other state-of-the-art BNCs, the novelty of UKDB is that it can use the information mined from labeled and unlabeled data to make joint decisions. From the case study we can see that given testing instances and , the weights of dependence relationships between the same pair of attribute values may differ that makes the topology of UKDB distinguish from that of UKDB. Besides, the model is learned directly from the data in some field, and it can only express part of domain knowledge, i.e., datasets are only part of the field, and the knowledge of statistics may be contrary to expert knowledge. Some of the mined knowledge does not conform to the knowledge of medical experts, which requires the discrimination of expert knowledge. Thus, if given expertise in medicine, the network structures of UKDB and UKDB will be improved.
Given a limited number of instances, the accuracy of probability estimation determines the robustness of dependence relationships, and then determines the structure complexity of BNCs. The characteristic of tight coupling helps UKDB improve the probability estimation. UKDB has been compared experimentally with some state-of-the-art BNCs with different structure complexities. Although KDB and UKDB are of the same structure complexity, UKDB presents superior advantage over KDB in terms of classification accuracy (zero-one loss) and robustness (bias and variance). The independence assumption of NB rarely holds for all instances but may hold for specific instance. However, high-dependence BNCs, e.g., TAN, KDB and UKDB focus on the interdependence between attributes but disregard the independence between attribute values. If the independence in testing instance can be measured and identified, UKDB can provide a much more competitive representation.
Target learning is related to dependence evaluation when attributes take specific values. Because the proposed UKDB is based on UKDB, it needs enough data to learn accurate conditional probability during structure learning. Thus, in practical applications, the inaccurate estimate of conditional probability for some attribute values, e.g., , may lead to noise propagation in the estimate of joint probability . This situation is more obvious while dealing with datasets with less attributes. Therefore, our further research is to decide the appropriate estimate of conditional probability needed for this purpose and to seek alternative methods, e.g., Laplace correction.