China Naming Network - Naming consultation - Naive Bayesian Classification - From Avenue to Simplicity

Naive Bayesian Classification - From Avenue to Simplicity

Given m samples, x is the feature variable and y is the corresponding category.

A model function h is required to predict as accurately as possible for new samples .

Many machine learning algorithms construct the model function h from the perspective of error, that is, first assume an h, and then define an error between h(x) and y, and gradually reduce the error between h(x) and y To obtain a fitted model h.

Now let’s think about it from a probability perspective.

Suppose y has m categories, that is,

For sample , if the conditional probability of each category can be calculated, then it can be considered that the category with the highest probability is the category to which . (For information on conditional probability and Bayes’ theorem, please refer to Understanding Bayes’ Theorem ). That is,

m samples are known,

x is an n-dimensional feature variable, that is,

y is the corresponding category, and there are K categories, that is ,

For any given x, we need to calculate the probability that x belongs to each category. If there is the maximum value, x belongs to the category, that is, the sample x belongs to the category

< p> Now we need to calculate and apply Bayes’ theorem:

Here is a conditional joint probability, which means that in classification, the features take a specific set of values ​​(that is, the characteristics of each feature of the sample x that need to be predicted value) probability. This probability is not easy to calculate, so for convenience, Naive Bayes makes its grand debut. What naive means here is that each feature of x is assumed to be conditionally independent. (Reference Wikipedia - Conditional Independence). Therefore

This transformation is actually the joint distribution of the independent variables = the product of the prior distribution of each variable (refer to Wikipedia - joint distribution), but here is the conditional probability, but because there are the same conditions before and after the transformation , from the perspective of sample space, is actually the product of the joint distribution converted into the prior distribution. (For understanding of sample space, please refer to Understanding Bayes’ Theorem ).

Bring (5) back to (4) to get

The value of x for any given sample is determined, and x does not depend on C, so P(x) Can be regarded as a constant. So it can be ignored.

This is the model function of Naive Bayes classification.

There are two main items in the above formula. Let’s examine how to calculate them respectively.

In the above formula, is relatively easy to calculate. Use frequency to estimate the probability. Just count the frequency of samples belonging to m samples. Assuming that there are samples in m samples whose category is , then

The calculation of needs to assume in advance the data distribution of sample characteristics . Assumptions about feature distributions, which we call event models, usually adopt the following three assumptions.

Sometimes the number of samples with a specific value of a certain feature in the sample will lead to the entire , seriously distorting the probability value of the feature. Therefore, Laplacian smoothing can usually be used to avoid this situation. That is,

usually takes

and brings (8) and (10) into Bayesian classifier (7) to get

Use a rough schematic diagram to understand how the conditional probability is estimated based on the sample set when the feature is a discrete value:

Example: Deciding whether to play tennis based on weather conditions

This case From Naive Bayes Classifier

The table above is the decision-making data of a student playing tennis under different weather conditions.

Suppose today's weather conditions are: Outlook=sunny, Temperature=cool, Humidity=high, Wind=strong, will the student go to play tennis?

Several features here, including weather, temperature, humidity, and wind speed, are all discrete variables and are suitable for the polynomial Bayesian classification method above. Write the above formula here for easy viewing.

We need to calculate the estimated probability of in two cases.

Statistics on the number of samples in various situations in the table above show:

The total number of samples m=14

The number of samples playing ball (k=yes) = 9

Number of samples for not playing ball (k=no) = 5

Value of weather (Sunny/Overcast/Rain)

Playing ball on sunny days (k= yes,j=Outlook,s=sunny) number of samples

Number of samples of not playing ball on a sunny day (k=no,j=Outlook,s=sunny)

Temperature value (Hot/Mild/Cool)

Number of samples of playing ball in cold weather (k=yes,j=Temperature,s=cool)

Number of samples of not playing ball in cold weather (k=no,j =Temperature,s=cool)

Humidity value (High/Normal)

Playing ball on humid days (k=yes,j=Humidity,s=high) Number of samples

Number of samples for not playing on wet days (k=no, j=Humidity, s=high)

Value of wind force (Strong/Weak)

< p> Number of samples of playing ball on a windy day (k=yes,j=Wind,s=strong)

Number of samples of not playing ball on a windy day (k=no,j=Wind,s=strong)

p>

Substituting the above data into formula (11), for the sample, the probability of playing ball (k=yes)

The probability of not playing ball (k=nos)

Here is 0.01822 > 0.007084, so the student may not go to play ball. After normalization,

The probability of not playing = 0.01822 / (0.01822 + 0.007084) = 72%

(Note: The calculation result here is different from the value in the original case, because here Laplacian smoothing is done, but not in the original case. In this case, there is actually no situation where the number of samples of a specific feature is 0, so there is no need to do Laplacian smoothing, but it is written down according to the formula. Calculated)

Also note that the Bernoulli distribution is actually a special case of the polynomial distribution, so we can use the above formula (12) to calculate, or we can use the previous polynomial distribution formula (11) to calculate.

Bernoulli distribution can be used in tasks involving text such as spam classification. For example, a vector of 5,000 different words can be constructed as the input feature x. For a piece of text, there are words that appear in x. The position of the corresponding word is set to 1, and other positions are set to 0, so that the value of each feature (word) in x is 1 or 0, consistent with the Bernoulli distribution.

For cases, please refer to Wikipedia - Cases - Gender Classification

Another common technique for dealing with continuous numerical problems is through the method of discretizing continuous numerical values. Usually, when the number of training samples is small or the exact distribution is known, the method through probability distribution is a better choice.

The discretization method performs better in the case of a large number of samples, because a large number of samples can learn the actual distribution of the data without "naive" assumptions about its distribution. Typically, many tasks will provide a large number of samples, so it is better to choose a discretization method than a probability distribution estimation method.

By the way, every time I see the word simplicity, I seem to see Bayes wearing a patchwork of clothes. And naive means inexperienced; childish; ignorant; gullible. Judging from the formula derivation process, the Naive Bayes classifier adopts some simplifying assumptions, such as assuming that each feature of x is conditionally independent, assuming that the sample feature data conforms to polynomial distribution, Bernoulli distribution, Gaussian distribution, etc. These Assumptions may not be entirely true to reality, as naive assumptions are made out of ignorance of the treacherous real world.

However, simplicity also means specificity and purity. In this sense, Bayesian classification can be regarded as the most simple, and it is as wise as a fool.

The main advantages of Naive Bayes are:

1) The algorithm is simple and has stable classification efficiency.

2) It performs very well on small-scale data, can handle multiple classification tasks, and is suitable for incremental training. Especially when the amount of data exceeds the memory, we can perform incremental training in batches.

3) Not very sensitive to missing data.

The main shortcomings of Naive Bayes are:

1) If the "naive" assumption is inconsistent with the actual situation, it will affect the model effect.

2) The expression form of the input feature data, such as continuous features, discrete features or binary features, will affect the probability calculation and the classification effect of the model.

Summary of the principles of Naive Bayes algorithm

Naive Bayes classifier

Wikipedia - Naive Bayes classifier

Understanding Bayes Sri Lanka's theorem