Most of the articles that introduce neural networks have too many conceptual things, and they are mixed with many mathematical formulas. It is a headache for people to read, especially those who have no foundation can completely get the ideas that the author wants to express. This article attempts a zero formula (but with a small amount of mathematical knowledge) to clarify what a neural network is, and to illustrate what a neural network can do. Other articles like to introduce the theme of “machine learning / deep learning / neural network / supervised learning” by introducing “predicting the value of the house based on historical transaction data” or “predicting whether it will rain in the next few days based on historical data”. Their role, the sample of this example (input X output Y) is a numerical value, the mapping of numbers to numbers is simple and easy to understand, but there are still many scenes in real-world applications, such as the example of “image classification” later in this article. The input is a picture and not a simple numeric input.


Convolutional neural network output


Classification and regression

The machine learning/deep learning/neural network that we usually discuss is mostly the category of “supervised learning”. The most widely used supervised learning is also the field in which neural networks play a huge role. Therefore, all the content of this paper is based on supervised learning. Learn “experience” from the tagged sample data, and finally apply the experience to the data outside the sample to get the predicted result. This is supervised learning. Supervised learning mainly solves two major types of problems:


The classification is well understood, that is, according to the input characteristics, the corresponding classification is predicted, and the output is a discrete value, such as whether it is raining tomorrow (rain/no rain), whether the short message is spam (yes/no), and the image contains Animals are cats, dogs or monkeys (cats/dogs/monkeys) and so on. The output of the classification model is generally an N-dimensional vector (N is the number of classifications), and each vector value represents the probability of belonging to this classification.


Figure 1 Classification task

As shown above, based on the sample data (yellow circle, blue square, green prism), supervised learning can determine two boundary lines. For any data other than the sample (the gray square in the figure), it can be predicted that it belongs to the classification B. The corresponding predicted output can be [0.04, 0.90, 0.06], representing a probability of belonging to class A being 0.04, a probability belonging to class B being 0.90, a probability belonging to class C being 0.06, and having the highest probability of belonging to class B, so we can Think of it as B.

Please note that the two dotted lines used to divide the type area in the figure, the similar samples are not completely separated by the dotted line, the yellow circle is divided into the B class, and the blue square is divided into the A class. This situation arises because the experience gained from supervised learning should have a certain degree of generalization ability, so that certain errors occur in the learning process, and such learning is effective.


Contrary to the classification, the regression mainly solves some problems with specific values, such as tomorrow’s temperature (20, 21, 30, etc.) and tomorrow’s stock opening price (100, 102, 200, etc.). The output of a regression model is typically a concrete value (containing a vector containing each specific value).


Figure 2 regression task

As shown above, according to the sample data (blue square in the figure, the plane coordinate system point), supervised learning can determine a straight line y=1.5x+1. For any input other than the sample (Xn), the corresponding output Y can be predicted as 1.5*Xn+1.

Please note that the line obtained by supervised learning is y=1.5*x+1. In fact, not every sample falls on the line, and most of it is distributed around the line. The reason is the same as mentioned above. The supervised learning process allows certain errors to occur, which is effective learning.

Learning process

Whether it is classification or regression, supervised learning learns from sample data and then applies the experience to data outside the sample. So what does this experience specifically mean? What is the nature of learning?

Taking the above regression problem as an example, the process of getting the line y=1.5*x+1 is as follows:

(1) Determine that the sample data is in a straight line distribution (approximately straight line distribution);

(2) Set an objective function: y=w*x+b;

(3) Adjust the values ​​of w and b so that the sample data points are distributed as close as possible to the straight line (the least squares method can be used);

(4) Obtain optimal values ​​of w and b.

The above is the process of completing the learning in 4 steps, which is also the simplest supervised learning process. As for “How to determine the sample is linearly distributed”, “How to judge the objective function as y=w*x+b” and “How to adjust the values ​​of w and b, the sample data points can be distributed as close as possible to the line” These questions are introduced one by one.

The model training in deep learning that we often hear refers to the learning process. The final output model mainly contains the values ​​of w and b. In other words, the training process mainly determines the values ​​of w and b. These parameters can be called “experience”.

The process of supervising learning is to find the mapping relationship of X->Y. The input X here can be called “feature”, and the feature can be multi-dimensional. In fact, X is mostly a multi-dimensional vector, similar to [1, 1.002, 0.2, …], the output Y is called the “predicted value”, and the predicted value can also be multidimensional, similar to [0.90, 0.08, 0.02]. For example, in the classification problem mentioned above, the output Y is a multidimensional vector, each The vector value represents the probability size of the predicted corresponding classification.


Fully connected neural network

A fully connected neural network is made up of many “neurons” connected. Each neuron can receive multiple inputs and produce an output similar to the previously mentioned X->Y mapping. If the input is multidimensional, the format That is [x1, x2, …, xn]->Y (for a single neuron, the output is a value). The plurality of neurons are connected to each other to form a neural network. The input of the neurons can be the output from other neurons, and the output of the neurons can be used as part of the input of other neurons. The figure below shows the structure of a neuron:


Figure 3 Neuron structure

As shown in the figure above, a neuron receives [x1, x2, …, xn] as input. For each input Xi, it is multiplied by a weight Wi, and the product results are added and then processed by function f to produce an output. Y. After multiple neurons are connected to each other, a neural network is obtained:


Figure 4 fully connected neural network

As shown above, multiple neurons are connected to each other to form a fully connected neural network (the figure only contains the w parameter, omitting b). The neural network in the figure contains 3 layers (Layer1, Layer2 and Layer3), and each layer in the upper layer. The output of the meta is used as the input to each neuron in the next layer. This network is called “fully connected neural network” (as the name implies, the meaning of full connection). The yellow part of the figure is the two complete neuron structures. The first neuron has three inputs (x1, x2, and x3), multiplied by the corresponding weights w31, w32, and w33, and the second neuron has four inputs. (from the 4 outputs in Layer 1 respectively). The neural network can accept a 3-dimensional vector as input (in the format [x1, x2, x3]), calculate from left to right, and finally output a 2-dimensional vector (in the format [y1, y2]). Corresponding to the experience it has learned, it can be used to deal with “classification” or “regression” problems that match the following mapping:

Figure 5 Neural network mapping

A fully-connected neural network is the simplest neural network. The neurons between two adjacent layers are connected to each other. Because the structure is the simplest, it is usually used as an entry to introduce other more complex networks. Note that most neural networks do not have a connection between each neuron, and some do not strictly follow the order of “data moves from left to right.”

Matrix calculation in neural networks

For the calculation process of a single neuron, it is very simple, in three steps:

(1) calculating a product of each input parameter Xi and a corresponding weight Wi;

(2) add up the product, plus an offset value b;

(3) Finally, the function f is applied to the result in (2) to obtain the output of the neuron.

But for a neural network that contains a large number of neurons, how can it be more convenient and more succinct in the code? The answer is that students who use matrices and linear algebras forget it. It is just a rule that uses “matrix multiplication” and “matrix addition”.

Matrix addition

Matrix addition requires that the two matrix dimensions be the same, and the corresponding numbers in the matrix can be directly added together to generate a new matrix with the same dimensions as before:

Figure 6 Matrix addition

Matrix multiplication

Matrix multiplication requires that the first matrix contains the same number of columns as the second matrix, and the matrix of M*N is multiplied by the matrix of N*T to obtain a new matrix of M*T:


Figure 7 Matrix Multiplication

Each element of the first row of the first matrix A is multiplied by each element of the first column of the second matrix B and then added up as the first row of the first row in the result matrix C, the first matrix A Each element of a row is multiplied by each element of the second column of the second matrix B and then added up as the first row and second column in the result matrix C, and so on. Multiply the matrix of 3*3 in the above figure by the matrix of 3*2 to get a new matrix of 3*2. If the A matrix in Figure 7 above is replaced with the parameters W (W11, W12, W22…) in the neural network, and the B matrix is ​​replaced with the input X features (X1, X2, X3…), then the fully connected nerve The calculation process for each layer in the network (which can contain multiple neurons) can be represented by a matrix:


Figure 8 Matrix usage in neural networks

As shown above, we can use the matrix to operate in batches. The calculation process for all the neurons of the first layer (Layer1) in Fig. 4 can be completed by one-time calculation in Fig. 8. In the figure, the W matrix is ​​first multiplied by the X matrix, and the offset B matrix is ​​added to obtain an intermediate result (also a matrix), and then the intermediate result is passed to the function f to output another new matrix Y, then this Y It is the output of the first layer of the neural network Layer1, which will be used as the input of the next layer Layer2, and so on. Note that the function f takes a matrix as a parameter and acts on each element in the matrix, returning a new matrix of the same dimension, which will be mentioned later. As you can see, you need to calculate 4 times f(w*x+b) before, and now you only need to do it once.

Through the previous introduction, we can know that the training process of the neural network is to find the most suitable W matrix (multiple) and the most suitable b matrix (multiple) so that the output of the neural network is closest to the real value (label). The process is also called model training or tuning (of course, model training is much more than that, there are other determinations such as hyperparameters).

Nonlinear transformation

Even if the input is a high-dimensional vector, after a simple W*X+b process, the output and input are still linear. But most of the problems to be solved in the real world are not linear models, so we need to add a nonlinear transformation between the input and the output, which is the f function (also called the activation function) mentioned many times before. For various reasons (here related to the specific training process of the neural network, the back propagation of the weight value, for the time being explained) , there are not many commonly available activation functions. Here are two functions:

Sigmoid function

The Sigmoid function can map any real number between (0, 1). The specific function image is as follows:


Figure 9 Sigmoid function image

The Sigmoid function in the above figure maps arbitrary input to values ​​between (0, 1), so the Sigmoid function is often called a logic function, which is often used for two-class prediction problems, assuming two categories A and B for any input. The feature X, the closer the Sigmoid return value is to 1, the prediction is classified as A, and vice versa.

ReLu function

The ReLu function is very simple, the return value is max(x, 0), and the specific function image is:


Figure 10 ReLu function image

In the above figure, the ReLu function converts the negative number of any input to 0, and the other inputs are output as they are. The ReLu function is the most widely used activation function in deep learning. The specific reason is not explained here. It needs to be mentioned here. Some things in deep learning/neural network do not have a very sufficient theoretical basis. They are completely based on previous experience. For example, the ReLu function here seems to be simple, why it works best in most occasions, or nerves. The question of how the organization of neurons is the most accurate in network training.

Neural network solves classification problems

It is not difficult to conclude from the previous introduction that the neural network can solve the “classification” problem of complex mapping relationships. The features are input to the neural network and the output is obtained through a series of calculations. The following figure shows an example of how the neural network solves the classification problem:



Figure 11 Predicting the liquor brand

The figure above shows a pipe network structure with the same structure as the fully connected neural network. There are multiple valves from top to bottom to adjust the liquid flow (1 in the figure). After several samples of liquid training in advance (using different brands and different alcohols) Degree, different sub-models of liquor), we adjust the valve to the best condition. Then pour a glass of white wine from the top into the mesh structure, and finally all the liquid through the pipe will flow into the three glasses (3 in the figure). If we pour a cup of Wuliangye into the pipeline, theoretically all of the liquid should flow completely into the first glass (on the left side of the figure), but in fact because of the generalization of the neural network, for any input (including training samples) Most of the time, it will not be 100% consistent with the correct result. In the end, only the first glass will have the most liquid (for example, 85%), and the other two glasses will also have a small amount of liquid (the right side in the figure).

So now there is a problem. The final output of the neural network is the value (or multi-dimensional vector, the vector contains the specific value). How does the result reflect the concept of “classification”? As I mentioned at the beginning of this article, the classification problem is finally expressed by probability. The probability of a certain classification is the highest, then it belongs to the classification. The following figure shows how to convert the value into probability:


Figure 12 Numerical to probability conversion

As shown in the figure above, for the 2 classification problem, we usually use the Sigmoid function mentioned above to convert it into a probability value between (0, 1), and then classify according to the probability value. For the N classification (N can also be 2), we need to use another function, Softmax, which takes a vector as a parameter and returns a new vector. The dimension is consistent with the input. Each value of the new vector is distributed at (0, 1) Before, and the sum of all the probabilities is 1. Note that this function acts on the entire vector, and each value in the vector has an effect on each other. Interested students check the formula online.


Image classification task

Image classification is also called image recognition. Given a picture, it is required to output the target type contained in the figure, such as our common “Microsoft Flower”, “Recognition Cat or Dog”, etc. This is the most typical computer vision. “Classification” issue. Image classification is the basis for other such as “target detection” and “target segmentation”.

Image definition

The digital image is essentially a multi-dimensional matrix. The common RGB image can be regarded as three two-dimensional matrices. Each value in the matrix represents the value on the corresponding color channel (0~255), and other such as grayscale image. It is regarded as a two-dimensional matrix, and each value in the matrix represents the pixel value of the color (0~255).


Figure 13 color digital image matrix

As shown in the figure above, an RGB full-color digital picture has a size of 180*200, corresponding to 3 matrices, and the size is 180*200. The values ​​in the matrix range from 0 to 255. For a single-channel grayscale image, the corresponding size is also 180*200 for one matrix:


Figure 14 grayscale matrix

Image classification using fully connected neural networks

I have already mentioned how to use the fully connected neural network to solve the problem of “classification”. Image classification is also a classification problem, so it can also be solved by using neural network. The only difference is that the above mentioned numerical feature input [x1, X2, x3, …], then what is the input to the neural network for the image? The answer is the image matrix. The image matrix contains the values. After an M*N two-dimensional matrix is ​​expanded, an M*N-dimensional vector is obtained. The vector is input into the neural network and calculated by the neural network to output the classification probabilities. Let’s take “handwritten digital image recognition” as an example to introduce how to fully image the neural network. Handwritten digital image recognition is a HelloWorld-level task in deep learning. Most of the tutorials use this as an example to explain image recognition. The following picture shows handwritten digital pictures:


Figure 15 handwritten digital image

The above picture shows 4 handwritten digital pictures, which are “5”, “0”, “4”, “1”, each picture size is 28*28, that is, the length and width are 28 pixels, and the pictures are grayscale images. That is to say, each picture corresponds to a 28*28-dimensional matrix, and the matrix is ​​expanded to obtain a 28*28-dimensional vector, which is directly input into the fully-connected neural network. There are 10 classifications from 0 to 9, so the output of the neural network is a 10-dimensional vector.


Figure 16 The whole process of handwritten digital picture recognition

As shown in the figure above, the original input picture size is 28*28, which is expanded into the feature X of [784*1] into the neural network. The neural network consists of two layers. The size of the first layer W matrix is ​​[1000*784]. After W*X, the output of size [1000*1] is obtained. The output is used as the input X of the second layer, and the W matrix of the second layer. The size is [10*1000], and the output of size [10*1] is obtained after W*X. The output (after Softmax action) is the probability of the number 0~9.

Note that the neural network structure defined above contains only two layers (blue and green in the figure, the yellow part is not counted), and the size of the W matrix of the first layer is [1000*784], where 1000 is arbitrarily set. It can be 500 or even 2000, which is consistent with the number of neurons. The size of the W matrix of the second layer is [10*1000], where 1000 is the same as before, where 10 is the number of classifications, because there are 10 classifications, so it is 10, if 100 is classified, here is 100. The number of layers in the neural network and the number of neurons in each layer can be adjusted. This process is what we often call “modifying the network structure.”

The accuracy of handwritten digital picture recognition by the above method may not be high (I have not tested it), even if it is already very high, it is not a very good way, this method may be effective for the task of handwritten digital picture recognition. But is it still effective for other images such as cat and dog recognition? The answer is no, the reason is simple: directly input the entire image data into the neural network, the features contained are too complicated, or too much noise, this phenomenon may be effective in a simple picture of handwritten numbers, once It may not work after changing to a complicated picture. So what do we need to do for the general image classification task before we pass the data to the neural network for classification?

Image feature

Image features are a very, very important concept in computer vision. To a certain extent, they can be used as specific identifiers for images. Each image contains features that are invisible to the human eye. For an introduction to image features, you can refer to one of my previous blogs:

Before using the neural network to classify pictures, we need to extract the image features first, and then input the extracted features into the fully connected neural network for classification. Therefore, the correct neural network structure to solve the image classification problem should be like this:


Figure 17 Neural network with feature extraction

As shown in the figure above, before adding a neural network, a module is added. This module is also a part of the neural network. It is also composed of many neurons, but it may not be fully connected. It can be extracted automatically. Image features, then enter the features into the fully connected network to classify. We usually refer to the fully connected network here as a “classifier” (is it familiar?). In this way, the input feature size of the fully connected network is no longer [784*1] (yellow part in the figure), but should be based on the previous output.

In Fig. 17, the neural network combined by the fully connected neural network (classifier) ​​and the feature extraction part has a proper noun, called “convolution neural network”, which is called “convolution” because of the feature extraction. The convolution operation was used, as described later.


Convolutional neural network

The convolutional neural network contains a feature extraction structure, which is mainly responsible for feature extraction, abstraction, dimensionality reduction, etc. of the original input data (such as images, attention can also be other things), which mainly includes the following contents. :

Convolution layer

The convolutional layer is mainly responsible for feature extraction. It uses a convolution kernel (a small matrix) to act on the original input matrix from left to right and top to bottom, and then generate one (or more) new matrices. The new matrix we call feature maps. The specific operation process is as follows:


Figure 18 Convolution operation process

As shown in the figure above, the green part of the figure is the original input matrix, and the yellow matrix is ​​the convolution kernel (a 3*3 matrix). After the convolution operation, a new matrix (pink) is generated, which is called feature map. There may be multiple convolution kernels, each with different convolution kernels. After the same input matrix is ​​processed by different convolution kernels, different feature maps are obtained. Therefore, in the convolutional layer, after multiple convolution kernels are processed, multiple feature maps are generated. These feature maps are different, each representing a certain feature.

If the original input matrix is ​​an image, after the convolution kernel is processed, the generated feature maps are still in the form of a matrix, but they can no longer be treated as images. The following image shows two feature maps generated after an image has been processed by two different convolution kernels. We use tools to display the two feature maps as images:


Figure 19 Visualization of feature map

As shown in the figure above, after a convolution process of a original image, the generated feature map appears as an image and can still be recognized by the human eye. However, if after multiple convolutions, the final feature map will not be recognized by the human eye. As can be seen from the above figure, after the different convolution kernels process the same input picture, there is a difference between the generated feature maps.

Here again, although the feature maps obtained through the convolution operation can still be displayed in the form of pictures, it is not the “picture” that we usually understand. Although the human eye no longer has any meaning, it is of great significance to computers. There can be multiple convolutional layers, one convolutional layer can be followed by another convolutional layer, and the output of the previous layer is the input of the next layer. Some parameters in the convolutional layer, such as the specific values ​​in the convolution kernel matrix, need to be trained. This is the same as the W and b parameters mentioned above, and it needs to be fitted through training.

Nonlinear transformation (activation function)

As with the fully connected neural network mentioned above, the feature maps generated after the convolution layer processing still need to perform nonlinear transformation. Here, the same as the previous one, using the common activation function, such as the ReLu function on the feature map is as follows: Figure:


Figure 20 does a nonlinear transformation of the feature map

As shown above, after the feature map is processed by the activation function, another matrix is ​​obtained, which we call the Rectified feature map. According to the introduction of ReLu, we can know that the activation function (max(0, x)) turns all negative numbers in the original feature map matrix into zero.

Pooling layer

Only the convolution operation and the activation process are not enough, because the feature data contained in the (Rectified) feature maps is still too large. In order to make the model have certain generalization capabilities, we need to reduce the dimension of the feature maps. It is pooled:


Figure 21 Maximum pooling operation

As shown above, the pooling layer operates on the original feature maps, or selects a sub-matrix in the order of “from left to right, top to bottom” (the circle part in the figure is 2*2, similar to the previous convolution kernel). The largest value in the range of the sub-matrix is ​​selected as the value in the new matrix, and processed in turn to form a new matrix. The new matrix size is smaller than the original one. In addition to taking the maximum value, there is also the practice of averaging and summing, but after the practice of the past, the maximum value (maximum pooling) is the best.

After the pooling layer processing, the feature maps can still be displayed as pictures, or as before, the human eye can’t tell the difference, but it is significant for the computer.


Figure 22 Pooling operation

As shown in the figure above, a feature map is pooled in two ways, taking the maximum and sum , respectively, to get different new matrices, and then displaying the new matrix as a picture. You can see that the difference is still very large (although The human eye has been unable to distinguish the content).

Normally, the convolutional layer does not need to be followed by a pooling layer. It can pass through multiple convolutional layers and then add a pooling layer. That is, convolution and pooling can be in proportion to 1:1. Make a combination. The feature extraction part of the convolutional neural network is a combination of a convolutional layer, an activation process, a pooling layer, etc., and the number of corresponding network layers can be modified as needed (commonly referred to as “adjusting the network structure”). The result of the last pooled layer output is the image feature we extracted. For example, the last pooling layer outputs T patterns (feature maps), each of which is M*N, then it is expanded to get a T*M* N-dimensional vector, then this vector is the image feature. It should be clear to this point that if we pass this feature vector to a “classifier”, we can get the final classification result through the classifier. The classifier can use the fully connected neural network mentioned above.

Fully connected layer (classifier)

In fact, seeing the students here, if the previous content is understood, this piece is not difficult. The image feature has been obtained, and it can be directly input into the fully connected neural network to obtain the final classification result. The following figure shows the process of passing a handwritten digital picture into the convolutional neural network. It first passes through two convolutional layers and two pooling layers (cross-connected, other operations such as activation processing are ignored in the figure), and then The output of the last pooled layer is first expanded and then used as the input of the fully connected network. After two fully connected layers, a 10*1 output is finally obtained.


Figure 23 Computation process of convolutional neural network

The maps of the convolutional neural network are from:
An Intuitive Explanation of Convolutional Neural Networks

About model training

Some deep learning frameworks will help us to do the specific work of model training, such as the determination of w and b mentioned above, to find the most suitable w and b to minimize the error between the predicted value and the true value. Here is an example of using tensorflow to optimize the loss=4*(w-1)^2 function to find the most appropriate w to minimize loss:


Figure 24 Image of loss=4*(w-1)^2 function

As shown in the figure above, the mathematics we have learned tells us that when w is equal to 1, the loss is minimal, and this process can be derived by derivation (when the derivative is equal to 0). So use tensorflow to help us determine what it is like? Here is the use of tensorflow to optimize the function to determine the most w value:

w = tf.get_variable(“w”, initializer = 3.0)
Optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
For i in range(5):
Optimizer.minimize(lambda: 4*(w-1)*(w-1))
Use the gradient descent optimization algorithm to find the most suitable w, the final output is:

1.4 1.0799999 1.016 1.0032 1.00064

We can see that after 5 searches, we get the best w of 1.00064, which is very close to 1. This process is actually a simple version of the deep learning framework training model .



(1) This article does not cover the principle of specific model training, that is, the specific process of finding W and b matrices. Because the process is complex and involves many mathematical formulas, the reader only needs to know that the essence of model training is to use a large amount. The tagged sample data finds a relatively suitable W and b matrix, which can then be applied to data outside the sample.

(2) Deep learning Many practices lack practical theoretical basis, and most of them rely on experience. For example, how many layers are appropriate, what activation function is better, and sometimes there may be different answers for different data sets (or problems).

(3) In addition to the same name, the neural network in deep learning has nothing to do with the working principle of human brain neural network. It was thought to be related, so I took a similar name. Later, scientists found that it seems to have nothing to do with it because the human brain is too complicated.


Orignal link: