Update images and references
Signed-off-by: Riccardo Finotello <riccardo.finotello@gmail.com>
This commit is contained in:
		| @@ -11,7 +11,7 @@ A linear model learns a function | ||||
| \end{equation} | ||||
| where $w$ and $b$ are the \emph{weights} and \emph{intercept} of the fit. | ||||
|  | ||||
| One of the key assumptions behind a linear fit is the independence of the residual error between the predicted point and the value of the model, which can therefore be assumed to be sampled from a normal distribution peaked at the average value~\cite{Lista:2017:StatisticalMethodsData, Caffo::DataScienceSpecialization}. | ||||
| One of the key assumptions behind a linear fit is the independence of the residual error between the predicted point and the value of the model, which can therefore be assumed to be sampled from a normal distribution peaked at the average value~\cite{Skiena:2017:DataScienceDesign, Caffo::DataScienceSpecialization}. | ||||
| The parameters of the fit are then chosen to maximise their \emph{likelihood} function, or conversely to minimise its logarithm with a reversed sign (the $\chi^2$ function). | ||||
| A related task is to minimise the mean squared error without assuming a statistical distribution of the residual error: \ml for regression usually implements this as loss function of the estimators. | ||||
| In this sense loss functions for regression are more general than a likelihood approach but they are nonetheless related. | ||||
| @@ -147,9 +147,9 @@ They are called \textit{support vectors} (accessible using the attribute \texttt | ||||
| As a consequence any sum involving $\alpha^{(i)}$ or $\beta^{(i)}$ can be restricted to the subset of support vectors. | ||||
| Using the kernel notation, the predictions will therefore be | ||||
| \begin{equation} | ||||
|   y_{pred}^{(i)} | ||||
|   y_{\text{pred}}^{(i)} | ||||
|   = | ||||
|   y_{pred}\qty(x^{(i)}) | ||||
|   y_{\text{pred}}\qty(x^{(i)}) | ||||
|   = | ||||
|   \finitesum{n}{1}{F'} w_n \phi_n\qty(x^{(i)}) + b | ||||
|   = | ||||
| @@ -215,7 +215,7 @@ In regression tasks it is usually given by the $l_1$ and $l_2$ norms of the devi | ||||
|   \begin{equation} | ||||
|     H^{[l]}_n\qty(x;\, t_{j,\, n}) | ||||
|     = | ||||
|     \frac{1}{\abs{\cM^{[l]}_n( t_{j,\, n} )}} \sum\limits_{i \in A^{[l]}_n} \abs{y^{(i)} - \tilde{y}^{[l]}_{pred,\, n}( x )}, | ||||
|     \frac{1}{\abs{\cM^{[l]}_n( t_{j,\, n} )}} \sum\limits_{i \in A^{[l]}_n} \abs{y^{(i)} - \tilde{y}^{[l]}_{\text{pred},\, n}( x )}, | ||||
|     \quad | ||||
|     \qty( x^{(i)},\, y^{(i)} ) \in \cM_n\qty( t_{j,\, n} ), | ||||
|   \end{equation} | ||||
| @@ -224,20 +224,20 @@ In regression tasks it is usually given by the $l_1$ and $l_2$ norms of the devi | ||||
|   \begin{equation} | ||||
|   H^{[l]}_n\qty(x;\, t_{j,\, n}) | ||||
|   = | ||||
|   \frac{1}{\abs{\cM^{[l]}_n( t_{j,\, n} )}} \sum\limits_{i \in A^{[l]}_n} \qty( y^{(i)} - \bar{y}^{[l]}_{pred,\, n}( x ) )^2, | ||||
|   \frac{1}{\abs{\cM^{[l]}_n( t_{j,\, n} )}} \sum\limits_{i \in A^{[l]}_n} \qty( y^{(i)} - \bar{y}^{[l]}_{\text{pred},\, n}( x ) )^2, | ||||
|   \quad | ||||
|   \qty( x^{(i)}, y^{(i)} ) \in \cM_n( t_{j,\, n} ), | ||||
|   \end{equation} | ||||
| \end{itemize} | ||||
| where $\abs{\cM^{[l]}_n\qty( t_{j,\, n} )}$ is the cardinality of the set $\cM^{[l]}_n\qty( t_{j,\, n} )$ for $l = 1, 2$ and | ||||
| \begin{equation} | ||||
|   \tilde{y}^{[l]}_{pred,\, n}( x ) | ||||
|   \tilde{y}^{[l]}_{\text{pred},\, n}( x ) | ||||
|   = | ||||
|   \underset{i \in A^{[l]}_n}{\mathrm{median}}~ y_{pred}\qty(x^{(i)}), | ||||
|   \underset{i \in A^{[l]}_n}{\mathrm{median}}~ y_{\text{pred}}\qty(x^{(i)}), | ||||
|   \qquad | ||||
|   \bar{y}^{[l]}_{pred,\, n}( x ) | ||||
|   \bar{y}^{[l]}_{\text{pred},\, n}( x ) | ||||
|   = | ||||
|   \frac{1}{\abs{A^{[l]}_n}} \sum\limits_{i \in A^{[l]}_n} y_{pred}\qty(x^{(i)}), | ||||
|   \frac{1}{\abs{A^{[l]}_n}} \sum\limits_{i \in A^{[l]}_n} y_{\text{pred}}\qty(x^{(i)}), | ||||
| \end{equation} | ||||
| where $A_n^{[l]} \subset A_n$ are the subset of labels in the left and right splits ($l = 1$ and $l = 2$, that is) of the node $n$. | ||||
|  | ||||
| @@ -280,7 +280,7 @@ Also random forests of trees provide a variable ranking system by averaging the | ||||
|  | ||||
| As a reference, \textit{random forests} of decision trees (as in \texttt{ensemble.RandomForestRegressor} in \texttt{scikit-learn}) are ensemble learning algorithms based on fully grown (deep) decision trees. | ||||
| They were created to overcome the issues related to overfitting and variability of the input data and are based on random sampling of the training data~\cite{Ho:1995:RandomDecisionForests}. | ||||
| The idea is to take $K$ random partitions of the training data and train a different decision tree for each of them and combine the results: for a classification task this would resort to averaging the \textit{a posteriori} (or conditional) probability of predicting the class $c$ given an input $x$ (i.e.\ the Bayesan probability $P\qty(c \mid x)$) over the $K$ trees, while for regression this amount to averaging the predictions of the trees $y_{pred,\, \hatn}^{(i)\, \lbrace k \rbrace}$ where $k = 1, 2, \dots, K$ and $\hatn$ is the final node (i.e. the node containing the final predictions). | ||||
| The idea is to take $K$ random partitions of the training data and train a different decision tree for each of them and combine the results: for a classification task this would resort to averaging the \textit{a posteriori} (or conditional) probability of predicting the class $c$ given an input $x$ (i.e.\ the Bayesan probability $P\qty(c \mid x)$) over the $K$ trees, while for regression this amount to averaging the predictions of the trees $y_{\text{pred},\, \hatn}^{(i)\, \lbrace k \rbrace}$ where $k = 1, 2, \dots, K$ and $\hatn$ is the final node (i.e. the node containing the final predictions). | ||||
| This defines what has been called a \textit{random forest} of trees which can usually help in improving the predictions by reducing the variance due to trees adapting too much to training sets. | ||||
|  | ||||
| \textit{Boosting} methods are another implementation of ensemble learning algorithms in which more \textit{weak learners}, in this case shallow decision trees, are trained over the training dataset~\cite{Friedman:2001:GreedyFunctionApproximation, Friedman:2002:StochasticGradientBoosting}. | ||||
| @@ -343,11 +343,13 @@ In \fc networks the input of layer $l$ is a feature vector $a^{(i)\, \qty{l}} \i | ||||
| } | ||||
| In other words, each entry of the vectors $a^{(i)\, \qty{l}}_j$ (for $j = 1, 2, \dots, n_l$) is mapped through a function $\psi$ to all the components of the following layer $a^{\qty{l+1}} \in \R^{n_{l+1}}$: | ||||
| \begin{equation} | ||||
|   \begin{split} | ||||
|   \psi\colon & \R^{n_l} \quad \longrightarrow \quad \R^{n_{l+1}} | ||||
|   \\ | ||||
|   & a^{(i)\, \qty{l}} \quad \longmapsto \quad a^{(i)\, \qty{l+1}} = \psi_j( a^{(i)\, \qty{l}} ), | ||||
|   \end{split} | ||||
|   \centering | ||||
|   \begin{tabular}{@{}rlll@{}} | ||||
|     $\psi\colon$ & $\R^{n_l}$ & $\longrightarrow$ & $\R^{n_{l+1}}$ | ||||
|     \\ | ||||
|                & $a^{\qty(i)\, \qty{l}}$ & $\longmapsto$ & $a^{\qty(i)\, \qty{l+1}} = \psi_j\qty( a^{\qty(i)\, \qty{l}} )$, | ||||
|     \\ | ||||
|   \end{tabular} | ||||
| \end{equation} | ||||
| such that | ||||
| \begin{equation} | ||||
| @@ -367,7 +369,7 @@ A common choice is the \textit{rectified linear unit} ($\mathrm{ReLU}$) function | ||||
| \end{equation} | ||||
| which has been proven to be better at training deep learning architectures~\cite{Glorot:2011:DeepSparseRectifier}, or its modified version $\mathrm{LeakyReLU}( z ) = \max( \alpha z, z )$ which introduces a slope $\alpha > 0$ to improve the computational performance near the non differentiable point in the origin. | ||||
|  | ||||
| \cnn architectures were born in the context of computer vision and object localisation~\cite{Tompson:2015:EfficientObjectLocalization}. | ||||
| \cnn architectures rose to fame in the context of computer vision and object localisation~\cite{Tompson:2015:EfficientObjectLocalization}. | ||||
| As one can suspect looking at~\Cref{fig:nn:lenet} for instance, the fundamental difference with \fc networks is that they use a convolution operation $K^{\qty{l}} * a^{(i)\, \qty{l}}$ instead of a linear map to transform the output of the layers, before applying the activation function.\footnotemark{} | ||||
| \footnotetext{% | ||||
|   In general the input of each layer can be a generic tensor with an arbitrary number of axis. | ||||
|   | ||||
		Reference in New Issue
	
	Block a user