Update images and references

Signed-off-by: Riccardo Finotello <riccardo.finotello@gmail.com>
This commit is contained in:
2020-10-20 19:29:13 +02:00
parent 06e27a3702
commit 1eb7136ead
16 changed files with 414 additions and 1301 deletions

View File

@@ -11,7 +11,7 @@ A linear model learns a function
\end{equation}
where $w$ and $b$ are the \emph{weights} and \emph{intercept} of the fit.
One of the key assumptions behind a linear fit is the independence of the residual error between the predicted point and the value of the model, which can therefore be assumed to be sampled from a normal distribution peaked at the average value~\cite{Lista:2017:StatisticalMethodsData, Caffo::DataScienceSpecialization}.
One of the key assumptions behind a linear fit is the independence of the residual error between the predicted point and the value of the model, which can therefore be assumed to be sampled from a normal distribution peaked at the average value~\cite{Skiena:2017:DataScienceDesign, Caffo::DataScienceSpecialization}.
The parameters of the fit are then chosen to maximise their \emph{likelihood} function, or conversely to minimise its logarithm with a reversed sign (the $\chi^2$ function).
A related task is to minimise the mean squared error without assuming a statistical distribution of the residual error: \ml for regression usually implements this as loss function of the estimators.
In this sense loss functions for regression are more general than a likelihood approach but they are nonetheless related.
@@ -147,9 +147,9 @@ They are called \textit{support vectors} (accessible using the attribute \texttt
As a consequence any sum involving $\alpha^{(i)}$ or $\beta^{(i)}$ can be restricted to the subset of support vectors.
Using the kernel notation, the predictions will therefore be
\begin{equation}
y_{pred}^{(i)}
y_{\text{pred}}^{(i)}
=
y_{pred}\qty(x^{(i)})
y_{\text{pred}}\qty(x^{(i)})
=
\finitesum{n}{1}{F'} w_n \phi_n\qty(x^{(i)}) + b
=
@@ -215,7 +215,7 @@ In regression tasks it is usually given by the $l_1$ and $l_2$ norms of the devi
\begin{equation}
H^{[l]}_n\qty(x;\, t_{j,\, n})
=
\frac{1}{\abs{\cM^{[l]}_n( t_{j,\, n} )}} \sum\limits_{i \in A^{[l]}_n} \abs{y^{(i)} - \tilde{y}^{[l]}_{pred,\, n}( x )},
\frac{1}{\abs{\cM^{[l]}_n( t_{j,\, n} )}} \sum\limits_{i \in A^{[l]}_n} \abs{y^{(i)} - \tilde{y}^{[l]}_{\text{pred},\, n}( x )},
\quad
\qty( x^{(i)},\, y^{(i)} ) \in \cM_n\qty( t_{j,\, n} ),
\end{equation}
@@ -224,20 +224,20 @@ In regression tasks it is usually given by the $l_1$ and $l_2$ norms of the devi
\begin{equation}
H^{[l]}_n\qty(x;\, t_{j,\, n})
=
\frac{1}{\abs{\cM^{[l]}_n( t_{j,\, n} )}} \sum\limits_{i \in A^{[l]}_n} \qty( y^{(i)} - \bar{y}^{[l]}_{pred,\, n}( x ) )^2,
\frac{1}{\abs{\cM^{[l]}_n( t_{j,\, n} )}} \sum\limits_{i \in A^{[l]}_n} \qty( y^{(i)} - \bar{y}^{[l]}_{\text{pred},\, n}( x ) )^2,
\quad
\qty( x^{(i)}, y^{(i)} ) \in \cM_n( t_{j,\, n} ),
\end{equation}
\end{itemize}
where $\abs{\cM^{[l]}_n\qty( t_{j,\, n} )}$ is the cardinality of the set $\cM^{[l]}_n\qty( t_{j,\, n} )$ for $l = 1, 2$ and
\begin{equation}
\tilde{y}^{[l]}_{pred,\, n}( x )
\tilde{y}^{[l]}_{\text{pred},\, n}( x )
=
\underset{i \in A^{[l]}_n}{\mathrm{median}}~ y_{pred}\qty(x^{(i)}),
\underset{i \in A^{[l]}_n}{\mathrm{median}}~ y_{\text{pred}}\qty(x^{(i)}),
\qquad
\bar{y}^{[l]}_{pred,\, n}( x )
\bar{y}^{[l]}_{\text{pred},\, n}( x )
=
\frac{1}{\abs{A^{[l]}_n}} \sum\limits_{i \in A^{[l]}_n} y_{pred}\qty(x^{(i)}),
\frac{1}{\abs{A^{[l]}_n}} \sum\limits_{i \in A^{[l]}_n} y_{\text{pred}}\qty(x^{(i)}),
\end{equation}
where $A_n^{[l]} \subset A_n$ are the subset of labels in the left and right splits ($l = 1$ and $l = 2$, that is) of the node $n$.
@@ -280,7 +280,7 @@ Also random forests of trees provide a variable ranking system by averaging the
As a reference, \textit{random forests} of decision trees (as in \texttt{ensemble.RandomForestRegressor} in \texttt{scikit-learn}) are ensemble learning algorithms based on fully grown (deep) decision trees.
They were created to overcome the issues related to overfitting and variability of the input data and are based on random sampling of the training data~\cite{Ho:1995:RandomDecisionForests}.
The idea is to take $K$ random partitions of the training data and train a different decision tree for each of them and combine the results: for a classification task this would resort to averaging the \textit{a posteriori} (or conditional) probability of predicting the class $c$ given an input $x$ (i.e.\ the Bayesan probability $P\qty(c \mid x)$) over the $K$ trees, while for regression this amount to averaging the predictions of the trees $y_{pred,\, \hatn}^{(i)\, \lbrace k \rbrace}$ where $k = 1, 2, \dots, K$ and $\hatn$ is the final node (i.e. the node containing the final predictions).
The idea is to take $K$ random partitions of the training data and train a different decision tree for each of them and combine the results: for a classification task this would resort to averaging the \textit{a posteriori} (or conditional) probability of predicting the class $c$ given an input $x$ (i.e.\ the Bayesan probability $P\qty(c \mid x)$) over the $K$ trees, while for regression this amount to averaging the predictions of the trees $y_{\text{pred},\, \hatn}^{(i)\, \lbrace k \rbrace}$ where $k = 1, 2, \dots, K$ and $\hatn$ is the final node (i.e. the node containing the final predictions).
This defines what has been called a \textit{random forest} of trees which can usually help in improving the predictions by reducing the variance due to trees adapting too much to training sets.
\textit{Boosting} methods are another implementation of ensemble learning algorithms in which more \textit{weak learners}, in this case shallow decision trees, are trained over the training dataset~\cite{Friedman:2001:GreedyFunctionApproximation, Friedman:2002:StochasticGradientBoosting}.
@@ -343,11 +343,13 @@ In \fc networks the input of layer $l$ is a feature vector $a^{(i)\, \qty{l}} \i
}
In other words, each entry of the vectors $a^{(i)\, \qty{l}}_j$ (for $j = 1, 2, \dots, n_l$) is mapped through a function $\psi$ to all the components of the following layer $a^{\qty{l+1}} \in \R^{n_{l+1}}$:
\begin{equation}
\begin{split}
\psi\colon & \R^{n_l} \quad \longrightarrow \quad \R^{n_{l+1}}
\\
& a^{(i)\, \qty{l}} \quad \longmapsto \quad a^{(i)\, \qty{l+1}} = \psi_j( a^{(i)\, \qty{l}} ),
\end{split}
\centering
\begin{tabular}{@{}rlll@{}}
$\psi\colon$ & $\R^{n_l}$ & $\longrightarrow$ & $\R^{n_{l+1}}$
\\
& $a^{\qty(i)\, \qty{l}}$ & $\longmapsto$ & $a^{\qty(i)\, \qty{l+1}} = \psi_j\qty( a^{\qty(i)\, \qty{l}} )$,
\\
\end{tabular}
\end{equation}
such that
\begin{equation}
@@ -367,7 +369,7 @@ A common choice is the \textit{rectified linear unit} ($\mathrm{ReLU}$) function
\end{equation}
which has been proven to be better at training deep learning architectures~\cite{Glorot:2011:DeepSparseRectifier}, or its modified version $\mathrm{LeakyReLU}( z ) = \max( \alpha z, z )$ which introduces a slope $\alpha > 0$ to improve the computational performance near the non differentiable point in the origin.
\cnn architectures were born in the context of computer vision and object localisation~\cite{Tompson:2015:EfficientObjectLocalization}.
\cnn architectures rose to fame in the context of computer vision and object localisation~\cite{Tompson:2015:EfficientObjectLocalization}.
As one can suspect looking at~\Cref{fig:nn:lenet} for instance, the fundamental difference with \fc networks is that they use a convolution operation $K^{\qty{l}} * a^{(i)\, \qty{l}}$ instead of a linear map to transform the output of the layers, before applying the activation function.\footnotemark{}
\footnotetext{%
In general the input of each layer can be a generic tensor with an arbitrary number of axis.