Stop adding papers

Signed-off-by: Riccardo Finotello <riccardo.finotello@gmail.com>
2020-10-10 19:11:29 +02:00
parent 389b368643
commit a69233e46b
12 changed files with 803 additions and 184 deletions
--- a/sciencestuff.sty
+++ b/sciencestuff.sty
@@ -64,6 +64,7 @@
 \providecommand{\mse}{\textsc{mse}\xspace}
 \providecommand{\mae}{\textsc{mae}\xspace}
 \providecommand{\ann}{\textsc{ann}\xspace}
+\providecommand{\nn}{\textsc{nn}\xspace}
 \providecommand{\cnn}{\textsc{cnn}\xspace}
 \providecommand{\ap}{\ensuremath{\alpha'}\xspace}
 \providecommand{\sgn}{\ensuremath{\mathrm{sign}}}
@@ -1396,6 +1397,10 @@
 \providecommand{\uffY}{\ensuremath{\underline{\mathfrak{Y}}}\xspace}
 \providecommand{\uffZ}{\ensuremath{\underline{\mathfrak{Z}}}\xspace}

+%---- functions
+
+\providecommand{\fid}{\mathrm{id}}
+
 %---- groups

 \providecommand{\OO}[1]{\ensuremath{\mathrm{O}(#1)}\xspace}
--- a/sec/app/ml.tex
+++ b/sec/app/ml.tex
@@ -0,0 +1,411 @@
+In this appendix we give a brief review and definition of the main \ml algorithms used in the text.
+We highlight the specific characteristics of interest in the analysis.
+
+\subsection{Linear regression}
+\label{sec:app:linreg}
+
+Consider a set of $F$ features $\qty{ x_n }$ where $n = 1, \ldots, F$.
+A linear model learns a function
+\begin{equation}
+  f(x_n) = \finitesum{n}{1}{F} w_n x_n + b,
+\end{equation}
+where $w$ and $b$ are the \emph{weights} and \emph{intercept} of the fit.
+
+One of the key assumptions behind a linear fit is the independence of the residual error between the predicted point and the value of the model, which can therefore be assumed to be sampled from a normal distribution peaked at the average value~\cite{Lista:2017:StatisticalMethodsData, Caffo::DataScienceSpecialization}.
+The parameters of the fit are then chosen to maximise their \emph{likelihood} function, or conversely to minimise its logarithm with a reversed sign (the $\chi^2$ function).
+A related task is to minimise the mean squared error without assuming a statistical distribution of the residual error: \ml for regression usually implements this as loss function of the estimators.
+In this sense loss functions for regression are more general than a likelihood approach but they are nonetheless related.
+For plain linear regression the associated loss is:
+\begin{equation}
+  \cL(w,\, b)
+  =
+  \frac{1}{2N}\,
+  \finitesum{i}{1}{N}
+  \finitesum{n}{1}{F}
+  \qty( y^{(i)} - (w_n x_n^{(i)} + b) )^2,
+\end{equation}
+where $N$ is the number of samples and $x_n^{(i)}$ the $n$th feature of the $i$-th sample.
+The values of the parameters will therefore be:
+\begin{equation}
+  \qty(w, b) = \underset{w,\, b}{\mathrm{argmin}}~ \cL(w, b).
+\end{equation}
+This usually requires looping over all samples and all features, thus the \emph{least squares} method has a time complexity of $\order{ F \times N }$: while the increase of the number of samples might be an issue, the number of engineered features and matrix components usually does not change and does not represent a huge effort in terms of rescaling the algorithm.
+
+There are however different versions of possible regularisation which we might add to constrain the parameters of the fit and avoid adapting too well to the training set.
+In particular we may be interested in adding a $\ell_1$ regularisation:
+\begin{equation}
+  \cL_1(w) = \sqrt{\finitesum{n}{1}{F} w_n^2},
+\end{equation}
+or the $\ell_2$ version:
+\begin{equation}
+  \cL_2(w) = \finitesum{n}{1}{F} w_n^2.
+\end{equation}
+Notice that in general we do not regularise the intercept.
+These terms can be added to the plain loss function to try and avoid large parameters to influence the predictions and to keep better generalisation properties:
+\begin{itemize}
+  \item add both $\ell_1$ and $\ell_2$ regularisation (this is called \emph{elastic net}):
+    \begin{equation}
+      \cL_{\textsc{en}}(w, b;~\alpha_{\textsc{en}}, L) = \cL(w,b) + \alpha_{\textsc{en}} \cdot L \cdot \cL_1(w) + \frac{\alpha_{\textsc{en}}}{2} \cdot (1 - L) \cdot \cL_2(w),
+    \end{equation}
+
+  \item keep only $\ell_1$ regularisation (i.e.\ the \emph{lasso} regression):
+    \begin{equation}
+      \cL_{\textsc{lss}}(w, b;~\alpha_{\textsc{lss}}) = \cL(w,b) + \alpha_{\textsc{lss}} \cdot \cL_1(w),
+    \end{equation}
+
+  \item keep only $\ell_2$ regularisation (\emph{ridge} regression):
+    \begin{equation}
+      \cL_{\textsc{rdg}}(w, b;~\alpha_{\textsc{rdg}}) = \cL(w,b) + \alpha_{\textsc{rdg}} \cdot \cL_2(w).
+      \label{eq:ridge:loss}
+    \end{equation}
+\end{itemize}
+The role of the hyperparameter $L$ is to balance the contribution of the additional terms.
+For larger values of the hyperparameter $\alpha$, $w$ (and $b$) assume smaller values and adapt less to the particular training set.
+
+
+\subsection{Support Vector Machines for Regression}
+\label{sec:app:svr}
+
+This family of supervised \ml algorithms were created with classification tasks in mind~\cite{Cortes:1995:SupportvectorNetworks} but have proven to be effective also for regression problems~\cite{Drucker:1997:SupportVectorRegression}.
+Differently from the linear regression, instead of minimising the squared distance of each sample, the algorithm assigns a penalty to predictions of samples $x^{(i)} \in \R^F$ (for $i = 1, 2, \dots, N$) which are further away than a certain hyperparameter $\varepsilon$ from their true value $y$, allowing however a \textit{soft margin} of tolerance represented by the penalties $\zeta$ above and $\xi$ below.
+This is achieved by minimising $w,\, b,\, \zeta$ and $\xi$ in the function:\footnotemark{}
+\footnotetext{%
+  In a classification task the training objective would be the minimisation of the opposite of the log-likelihood function of predicting a positive class, that is $y^{(i)}\, \qty( w_n \phi_n\qty(x^{(i)}) + b )$, which should equal the unity for good predictions (we can consider $\varepsilon = 1$), instead of the regression objective $y^{(i)} - w_n \phi_n\qty(x^{(i)}) - b$.
+  The differences between \svm for classification purposes and regression follow as shown.
+}
+\begin{equation}
+  \begin{split}
+    \cL\qty(w, b, \zeta, \xi)
+    & =
+    \frac{1}{2} \finitesum{n}{1}{F'} w_n^2
+    +
+    C \finitesum{i}{1}{N} \qty( \zeta^{(i)} + \xi^{(i)} )
+    \\
+    & +
+    \finitesum{i}{1}{N} \finitesum{n}{1}{F'} \alpha^{(i)}
+    \qty( y^{(i)} - w_n \phi_n\qty(x^{(i)}) - b - \varepsilon - \zeta^{(i)} )
+    \\
+    & +
+    \finitesum{i}{1}{N} \finitesum{n}{1}{F'} \beta^{(i)}
+    \qty( w_n \phi_n\qty(x^{(i)}) + b - y^{(i)} - \varepsilon - \xi^{(i)} )
+    \\
+    & -
+    \finitesum{i}{1}{N} \qty( \rho^{(i)} \zeta^{(i)} + \sigma^{(i)} \xi^{(i)} )
+  \end{split}
+  \label{eq:svr:loss}
+\end{equation}
+where $\alpha^{(i)},\, \beta^{(i)},\, \rho^{(i)},\, \sigma^{(i)} \ge 0$ such that the previous expression encodes the constraints
+\begin{equation}
+  \begin{cases}
+    y^{(i)} - \finitesum{n}{1}{F'} w_n \phi_n\qty(x^{(i)}) - b & \le \varepsilon + \zeta^{(i)},
+    \qquad
+    \varepsilon \ge 0,
+    \quad
+    \zeta^{(i)} \ge 0,
+    \quad
+    i = 1, 2, \dots, N
+    \\
+    \finitesum{n}{1}{F'} w_n \phi_n\qty(x^{(i)}) + b  - y^{(i)} & \le \varepsilon + \xi^{(i)},
+    \qquad
+    \varepsilon \ge 0,
+    \quad
+    \xi^{(i)} \ge 0,
+    \quad
+    i = 1, 2, \dots, N
+  \end{cases}
+  \label{eq:svr:constraints}
+\end{equation}
+and where $\phi\qty(x^{(i)}) \in \R^{F'}$ is a function mapping the feature vector $x^{(i)} \in \R^F$ in a higher dimensional space ($F' > F$), whose interpretation will become clear in an instant.
+The minimisation problem leads to
+\begin{equation}
+  \begin{cases}
+    w_n - \finitesum{i}{1}{N} \qty( \alpha^{(i)} - \beta^{(i)} ) \phi_n\qty(x^{(i)}) = 0
+    \\
+    \finitesum{i}{1}{N} \qty( \alpha^{(i)} - \beta^{(i)} ) = 0
+    \\
+    \finitesum{i}{1}{N} \qty( \alpha^{(i)} + \rho^{(i)} )
+    =
+    \finitesum{i}{1}{N} \qty( \beta^{(i)} + \sigma^{(i)} )
+    =
+    C
+  \end{cases}
+\end{equation}
+such that $0 \le \alpha^{(i)},\, \beta^{(i)} \le C,~\forall\, i = 1, 2, \dots, N$. This can be reformulated as a \textit{dual} problem in finding the extrema of $\alpha^{(i)}$ and $\beta^{(i)}$ in
+\begin{equation}
+  W(\alpha, \beta)
+  =
+  \frac{1}{2} \sum\limits_{i, j = 1}^N \theta^{(i)} \theta^{(j)} \rK( x^{(i)}, x^{(j)} )
+  -
+  \varepsilon \finitesum{i}{1}{N} \qty( \alpha^{(i)} + \beta^{(i)} )
+  +
+  \finitesum{i}{1}{N} y^{(i)} \theta^{(i)},
+  \label{eq:svr:loss-v2}
+\end{equation}
+where $\theta = \alpha - \beta$ are called \textit{dual coefficients} (accessible through the attribute \texttt{dual\_coef\_} of \texttt{svm.SVR} in \texttt{scikit-learn}) and $\rK\qty( x^{(i)}, x^{(j)} ) = \finitesum{n}{1}{F'} \phi_n\qty(x^{(i)}) \phi_n\qty( x^{(j)} )$ is the \textit{kernel} function.
+Notice that the Lagrange multipliers $\alpha^{(i)}$ and $\beta^{(i)}$ are non vanishing only for particular sets of vectors $l^{(i)}$ which lie outside the $\varepsilon$ dependent bounds of \eqref{eq:svr:constraints} and operate as landmarks for the others.
+They are called \textit{support vectors} (accessible using the attribute \texttt{support\_vectors\_} in \texttt{svm.SVR}), hence the name of the algorithm. There can be at most $N$ when $\varepsilon \to 0^+$.
+As a consequence any sum involving $\alpha^{(i)}$ or $\beta^{(i)}$ can be restricted to the subset of support vectors.
+Using the kernel notation, the predictions will therefore be
+\begin{equation}
+  y_{pred}^{(i)}
+  =
+  y_{pred}\qty(x^{(i)})
+  =
+  \finitesum{n}{1}{F'} w_n \phi_n\qty(x^{(i)}) + b
+  =
+  \sum\limits_{a \in A} \theta^{(a)} \rK\qty( x^{(i)}, l^{(a)} ) + b,
+\end{equation}
+where $A \subset \lbrace 1, 2, \dots, N \rbrace$ is the subset of labels of the support vectors.
+
+In~\Cref{sec:res:svr} we consider two different implementations of the \svm algorithm:
+\begin{itemize}
+  \item the \textit{linear kernel}, namely the case when $K \equiv \fid$ and the loss, in the \texttt{scikit-learn} implementation of \texttt{svm.LinearSVR}, can be simplified to
+  \begin{equation}
+    \cL(w, b)
+    =
+    \rC \finitesum{i}{1}{N} \finitesum{n}{1}{F'} \max\qty( 0, \abs{ y^{(i)} - w_n \phi_n\qty( x^{(i)} - b) } - \varepsilon ) + \frac{1}{2} \finitesum{n}{1}{F'} w_j^2,
+  \end{equation}
+  without resolving to the dual formulation of the problem.
+
+  \item the Gaussian kernel (called \texttt{rbf}, from \textit{radial basis function}) in which
+  \begin{equation}
+    \rK\qty(x^{(i)},\, l^{(a)})
+    =
+    \exp\qty( - \gamma \finitesum{n}{1}{F} \qty( x^{(i)}_n - l^{(a)}_n )^2 ).
+  \end{equation}
+\end{itemize}
+From the definition of the loss function in ~\eqref{eq:svr:loss} and the kernels, we can appreciate the role of the main hyperparameters of the algorithm.
+While the interpretation of $\varepsilon$ is straightforward as the margin allowed without penalty for the prediction, $\gamma$ represents the width of the normal distribution used to map the features in the higher dimensional space.
+Furthermore, $C$ plays a similar role to the $l_2$ additional term in~\eqref{eq:ridge:loss} by controlling the entity of the penalty for samples outside the $\varepsilon$-dependent bound, however its relation to the linear regularisation is $\alpha_{ridge} = C^{-1}$, thus $C > 0$ by definition.
+
+Given the nature of the algorithm, support vectors are powerful tools which usually grant better results in both classification and regression tasks with respect to logistic and linear regression, but they scale poorly with the number of samples used during training.
+In particular the time complexity is at worst $\order{F \times N^3}$ due to the quadratic nature of~\eqref{eq:svr:loss-v2} and the computation of the kernel function for all samples: for large datasets ($N \gtrsim 10^4$) they are usually outperformed by neural networks.\footnotemark{}
+\footnotetext{%
+  In general it is plausible that the time complexity is $\order{F \times N^2}$ based on good implementations of caching in the algorithm.
+}
+
+
+\subsection{Decision Trees, Random Forests and Gradient Boosting}
+\label{sec:app:trees}
+
+Decision trees are supervised \ml algorithms which model simple decision rules based on the input data~\cite{Quinlan:1986:InductionDecisionTrees, Wittkowski:1986:ClassificationRegressionTrees}.
+They are informally referred to with the acronym CART (as in \textit{Classification And Regression Trees}) and their name descends from the binary tree structure coming from such decision functions separating the input data at each iteration (\textit{node}), thus creating a bifurcating structure with \textit{branches} (the different paths, or decisions made) and \textit{leaves} (the samples in each branch): the basic idea behind them is an \textit{if\dots then\dots else} structure.
+In \texttt{scikit-learn} this is implemented in the classes \texttt{tree.DecisionTreeClassifier} and \texttt{tree.DecisionTreeRegressor}.
+
+The idea behind it is to take input samples $x^{(i)} \in \R^F$ (for $i = 1, 2, \dots, N$) and partition the space in such a way that data with the same label $y^{(i)} \in \R$ is on the same subset of samples (while for classification this may be natural to visualise, for regression this amounts to approximate the input data with a step function whose value is constant inside the partition).
+Let in fact $j = 1, 2, \dots, F$ be a feature and $x^{(i)}_j$ the corresponding value for the sample $i$, at each node $n$ of the tree we partition the set of input data $\cM_n$ into two subsets:
+\begin{equation}
+  \begin{split}
+    \cM^{[1]}_n\qty( t_{j,\, n} )
+    & =
+    \qty{ \qty(x^{(i)},\, y^{(i)}) \in \R^F \times \R \quad \vert \quad x^{(i)}_j < t_{j,\, n} \quad \forall i \in A_n },
+    \\
+    \cM^{[2]}_n\qty( t_{j,\, n} )
+    & =
+    \cM_n \setminus \cM^{[1]}_n\qty( t_{j,\, n} ),
+  \end{split}
+\end{equation}
+where $A_n$ is the full set of labels of the data samples in the node $n$ and $t_{j,\, n} \in \R$ is a threshold value for the feature $j$ at node $n$.
+
+The measure of the ability of the split to reach the objective (classifying or creating a regression model to predict the labels) is modelled through an \textit{impurity} function (i.e. the measure of how often a random data point would be badly classified or how much it would be badly predicted).
+Common choices in classification tasks are the Gini impurity, a special quadratic case of the Tsallis entropy (which in turn is a generalisation of the Boltzmann-Gibbs entropy, recovered as the first power of the Tsallis entropy) and the information theoretic definition of the entropy.
+In regression tasks it is usually given by the $l_1$ and $l_2$ norms of the deviation from different estimators (mean and median) for each node $n$:
+\begin{itemize}
+  \item \textit{mean absolute error}
+  \begin{equation}
+    H^{[l]}_n\qty(x;\, t_{j,\, n})
+    =
+    \frac{1}{\abs{\cM^{[l]}_n( t_{j,\, n} )}} \sum\limits_{i \in A^{[l]}_n} \abs{y^{(i)} - \tilde{y}^{[l]}_{pred,\, n}( x )},
+    \quad
+    \qty( x^{(i)},\, y^{(i)} ) \in \cM_n\qty( t_{j,\, n} ),
+  \end{equation}
+
+  \item \textit{mean squared error}:
+  \begin{equation}
+  H^{[l]}_n\qty(x;\, t_{j,\, n})
+  =
+  \frac{1}{\abs{\cM^{[l]}_n( t_{j,\, n} )}} \sum\limits_{i \in A^{[l]}_n} \qty( y^{(i)} - \bar{y}^{[l]}_{pred,\, n}( x ) )^2,
+  \quad
+  \qty( x^{(i)}, y^{(i)} ) \in \cM_n( t_{j,\, n} ),
+  \end{equation}
+\end{itemize}
+where $\abs{\cM^{[l]}_n\qty( t_{j,\, n} )}$ is the cardinality of the set $\cM^{[l]}_n\qty( t_{j,\, n} )$ for $l = 1, 2$ and
+\begin{equation}
+  \tilde{y}^{[l]}_{pred,\, n}( x )
+  =
+  \underset{i \in A^{[l]}_n}{\mathrm{median}}~ y_{pred}\qty(x^{(i)}),
+  \qquad
+  \bar{y}^{[l]}_{pred,\, n}( x )
+  =
+  \frac{1}{\abs{A^{[l]}_n}} \sum\limits_{i \in A^{[l]}_n} y_{pred}\qty(x^{(i)}),
+\end{equation}
+where $A_n^{[l]} \subset A_n$ are the subset of labels in the left and right splits ($l = 1$ and $l = 2$, that is) of the node $n$.
+
+The full measure of the impurity of the node $n$ and for a feature $j$ is then:
+\begin{equation}
+  G_{j,\, n}(\cM;\, t_{j,\, n})
+  =
+  \frac{\abs{\cM_n^{[1]}( t_{j,\, n} )}}{\abs{\cM_n}} H^{[1]}_n( x;\, t_{j,\, n} )
+  +
+  \frac{\abs{\cM_n^{[2]}( t_{j,\, n} )}}{\abs{\cM_n}} H^{[2]}_n( x;\, t_{j,\, n} ),
+\end{equation}
+from which we select the parameters
+\begin{equation}
+  \hatt_{j,\, n}
+  =
+  \underset{t_{j,\, n}}{\mathrm{argmin}}~ G_n( \cM_n;\, t_{j,\, n} ).
+  \label{eq:trees:lossmin}
+\end{equation}
+We then recurse over all $\cM_n^{[l]}\qty( \hatt_{j,\, n} )$ (for $l = 1, 2$) until we reach the maximum allowed depth of the tree (at most $\abs{\cM_n} = 1$).
+
+Other than just predicting a class or a numeric value, decision trees provide a criterion to assign the importance of each feature appearing in the nodes.
+The implementation of the procedure can however vary between different libraries: in \texttt{scikit-learn} the importance of a feature is computed by the total reduction in the objective function due to the presence of the feature, normalised over all nodes.
+Namely it is defined as the difference between the total impurity normalised by the total amount of samples in the node and the sum of the separate impurities of the left and right split normalised over the number of samples in the respective splits, summed over all the nodes.
+Thus features with a high \textit{variable ranking} (or \textit{variable importance}) are those with a higher impact in reducing the loss of the algorithm and can be expected to be seen in the initial branches of the tree.
+A measure of the variable importance is in general extremely useful for feature engineering and feature selection since it gives a natural way to pick features with a higher chance to provide a good prediction of the labels.
+
+By nature decision trees have a query time complexity of $\order{ \log(N) }$ as most binary search algorithms.
+However their definition requires running over all $F$ features to find the best split for each sample thus increasing the time complexity to $\order{ F \times N \log( N ) }$.
+Summing over all samples in the whole node structure leads to the worst case scenario of a time complexity $\order{ F \times N^2 \log( N ) }$.
+Well balanced trees (that is, nodes are approximately symmetric with the same amount of data samples inside) can usually reduce that time by a factor $N$, but it may not always be the case.
+
+Decision trees have the advantage to be very good at classifying or creating regression relations in the presence of ``well separable'' data samples and they usually provide very good predictions in a reasonable amount of time (especially when balanced).
+However if $F$ is very large, a small variation of the data will almost always lead to a huge change in the decision thresholds and they are usually prone to overfit.
+There are however smart ways to compensate this behaviour based on \textit{ensemble} learning such as \textit{bagging} and \textit{boosting} as well as \textit{pruning} methods such as limiting the depth of the tree or the number of splits and introducing a dropout parameter to remove certain nodes of the tree.\footnotemark{}
+\footnotetext{%
+  The term \textit{bagging} comes from the contraction of \textit{bootstrap} and \textit{aggregating}: predictions are in fact made over randomly sampled partitions of the training set with substitution (i.e.\ samples can appear in different partitions, known as \textit{bootstrap} approach) and then averaged together (\textit{aggregating}).
+  Random forests are an improvement to this simple idea and work best for decision trees: while it is possible to bag simple trees and take their predictions, using the random subsampling as described usually leads to better performance and results.
+}
+Also random forests of trees provide a variable ranking system by averaging the importance of each feature across all base estimators in the bagging aggregator.
+
+As a reference, \textit{random forests} of decision trees (as in \texttt{ensemble.RandomForestRegressor} in \texttt{scikit-learn}) are ensemble learning algorithms based on fully grown (deep) decision trees.
+They were created to overcome the issues related to overfitting and variability of the input data and are based on random sampling of the training data~\cite{Ho:1995:RandomDecisionForests}.
+The idea is to take $K$ random partitions of the training data and train a different decision tree for each of them and combine the results: for a classification task this would resort to averaging the \textit{a posteriori} (or conditional) probability of predicting the class $c$ given an input $x$ (i.e.\ the Bayesan probability $P\qty(c \mid x)$) over the $K$ trees, while for regression this amount to averaging the predictions of the trees $y_{pred,\, \hatn}^{(i)\, \lbrace k \rbrace}$ where $k = 1, 2, \dots, K$ and $\hatn$ is the final node (i.e. the node containing the final predictions).
+This defines what has been called a \textit{random forest} of trees which can usually help in improving the predictions by reducing the variance due to trees adapting too much to training sets.
+
+\textit{Boosting} methods are another implementation of ensemble learning algorithms in which more \textit{weak learners}, in this case shallow decision trees, are trained over the training dataset~\cite{Friedman:2001:GreedyFunctionApproximation, Friedman:2002:StochasticGradientBoosting}.
+In general parameters $\hatt_{j,\, n}$ in~\eqref{eq:trees:lossmin} can be approximated by an expansion
+\begin{equation}
+  t_{j,\, n}( x )
+  =
+  \finitesum{m}{0}{M} t^{\qty{m}}_{j,\, n}( x )
+  =
+  \finitesum{m}{0}{M} \beta^{\qty{m}}_{j,\, n} g( x;\, a^{\qty{m}}_{j,\, n} ),
+  \label{eq:trees:par}
+\end{equation}
+where $g( x;\, a^{\qty{m}}_{j,\, n})$ are called \textit{base learners} and $M$ is the number of iterations.\footnotemark{}
+\footnotetext{%
+  Different implementations of the algorithm refer to the number of iterations in different way.
+  For instance \texttt{scikit-learn} calls them \texttt{n\_estimators} in the class \texttt{ensemble.GradientBoostingRegressor} in analogy to the random forest where the same name is given to the number of trained decision trees, while \texttt{XGBoost} prefers \texttt{num\_boost\_rounds} and \texttt{num\_parallel\_tree} to name the number of boosting rounds (the iterations) and the number of trees trained in parallel in a forest.
+}
+The values of $a^{\qty{m}}_{j,\, n}$ and $\beta^{\qty{m}}_{j,\, n}$ are enough to specify the value of $t_{j,\, n}( x )$ and can be compute by iterating \eqref{eq:trees:lossmin}:
+\begin{equation}
+  \qty( a^{\qty{m}}_{j,\, n},\, \beta^{\qty{m}}_{j,\, n} )
+  =
+  \underset{\qty{a_{j,\, n};\, \beta_{j,\, n}}}{\mathrm{argmin}}~
+  G_{j,\, n}\qty( \cM_n;\, t^{\qty{m-1}}_{j,\, n}( x ) + \beta_{j,\, n} g\qty( x;\, a_{j,\, n} ) ).
+  \label{eq:trees:iter}
+\end{equation}
+The specific case of boosted trees is simpler since the base learner predicts a constant value $g\qty( x;\, a^{\qty{m}}_{j,\, n} )$, thus~\eqref{eq:trees:iter} simplifies to
+\begin{equation}
+  \gamma^{\qty{m}}_{j,\, n}
+  =
+  \underset{\gamma_{j,\, n}}{\mathrm{argmin}}~
+  G_{j,\, n}\qty( \cM_n;\, t^{\qty{m-1}}_{j,\, n}( x ) + \gamma_{j,\, n} ).
+\end{equation}
+Ultimatelythe value of the parameters in~\eqref{eq:trees:par} are updated using gradient descent as
+\begin{equation}
+  t^{\qty{m}}_{j,\, n}( x )
+  =
+  t^{\qty{m-1}}_{j,\, n}( x ) + \nu\, \gamma_{j,\, n}^{\qty{m}},
+\end{equation}
+where $0 \le \nu \le 1$ is the \textit{learning rate} which controls the magnitude of the update.
+Through this procedure, boosted trees can usually vastly improve the predictions of very small decision trees by increasing variance over bias.
+Another way to prevent overfitting the training set is to randomly \textit{subsample} the features vector by taking a subset of them (in \texttt{scikit-learn} it is represented as a percentage of the total number of features).
+Moreover \texttt{scikit-learn} introduces various ways to control the loss of gradient boosting: apart from the aforementioned \textit{least squares} and \textit{least absolute deviation}, we can have hybrid versions of these such as the \textit{huber} loss which combines the two previous losses with an additional hyperparameter $\alpha$ \cite{Fawcett:2001:UsingRuleSets}. While more implementations are present, also the boosted trees provide a way to measure the importance of the variables as any decision tree algorithm.
+
+
+\subsection{Artificial Neural Networks}
+\label{sec:app:nn}
+
+\ann are a state of the art algorithm in \ml.
+They usually outperform any other algorithm in very large datasets (the size of our dataset is roughly at the threshold) and can learn very complicated decision boundaries and functions.\footnotemark{}
+\footnotetext{%
+  Despite their fame in the face of the general public, even small networks can prove to be extremely good at learning complicated functions in a small amount of time.
+}
+In the main text we used two types of neural networks: \textit{fully connected} (\fc) networks and \textit{convolutional neural networks} (\cnn).
+They both rely on being built in a layered structure, starting from the input layers (e.g. the configuration matrix of CY manifolds or an RGB image or several engineered features) going towards the output layers (e.g. the Hodge numbers or the classification class of the image).
+
+In \fc networks the input of layer $l$ is a feature vector $a^{(i)\, \qty{l}} \in \R^{n_l}$ (for $i = 1, 2, \dots, N$) and, as shown in~\Cref{fig:nn:dense}, each layer is densely connected to the following.\footnotemark{}
+\footnotetext{%
+  The input vector $x \in \R^F$ is equivalent to the vector $a^{\qty{0}}$ and $n_0 = F$.
+  Inputs to each layer are here represented as a matrix $a^{\qty{l}}$ whose columns are made by samples and whose rows are filled with the values of the features.
+}
+In other words, each entry of the vectors $a^{(i)\, \qty{l}}_j$ (for $j = 1, 2, \dots, n_l$) is mapped through a function $\psi$ to all the components of the following layer $a^{\qty{l+1}} \in \R^{n_{l+1}}$:
+\begin{equation}
+  \begin{split}
+  \psi\colon & \R^{n_l} \quad \longrightarrow \quad \R^{n_{l+1}}
+  \\
+  & a^{(i)\, \qty{l}} \quad \longmapsto \quad a^{(i)\, \qty{l+1}} = \psi_j( a^{(i)\, \qty{l}} ),
+  \end{split}
+\end{equation}
+such that
+\begin{equation}
+  a^{(i)\, \qty{l+1}}_j
+  =
+  \psi_j( a^{(i)\, \qty{l}} )
+  =
+  \phi\qty( \finitesum{k}{1}{n_l} a^{(i)\, \qty{l}}_k W^{\qty{l}}_{kj} + b^{\qty{l}}\, \1_{j} ),
+\end{equation}
+where $\1 \in \R^{n_{l+1}}$ is an identity vector.
+The matrix $W^{\qty{l}}$ is \textit{weight matrix} and $b^{\qty{l}}$ is the \textit{bias} term.
+The function $\phi$ is a non linear function and plays a fundamental role: without it the successive application of the linear map $a^{\qty{l}} \cdot W^{\qty{l}} + b\, g$ would prevent the network from learning more complicated decision boundaries or functions as the ANN would only be capable of reproducing linear relations.
+$\phi$ is known as \textit{activation function} and can assume different forms, as long as its non linearity is preserved (e.g. a \textit{sigmoid} function in the output layer of a network squeezes the results in the interval $[0, 1]$ thus reproducing the probabilities of of a classification).
+A common choice is the \textit{rectified linear unit} ($\mathrm{ReLU}$) function
+\begin{equation}
+  \phi( z ) = \mathrm{ReLU}( z ) = \max( 0, z ),
+\end{equation}
+which has been proven to be better at training deep learning architectures~\cite{Glorot:2011:DeepSparseRectifier}, or its modified version $\mathrm{LeakyReLU}( z ) = \max( \alpha z, z )$ which introduces a slope $\alpha > 0$ to improve the computational performance near the non differentiable point in the origin.
+
+\cnn architectures were born in the context of computer vision and object localisation~\cite{Tompson:2015:EfficientObjectLocalization}.
+As one can suspect looking at~\Cref{fig:nn:lenet} for instance, the fundamental difference with \fc networks is that they use a convolution operation $K^{\qty{l}} * a^{(i)\, \qty{l}}$ instead of a linear map to transform the output of the layers, before applying the activation function.\footnotemark{}
+\footnotetext{%
+  In general the input of each layer can be a generic tensor with an arbitrary number of axis.
+  For instance, an RGB image can be represented by a three dimensional tensor with indices representing the width of the image, its height and the number of filters (in this case $3$, one for each colour channel).
+}
+This way the network is no longer densely connected, as the results of the convolution (\textit{feature map}) depends only on a restricted neighbourhood of the original feature, depending on the size of the \textit{kernel} window $K^{\qty{l}}$ used and the shape of the input $a^{(i) \qty{l}}$, which is no longer limited to flattened vectors.
+In turn its size influences the convolution operator which we can compute: one way to see this is to visualise an image being scanned by a smaller window function over all pixels or by skipping some a certain number of them (the length of the \textit{stride} of the kernel).
+In general the output will therefore be different than the input, unless the latter is \textit{padded} (with zeros usually) before the convolution. The size of the output is therefore:
+\begin{equation}
+  O_n = \frac{I_n - k_n + 2 p_n}{S_n} + 1, \qquad n = 1, 2, \dots,
+\end{equation}
+where $O$ is the output size, $I$ the input size, $k$ the size of the kernel used, $p$ the amount of padding (symmetric at the start and end of the axis considered) and $S$ the stride.
+In the formula, $n$ runs over the number of components of the input tensor.
+While any padding is possible, we are usually interested in two kinds of possible convolutions:
+\begin{itemize}
+  \item ``same'' convolutions for which $O_n = I_n$, thus $p_n = \frac{I_n ( S_n - 1 ) - S_n + k_n}{2}$,
+
+  \item ``valid'' convolutions for which $O_n < I_n$ and $p_n = 0$.
+\end{itemize}
+
+In both cases the learning process aims to minimise the loss function defined for the task: in our regression implementation of the architecture we used the mean squared error of the predictions.
+The objective is to find best possible values of weight and bias terms $W^{\qty{l}}$ and $b^{\qty{l}}$) or to build the best filter kernel $K^{\qty{l}}$ through \textit{backpropagation}~\cite{Rumelhart:1986:LearningRepresentationsBackpropagating}, that is by reconstructing the gradient of the loss function climbing back the network from the output layer to the input and then using the usual gradient descent procedure to select the optimal parameters.
+For instance, in the case of \fc networks we need to find
+\begin{equation}
+  \qty( \hatW^{\qty{l}},\, \hatb^{\qty{l}} )
+  =
+  \underset{W^{\qty{l}},\, b^{\qty{l}}}{\mathrm{argmin}} \frac{1}{2 N} \finitesum{i}{1}{N} \qty( y^{(i)} - a^{(i)\, \qty{L}} )^2
+  \quad
+  \forall l = 1, 2, \dots, L,
+\end{equation}
+where $L$ is the total number of layers in the network.
+A similar relation holds in the case of CNN architectures.
+In the main text we use the \textit{Adam}~\cite{Kingma:2017:AdamMethodStochastic} implementation of gradient descent and add batch normalisation layers to improve the convergence of the algorithm.
+
+As we can see from their definition, neural networks are capable of learning very complex structures at the cost of having a large number of parameters to tune.
+The risk of overfitting the training set is therefore quite evident.
+There are in general several techniques to counteract the tendency to adapt the training set, one of them being the introduction of regularisation ($l_2$ and $l_1$) in the same fashion of a linear model (we show it in~\Cref{sec:app:linreg}).
+Another successful way is to introduce \textit{dropout} layers~\cite{Srivastava:2014:DropoutSimpleWay} where connections are randomly switched off according to a certain retention probability (or its complementary, the dropout \textit{rate}): this regularisation technique allows to keep good generalisation properties since the prediction can rely in a less incisive way on the particular architecture since which is randomly modified during training (dropout layers however act as the identity during predictions to avoid producing random results).
+
+
+% vim: ft=tex
--- a/sec/part1/conclusion.tex
+++ b/sec/part1/conclusion.tex
@@ -0,0 +1,28 @@
+We thus showed that the specific geometry of the intersecting D-branes leads to different results when computing the value of the classical action, that is the leading contribution to the Yukawa couplings in string theory.
+In particular in the Abelian case the value of the action is exactly the area formed by the intersecting D-branes in the $\R^2$ plane, i.e.\ the string worldsheet is completely contained in the polygon on the plane.
+The difference between the \SO{4} case and \SU{2} is more subtle as in the latter there are complex coordinates in $\R^4$ for which the classical string solution is holomorphic in the upper half plane.
+In the generic case presented so far this is in general no longer true.
+The reason can probably be traced back to supersymmetry, even though we only dealt with the bosonic string.
+In fact when considering \SU{2} rotated D-branes part of the spacetime supersymmetry is preserved, while this is not the case for \SO{4} rotations.
+
+In the general case there does not seem to be any possible way of computing the action~\eqref{eq:action_with_imaginary_part} in term of the global data.
+Most probably the value of the action is larger than in the holomorphic case since the string is no longer confined to a plane.
+Given the nature of the rotation its worldsheet has to bend in order to be attached to the D-brane as pictorially shown in~\Cref{fig:brane3d} in the case of a $3$-dimensional space.
+The general case we considered then differs from the known factorized case by an additional contribution in the on-shell action which can be intuitively understood as a small ``bump'' of the string worldsheet in proximity of the boundary.
+
+We thus showed that the specific geometry of the intersecting D-branes leads to different results when computing the value of the classical action, that is the leading contribution to the Yukawa couplings in string theory.
+In particular in the Abelian case the value of the action is exactly the area formed by the intersecting D-branes in the $\R^2$ plane, i.e.\ the string worldsheet is completely contained in the polygon on the plane.
+The difference between the \SO{4} case and \SU{2} is more subtle as in the latter there are complex coordinates in $\R^4$ for which the classical string solution is holomorphic in the upper half plane.
+In the generic case presented so far this is in general no longer true.
+The reason can probably be traced back to supersymmetry, even though we only dealt with the bosonic string.
+In fact when considering \SU{2} rotated D-branes part of the spacetime supersymmetry is preserved, while this is not the case for \SO{4} rotations.
+
+
+In a technical and direct way we showed the computation of amplitudes involving an arbitrary number of Abelian spin and matter fields.
+The approach we introduced does not generally rely on \cft techniques and can be seen as an alternative to bosonization and old methods based on the Reggeon vertex.
+Starting from this work the future direction may involve the generalisation to non Abelian spin fields and the application to twist fields.
+In this sense this approach might be the only way to compute the amplitudes involving these complicated scenarios.
+This analytical approach may also shed some light on the non existence of a technique similar to bosonisation for twist fields.
+
+
+% vim: ft=tex
--- a/sec/part1/dbranes.tex
+++ b/sec/part1/dbranes.tex
@@ -2552,9 +2552,6 @@ A visual reference can be found in~\Cref{fig:branes_at_angles}.
 For the \SU{2} case we can use a rotation to map $(f_{(t-1)} - f_{(t)})^i$ to the form $\norm{f_{(t-1)} - f_{(t)}} \delta^i_1$.
 Each term of the action can be interpreted again as an area of a triangle where the distance between the interaction points is the base.

-
-\subsubsection{Generalisation and Summary}
-
 \begin{figure}[tbp]
  \centering
  \def\svgwidth{0.35\textwidth}
@@ -2566,16 +2563,12 @@ Each term of the action can be interpreted again as an area of a triangle where
  \label{fig:brane3d}
 \end{figure}

+\subsubsection{General Case and Intuitive Explanation}
+
 In the general case there does not seem to be any possible way of computing the action~\eqref{eq:action_with_imaginary_part} in term of the global data.
 Most probably the value of the action is larger than in the holomorphic case since the string is no longer confined to a plane.
 Given the nature of the rotation its worldsheet has to bend in order to be attached to the D-brane as pictorially shown in~\Cref{fig:brane3d} in the case of a $3$-dimensional space.
 The general case we considered then differs from the known factorized case by an additional contribution in the on-shell action which can be intuitively understood as a small ``bump'' of the string worldsheet in proximity of the boundary.

-We thus showed that the specific geometry of the intersecting D-branes leads to different results when computing the value of the classical action, that is the leading contribution to the Yukawa couplings in string theory.
-In particular in the Abelian case the value of the action is exactly the area formed by the intersecting D-branes in the $\R^2$ plane, i.e.\ the string worldsheet is completely contained in the polygon on the plane.
-The difference between the \SO{4} case and \SU{2} is more subtle as in the latter there are complex coordinates in $\R^4$ for which the classical string solution is holomorphic in the upper half plane.
-In the generic case presented so far this is in general no longer true.
-The reason can probably be traced back to supersymmetry, even though we only dealt with the bosonic string.
-In fact when considering \SU{2} rotated D-branes part of the spacetime supersymmetry is preserved, while this is not the case for \SO{4} rotations.

 % vim: ft=tex
--- a/sec/part1/fermions.tex
+++ b/sec/part1/fermions.tex
@@ -2865,12 +2865,4 @@ using Wick's theorem since the algebra and the action of creation and annihilati
 In particular taking one $\Psi(z)$ and one $\Psi^*(w)$ we get the Green function which is nothing else but the contraction in equation~\eqref{eq:gen_Radial_order}.


-\subsubsection{Summary and Conclusions}
-
-In a technical and direct way we showed the computation of amplitudes involving an arbitrary number of Abelian spin and matter fields.
-The approach we introduced does not generally rely on \cft techniques and can be seen as an alternative to bosonization and old methods based on the Reggeon vertex.
-Starting from this work the future direction may involve the generalisation to non Abelian spin fields and the application to twist fields.
-In this sense this approach might be the only way to compute the amplitudes involving these complicated scenarios.
-This analytical approach may also shed some light on the non existence of a technique similar to bosonisation for twist fields.
-
 % vim: ft=tex
--- a/sec/part2/conclusion.tex
+++ b/sec/part2/conclusion.tex
@@ -0,0 +1,11 @@
+In the previous analysis it seems that string theory cannot do better than field theory when the latter does not exist, at least at the perturbative level where one deals with particles.
+Moreover when spacetime becomes singular, the string massive modes are not anymore spectators.
+Everything seems to suggest that issues with spacetime singularities are hidden into contact terms and interactions with massive states.
+This would explain in an intuitive way why the eikonal approach to gravitational scattering works well: the eikonal is indeed concerned with three point massless interactions.
+In fact it appears that the classical and quantum scattering on an electromagnetic wave~\cite{Jackiw:1992:ElectromagneticFieldsMassless} or gravitational wave~\cite{tHooft:1987:GravitonDominanceUltrahighenergy} in \bo and \nbo are well behaved.
+From this point of view the ACV approach~\cite{Soldate:1987:PartialwaveUnitarityClosedstring,Amati:1987:SuperstringCollisionsPlanckian} may be more sensible, especially when considering massive external states~\cite{Black:2012:HighEnergyString}.
+Finally it seems that all issues are related with the Laplacian associated with the space-like subspace with vanishing volume at the singularity.
+If there is a discrete zero eigenvalue the theory develops divergences. 
+
+
+% vim: ft=tex
--- a/sec/part2/divergences.tex
+++ b/sec/part2/divergences.tex
@@ -4609,15 +4609,4 @@ The two terms add together because of sign of the covariant derivative to give:
 which is divergent for the physical polarisation $\cS_{t\,t} = \cS_{\varphi\,\varphi} = -\hsigma_- \sigma_- \cS_{t\,\varphi} = -\frac{1}{2} \hsigma_- \sigma_- \cS_{2 2}$.
  

-\subsection{Summary and Conclusions}
-
-In the previous analysis it seems that string theory cannot do better than field theory when the latter does not exist, at least at the perturbative level where one deals with particles.
-Moreover when spacetime becomes singular, the string massive modes are not anymore spectators.
-Everything seems to suggest that issues with spacetime singularities are hidden into contact terms and interactions with massive states.
-This would explain in an intuitive way why the eikonal approach to gravitational scattering works well: the eikonal is indeed concerned with three point massless interactions.
-In fact it appears that the classical and quantum scattering on an electromagnetic wave~\cite{Jackiw:1992:ElectromagneticFieldsMassless} or gravitational wave~\cite{tHooft:1987:GravitonDominanceUltrahighenergy} in \bo and \nbo are well behaved.
-From this point of view the ACV approach~\cite{Soldate:1987:PartialwaveUnitarityClosedstring,Amati:1987:SuperstringCollisionsPlanckian} may be more sensible, especially when considering massive external states~\cite{Black:2012:HighEnergyString}.
-Finally it seems that all issues are related with the Laplacian associated with the space-like subspace with vanishing volume at the singularity.
-If there is a discrete zero eigenvalue the theory develops divergences. 
-
 % vim: ft=tex
--- a/sec/part3/conclusion.tex
+++ b/sec/part3/conclusion.tex
@@ -0,0 +1,22 @@
+We have proved how a proper data analysis can lead to improvements in predictions of Hodge numbers \hodge{1}{1} and \hodge{2}{1} for \cicy $3$-folds.
+Moreover considering more complex neural networks inspired by the computer vision applications~\cite{Szegedy:2015:GoingDeeperConvolutions, Szegedy:2016:RethinkingInceptionArchitecture, Szegedy:2016:Inceptionv4InceptionresnetImpact} allowed us to reach close to \SI{100}{\percent} accuracy for \hodge{1}{1} with much less data and less parameters than in previous works.
+While our analysis improved the accuracy for \hodge{2}{1} over what can be expected from a simple sequential neural network, we barely reached \SI{50}{\percent}.
+Hence it would be interesting to push further our study to improve the accuracy.
+Possible solutions would be to use a deeper Inception network, find a better architecture including engineered features, and refine the ensembling.
+
+Another interesting question to probe is related to representation learning, i.e.\ finding a better description of the \cy.
+Indeed one of the main difficulty in making predictions is the redundancy of the possible descriptions of a single manifold.
+For instance we could try to set up a map from any matrix to its favourable representation (if it exists).
+This could be the basis for the use of adversarial networks~\cite{Goodfellow:2014:GenerativeAdversarialNets} capable of generating the favourable embedding from the first.
+Or on the contrary one could generate more matrices for the same manifold in order to increase the size of the training set.
+Another possibility is to use the graph representation of the configuration matrix to which is automatically invariant under permutations~\cite{Hubsch:1992:CalabiyauManifoldsBestiary} (another graph representation has been decisive in~\cite{Krippendorf:2020:DetectingSymmetriesNeural} to get a good accuracy).
+Techniques such as (variational) autoencoders~\cite{Kingma:2014:AutoEncodingVariationalBayes, Rezende:2014:StochasticBackpropagationApproximate, Salimans:2015:MarkovChainMonte}, cycle GAN~\cite{Zhu:2017:UnpairedImagetoimageTranslation}, invertible neural networks~\cite{Ardizzone:2019:AnalyzingInverseProblems}, graph neural networks~\cite{Gori:2005:NewModelLearning, Scarselli:2004:GraphicalbasedLearningEnvironments} or more generally techniques from geometric deep learning~\cite{Monti:2017:GeometricDeepLearning} could be helpful.
+
+Finally, our techniques apply directly to \cicy $4$-folds~\cite{Gray:2013:AllCompleteIntersection, Gray:2014:TopologicalInvariantsFibration}.
+However there are much more manifolds in this case, such that one can expect to reach a better accuracy for the different Hodge numbers (the different learning curves for the $3$-folds indicate that the model training would benefit from more data).
+Another interesting class of manifolds to explore with our techniques are generalized \cicy $3$-folds~\cite{Anderson:2016:NewConstructionCalabiYau}.
+
+These and others will indeed be ground for future investigations.
+
+
+% vim: ft=tex
--- a/sec/part3/deeplearning.tex
+++ b/sec/part3/deeplearning.tex
--- a/sec/part3/ml.tex
+++ b/sec/part3/ml.tex
@@ -620,7 +620,7 @@ We rounded the predictions to the floor for the original dataset and to the next
 \end{tabular}%
 }
 \caption{%
-  Hyperparameter choices of the linear \svm regression.
+  Hyperparameter choices of the linear support vector regression.
  The parameter \texttt{intercept\_scaling} is clearly only relevant when the intercept is used.
  The different losses used simply distinguish between the $\ell_1$ norm of the $\epsilon$-dependent boundary where no penalty is assigned and its $\ell_2$ norm.
 }
@@ -974,7 +974,7 @@ Differently from the previous algorithms, we do not perform a cross-validation s
 Thus we use \SI{80}{\percent} of the samples for training, \SI{10}{\percent} for evaluation, and \SI{10}{\percent} as a test set.
 For the same reason, the optimisation of the algorithm has been performed manually.

-We always use the Adam optimiser with default learning rate $\num{e-3}$ to perform the gradient descent and a fix batch size of $32$.
+We always use the Adam optimiser with default learning rate \num{e-3} to perform the gradient descent and a fix batch size of $32$.
 The network is trained for a large number of epochs to avoid missing possible local optima.
 In order to avoid overshooting the minimum of the loss function, we dynamically reduce the learning rate both using the \emph{Adam} optimiser which implements learning rate decay, and through the callback \texttt{callbacks.ReduceLROnPlateau} in Keras, which scales the learning rate by a given factor when the monitored quantity (in our case the validation loss) does not decrease): we choose to reduce it by $0.3$ when the validation loss does not improve for at least $75$ epochs.
 Moreover we stop training when the validation loss does not improve during $200$ epochs.
@@ -991,8 +991,8 @@ First we reproduce the analysis in~\cite{Bull:2018:MachineLearningCICY} for the
 \paragraph{Model}

 The neural network presented in~\cite{Bull:2018:MachineLearningCICY} for the regression task contains $5$ hidden layers with $876$, $461$, $437$, $929$ and $404$ units (\Cref{fig:nn:dense}).
-All layers (including the output layer) are followed by a ReLU activation and by a dropout layer with a rate of $\num{0.2072}$.
-This network contains roughly $\num{1.58e6}$ parameters.
+All layers (including the output layer) are followed by a ReLU activation and by a dropout layer with a rate of \num{0.2072}.
+This network contains roughly \num{1.58e6} parameters.

 The other hyperparameters (like the optimiser, batch size, number of epochs, regularisation, etc.) are not mentioned.
 In order to reproduce the results, we fill the gap as follows:
@@ -1033,7 +1033,7 @@ Using the same network we also achieve \SI{97}{\percent} of accuracy in the favo
  \end{subfigure}
  \caption{%
    Fully connected network for the prediction of \hodge{1}{1}.
-    For simplicity we do not draw the dropout and batch normalisation layers present after every FC layer.
+    For simplicity we do not draw the dropout and batch normalisation layers present after every densely connected layer.
  }
  \label{fig:nn:fcnetwork}
 \end{figure}
@@ -1113,12 +1113,12 @@ The convolution layers have $180$, $100$, $40$ and $20$ units each.
 With this setup, we were able to achieve an accuracy of \SI{94}{\percent} on both the development and the test sets for the ``old'' database and \SI{99}{\percent} for the favourable dataset in both validation and test sets (results are briefly summarised in \Cref{tab:res:ann}).
 We thus improved the results of the densely connected network and proved that convolutional networks can be valuable assets when dealing with the extraction of a good representation of the input data: not only are convolutional networks very good at recognising patterns and rotationally invariant objects inside pictures or general matrices of data, but deep architectures are also capable of transforming the input using non linear transformations~\cite{Mallat:2016:UnderstandingDeepConvolutional} to create new patterns which can then be used for predictions.

-Even though the convolution operation is very time consuming another advantage of \cnn is the extremely reduced number of parameters with respect to FC networks.\footnotemark{}
+Even though the convolution operation is very time consuming another advantage of \cnn is the extremely reduced number of parameters with respect to fully connected (\fc) networks.\footnotemark{}
 \footnotetext{%
  It took around 4 hours of training (and no optimisation) for each Hodge number in each dataset.
  The use use of modern generation GPUs with tensor cores can however speed up the training by order of magnitudes.
 }
-The architectures we used were in fact made of approximately $\num{5.8e5}$ parameters: way less than half the number of parameters used in the FC network.
+The architectures we used were in fact made of approximately \num{5.8e5} parameters: way less than half the number of parameters used in the \fc network.
 Ultimately, this leads to a smaller number of training epochs necessary to achieve good predictions (see~\Cref{fig:cnn:class-ccnn}).

 \begin{figure}[tbp]
@@ -1145,11 +1145,11 @@ Ultimately, this leads to a smaller number of training epochs necessary to achie
 Using this classic setup we tried different architectures.
 The network for the original dataset seems to work best in the presence of larger kernels, dropping by roughly \SI{5}{\percent} in accuracy when a more ``classical'' $3 \times 3$ kernel is used.
 We also tried to use to set the padding to \texttt{valid}, reducing the input from a $12 \times 15$ matrix to a $1 \times 1$ feature map over the course of $5$ layers with $180$, $100$, $75$, $40$ and $20$ filters.
-The advantage is the reduction of the number of parameters (namely $\sim \num{4.9e5}$) mainly due to the small FC network at the end, but accuracy dropped to \SI{87}{\percent}.
+The advantage is the reduction of the number of parameters (namely $\sim \num{4.9e5}$) mainly due to the small \fc network at the end, but accuracy dropped to \SI{87}{\percent}.
 The favourable dataset seems instead to be more independent of the specific architecture retaining accuracy also with smaller kernels.

 The analysis for \hodge{2}{1} follows the same prescriptions.
-For both the original and favourable dataset, we opted for 4 convolutional layers with \numlist{250;150;100;50} filters and no FC network for a total amount of $\num{2.1e6}$ parameters.
+For both the original and favourable dataset, we opted for 4 convolutional layers with \numlist{250;150;100;50} filters and no \fc network for a total amount of \num{2.1e6} parameters.
 In this scenario we were able to achieve \SI{36}{\percent} of accuracy in the development set and \SI{40}{\percent} on the test set for \hodge{2}{1} in the ``old'' dataset and \SI{31}{\percent} in both development and test sets in the favourable set (see~\Cref{tab:res:ann}).
 The learning curves for both Hodge numbers are given in \Cref{fig:lc:class-ccnn}.
 This model uses the same architecture as the one for predicting \hodge{1}{1} only, which explains why it is less accurate as it needs to also adapt to compute \hodge{2}{1} (see for example \Cref{fig:lc:inception}).
@@ -1164,81 +1164,79 @@ This model uses the same architecture as the one for predicting \hodge{1}{1} onl
 \end{figure}


-%%% TODO %%%
 \subsubsection{Inception-like Neural Network}
 \label{sec:ml:nn:inception}

-
 In the effort to find a better architecture, we took inspiration from Google's winning \cnn in the annual \href{https://image-net.org/challenges/LSVRC/}{\emph{ImageNet challenge}} in 2014~\cite{Szegedy:2015:GoingDeeperConvolutions, Szegedy:2016:RethinkingInceptionArchitecture, Szegedy:2016:Inceptionv4InceptionresnetImpact}.
-The architecture presented uses \emph{inception} modules in which separate $3 \times 3$, $5 \times 5$ convolutions are performed side by side (together with \emph{max pooling} operations) before recombining the outputs.
+The architecture in its original presentation uses \emph{inception} modules in which separate $1 \times 1$, $3 \times 3$ and  $5 \times 5$ convolutions are performed side by side (together with \emph{max pooling} operations) before recombining the outputs.
 The modules are then repeated until the output layer is reached.
 This has two evident advantages: users can avoid taking a completely arbitrary decision on the type of convolution to use since the network will take care of it tuning the weights, and the number of parameters is extremely restricted as the network can learn complicated functions using fewer layers.
 As a consequence the architecture of such models can be made very deep while keeping the number of parameters contained, thus being able to learn very difficult representations of the input and producing accurate predictions.
-Moreover, while the training phase might become very long due to the complicated convolutional operations, the small number of parameters is such that predictions can be generated in a very small amount of time, making inception-like models extremely appropriate whenever quick predictions are necessary.
-Another advantage of the architecture is the presence of different kernel sizes inside each module: the network automatically learns features at different scales and different positions, thus leveraging the advantages of a deep architecture with the ability to learn different representations at the same time and compare them.
+Moreover while the training phase might become very long due to the complicated convolutional operations, the small number of parameters is such that predictions can be generated in a very small amount of time making inception-like models extremely appropriate whenever quick predictions are necessary.
+Another advantage of the architecture is the presence of different kernel sizes inside each module: the network automatically learns features at different scales and different positions thus leveraging the advantages of a deep architecture with the ability to learn different representations at the same time and compare them.


 \paragraph{Model}

-In \Cref{fig:nn:inception}, we show a schematic of our implementation.
+In~\Cref{fig:nn:inception} we show a schematic of our implementation.
 Differently from the image classification task, we drop the pooling operation and implement two side-by-side convolution over rows ($12 \times 1$ kernel for the original dataset, $15 \times 1$ for the favourable) and one over columns ($1 \times 15$ and $1 \times 18$ respectively).\footnotemark{}
 \footnotetext{%
-	Pooling operations are used to shrink the size of the input.
-	Similar to convolutions, they use a window of a given size to scan the input and select particular values inside.
-	For instance, we could select the average value inside the small portion selected, performing an \emph{average pooling} operation, or the maximum value, a \emph{max pooling} operation.
-	This usually improves image classification and object detection tasks as it can be used to sharpen edges and borders.
-}%
+  Pooling operations are used to shrink the size of the input.
+  Similar to convolutions, they use a window of a given size to scan the input and select particular values inside.
+  For instance, we could select the average value inside the small portion selected, performing an \emph{average pooling} operation, or the maximum value, a \emph{max pooling} operation.
+  This usually improves image classification and object detection tasks as it can be used to sharpen edges and borders.
+}
 We use \texttt{same} as padding option.
 The output of the convolutions are then concatenated in the filter dimensions before repeating the ``inception'' module.
 The results from the last module are directly connected to the output layer through a flatten layer.
-In both datasets, we use batch normalisation layers (with momentum $0.99$) after each concatenation layer and a dropout layer (with rate $0.2$) before the FC network.\footnotemark{}
+In both datasets we use batch normalisation layers (with momentum $0.99$) after each concatenation layer and a dropout layer (with rate $0.2$) before the \fc network.\footnotemark{}
 \footnotetext{%
-	The position of the batch normalisation is extremely important as the parameters computed by such layer directly influence the following batch.
-	We however opted to wait for the scan over rows and columns to finish before normalising the outcome to avoid biasing the resulting activation function.
-}%
+  The position of the batch normalisation is extremely important as the parameters computed by such layer directly influence the following batch.
+  We however opted to wait for the scan over rows and columns to finish before normalising the outcome to avoid biasing the resulting activation function.
+}

-For both \hodge{1}{1} and \hodge{2}{1} (in both datasets), we used 3 modules made by 32, 64 and 32 filters for the first Hodge number, and 128, 128 and 64 filters for the second.
-We also included $\ell_1$ and $\ell_2$ regularisation of magnitude $10^{-4}$ in all cases.
-The number of parameters was thus restricted to $\num{2.3e5}$ parameters for \hodge{1}{1} in the original dataset and $\num{2.9e5}$ in the favourable set, and $\num{1.1e6}$ parameters for \hodge{2}{1} in the original dataset and $\num{1.4e6}$ in the favourable dataset.
-In all cases, the number of parameters has decreased by a significant amount: in the case of \hodge{1}{1} they are roughly $\frac{1}{3}$ of the parameters used in the classical \cnn and around $\frac{1}{6}$ of those used in the FC network.
-
-For training we used the \emph{Adam} gradient descent with an initial learning rate of $10^{-3}$ and a batch size of $32$.
+For both \hodge{1}{1} and \hodge{2}{1} (in both datasets) we used 3 modules made by \numlist{32;64;32} filters for the first Hodge number, and \numlist{128;128;64} filters for the second.
+We also included $\ell_1$ and $\ell_2$ regularisation of magnitude \num{e-4} in all cases.
+The number of parameters was thus restricted to \num{2.3e5} parameters for \hodge{1}{1} in the original dataset and \num{2.9e5} in the favourable set, and \num{1.1e6} parameters for \hodge{2}{1} in the original dataset and \num{1.4e6} in the favourable dataset.
+In all cases the number of parameters has decreased by a significant amount: in the case of \hodge{1}{1} they are roughly $\frac{1}{3}$ of the parameters used in the classical \cnn and around $\frac{1}{6}$ of those used in the \fc network.
+During training we used the \emph{Adam} gradient descent with an initial learning rate of $10^{-3}$ and a batch size of $32$.
 The callbacks helped to contain the training time (without optimisation) under 5 hours for each Hodge number in each dataset.

-
-\begin{figure}[htp]
-    \centering
-    \includegraphics[width=0.9\linewidth]{img/icnn}
-    \caption{%
-    	In each concatenation module (here shown for the ``old'' dataset) we operate with separate convolution operations over rows and columns, then concatenate the results. The overall architecture is composed of 3 ``inception'' modules made by two separate convolutions, a concatenation layer and a batch normalisation layer (strictly in this order), followed by a dropout layer, a flatten layer and the output layer with ReLU activation (in this order).}
-    \label{fig:nn:inception}
+\begin{figure}[tbp]
+  \centering
+  \includegraphics[width=0.9\linewidth]{img/icnn}
+  \caption{%
+    In each concatenation module (here shown for the ``old'' dataset) we operate with separate convolution operations over rows and columns, then concatenate the results.
+    The overall architecture is composed of 3 ``inception'' modules made by two separate convolutions, a concatenation layer and a batch normalisation layer (strictly in this order), followed by a dropout layer, a flatten layer and the output layer with ReLU activation (in this order).
+  }
+  \label{fig:nn:inception}
 \end{figure}


 \paragraph{Results}

-With these architectures, we were able to achieve more than \SI{99}{\percent} of accuracy for \hodge{1}{1} in the test set (same for the development set) and \SI{50}{\percent} of accuracy for \hodge{2}{1} (a slightly smaller value for the development set).
-We report the results in \Cref{tab:res:ann}.
+With these architectures we were able to achieve more than \SI{99}{\percent} of accuracy for \hodge{1}{1} in the test set (same for the development set) and \SI{50}{\percent} of accuracy for \hodge{2}{1} (a slightly smaller value for the development set).
+We report the results in~\Cref{tab:res:ann}.

 We therefore increased the accuracy for both Hodge numbers (especially \hodge{2}{1}) compared to what can achieve a simple sequential network, while at the same time reducing significantly the number of parameters of the network.\footnotemark{}
-This increases the robustness of the method and its generalisation properties.
 \footnotetext{%
-	In an attempt to improve the results for \hodge{2}{1} even further, we also considered to first predict $\ln( 1 + \hodge{2}{1} )$ and then transform it back. However, the predictions dropped by almost \SI{10}{\percent} in accuracy even using the ``inception'' network: the network seems to be able to approximate quite well the results (not better nor worse than simply \hodge{2}{1}) but the subsequent exponentiation is taking apart predictions and true values.
-	Choosing a correct rounding strategy then becomes almost impossible.
+  In an attempt to improve the results for \hodge{2}{1} even further, we also considered to first predict $\ln( 1 + \hodge{2}{1} )$ and then transform it back.
+  However, the predictions dropped by almost \SI{10}{\percent} in accuracy even using the ``inception'' network: the network seems to be able to approximate quite well the results (not better nor worse than simply \hodge{2}{1}) but the subsequent exponentiation is taking apart predictions and true values.
+  Choosing a correct rounding strategy then becomes almost impossible.
 }
+This increases the robustness of the method and its generalisation properties.

-In \Cref{fig:nn:inception_errors}, we show the distribution of the residuals and their scatter plot, showing that the distribution of the errors does not present pathological behaviour and the variance of the residuals is well distributed over the predictions.
-
-In fact, this neural network is much more powerful than the previous networks we considered, as can be seen by studying the learning curves (\Cref{fig:lc:inception}).
-When predicting only \hodge{1}{1}, it surpasses \SI{97}{\percent} accuracy using only \SI{30}{\percent} of the data for training.
-While it seems that the predictions suffer when using a single network for both Hodge numbers, this remains much better than any other algorithm.
-It may seem counter-intuitive that convolutions work well on this data since they are not translation or rotation invariant, but only permutation invariant.
-However, convolution alone is not sufficient to ensure invariances under these transformations but it must be supplemented with pooling operations~\cite{Bengio:2017:DeepLearning}, which we do not use.
-Moreover, convolution layers do more than just taking translation properties into account: they allow to make highly complicated combinations of the inputs and to share weights among components, which allow to find subtler patterns than standard fully connected layers.
+In~\Cref{fig:nn:inception_errors} we show the distribution of the residuals and their scatter plot.
+The distribution of the errors does not present pathological behaviour and the variance of the residuals is well distributed over the predictions.
+In fact this neural network is much more powerful than the previous networks we considered, as can be seen by studying the learning curves in~\Cref{fig:lc:inception}.
+When predicting only \hodge{1}{1} it surpasses \SI{97}{\percent} accuracy using only \SI{30}{\percent} of the data for training.
+While it seems that the predictions suffer when using a single network for both Hodge numbers this remains much better than any other algorithm.
+It may seem counter-intuitive that convolutions work well on this data since they are not translation or rotation invariant but only permutation invariant.
+However convolution alone is not sufficient to ensure invariances under these transformations but it must be supplemented with pooling operations~\cite{Bengio:2017:DeepLearning} which we do not use.
+Moreover convolution layers do more than just taking translation properties into account: they allow to make highly complicated combinations of the inputs and to share weights among components to find subtler patterns than standard fully connected layers.
 This network is more studied in more details in~\cite{Erbin:2020:InceptionNeuralNetwork}.

-
-\begin{figure}[htp]
+\begin{figure}[tbp]
  \centering
  \begin{subfigure}[c]{0.45\linewidth}
    \centering
@@ -1251,155 +1249,139 @@ This network is more studied in more details in~\cite{Erbin:2020:InceptionNeural
    \includegraphics[width=\linewidth, trim={0 0 6in 0}, clip]{img/loss-lr_icnn_h21_orig}
    \caption{Loss of \hodge{2}{1}.}
  \end{subfigure}
-  \caption{The loss functions of ``inception'' network for \hodge{1}{1} and \hodge{2}{1} in the original dataset show that the number of epochs required for training is definitely larger than for simpler architectures, despite the reduced number of parameters.}
+  \caption{%
+    The loss functions of ``inception'' network for \hodge{1}{1} and \hodge{2}{1} in the original dataset show that the number of epochs required for training is definitely larger than for simpler architectures, despite the reduced number of parameters.
+  }
  \label{fig:cnn:inception-loss}
 \end{figure}

-
-\begin{figure}[htp]
+\begin{figure}[tbp]
  \centering
  \begin{subfigure}[c]{\linewidth}
    \centering
    \includegraphics[width=0.8\linewidth]{img/errors_icnn_h11_orig}
    \caption{Residuals of \hodge{1}{1}.}
  \end{subfigure}
-  \quad
+  \hfill
  \begin{subfigure}[c]{\linewidth}
    \centering
    \includegraphics[width=0.8\linewidth]{img/errors_icnn_h21_orig}
    \caption{Residuals of \hodge{2}{1}.}
  \end{subfigure}
-  \caption{Histograms of the residual errors and residual plots of the Inception network.}
+  \caption{%
+    Histograms of the residual errors and residual plots of the Inception network.
+  }
  \label{fig:nn:inception_errors}
 \end{figure}

-
-\begin{figure}[htp]
-	\centering
-
-	\begin{subfigure}[c]{0.45\linewidth}
-		\centering
-		\includegraphics[width=\linewidth]{img/inc_nn_learning_curve}
-		\caption{predicting both \hodge{1}{1} and \hodge{2}{1}}
-	\end{subfigure}
-	\qquad
-	\begin{subfigure}[c]{0.45\linewidth}
-		\centering
-		\includegraphics[width=\linewidth]{img/inc_nn_learning_curve_h11}
-		\caption{predicting \hodge{1}{1} only}
-	\end{subfigure}
-
-	\caption{Learning curves for the Inception neural network (original dataset).}
-	\label{fig:lc:inception}
+\begin{figure}[tbp]
+  \centering
+  \begin{subfigure}[c]{0.45\linewidth}
+    \centering
+    \includegraphics[width=\linewidth]{img/inc_nn_learning_curve}
+    \caption{predicting both \hodge{1}{1} and \hodge{2}{1}}
+  \end{subfigure}
+  \hfill
+  \begin{subfigure}[c]{0.45\linewidth}
+    \centering
+    \includegraphics[width=\linewidth]{img/inc_nn_learning_curve_h11}
+    \caption{predicting \hodge{1}{1} only}
+  \end{subfigure}
+  \caption{Learning curves for the Inception neural network (original dataset).}
+  \label{fig:lc:inception}
 \end{figure}

-
 \begin{table}[htb]
-\centering
-	\begin{tabular}{@{}ccccccc@{}}
-		\toprule
-		& \multicolumn{2}{c}{\textbf{DenseNet}}
-		& \multicolumn{2}{c}{\textbf{classic ConvNet}}
-		& \multicolumn{2}{c}{\textbf{inception ConvNet}}
-		\\
-		& \emph{old} & \emph{fav.}
-		& \emph{old} & \emph{fav.}
-		& \emph{old} & \emph{fav.}
-		\\
-		\midrule
-		\hodge{1}{1}
-		& \SI{77}{\percent}  & \SI{97}{\percent}
-		& \SI{94}{\percent}  & \SI{99}{\percent}
-		& \SI{99}{\percent}  & \SI{99}{\percent}
-		\\
-		\hodge{2}{1}
-		& -     & -
-		& \SI{36}{\percent}  & \SI{31}{\percent}
-		& \SI{50}{\percent}  & \SI{48}{\percent}
-		\\
-		\bottomrule
-\end{tabular}
-\caption{Accuracy using \emph{rint} rounding on the predictions of the ANNs on \hodge{1}{1} and \hodge{2}{1} on the test set.}
-\label{tab:res:ann}
+  \centering
+  \begin{tabular}{@{}ccccccc@{}}
+    \toprule
+    & \multicolumn{2}{c}{\textbf{DenseNet}}
+    & \multicolumn{2}{c}{\textbf{classic ConvNet}}
+    & \multicolumn{2}{c}{\textbf{inception ConvNet}}
+    \\
+    & \emph{old} & \emph{fav.}
+    & \emph{old} & \emph{fav.}
+    & \emph{old} & \emph{fav.}
+    \\
+    \midrule
+    \hodge{1}{1}
+    & \SI{77}{\percent}  & \SI{97}{\percent}
+    & \SI{94}{\percent}  & \SI{99}{\percent}
+    & \SI{99}{\percent}  & \SI{99}{\percent}
+    \\
+    \hodge{2}{1}
+    & -     & -
+    & \SI{36}{\percent}  & \SI{31}{\percent}
+    & \SI{50}{\percent}  & \SI{48}{\percent}
+    \\
+    \bottomrule
+  \end{tabular}
+  \caption{%
+    Accuracy using \emph{rint} rounding on the predictions of the ANNs on \hodge{1}{1} and \hodge{2}{1} on the test set.
+  }
+  \label{tab:res:ann}
 \end{table}


 \subsubsection{Boosting the Inception-like Model}

-
-To improve further the accuracy of \hodge{2}{1}, we have tried to modify the network by adding engineered features as auxiliary inputs.
+To improve further the accuracy of \hodge{2}{1} we modify the network by adding engineered features as auxiliary inputs.
 This can be done by adding inputs to the inception neural network and merging the different branches at different stages.
-There are two possibilities to train such a network: 1) train all the network directly, or 2) train the inception network alone, then freeze its weights and connect it to the additional inputs, training only the new layer.
+There are two possibilities to train such a network: train the whole network directly or train the inception network alone, then freeze its weights and connect it to the additional inputs, training only the new layer.
 We found that the architectures we tried did not improve the accuracy, but we briefly describe our attempts for completeness.
-
 We focused in particular on the number of projective spaces, the vector of dimensions of the projective spaces and the vector of dimensions of the principal cohomology group) and predicting \hodge{1}{1} and \hodge{2}{1} at the same time.
-The core of the neural network is the Inception network described in \Cref{sec:ml:nn:inception}.
-Then, the engineered features are processed using fully connected layers and merged to the predictions from the Inception branch using a concatenation layer.
-Obviously, output layers for \hodge{1}{1} and \hodge{2}{1} can be located on different branches, which allow for different processing of the features.
+The core of the neural network is the Inception network described earlier in~\Cref{sec:ml:nn:inception}.
+The engineered features are processed using fully connected layers and merged to the predictions from the Inception branch using a concatenation layer.
+Obviously output layers for \hodge{1}{1} and \hodge{2}{1} can be located on different branches which allow for different processing of the features.

 As mentioned earlier, a possible approach is to first train the Inception branch alone, before freezing its weights and connecting it to the rest of the network.
 This can prevent spoiling the already good predictions and speed up the new learning process.
 This is a common technique called \emph{transfer learning}: we can use a model previously trained on a slightly different task and use its weights as part of the new architecture.
-
-Our trials involved shallow fully connected layers ($1$--$3$ layers with $10$ to $150$ units) between the engineered features and after the concatenation layer.
-Since the \eda analysis (\Cref{sec:data:eda}) shows a correlation between both Hodge numbers, we tried architectures where the result for \hodge{1}{1} is used to predict \hodge{2}{1}.
-
-For the training phase, we also tried an alternative to the canonical choice of optimising the sum of the losses.
+Our trials involved shallow fully connected layers ($1$ to $3$ layers with $10$ to $150$ units) between the engineered features and after the concatenation layer.
+Since the \eda analysis in~\Cref{sec:data:eda} shows a correlation between both Hodge numbers, we tried architectures where the result for \hodge{1}{1} is used to predict \hodge{2}{1}.
+For the training phase we also tried an alternative to the canonical choice of optimising the sum of the losses.
 We first train the network and stop the process when the validation loss for \hodge{1}{1} does not longer improve, load back the best weights and save the results, keep training and stop when the loss for \hodge{2}{1} reaches a plateau.

-
-
-
 With this setup we were able to slightly improve the predictions of \hodge{1}{1} in the original dataset, reaching almost \SI{100}{\percent} of accuracy in the predictions, while the favourable dataset stayed at around \SI{99}{\percent} of accuracy.
-The only few missed predictions (4 manifolds out of 786 in the test set) are in very peculiar regions of the distribution of the Hodge number.
+The only few missed predictions (\num{4} manifolds out of \num{786} in the test set) are in very peculiar regions of the distribution of the Hodge number.
 For \hodge{2}{1} no improvement has been noticed.


-
 \subsection{Ensemble Learning: Stacking}

-
 We conclude the \ml analysis by describing a method very popular in \ml competitions: ensembling.
 This consists in taking several \ml algorithms and combining together the predictions of each individual model to obtain a more precise predictions.
 Using this technique it is possible to decrease the variance and improve generalization by compensating weaknesses of algorithms with strengths of others.
-Indeed, the idea is to put together algorithms which perform best in different zones of the label distribution in order to combine them to build an algorithm better than any individual component.
-
-The simplest such algorithm is \emph{stacking} whose principle is summarised in \Cref{fig:stack:def}.
-First, the original training set is split in two parts (not necessarily even).
-Second, a certain number of \emph{first-level learners} is trained over the first split and used to generate predictions over the second split.
-Third, a ``meta learner'' is trained of the second split to combine the predictions from the first-level learners.
+Indeed the idea is to put together algorithms which perform best in different zones of the label distribution in order to combine them to build an algorithm better than any individual component.
+The simplest such algorithm is \emph{stacking} whose principle is summarised in~\Cref{fig:stack:def}.
+First the original training set is split in two parts (not necessarily even).
+Second a certain number of \emph{first-level learners} is trained over the first split and used to generate predictions over the second split.
+Third a ``meta learner'' is trained of the second split to combine the predictions from the first-level learners.
 Predictions for the test set are obtained by applying both level of models one after the other.

-We have selected the following models for the first level: linear gression, \svm with the Gaussian kernel, the random forest and the ``inception'' neural network.
+We have selected the following models for the first level: linear regression, \svm with the Gaussian kernel, the random forest and the ``inception'' neural network.
 The meta-learner is a simple linear regression with $\ell_1$ regularisation (Lasso).
 The motivations for the first-level algorithms is that stacking works best with a group of algorithms which work in the most diverse way among them.
-
-Also in this case, we use a cross-validation strategy with 5 splits for each level of the training: from \SI{90}{\percent} of total training set, we split into two halves containing each \SI{45}{\percent} of the total samples and then use 5 splits to grade the algorithm, thus using \SI{9}{\percent} of each split for cross correlation at each iteration) and the Bayes optimisation for all algorithms but the ANN (50 iterations for elastic net, \svm and lasso and 25 for the random forests).
-The ANN was trained using a holdout validation set containing the same number of samples as each cross-validation fold, namely \SI{9}{\percent} of the total set.
+Also in this case, we use a cross-validation strategy with 5 splits for each level of the training: from \SI{90}{\percent} of total training set, we split into two halves containing each \SI{45}{\percent} of the total samples and then use 5 splits to grade the algorithm, thus using \SI{9}{\percent} of each split for cross correlation at each iteration) and the Bayes optimisation for all algorithms but the \ann (50 iterations for elastic net, \svm and lasso and 25 for the random forests).
+The \ann was trained using a holdout validation set containing the same number of samples as each cross-validation fold, namely \SI{9}{\percent} of the total set.
 The accuracy is then computed as usual using \texttt{numpy.rint} for \svm, neural networks, the meta learner and \hodge{1}{1} in the original dataset in general, and \texttt{numpy.floor} in the other cases.

-In \Cref{tab:res:stack}, we show the accuracy of the ensemble learning.
+In~\Cref{tab:res:stack}, we show the accuracy of the ensemble learning.
 We notice that accuracy improves slightly only for \hodge{2}{1} (original dataset) compared to the first-level learners.
-However, this is much lower than what has been achieved in \Cref{sec:ml:nn:inception}.
+However this is much lower than what has been achieved in~\Cref{sec:ml:nn:inception}.
 The reason is that the learning suffers from the reduced size of the training set.
 Another reason is that the different algorithms may perform similarly well in the same regions.

-
-\begin{figure}[htp]
-	\centering
-	\includegraphics[width=0.65\linewidth]{img/stacking}
-	\caption{Stacking ensemble learning with two level learning.
-	The original training set is split into two training folds and the first level learners are trained on the first.
-	The trained models are then used to generate a new training set (here the ``1st level labels'') using the second split as input features.
-	The same also applies to the test set.
-	Finally a ``meta-learner'' uses the newly generated training set to produce the final predictions on the test set.}
-	\label{fig:stack:def}
+\begin{figure}[tbp]
+  \centering
+  \includegraphics[width=0.65\linewidth]{img/stacking}
+  \caption{Stacking ensemble learning with two level learning.}
+  \label{fig:stack:def}
 \end{figure}

-
-\begin{table}[htb]
-\centering
-\begin{tabular}{@{}cccccc@{}}
+\begin{table}[tbp]
+  \centering
+  \begin{tabular}{@{}cccccc@{}}
    \toprule
    &
    & \multicolumn{2}{c}{\hodge{1}{1}}
@@ -1411,7 +1393,7 @@ Another reason is that the different algorithms may perform similarly well in th
    \\
    \midrule
    \multirow{4}{*}{\emph{1st level}}
-    & EN
+    & \textsc{en}
        & \SI{65}{\percent}  & \SI{100}{\percent}
        & \SI{19}{\percent}  & \SI{19}{\percent}
    \\
@@ -1419,11 +1401,11 @@ Another reason is that the different algorithms may perform similarly well in th
        & \SI{70}{\percent}  & \SI{100}{\percent}
        & \SI{30}{\percent}  & \SI{34}{\percent}
    \\
-    & RF
+    & \textsc{rf}
        & \SI{61}{\percent}  & \SI{98}{\percent}
        & \SI{18}{\percent}  & \SI{24}{\percent}
    \\
-    & ANN
+    & \ann
        & \SI{98}{\percent}  & \SI{98}{\percent}
        & \SI{33}{\percent}  & \SI{30}{\percent}
    \\
@@ -1434,9 +1416,12 @@ Another reason is that the different algorithms may perform similarly well in th
        & \SI{36}{\percent}  & \SI{33}{\percent}
    \\
    \bottomrule
-\end{tabular}
-\caption{Accuracy of the first and second level predictions of the stacking ensemble for elastic net regression (EN), support vector with \texttt{rbf} kernel (SVR), random forest (RF) and the artificial neural network (ANN) as first level learners and lasso regression as meta learner.}
-\label{tab:res:stack}
+  \end{tabular}
+  \caption{%
+    Accuracy of the first and second level predictions of the stacking ensemble for elastic net regression (\textsc{en}), support vector with \texttt{rbf} kernel (\svm), random forest (\textsc{rf}) and the artificial neural network (\ann) as first level learners and lasso regression as meta learner.
+  }
+  \label{tab:res:stack}
 \end{table}

+
 % vim: ft=tex
--- a/thesis.bib
+++ b/thesis.bib
@@ -243,6 +243,24 @@
  number = {1}
 }

+@article{Anderson:2016:NewConstructionCalabiYau,
+  title = {A {{New Construction}} of {{Calabi}}-{{Yau Manifolds}}: {{Generalized CICYs}}},
+  shorttitle = {A {{New Construction}} of {{Calabi}}-{{Yau Manifolds}}},
+  author = {Anderson, Lara B. and Apruzzi, Fabio and Gao, Xin and Gray, James and Lee, Seung-Joo},
+  date = {2016},
+  journaltitle = {Nuclear Physics B},
+  shortjournal = {Nuclear Physics B},
+  volume = {906},
+  pages = {441--496},
+  issn = {05503213},
+  doi = {10.1016/j.nuclphysb.2016.03.016},
+  abstract = {We present a generalization of the complete intersection in products of projective space (CICY) construction of Calabi-Yau manifolds. CICY three-folds and four-folds have been studied extensively in the physics literature. Their utility stems from the fact that they can be simply described in terms of a `configuration matrix', a matrix of integers from which many of the details of the geometries can be easily extracted. The generalization we present is to allow negative integers in the configuration matrices which were previously taken to have positive semi-definite entries. This broadening of the complete intersection construction leads to a larger class of Calabi-Yau manifolds than that considered in previous work, which nevertheless enjoys much of the same degree of calculational control. These new Calabi-Yau manifolds are complete intersections in (not necessarily Fano) ambient spaces with an effective anticanonical class. We find examples with topology distinct from any that has appeared in the literature to date. The new manifolds thus obtained have many interesting features. For example, they can have smaller Hodge numbers than ordinary CICYs and lead to many examples with elliptic and K3-fibration structures relevant to F-theory and string dualities.},
+  archivePrefix = {arXiv},
+  eprint = {1507.03235},
+  eprinttype = {arxiv},
+  file = {/home/riccardo/.local/share/zotero/files/anderson_et_al_2016_a_new_construction_of_calabi-yau_manifolds2.pdf;/home/riccardo/.local/share/zotero/storage/GWD2QTX5/1507.html}
+}
+
@article{Anderson:2017:FibrationsCICYThreefolds,
  title = {Fibrations in {{CICY}} Threefolds},
  author = {Anderson, Lara B. and Gao, Xin and Gray, James and Lee, Seung-Joo},
@@ -303,6 +321,21 @@
  file = {/home/riccardo/.local/share/zotero/files/angelantonj_sagnotti_2002_open_strings.pdf}
 }

+@online{Ardizzone:2019:AnalyzingInverseProblems,
+  title = {Analyzing {{Inverse Problems}} with {{Invertible Neural Networks}}},
+  author = {Ardizzone, Lynton and Kruse, Jakob and Wirkert, Sebastian and Rahner, Daniel and Pellegrini, Eric W. and Klessen, Ralf S. and Maier-Hein, Lena and Rother, Carsten and Köthe, Ullrich},
+  date = {2019-02-06},
+  url = {http://arxiv.org/abs/1808.04730},
+  urldate = {2020-10-10},
+  abstract = {In many tasks, in particular in natural science, the goal is to determine hidden system parameters from a set of measurements. Often, the forward process from parameter- to measurement-space is a well-defined function, whereas the inverse problem is ambiguous: one measurement may map to multiple different sets of parameters. In this setting, the posterior parameter distribution, conditioned on an input measurement, has to be determined. We argue that a particular class of neural networks is well suited for this task -- so-called Invertible Neural Networks (INNs). Although INNs are not new, they have, so far, received little attention in literature. While classical neural networks attempt to solve the ambiguous inverse problem directly, INNs are able to learn it jointly with the well-defined forward process, using additional latent output variables to capture the information otherwise lost. Given a specific measurement and sampled latent variables, the inverse pass of the INN provides a full distribution over parameter space. We verify experimentally, on artificial data and real-world problems from astrophysics and medicine, that INNs are a powerful analysis tool to find multi-modalities in parameter space, to uncover parameter correlations, and to identify unrecoverable parameters.},
+  archivePrefix = {arXiv},
+  eprint = {1808.04730},
+  eprinttype = {arxiv},
+  file = {/home/riccardo/.local/share/zotero/files/ardizzone_et_al_2019_analyzing_inverse_problems_with_invertible_neural_networks.pdf;/home/riccardo/.local/share/zotero/storage/NQJPI658/1808.html},
+  keywords = {⛔ No DOI found},
+  primaryClass = {cs, stat}
+}
+
@online{Arduino:2020:OriginDivergencesTimeDependent,
  title = {On the {{Origin}} of {{Divergences}} in {{Time}}-{{Dependent Orbifolds}}},
  author = {Arduino, Andrea and Finotello, Riccardo and Pesando, Igor},
@@ -1144,7 +1177,6 @@
  journaltitle = {Machine learning},
  volume = {20},
  pages = {273--297},
-  publisher = {{Springer}},
  file = {/home/riccardo/.local/share/zotero/files/cortes_vapnik_1995_support-vector_networks.pdf},
  keywords = {❓ Multiple DOI},
  number = {3}
@@ -2060,6 +2092,28 @@
  number = {1}
 }

+@incollection{Goodfellow:2014:GenerativeAdversarialNets,
+  title = {Generative Adversarial Nets},
+  booktitle = {Advances in Neural Information Processing Systems 27},
+  author = {Goodfellow, Ian and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua},
+  editor = {Ghahramani, Z. and Welling, M. and Cortes, C. and Lawrence, N. D. and Weinberger, K. Q.},
+  date = {2014},
+  pages = {2672--2680},
+  publisher = {{Curran Associates, Inc.}},
+  url = {http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf}
+}
+
+@inproceedings{Gori:2005:NewModelLearning,
+  title = {A New Model for Learning in Graph Domains},
+  booktitle = {Proceedings. 2005 {{IEEE}} International Joint Conference on Neural Networks, 2005.},
+  author = {Gori, Marco and Monfardini, Gabriele and Scarselli, Franco},
+  date = {2005},
+  volume = {2},
+  pages = {729--734},
+  doi = {10.1109/IJCNN.2005.1555942},
+  organization = {{IEEE}}
+}
+
@article{Grana:2006:FluxCompactificationsString,
  ids = {Grana:2005:FluxCompactificationsString},
  title = {Flux Compactifications in String Theory: {{A}} Comprehensive Review},
@@ -2092,6 +2146,42 @@
  series = {{{SpringerBriefs}} in {{Physics}}}
 }

+@article{Gray:2013:AllCompleteIntersection,
+  title = {All {{Complete Intersection Calabi}}-{{Yau Four}}-{{Folds}}},
+  author = {Gray, James and Haupt, Alexander S. and Lukas, Andre},
+  date = {2013-07},
+  journaltitle = {Journal of High Energy Physics},
+  shortjournal = {J. High Energ. Phys.},
+  volume = {2013},
+  pages = {70},
+  issn = {1029-8479},
+  doi = {10.1007/JHEP07(2013)070},
+  abstract = {We present an exhaustive, constructive, classification of the Calabi-Yau four-folds which can be described as complete intersections in products of projective spaces. A comprehensive list of 921,497 configuration matrices which represent all topologically distinct types of complete intersection Calabi-Yau four-folds is provided and can be downloaded at http://www-thphys.physics.ox.ac.uk/projects/CalabiYau/Cicy4folds/index.html . The manifolds have non-negative Euler characteristics in the range 0 - 2610. This data set will be of use in a wide range of physical and mathematical applications. Nearly all of these four-folds are elliptically fibered and are thus of interest for F-theory model building.},
+  archivePrefix = {arXiv},
+  eprint = {1303.1832},
+  eprinttype = {arxiv},
+  file = {/home/riccardo/.local/share/zotero/files/gray_et_al_2013_all_complete_intersection_calabi-yau_four-folds2.pdf;/home/riccardo/.local/share/zotero/storage/B4K3HHPX/1303.html},
+  number = {7}
+}
+
+@article{Gray:2014:TopologicalInvariantsFibration,
+  title = {Topological {{Invariants}} and {{Fibration Structure}} of {{Complete Intersection Calabi}}-{{Yau Four}}-{{Folds}}},
+  author = {Gray, James and Haupt, Alexander S. and Lukas, Andre},
+  date = {2014-09},
+  journaltitle = {Journal of High Energy Physics},
+  shortjournal = {J. High Energ. Phys.},
+  volume = {2014},
+  pages = {93},
+  issn = {1029-8479},
+  doi = {10.1007/JHEP09(2014)093},
+  abstract = {We investigate the mathematical properties of the class of Calabi-Yau four-folds recently found in [arXiv:1303.1832]. This class consists of 921,497 configuration matrices which correspond to manifolds that are described as complete intersections in products of projective spaces. For each manifold in the list, we compute the full Hodge diamond as well as additional topological invariants such as Chern classes and intersection numbers. Using this data, we conclude that there are at least 36,779 topologically distinct manifolds in our list. We also study the fibration structure of these manifolds and find that 99.95 percent can be described as elliptic fibrations. In total, we find 50,114,908 elliptic fibrations, demonstrating the multitude of ways in which many manifolds are fibered. A sub-class of 26,088,498 fibrations satisfy necessary conditions for admitting sections. The complete data set can be downloaded at http://www-thphys.physics.ox.ac.uk/projects/CalabiYau/Cicy4folds/index.html .},
+  archivePrefix = {arXiv},
+  eprint = {1405.2073},
+  eprinttype = {arxiv},
+  file = {/home/riccardo/.local/share/zotero/files/gray_et_al_2014_topological_invariants_and_fibration_structure_of_complete_intersection2.pdf;/home/riccardo/.local/share/zotero/storage/GWDFUFYW/1405.html},
+  number = {9}
+}
+
@article{Green:1987:CalabiYauManifoldsComplete,
  title = {Calabi-{{Yau}} Manifolds as Complete Intersections in Products of Complex Projective Spaces},
  author = {Green, Paul and Hübsch, Tristan},
@@ -2661,6 +2751,35 @@
  number = {8}
 }

+@online{Kingma:2014:AutoEncodingVariationalBayes,
+  title = {Auto-{{Encoding Variational Bayes}}},
+  author = {Kingma, Diederik P. and Welling, Max},
+  date = {2014-05-01},
+  url = {http://arxiv.org/abs/1312.6114},
+  urldate = {2020-10-10},
+  abstract = {How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions is two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.},
+  archivePrefix = {arXiv},
+  eprint = {1312.6114},
+  eprinttype = {arxiv},
+  file = {/home/riccardo/.local/share/zotero/files/kingma_welling_2014_auto-encoding_variational_bayes2.pdf;/home/riccardo/.local/share/zotero/storage/KYP8BISG/1312.html},
+  keywords = {⛔ No DOI found},
+  primaryClass = {cs, stat}
+}
+
+@online{Kingma:2017:AdamMethodStochastic,
+  title = {Adam: {{A Method}} for {{Stochastic Optimization}}},
+  shorttitle = {Adam},
+  author = {Kingma, Diederik P. and Ba, Jimmy},
+  date = {2017},
+  abstract = {We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.},
+  archivePrefix = {arXiv},
+  eprint = {1412.6980},
+  eprinttype = {arxiv},
+  file = {/home/riccardo/.local/share/zotero/files/kingma_ba_2017_adam3.pdf;/home/riccardo/.local/share/zotero/storage/9JQ8YQL7/1412.html},
+  keywords = {⛔ No DOI found},
+  primaryClass = {cs}
+}
+
@online{Kingma:2017:AdamMethodStochastica,
  ids = {Kingma:2017:AdamMethodStochastic},
  title = {Adam: {{A Method}} for {{Stochastic Optimization}}},
@@ -2968,6 +3087,15 @@
  series = {Lecture {{Notes}} in {{Computer Science}}}
 }

+@inproceedings{Monti:2017:GeometricDeepLearning,
+  title = {Geometric Deep Learning on Graphs and Manifolds Using Mixture Model Cnns},
+  booktitle = {Proceedings of the {{IEEE}} Conference on Computer Vision and Pattern Recognition},
+  author = {Monti, Federico and Boscaini, Davide and Masci, Jonathan and Rodola, Emanuele and Svoboda, Jan and Bronstein, Michael M.},
+  date = {2017},
+  pages = {5115--5124},
+  keywords = {⛔ No DOI found}
+}
+
@article{Mutter:2019:DeepLearningHeterotic,
  title = {Deep Learning in the Heterotic Orbifold Landscape},
  author = {Mütter, Andreas and Parr, Erik and Vaudrevange, Patrick K. S.},
@@ -3367,6 +3495,21 @@
  number = {1}
 }

+@online{Rezende:2014:StochasticBackpropagationApproximate,
+  title = {Stochastic {{Backpropagation}} and {{Approximate Inference}} in {{Deep Generative Models}}},
+  author = {Rezende, Danilo Jimenez and Mohamed, Shakir and Wierstra, Daan},
+  date = {2014-05-30},
+  url = {http://arxiv.org/abs/1401.4082},
+  urldate = {2020-10-10},
+  abstract = {We marry ideas from deep neural networks and approximate Bayesian inference to derive a generalised class of deep, directed generative models, endowed with a new algorithm for scalable inference and learning. Our algorithm introduces a recognition model to represent approximate posterior distributions, and that acts as a stochastic encoder of the data. We develop stochastic back-propagation -- rules for back-propagation through stochastic variables -- and use this to develop an algorithm that allows for joint optimisation of the parameters of both the generative and recognition model. We demonstrate on several real-world data sets that the model generates realistic samples, provides accurate imputations of missing data and is a useful tool for high-dimensional data visualisation.},
+  archivePrefix = {arXiv},
+  eprint = {1401.4082},
+  eprinttype = {arxiv},
+  file = {/home/riccardo/.local/share/zotero/files/rezende_et_al_2014_stochastic_backpropagation_and_approximate_inference_in_deep_generative_models2.pdf;/home/riccardo/.local/share/zotero/storage/HKC6H5VK/1401.html},
+  keywords = {⛔ No DOI found},
+  primaryClass = {cs, stat}
+}
+
@article{Rudolph:1994:ConvergenceAnalysisCanonical,
  title = {Convergence Analysis of Canonical Genetic Algorithms},
  author = {Rudolph, Günter},
@@ -3424,6 +3567,25 @@
  number = {6088}
 }

+@inproceedings{Salimans:2015:MarkovChainMonte,
+  title = {Markov Chain Monte Carlo and Variational Inference: {{Bridging}} the Gap},
+  booktitle = {International Conference on Machine Learning},
+  author = {Salimans, Tim and Kingma, Diederik and Welling, Max},
+  date = {2015},
+  pages = {1218--1226},
+  keywords = {⛔ No DOI found}
+}
+
+@inproceedings{Scarselli:2004:GraphicalbasedLearningEnvironments,
+  title = {Graphical-Based Learning Environments for Pattern Recognition},
+  booktitle = {Joint {{IAPR}} International Workshops on Statistical Techniques in Pattern Recognition ({{SPR}}) and Structural and Syntactic Pattern Recognition ({{SSPR}})},
+  author = {Scarselli, Franco and Tsoi, Ah Chung and Gori, Marco and Hagenbuchner, Markus},
+  date = {2004},
+  pages = {42--56},
+  keywords = {⛔ No DOI found},
+  organization = {{Springer}}
+}
+
@online{Schellekens:2017:BigNumbersString,
  title = {Big {{Numbers}} in {{String Theory}}},
  author = {Schellekens, A. N.},
@@ -3881,6 +4043,15 @@
  file = {/home/riccardo/.local/share/zotero/files/zheng_casari_2018_feature_engineering_for_machine_learning.pdf}
 }

+@inproceedings{Zhu:2017:UnpairedImagetoimageTranslation,
+  title = {Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks},
+  booktitle = {Proceedings of the {{IEEE}} International Conference on Computer Vision},
+  author = {Zhu, Jun-Yan and Park, Taesung and Isola, Phillip and Efros, Alexei A},
+  date = {2017},
+  pages = {2223--2232},
+  keywords = {⛔ No DOI found}
+}
+
@book{Zwiebach::FirstCourseString,
  title = {A {{First Course}} in {{String Theory}}},
  author = {Zwiebach, Barton},
--- a/thesis.tex
+++ b/thesis.tex
@@ -7,7 +7,7 @@
 \usepackage{import}

 \author{Riccardo Finotello}
-\title{Theoretical and Computational Aspects for Phenomenology in String Theory}
+\title{String Theory and Phenomenology: \\ Theoretical and Computational Aspects From D-branes to Artificial Intelligence}
 \advisor{Igor Pesando}
 \institution{Università degli Studi di Torino}
 \school{Scuola di Dottorato}
@@ -20,12 +20,13 @@
 \fancyhead[R]{\rightmark}

 \hypersetup{%
+  pdftitle={String Theory and Phenomenology: Theoretical and Computational Aspects From D-branes to Artificial Intelligence},
  pdfauthor={Riccardo Finotello}
 }

 %---- abbreviations
 \newcommand{\ml}{\textsc{ml}\xspace}
-\newcommand{\nn}{\textsc{nn}\xspace}
+\newcommand{\fc}{\textsc{fc}\xspace}
 \newcommand{\eda}{\textsc{eda}\xspace}
 \newcommand{\pca}{\textsc{pca}\xspace}
 \newcommand{\svm}{\textsc{svm}\xspace}
@@ -131,6 +132,8 @@
 \input{sec/part1/dbranes.tex}
 \section{Fermions With Boundary Defects}
 \input{sec/part1/fermions.tex}
+\section{Summary and Conclusion}
+\input{sec/part1/conclusion.tex}

 %---- COSMOLOGY
 \thesispart{Cosmological Backgrounds and Divergences}
@@ -138,6 +141,8 @@
 \input{sec/part2/introduction.tex}
 \section{Time Dependent Orbifolds}
 \input{sec/part2/divergences.tex}
+\section{Summary and Conclusion}
+\input{sec/part2/conclusion.tex}

 %---- DEEP LEARNING
 \thesispart{Deep Learning the Geometry of String Theory}
@@ -146,6 +151,10 @@
 \input{sec/part3/introduction.tex}
 \section{Machine and Deep Learning for CICY Manifolds}
 \input{sec/part3/ml.tex}
+\section{Summary and Conclusion}
+\label{sec:conclusion}
+\input{sec/part3/conclusion.tex}
+

 %---- APPENDIX
 \thesispart{Appendix}
@@ -170,10 +179,13 @@
 \label{sec:NO_full_TTS}
 \input{sec/app/massive.tex}

+\section{Machine Learning Algorithms}
+\label{app:ml-algo}
+\input{sec/app/ml.tex}
+

 %---- BIBLIOGRAPHY
 \cleardoubleplainpage{}
-\small
 \printbibliography[heading=bibintoc]

 \end{document}