Introduction


When confronted with a machine learning task, probably some of the first questions a data scientist will ponder is: 1) What is our instance space? What is our label space? How do we define success? Are there computational issues associated with the task/data? How can we achieve generalization without over-fitting? What kind of features are we going to use? After answering these questions, a data scientist or team will settle on a learning algorithm and an appropriate loss function/evaluation metric. Unfortunately, in my experience, what is lacking from the initial discussion regarding data science tasks is a a seemingly trivial discuss of the hypothesis space. Since it is more of a theoretical framework as opposed to a tangible manifestation of a learning algorithm or code base, it is not discussed or included in a project workplan. However, ultimately a thorough discussion of the hypothesis space and its constraints is a vital component in initially framing a modeling task, whether machine learning or deep learning, and driving a project towards successful and reproducible results.
When learning a model $g(x)$, we must choose which find of function we expect $g(x)$ to be. For example, consider an unput with four binary features $(x = \left [ x_1x_2x_3x_4 \right ]; x \in \left \{ 0,1 \right \})$ and an unknown function $f(x)$ that returns y. For four features there are 16 possible instances. Thus in the binary classification task there are $2^{16}$ possible functions to describe our data. Therefore, we are confronted with two issues. First, without without restrictions on the set of functinos $g(x)$ learning is not feasible. Second, even if it were feasible, assuming there is a deterministic target hypothesis $g_0$ relating x to y, chosing a $g$ from a large enough space of functions, we certainly would achieve very good performance on training data. If fact, if we wanted to de could always make the training error exactly zero. A simple hypothesis would do just that:
$$ g(x) = \left\{\begin{matrix} +1 & i \in\left \{ 1, 2, ...m \right \} x_i = x, y_i =1\\ -1 & otherwise \end{matrix}\right. $$
Obviously however this hypothesis would not generalize to data outside of our test set. Rather, we just created a model that simply memorizes labels. So all we ended up doing is minimizing the empirical error at the expense of the true error, which is called overfitting. What a good model does is minimize the true error $\Re (g) = E_D\left [ L(g(x), y) \right ]$ where the expectation is taken over (x,y) pairs drawn from some distribution D.

The Formal Model

The above considerations give rise to the formal model of our learning prolem. Let $X$ be the input space, $y$ be the output space, D an unknown probability distribution on $Xxy$ and let $g$ our hypothesis space be a class of functions $g: X \to y $. For many practical algorithms, it is often observed that the true error is not too far from the empirical error. Real algorithms are not as ill behaved as the label memorization algorithm above, which was an extreme example. The key observation is the dicrepancy between the $R_D$ and $R_{emp}$ is related to the size of $g$, that is, how flexible our model is in terms of the size and shape of the hypothesis space.
What made the label memorization algorithm so bad was that the class of possible hypotheses was so huge, permitting us to fit anything we wanted. That is an invitation for disastrous overfitting. Learning algorithms used in practice usually have access to a much more limited set of hyoptheses (such as linear discriminators in the case of the perceptron, SVM etc..), so they have less opportunity to overfit. One way to do this is by explicitly restricting the hypothesis space $g$ to “simple” hypotheses, as in Structural Risk Minimization.Another way is to introduce a penalty functional $\Omega $ that somehow measures the complexity of each hypothesis f, and to minimize the sum $$ R_reg(f) = R_{emp}(f) + \Omega (f)$$.
We can now begin to construct our hypothesis class g. Naturally we want G to be a linear function space in te sense that for any $f \in G$ and any real number $\lambda$, $\lambda f$ is in G. Also, for any $f_1, f_2 \in g$ the sum of $f_1 + f_2$ is in G. We also want G to be related to te regularizer $\Omega$. We define a norm on g and set $\Omega\left [ f \right ] = \left \| f \right \|^2$. We further require that the norm be derived from an inner product.
This leads us to the notion of a Hilbert Space, or a linear inner product space, which will allow us to make the connection between the abstract structure of our hypothesis space g and what the elements of g actually are (discriminant functions or regressors). Within a Hilbert Space, for any such $ f \in g$ any x has a corresponding $f_x \in g$ such that x(f) = $\left \langle f_x, x \right \rangle$. Therefore, for any $x \in X$ we have a special function $k_x$ in our Hilbert Space called the representer and satisfying $f(x) = \left \langle k_x, f \right \rangle \forall f \in g$. Ultimately, we can rewrite our entire regularized risk minimization problem as:
$$ \hat{f} = arg\underset{f \in g}{min}\left [ \frac{1}{m}\sum_{i=1}^{m}L(\left \langle k_x,f \right \rangle, y_i) + \left \langle f, f \right \rangle\right ] $$
The hypothesis f only features in this equation in the form of inner producs with other functions in hypothesis space g. Once we know the form of the inner product and what the $k_x$ are, we can do everything we want with simple math. Moreover, anything outside of the span of $\left \{ k_x \right \}x \in X$ is uninteresting since it does not affect what $f \in g$ evaluated at any point of the input space is, so we can leave it out of the hypothesis space altogether. The result of the whole construction is just driven by the inner product $k(x, x') = \left \langle k_x, k_{x'} \right \rangle$. This is called the kernel.
In fact, we can reverse the whole procedure and construct g starting from the kernel. Given a positive definite function k on the input space X, we define g to be the minimal complete space of functions that includes all $\left \{ k_x \right \}x \in X$ and that has an inner product define in a fashion above. This defines g uniquely, and formaly g is called the Reproducing Kernel Hilbert Space associated with kernel k. Finally. we have reduced the learning problem to that of defining the loss function Representer L and the kernel k.
One more interesting point. By looking at our new formulation of our regularized risk minimization problem above, it is clear that $\hat{f}$ is going to be in the space of representers of the training data $k_1, k_2, k_3, ..., k_m$. We can tell because the loss term only depends on the inner products of f with $k_{x_1}, k_{x_2}, k_{x_3}, ..., k_{x_m}$ while the regularization term penalizes f in all directions. If f has any component orthogonal to the subspace spanned by $k_{x_1}, k_{x_2}, k_{x_3}, ..., k_{x_m}$, the loss term is not going to be sensitive to that component, but the regularization term will still penalize it. Hence ,the optimal f will be entirely contained in the subspace spanned by the representers. This is the meaning of the representer teorem and it means that the optimal hypothesis $\hat{f}$ can be expressed as $$ \hat{f} = b + \sum_{i=1}^{m}\alpha_ik_{x_i} $$ for some real coefficients $\alpha_1, \alpha_2, \alpha_3...\alpha_m$ and bais b. Or as it is better known, $$ \hat{f} = b + \sum_{i=1}^{m}\alpha_ik(x_i, x) $$