Introduction to Machine Learning

Week 1

CONTEXT:

We have a set of objects $cal(O)$ in the real world, their equivalent in the model world, the model set $cal(M)$ , and a results set $cal(R)$ .

$a:cal(O)->cal(M), t:cal(M)->cal(R), p:cal(O)->cal(R)$ --> Our aim is to find $t$ such that $t dot a = p$

Denote a training set as finite sequence/vector $D=(D_i) in (cal(M)times cal(R))^N$ $=:$ $D*$
Mostly ignore $cal(O),a,p$ , so in a training task we consider only $(cal(M),t, cal(R))$

DEF.

Supervised Learning: Is an algorithm $cal(A):$ $D*->{h:cal(M)->cal(R)}$ that uses the training Data $D in D*$ to generate a target map for our model.
Unsupervised Learning: Is an algorithm $cal(A)$ that would generate labels by looking at unlabelled data $D` in cal(M)^N$
Reinforcement Learning: It is an iterative version of supervised learning, that is, the algorithm $cal(A)$ is initially given a random dataset $D_0 in D*$ and then uses the resulting target function $cal(A)(D_0)=:h$ to generate the next dataset $D_1 in D*$ .

DEF. $ker(f)$ $:=$ ${(x,y)|f(x)=f(y)}$ for some map $f$
DEF. $"ran"(f)$ $:=$ ${f(x)|x in X}=f(X)$ range of $f$
DEF. $"supp"(f)$ $:=$ ${m in cal(M)|f(m)!=0}$ for some target map $f:cal(M)->cal(R)$

DEF. Classification Problem (CP) $:<=>$ There is a learning task $(cal(M),t,cal(R))$ such that $"ran"(t)$ is finite.
$arrow.r.hook$ especially when $cal(R)$ is finite!
$arrow.r.hook$ called binary CP $:<=>$ $|"ran"(t)|=2$
$arrow.r.hook$ called multiclass CP $:<=>$ $|"ran"(t)|>2$

CONTEXT:

Consider now only binary CPs $(cal(M),t,cal(R))$

DEF. The model features of the CP are some sets $M_1,...,M_k$ such that $M_1 times ... times M_k = cal(M)$ .
$arrow.r.hook$ model features nominal $:<=>$ $M_1,...,M_k$ all finite (<=> $cal(M)$ finite)

DEF. Any map $h:cal(M)->cal(R)$ is a hypothesis
$arrow.r.hook$ Because $t$ is unknown. (Remember that $t$ was the optimal target function such that $t dot a = p$ which we try to approach with our learning algorithms)

CONTEXT:

Let $(cal(M),t,cal(R))$ be a binary CP ( $cal(R)={0,1}$ ) with a supervised learning algorithm $cal(A): D*->{h:cal(M)->cal(R)}$ . Remember $D=(D_i) in (cal(M)times cal(R))^N=:D*$ is the notation for training data.

DEF. Hypothesis space $cal(H)_cal(A):=$ $cal(A)(D*)={h:cal(M)->cal(R)|exists D in D*: cal(A)(D)=h}$
$arrow.r.hook$ All possible maps that could results by applying the algorithm $cal(A)$ on any data.

DEF. Any hypothesis $h$ is said to fit training data $D$ $in D*$ $:<=> forall i in [N], D_i=:(m,r): h(m)=r$
$arrow.r.hook$ A hypothesis fits iff it is at least in alignment with the training data.

DEF. Version space $cal(V)_cal(A)(D)$ $:=cal(H)_cal(A)without{h:cal(M)->cal(R)|h "does not fit" D}$ is an image of $D$ under the map $cal(V)_cal(A):D*->cal(P)(cal(H)_cal(A))$

DEF. $D^+:=$ ${m_i|(m_i,r_i)=D_i, r_i=0, i in [N]}subset cal(M)$ is the set of all positive examples
$D^-:=$ ${m_i|(m_i,r_i)=D_i, r_i=1, i in [N]}subset cal(M)$ is the set of all negative examples

DEF. $D_{cal(M)}:=$ ${m|(m,"_")=D_i "for some" i in [N]}$ is the set of models in $cal(M)$ that appear in the sequential training data $D$ .
Analog: $D_{cal(R)}:={r|("_",r)=D_i, i in [N]}$

DEF. For $cal(M)_(without D):=cal(M) without D_{cal(M)}$ and some hypothesis $h$ , we have the set of positively/negatively biased models under $h$ :
$B_+ :={m in cal(M)_(without D)|h(m)=1}$
$B_- :={m in cal(M)_(without D)|h(m)=0}$
$arrow.r.hook$ $|B_+| + |B_-| = |cal(M)_(without D)|$
$arrow.r.hook$ $(B_+,B_-$ ) is the inductive bias of $h$
$arrow.r.hook$ The inductive bias can be measured by $(|B_+|)/(|cal(M)_(without D)|) in [0,1]_QQ$

Week 2

CONTEXT:

Let $(cal(M),t,cal(R))$ be a binary CP ( $cal(R)={0,1}$ ) with nominal problem features (remember: $cal(M)=M_1 times ... times M_k$ ). In a binary CP $"supp"(f)={m in cal(M)|f(m)=1}$ is exactly the set of positively evaluated models by $f$ .

DEF. The tuple $(theta_1,...,theta_k)=:theta$ is called conjunctive clause $:<=>$ $forall i in [k]: theta_i in M_i union {star.filled,bot}$
$arrow.r.hook$ obviously $star.filled,bot in.not M_i$
$arrow.r.hook$ Elements of $M_i$ are called literals, $star.filled$ wildcard, $bot$ contradiction

DEF. A conjunctive clause $theta$ yields the induced hypothesis $h_theta$ $:cal(M)->cal(R)$ with $h_theta ((m_1,...,m_k)):=cases(1 " , if "theta_i in {m_i,star.filled} forall i in [k], 0 " , else" )$ DEF. We order hypotheses $h,mu$ with $h prec.eq mu$ $:<=>$ $"supp"(h) subset.eq "supp"(mu)$
$arrow.r.hook$ $h$ more specific than $mu$ (and vise versa)
$arrow.r.hook$ $h prec mu$ $:<=>$ $"supp"(h) subset "supp"(mu)$

Satz.

$"supp"(h_((bot,...,bot)))$ $=$ $emptyset$ (=> $forall h "hypothesis": h_((bot,...,bot)) prec.eq h$ )
$"supp"(h_((star.filled,...,star.filled)))$ $=$ $cal(M)$ (=> $forall h "hypothesis": h prec.eq h_((star.filled,...,star.filled))$ )

Satz. The Find-S Algorithm yields the most specific conjunctive clause that fits the training data $D$ :

Start with $theta_1:=(m_1,...,m_k)=m$ where $(m,1)=D_i$ is the first positive example of our training data $D$ .
If there are no positives, return $theta_0=(bot,...,bot)$ .
Otherwise, run over all other positive examples $((m_1,...,m_k),1)=D_i$ iteratively. For the amount $d=|{((m_1,...,m_k,),1)}|$ of such examples, lets call the positive models $m^+_2,...,m^+_d$ .
Then "flip" in each iteration $i in [d]without{1}$ every coordinate in $theta_i$ that does NOT match the coordinate in the corresponding positive model $m^+_i$ to the wildcard $star.filled$
Result: $theta_d$ induces a hypothesis $h_theta_d$ that fits the entire training data $D$ .
$arrow.r.hook$ A result like $theta_d=(a,b,star.filled,c,star.filled,star.filled,star.filled)$ would mean that all positive models have a at the first position, b at the second and c at the third. Also, in every other position, at least two positive models exists that have different entries.
$arrow.r.hook$ $theta_d$ is the most specific conjunctive clause induced hypothesis that fits $D$ .

DEF. A disjunctive normal form (DNF) is a finite set $Theta$ of conjunctive clauses
$arrow.r.hook$ $Theta subset (M_1 union {star.filled,bot})times...times (M_k union {star.filled,bot})$

DEF. A DNF yields the induced hypothesis $h_Theta$ $:cal(M)->cal(R)$ with $h_Theta (m):=cases(1 " , if "exists theta in Theta: h_theta (m)=1, 0 " , else")$ $arrow.r.hook$ We can filter for multiple patterns at once. (or-combined)

DEF. Remember that $cal(A)$ was an algorithm, then

$cal(V)_cal(A)^(bot) (D):={h in cal(V)_cal(A) (D) | exists.not mu in cal(V)_cal(A) (D): h prec mu}$ set of maximally general hypotheses
$cal(V)_cal(A)^(top) (D):={h in cal(V)_cal(A) (D) | exists.not mu in cal(V)_cal(A) (D): mu prec h}$ set of maximally specific hypotheses

Satz. The version space of $D$ under algorithm $cal(A)$ is uniquely defined by its upper and lower bounds $T:=cal(V)_cal(A)^(bot)$ and $L:=cal(V)_cal(A)^(top)$ , that is $cal(V)_cal(A)(D)={h in cal(H)_cal(A)| exists h_T in T,h_L in L: h_L prec.eq h prec.eq h_T}$ Satz. The Candidate Elimination Algorithm yields $T,L$ , if they exist.

Start with $L={theta_bot}, U={theta_star.filled}$
Run through the training data $D$ , for each $(m,r)$ consider:
1. If $r=1$ :
  1. Iterate again through all positives in $D$ : Remove all $theta in U$ for which $h_theta (m)!=1$
  2. For each $theta in L$ :
    1. If $h_theta (m)=1$ -> ok. (That means, the minimal hypothesis does not break on this training data item, it fits it.)
    2. Else: Minimally generalise $theta$ so $h_theta$ fits this training data item. (Use Find-S methods) -> replace $theta$ with the new generalised $theta`$ in $L$
2. If $r=0$ :
  1. For each $theta in U$ :
    1. If $h_theta (m)=0$ -> ok (That means that even the generalised versions detect this item as wrong, they are not too general.)
    2. Else: Find all the slightly specialised versions of $theta$ such that each $theta`$ would now fit $m$ (that is, label it wrong=0) -> Replace $theta$ with new specialised versions $theta_1, ..., theta_n$ in $U$
Return $L,U$

Satz. The Candidate Elimination Algorithm in a binary CP can also be simplified to:

Start with $L={theta^((0)))}, U={theta_star.filled}$ where $theta^((0)):="Find-S"(theta_bot)$ is the most specific CC that fits all positive training data entries.
Run through all negative training data entries $(m,0) in D$ :
1. For each $theta in U$ , if $h_theta (m)=1$ then $theta$ does NOT fit the data, so we need to specialise it:
  1. $U$ .remove $(theta)$
  2. $Theta :=$ Find-Minimally-Specialised-Versions-of $(theta)$ such that $h_Theta (m)=0$
  3. For every new $theta^* in Theta$ we need to check if they do not accidentally unfit some positive training data entry.
    1. So for every $(m"'",1) in D$ check also: $h_(theta^*) (m"'")=1$
    2. If that is so for all of them, then:
      $U$ .add $(theta^*)$

Where Find-Minimally-Specialised-Versions-of $(theta)$ ("Find-M") is defined by sacrificing any $star.filled$ in $theta$ for every possible value $m"'" in M_i$ that is $m"'"!=m_i$ where $((m_1,...,m_k),0) in D$ was a negative example.
$arrow.r.hook$ E.g. $M_i={0,1}, ((0,0,1,1),0) in D$ and $theta=&(star.filled,star.filled,1,star.filled) \ => Theta="{" &(1,star.filled,1,star.filled), \ &(star.filled,1,1,star.filled), \ & (star.filled,star.filled,1,0)}$