Introduction to Machine Learning

Week 1

CONTEXT:

We have a set of objects Ocal(O) in the real world, their equivalent in the model world, the model set Mcal(M), and a results set Rcal(R).

  • a:OM,t:MR,p:ORa:cal(O)->cal(M), t:cal(M)->cal(R), p:cal(O)->cal(R) --> Our aim is to find tt such that ta=pt dot a = p
  • Denote a training set as finite sequence/vector D=(Di)(M×R)ND=(D_i) in (cal(M)times cal(R))^N =: DD*
  • Mostly ignore O,a,pcal(O),a,p, so in a training task we consider only (M,t,R)(cal(M),t, cal(R))

DEF.

  • Supervised Learning: Is an algorithm A:cal(A): D{h:MR}D*->{h:cal(M)->cal(R)} that uses the training Data DDD in D* to generate a target map for our model.
  • Unsupervised Learning: Is an algorithm Acal(A) that would generate labels by looking at unlabelled data DMND` in cal(M)^N
  • Reinforcement Learning: It is an iterative version of supervised learning, that is, the algorithm Acal(A) is initially given a random dataset D0DD_0 in D* and then uses the resulting target function A(D0)hcal(A)(D_0)=:h to generate the next dataset D1DD_1 in D*.

DEF. ker(f)ker(f) :=:= {(x,y)f(x)=f(y)}{(x,y)|f(x)=f(y)} for some map ff
DEF. ran(f)"ran"(f) :=:= {f(x)xX}=f(X){f(x)|x in X}=f(X) range of ff
DEF. supp(f)"supp"(f):=:= {mMf(m)=0}{m in cal(M)|f(m)!=0} for some target map f:MRf:cal(M)->cal(R)

DEF. Classification Problem (CP) ::<=> There is a learning task (M,t,R)(cal(M),t,cal(R)) such that ran(t)"ran"(t) is finite.
arrow.r.hook especially when Rcal(R) is finite!
arrow.r.hook called binary CP ::<=> ran(t)=2|"ran"(t)|=2
arrow.r.hook called multiclass CP ::<=> ran(t)>2|"ran"(t)|>2

CONTEXT:

Consider now only binary CPs (M,t,R)(cal(M),t,cal(R))

DEF. The model features of the CP are some sets M1,,MkM_1,...,M_k such that M1××Mk=MM_1 times ... times M_k = cal(M).
arrow.r.hook model features nominal ::<=> M1,,MkM_1,...,M_k all finite (<=>Mcal(M) finite)

DEF. Any map h:MRh:cal(M)->cal(R) is a hypothesis
arrow.r.hook Because tt is unknown. (Remember that tt was the optimal target function such that ta=pt dot a = p which we try to approach with our learning algorithms)

CONTEXT:

Let (M,t,R)(cal(M),t,cal(R)) be a binary CP (R={0,1}cal(R)={0,1}) with a supervised learning algorithm A:D{h:MR}cal(A): D*->{h:cal(M)->cal(R)}. Remember D=(Di)(M×R)NDD=(D_i) in (cal(M)times cal(R))^N=:D* is the notation for training data.

DEF. Hypothesis space HA:=cal(H)_cal(A):= A(D)={h:MRDD:A(D)=h}cal(A)(D*)={h:cal(M)->cal(R)|exists D in D*: cal(A)(D)=h}
arrow.r.hook All possible maps that could results by applying the algorithm Acal(A) on any data.

DEF. Any hypothesis hh is said to fit training data DD Din D* :i[N],Di(m,r):h(m)=r:<=> forall i in [N], D_i=:(m,r): h(m)=r
arrow.r.hook A hypothesis fits iff it is at least in alignment with the training data.

DEF. Version space VA(D)cal(V)_cal(A)(D):=HA{h:MRh does not fit D}:=cal(H)_cal(A)without{h:cal(M)->cal(R)|h "does not fit" D} is an image of DD under the map VA:DP(HA)cal(V)_cal(A):D*->cal(P)(cal(H)_cal(A))

DEF. D+:=D^+:={mi(mi,ri)=Di,ri=0,i[N]}M{m_i|(m_i,r_i)=D_i, r_i=0, i in [N]}subset cal(M) is the set of all positive examples
D:=D^-:={mi(mi,ri)=Di,ri=1,i[N]}M{m_i|(m_i,r_i)=D_i, r_i=1, i in [N]}subset cal(M) is the set of all negative examples

DEF. D{M}:=D_{cal(M)}:={m(m,_)=Di for some i[N]}{m|(m,"_")=D_i "for some" i in [N]} is the set of models in Mcal(M) that appear in the sequential training data DD.
Analog: D{R}:={r(_,r)=Di,i[N]}D_{cal(R)}:={r|("_",r)=D_i, i in [N]}

DEF. For MD:=MD{M}cal(M)_(without D):=cal(M) without D_{cal(M)} and some hypothesis hh, we have the set of positively/negatively biased models under hh:
B+:={mMDh(m)=1}B_+ :={m in cal(M)_(without D)|h(m)=1}
B:={mMDh(m)=0}B_- :={m in cal(M)_(without D)|h(m)=0}
arrow.r.hook B+  +  B  =  MD|B_+| + |B_-| = |cal(M)_(without D)|
arrow.r.hook (B+,B(B_+,B_-) is the inductive bias of hh
arrow.r.hook The inductive bias can be measured by B+MD[0,1]Q(|B_+|)/(|cal(M)_(without D)|) in [0,1]_QQ

Week 2

CONTEXT:

Let (M,t,R)(cal(M),t,cal(R)) be a binary CP (R={0,1}cal(R)={0,1}) with nominal problem features (remember: M=M1××Mkcal(M)=M_1 times ... times M_k). In a binary CP supp(f)={mMf(m)=1}"supp"(f)={m in cal(M)|f(m)=1} is exactly the set of positively evaluated models by ff.

DEF. The tuple (θ1,,θk)θ(theta_1,...,theta_k)=:theta is called conjunctive clause ::<=> i[k]:θiMi{,}forall i in [k]: theta_i in M_i union {star.filled,bot}
arrow.r.hook obviously ,Mistar.filled,bot in.not M_i
arrow.r.hook Elements of MiM_i are called literals, star.filled wildcard, bot contradiction

DEF. A conjunctive clause θtheta yields the induced hypothesis hθh_theta:MR:cal(M)->cal(R) with hθ((m1,,mk)):={1   , if θi{mi,}i[k]0   , elseh_theta ((m_1,...,m_k)):=cases(1 " , if "theta_i in {m_i,star.filled} forall i in [k], 0 " , else" )DEF. We order hypotheses h,μh,mu with hμh prec.eq mu ::<=> supp(h)supp(μ)"supp"(h) subset.eq "supp"(mu)
arrow.r.hook hh more specific than μmu (and vise versa)
arrow.r.hook hμh prec mu ::<=> supp(h)supp(μ)"supp"(h) subset "supp"(mu)

Satz.

  • supp(h(,,))"supp"(h_((bot,...,bot)))== emptyset (=> h hypothesis:h(,,)hforall h "hypothesis": h_((bot,...,bot)) prec.eq h)
  • supp(h(,,))"supp"(h_((star.filled,...,star.filled)))== Mcal(M) (=> h hypothesis:hh(,,)forall h "hypothesis": h prec.eq h_((star.filled,...,star.filled)))

Satz. The Find-S Algorithm yields the most specific conjunctive clause that fits the training data DD:

  1. Start with θ1:=(m1,,mk)=mtheta_1:=(m_1,...,m_k)=m where (m,1)=Di(m,1)=D_i is the first positive example of our training data DD.
    If there are no positives, return θ0=(,,)theta_0=(bot,...,bot).
  2. Otherwise, run over all other positive examples ((m1,,mk),1)=Di((m_1,...,m_k),1)=D_i iteratively. For the amount d={((m1,,mk,),1)}d=|{((m_1,...,m_k,),1)}| of such examples, lets call the positive models m2+,,md+m^+_2,...,m^+_d.
    Then "flip" in each iteration i[d]{1}i in [d]without{1} every coordinate in θitheta_i that does NOT match the coordinate in the corresponding positive model mi+m^+_i to the wildcard star.filled
  3. Result: θdtheta_d induces a hypothesis hθdh_theta_d that fits the entire training data DD.
    arrow.r.hook A result like θd=(a,b,,c,,,)theta_d=(a,b,star.filled,c,star.filled,star.filled,star.filled) would mean that all positive models have a at the first position, b at the second and c at the third. Also, in every other position, at least two positive models exists that have different entries.
    arrow.r.hook θdtheta_d is the most specific conjunctive clause induced hypothesis that fits DD.

DEF. A disjunctive normal form (DNF) is a finite set ΘTheta of conjunctive clauses
arrow.r.hook Θ(M1{,})××(Mk{,})Theta subset (M_1 union {star.filled,bot})times...times (M_k union {star.filled,bot})

DEF. A DNF yields the induced hypothesis hΘh_Theta :MR:cal(M)->cal(R) with hΘ(m):={1   , if θΘ:hθ(m)=10   , elseh_Theta (m):=cases(1 " , if "exists theta in Theta: h_theta (m)=1, 0 " , else")arrow.r.hook We can filter for multiple patterns at once. (or-combined)

DEF. Remember that Acal(A) was an algorithm, then

  • VA(D):={hVA(D)    μVA(D):hμ}cal(V)_cal(A)^(bot) (D):={h in cal(V)_cal(A) (D) | exists.not mu in cal(V)_cal(A) (D): h prec mu} set of maximally general hypotheses
  • VA(D):={hVA(D)    μVA(D):μh}cal(V)_cal(A)^(top) (D):={h in cal(V)_cal(A) (D) | exists.not mu in cal(V)_cal(A) (D): mu prec h} set of maximally specific hypotheses

Satz. The version space of DD under algorithm Acal(A) is uniquely defined by its upper and lower bounds T:=VAT:=cal(V)_cal(A)^(bot) and L:=VAL:=cal(V)_cal(A)^(top), that is VA(D)={hHA  hTT,hLL:hLhhT}cal(V)_cal(A)(D)={h in cal(H)_cal(A)| exists h_T in T,h_L in L: h_L prec.eq h prec.eq h_T}Satz. The Candidate Elimination Algorithm yields T,LT,L, if they exist.

  1. Start with L={θ},U={θ}L={theta_bot}, U={theta_star.filled}
  2. Run through the training data DD, for each (m,r)(m,r) consider:
    1. If r=1r=1:
      1. Iterate again through all positives in DD: Remove all θUtheta in U for which hθ(m)=1h_theta (m)!=1
      2. For each θLtheta in L:
        1. If hθ(m)=1h_theta (m)=1 -> ok. (That means, the minimal hypothesis does not break on this training data item, it fits it.)
        2. Else: Minimally generalise θtheta so hθh_theta fits this training data item. (Use Find-S methods) -> replace θtheta with the new generalised θtheta` in LL
    2. If r=0r=0:
      1. For each θUtheta in U:
        1. If hθ(m)=0h_theta (m)=0 -> ok (That means that even the generalised versions detect this item as wrong, they are not too general.)
        2. Else: Find all the slightly specialised versions of θtheta such that each θtheta` would now fit mm (that is, label it wrong=0) -> Replace θtheta with new specialised versions θ1,,θntheta_1, ..., theta_n in UU
  3. Return L,UL,U

Satz. The Candidate Elimination Algorithm in a binary CP can also be simplified to:

  1. Start with L={θ(0))},U={θ}L={theta^((0)))}, U={theta_star.filled} where θ(0):=Find-S(θ)theta^((0)):="Find-S"(theta_bot) is the most specific CC that fits all positive training data entries.
  2. Run through all negative training data entries (m,0)D(m,0) in D:
    1. For each θUtheta in U, if hθ(m)=1h_theta (m)=1 then θtheta does NOT fit the data, so we need to specialise it:
      1. UU.remove(θ)(theta)
      2. Θ:=Theta := Find-Minimally-Specialised-Versions-of(θ)(theta) such that hΘ(m)=0h_Theta (m)=0
      3. For every new θΘtheta^* in Theta we need to check if they do not accidentally unfit some positive training data entry.
        1. So for every (m',1)D(m"'",1) in D check also: hθ(m')=1h_(theta^*) (m"'")=1
        2. If that is so for all of them, then:
          UU.add(θ)(theta^*)

Where Find-Minimally-Specialised-Versions-of(θ)(theta) ("Find-M") is defined by sacrificing any star.filled in θtheta for every possible value m'Mim"'" in M_i that is m'=mim"'"!=m_i where ((m1,,mk),0)D((m_1,...,m_k),0) in D was a negative example.
arrow.r.hook E.g. Mi={0,1},((0,0,1,1),0)DM_i={0,1}, ((0,0,1,1),0) in D and θ=(,,1,)Θ={(1,,1,),(,1,1,),(,,1,0)} theta=&(star.filled,star.filled,1,star.filled) \ => Theta="{" &(1,star.filled,1,star.filled), \ &(star.filled,1,1,star.filled), \ & (star.filled,star.filled,1,0)}