We have a set of objects O in the real world, their equivalent in the model world, the model set M, and a results set R.
a:O→M,t:M→R,p:O→R --> Our aim is to find t such that t⋅a=p
Denote a training set as finite sequence/vector D=(Di)∈(M×R)N≕D∗
Mostly ignore O,a,p, so in a training task we consider only (M,t,R)
DEF.
Supervised Learning: Is an algorithm A:D∗→{h:M→R} that uses the training Data D∈D∗ to generate a target map for our model.
Unsupervised Learning: Is an algorithm A that would generate labels by looking at unlabelled data D‘∈MN
Reinforcement Learning: It is an iterative version of supervised learning, that is, the algorithm A is initially given a random dataset D0∈D∗ and then uses the resulting target function A(D0)≕h to generate the next dataset D1∈D∗.
DEF.ker(f):={(x,y)∣f(x)=f(y)} for some map f DEF.ran(f):={f(x)∣x∈X}=f(X) range of f DEF.supp(f):={m∈M∣f(m)=0} for some target map f:M→R
DEF. Classification Problem (CP) :⇔ There is a learning task (M,t,R) such that ran(t) is finite. ↪ especially when R is finite! ↪ called binary CP :⇔∣ran(t)∣=2 ↪ called multiclass CP :⇔∣ran(t)∣>2
CONTEXT:
Consider now only binary CPs(M,t,R)
DEF. The model features of the CP are some sets M1,…,Mk such that M1×…×Mk=M. ↪ model features nominal :⇔M1,…,Mk all finite (<=>M finite)
DEF. Any map h:M→R is a hypothesis ↪ Because t is unknown. (Remember that t was the optimal target function such that t⋅a=p which we try to approach with our learning algorithms)
CONTEXT:
Let (M,t,R) be a binary CP (R={0,1}) with a supervised learning algorithmA:D∗→{h:M→R}. Remember D=(Di)∈(M×R)N≕D∗ is the notation for training data.
DEF. Hypothesis space HA:=A(D∗)={h:M→R∣∃D∈D∗:A(D)=h} ↪ All possible maps that could results by applying the algorithm A on any data.
DEF. Any hypothesis h is said to fit training data D∈D∗:⇔∀i∈[N],Di≕(m,r):h(m)=r ↪ A hypothesis fits iff it is at least in alignment with the training data.
DEF. Version space VA(D):=HA∖{h:M→R∣hdoes not fitD} is an image of D under the map VA:D∗→P(HA)
DEF.D+:={mi∣(mi,ri)=Di,ri=0,i∈[N]}⊂M is the set of all positive examples D−:={mi∣(mi,ri)=Di,ri=1,i∈[N]}⊂M is the set of all negative examples
DEF.D{M}:={m∣(m,_)=Difor somei∈[N]} is the set of models in M that appear in the sequential training data D.
Analog: D{R}:={r∣(_,r)=Di,i∈[N]}
DEF. For M∖D:=M∖D{M} and some hypothesis h, we have the set of positively/negatively biased models under h: B+:={m∈M∖D∣h(m)=1} B−:={m∈M∖D∣h(m)=0} ↪∣B+∣+∣B−∣=∣M∖D∣ ↪(B+,B−) is the inductive bias of h ↪ The inductive bias can be measured by ∣M∖D∣∣B+∣∈[0,1]Q
Week 2
CONTEXT:
Let (M,t,R) be a binary CP (R={0,1}) with nominal problem features (remember: M=M1×…×Mk). In a binary CP supp(f)={m∈M∣f(m)=1} is exactly the set of positively evaluated models by f.
DEF. The tuple (θ1,…,θk)≕θ is called conjunctive clause :⇔∀i∈[k]:θi∈Mi∪{★,⊥} ↪ obviously ★,⊥∉Mi ↪ Elements of Mi are called literals, ★ wildcard, ⊥ contradiction
DEF. A conjunctive clause θ yields the induced hypothesis hθ:M→R with hθ((m1,…,mk)):={1 , if θi∈{mi,★}∀i∈[k]0 , elseDEF. We order hypotheses h,μ with h≼μ:⇔supp(h)⊆supp(μ) ↪h more specific than μ (and vise versa) ↪h≺μ:⇔supp(h)⊂supp(μ)
Satz.
supp(h(⊥,…,⊥))=∅ (=> ∀hhypothesis:h(⊥,…,⊥)≼h)
supp(h(★,…,★))=M (=> ∀hhypothesis:h≼h(★,…,★))
Satz. The Find-S Algorithm yields the most specific conjunctive clause that fits the training data D:
Start with θ1:=(m1,…,mk)=m where (m,1)=Di is the first positive example of our training data D.
If there are no positives, return θ0=(⊥,…,⊥).
Otherwise, run over all other positive examples ((m1,…,mk),1)=Di iteratively. For the amount d=∣{((m1,…,mk,),1)}∣ of such examples, lets call the positive models m2+,…,md+.
Then "flip" in each iteration i∈[d]∖{1} every coordinate in θi that does NOT match the coordinate in the corresponding positive model mi+ to the wildcard ★
Result: θd induces a hypothesis hθd that fits the entire training data D. ↪ A result like θd=(a,b,★,c,★,★,★) would mean that all positive models have a at the first position, b at the second and c at the third. Also, in every other position, at least two positive models exists that have different entries. ↪θd is the most specific conjunctive clause induced hypothesis that fits D.
DEF. A disjunctive normal form (DNF) is a finite set Θ of conjunctive clauses ↪Θ⊂(M1∪{★,⊥})×…×(Mk∪{★,⊥})
DEF. A DNF yields the induced hypothesis hΘ:M→R with hΘ(m):={1 , if ∃θ∈Θ:hθ(m)=10 , else↪ We can filter for multiple patterns at once. (or-combined)
DEF. Remember that A was an algorithm, then
VA⊥(D):={h∈VA(D)∣∄μ∈VA(D):h≺μ} set of maximally general hypotheses
VA⊤(D):={h∈VA(D)∣∄μ∈VA(D):μ≺h} set of maximally specific hypotheses
Satz. The version space of D under algorithm A is uniquely defined by its upper and lower bounds T:=VA⊥ and L:=VA⊤, that is VA(D)={h∈HA∣∃hT∈T,hL∈L:hL≼h≼hT}Satz. The Candidate Elimination Algorithm yields T,L, if they exist.
Start with L={θ⊥},U={θ★}
Run through the training data D, for each (m,r) consider:
If r=1:
Iterate again through all positives in D: Remove all θ∈U for which hθ(m)=1
For each θ∈L:
If hθ(m)=1 -> ok. (That means, the minimal hypothesis does not break on this training data item, it fits it.)
Else: Minimally generalise θ so hθ fits this training data item. (Use Find-S methods) -> replaceθ with the new generalised θ‘ in L
If r=0:
For each θ∈U:
If hθ(m)=0 -> ok (That means that even the generalised versions detect this item as wrong, they are not too general.)
Else: Find all the slightly specialised versions of θ such that each θ‘ would now fit m (that is, label it wrong=0) -> Replaceθ with new specialised versions θ1,…,θn in U
Return L,U
Satz. The Candidate Elimination Algorithm in a binary CP can also be simplified to:
Start with L={θ(0))},U={θ★} where θ(0):=Find-S(θ⊥) is the most specific CC that fits all positive training data entries.
Run through all negative training data entries (m,0)∈D:
For each θ∈U, if hθ(m)=1 then θ does NOT fit the data, so we need to specialise it:
U.remove(θ)
Θ:= Find-Minimally-Specialised-Versions-of(θ) such that hΘ(m)=0
For every new θ∗∈Θ we need to check if they do not accidentally unfit some positive training data entry.
So for every (m',1)∈D check also: hθ∗(m')=1
If that is so for all of them, then: U.add(θ∗)
Where Find-Minimally-Specialised-Versions-of(θ) ("Find-M") is defined by sacrificing any ★ in θ for every possible value m'∈Mi that is m'=mi where ((m1,…,mk),0)∈D was a negative example. ↪ E.g. Mi={0,1},((0,0,1,1),0)∈D and θ=⇒Θ={(★,★,1,★)(1,★,1,★),(★,1,1,★),(★,★,1,0)}