€¦ · Introduction & Motivation: Connectionist Systems IWell-suited to learn, to adapt to...

Neural-Symbolic Integration

Steffen HolldoblerInternational Center for Computational LogicTechnische Universitat DresdenGermany

ICCLInternational Center for Computational Logic

Algebra, Logic and Formal Methods in Computer Science1

Introduction & Motivation: Overview

I Introduction & Motivation

I Propositional Logic

. Existing Approaches

. Proposititonal Logic Programs and the Core Method

I First-Order Logic

. Existing Approaches

. First-Order Logic Programs and the Core Method

I The Neural-Symbolic Learning Cycle

I Challenge Problems

Introduction & Motivation: Connectionist Systems

I Well-suited to learn, to adapt to new environments, to degrade gracefully etc.

I Many successful applications.

I Approximate functions.

. Hardly any knowledge about the functions is needed.

. Trained using incomplete data.

I Declarative semantics is not available.

I Recursive networks are hardly understood.

I McCarthy 1988: We still observe a propositional fixation.

I Structured objects are difficult to represent.

. Smolensky 1987: Can we instantiate the power of symboliccomputation within fully connectionist systems?

Introduction & Motivation: Logic Systems

I Well-suited to represent and reason aboutstructured objects and structure-sensitive processes.

I Many successful applications.

I Direct implementation of relations and functions.

I Explicit expert knowledge is required.

I Highly recursive structures.

I Well understood declarative semantics.

I Logic systems are brittle.

I Expert knowledge may not be available.

. Can we instantiate the power of connectionist computationwithin a logic system?

Introduction & Motivation: Objective

I Seek the best of both paradigms!

I Understanding the relation between connectionist and logic systems.

I Contribute to the open research problems of both areas.

I Well developed for propositional case.

I Hard problem: going beyond.

I In this lecture:

. Overview on existing approaches.

. Logic programs and recurrent networks.

. Semantic operators for logic programs can be computedby connectionist systems.

. Semantic operators can be learned.

. Logic programs can be extracted.

Introduction & Motivation: Objective

I Seek the best of both paradigms!

I Understanding the relation between connectionist and logic systems.

I Contribute to the open research problems of both areas.

I Well developed for propositional case.

I Hard problem: going beyond.

I In this lecture:

. Overview on existing approaches.

. Logic programs and recurrent networks.

. Semantic operators for logic programs can be computedby connectionist systems.

. Semantic operators can be learned.

. Logic programs can be extracted.

Neural Symbolic Integration using the Core Method

Connectionist Networks

I A connectionist network consists of

. a set U of units and

. a set W ⊆ U × U of connections.

I Each connection is labeled by a weight w ∈ R.

I If there is a connection from unit uj to uk, then wkj is its associated weight.

I A unit is specified by

. an input vector~i = (i1, . . . , im), ij ∈ R, 1 ≤ j ≤ m,

. an activation function Φ mapping~i to a potential p ∈ R,

. an output function Ψ mapping p to an (output) value v ∈ R.

I A unit is specified by

. an input vector~i = (i1, . . . , im), ij ∈ R, 1 ≤ j ≤ m,

. an activation function Φ mapping~i to a potential p ∈ R,

. an output function Ψ mapping p to an (output) value v ∈ R.

I If there is a connection from uj to uk

then wkjvj is the input received by uk along this connection.

I The potential and value of a unit are synchronously recomputed (or updated).

I Often a linear time t is added as parameter to input, potential and value.

I The state of a network with units u1, . . . , un at time t is (v1(t), . . . , vn(t)).

A Simple Connectionist Network

m mu3 u4

m mu1 u2

w31 w42

��

��@@R

��w34

HHHHH��

w34, w43 = −0.5w31, w42 = 1

pi(t + 1) = pi(t) +P4

j=1 wijvj(t)

vi(t) = round(pi(t))

v1(t) =

6 if t = 02 otherwise

v2(t) =

I What happens if the network is synchronously updated?

A Simple Connectionist Network

m mu3 u4

m mu1 u2

w31 w42

��

��@@R

��w34

HHHHH��

w34, w43 = −0.5w31, w42 = 1

pi(t + 1) = pi(t) +P4

j=1 wijvj(t)

vi(t) = round(pi(t))

v1(t) =

v2(t) =

I What happens if the network is synchronously updated?

I A winner-take-all network is a synchronously updated connectionist network of n

units (not counting input units) such that after each unit receives an initial inputat t = 0 eventually only the unit with the highest initial input produces a valuegreater than 0 whereas the value of all other units is 0.

I Exercise Construct a winner-take-all network of 3 units.

Literature

I Feldman, Ballard 1982: Connectionist Models and Their Properties.Cognitive Science 6 (3), 205-254.

I McCarthy 1988: Epistemological Challenges for Connectionism.Behavioural and Brain Sciences 11, 44.

I Smolensky 1987: On Variable Binding and the Representation of Symbolic Struc-tures in Connectionist Systems. Report No. CU-CS-355-87, Department of Com-puter Science & Institute of Cognitive Science, University of Colorado, Boulder.

Propositional Logic

I Existing Approaches

. Finite Automata and McCulloch-Pitts Networks

. Weighted Automata and Semiring Artificial Neural Networks

. Propositional Reasoning and Symmetric/Stochastic Networks

. Other Approaches

I Proposititonal Logic Programs and the Core Method

. The Very Idea

. Logic Programs

. Propositional Core Method

. Backpropagation

. Knowledge-Based Artificial Neural Networks

. Propositional Core Method using Sigmoidal Units

. Further Extensions

McCulloch-Pitts Networks

I McCulloch, Pitts 1943:Can the activities of nervous systems be modelled by a logical calculus?

I A McCulloch-Pitts network consists of a set U of binary threshold unitsand a set W ⊆ U × U of weighted connections.

I The set UI of input units is defined as UI = {uk ∈ U | (∀uj ∈ U) wkj = 0}.I The set UO of output units is defined as UO = {uj ∈ U | (∀uk ∈ U) wkj = 0}.

McCulloch-Pittsnetwork

... UO

Binary Threshold Units

I uk is a binary threshold unit if

Φ(~ik) = pk =Pm

j=1 wkjvj

Ψ(pk) = vk =

1 if pk ≥ θk

0 otherwise

where θk ∈ R is a threshold.

I Three binary threshold units:

v1 -w21 = −1 θ2

= −0.5

- v2 = ¬v1

Φ(~ik) = pk =Pm

j=1 wkjvj

Ψ(pk) = vk =

1 if pk ≥ θk

0 otherwise

v1 -w21 = −1 θ2

= −0.5

- v2 = ¬v1

w32 = 1

v1 -w31 = 1

θ3 = 0.5

- v3 = v1 ∨ v2

Φ(~ik) = pk =Pm

j=1 wkjvj

Ψ(pk) = vk =

1 if pk ≥ θk

0 otherwise

v1 -w21 = −1 θ2

= −0.5

- v2 = ¬v1

w32 = 1

v1 -w31 = 1

θ3 = 0.5

- v3 = v1 ∨ v2v2 -

w32 = 1

v1 -w31 = 1

θ3 = 1.5

- v3 = v1 ∧ v2

A Simple McCulloch-Pitts Network

I Example Consider the following network of logical threshold units:

.5 .5u1 u3

��

��*HHHH

HHHHHH

.5u5��:

XXXXXXXXXXXXz

I Exercise

. Specify UI and UO.

. What is computed by the network if all units are updated synchronously?

. Specify the states of the network ignoring input and output units.

Finite Automata

I A finite automaton consists of:

. Σ, a finite set of input symbols,

. Φ, a finite set of output symbols,

. Q, a finite set of states,

. q0 ∈ Q, an initial state,

. F ⊂ Q, a set of final states

. δ : Q× Σ→ Q, a state transition function,

. ρ : Q→ Φ, an output function.

I Exercise Let Σ = Φ = {1, 2}, Q = {p, q, r}, F = {r}, q0 = p,

ρ p q r

1 1 2, δ 1 2

What is computed by this automaton?

Finite Automata and McCulloch-Pitts Networks

I Theorem McCulloch-Pitts networks are finite automata and vice versa.

I Proof

⇒ Exercise⇐ Let T = (Σ, Φ, Q, q0, F, δ, ρ) an automaton with

• Σ = {b1, . . . , bm},• Φ = {c1, . . . , cr},• Q = {q0, . . . , qk−1}.

To show there exists network N with

• inputs {b′1, . . . , b′m},• outputs {c′1, . . . , c′r},• states {q′0, . . . , q′k−1} such that

if T generates cj1, . . . , cjn given bj1, . . . , bjn

then N generates c′j1, . . . , c′jngiven b′j1, . . . , b′jn

Construction of the Network: Inputs and Outputs

I Remember |Σ| = m, |Φ| = r.

I Inputs x1, . . . , xm with b′j = ~x where

1 if i = j,

0 otherwise.

I Inputs x1, . . . , xm with b′j = ~x where

1 if i = j,

0 otherwise.

I Outputs y1, . . . , yr with c′j = ~y where

1 if i = j,

0 otherwise.

Construction of the Network: Units and Connections

I Remember |Σ| = m, |Φ| = r, |Q| = k.

I qb-units represent that T in state q receives input b (k×m units).

I c-units represent output c (r units).

I Connections

. Let {k1, . . . , kn(k)} = {(q, b) | δ(q, b) = q∗} in

vuq∗b∗(t + 1) =

1 if xb∗(t) ∧ [k1(t) ∨ . . . ∨ kn(k)(t)],0 otherwise.

I Connections

vuq∗b∗(t + 1) =

. Let {l1, . . . , ln(l)} = {(q, b) | ρ(q) = c} in

vuc(t + 1) =

1 if l1(t) ∨ . . . ∨ ln(l)(t),0 otherwise.

I Connections

vuq∗b∗(t + 1) =

. Let {l1, . . . , ln(l)} = {(q, b) | ρ(q) = c} in

vuc(t + 1) =

1 if l1(t) ∨ . . . ∨ ln(l)(t),0 otherwise.

I The theorem follows by induction on the length of the input sequence.

Exercises

I Specify the automaton corresponding to the sample network.

I Specify the network corresponding to the sample finite automaton.

I Complete the proof of the theorem.

Some Remarks on McCulloch-Pitts Networks

I McCulloch-Pitts networks are not just simple reactive systems, but theirbehavior depends on previous inputs as well as the activity within the network.

. Example

��0.5 -

1 ��0.5 - y

I Literature

. Arbib: Brains, Machines and Mathematics. Springer, 2nd edition (1987).

. McCulloch & Pitts: A Logical Calculus and the Ideas Immanent in theNervous Activity. Bulletin of Mathematical Biophysics 5, 115-133 (1943).

Weighted Automata and Semiring Artificial Neural Networks

I Bader, Holldobler, Scalzitti 2004:Can the result by McCulloch and Pitts be extended to weighted automata?

I Let (K,⊕,�, 0K, 1K) be a semiring.

I uk is a⊕-unit ifΦ(~ik) = pk =

Lmj=1 wkj � vj

Ψ(pk) = vk = pk

I uk is a�-unit ifΦ(~ik) = pk =

Jmj=1 wkj � vj

Ψ(pk) = vk = pk

I A semiring artificial neural network consists of a set U of⊕- and�-unitsand a set W ⊆ U × U of K-weighted connections.

I Theorem Weighted automata are semiring artificial neural networks.

I Literature Bader, Holldobler, Scalzitti 2004: Semiring Artificial Neural Networksand Weighted Automata – and an Application to Digital Image Encoding.In: KI 2004: Advances in Artificial Intelligence,Lecture Notes in Artificial Intelligence 3238, 281-294.

Symmetric Networks

I Hopfield 1982: Can statistical models for magnetic materialsexplain the behavior of certain classes of networks?

I Original application: associative memory.

I A symmetric network consists of a set U of binary threshold unitsand a set W ⊆ U × U of weighted connectionssuch that wkj = wjk for all k, j with k 6= j.

Symmetric Networks

I Asynchronous update procedure:while state ~v is unstable: update an arbitrary unit.

−1 m5

��

HHHHHH

Symmetric Networks

−1 m5

��

HHHHHH

Symmetric Networks

−1 m5

��

HHHHHH

Symmetric Networks

−1 m5

��

HHHHHH

Symmetric Networks

−1 m5

��

HHHHHH

Symmetric Networks

−1 m5

��

HHHHHH

Symmetric Networks

−1 m5

��

HHHHHH

Symmetric Networks

−1 m5

��

HHHHHH

Symmetric Networks

−1 m5

��

HHHHHH

Symmetric Networks

−1 m5

��

HHHHHH

ml0}0ml5

Symmetric Networks

−1 m5

��

HHHHHH

ml0}0ml5}5

Symmetric Networks

−1 m5

��

HHHHHH

ml0}0ml5}5 }1 m12

Symmetric Networks

−1 m5

��

HHHHHH

ml0}0ml5}5 }1 m12}1 ml1

Symmetric Networks

−1 m5

��

HHHHHH

ml0}0ml5}5 }1 m12}1 ml1}m1 }1

Symmetric Networks

−1 m5

��

HHHHHH

ml0}0ml5}5 }1 m12}1 ml1}m1 }1

Energy Minimization

I What happens precisely when a symmetric network is updated?

I Consider the energy function

E(t) = −12

Pk,j wkjvj(t)vk(t) +

Pk θkvk(t)

= −P

k<j wkjvj(t)vk(t) +P

k θkvk(t)

describing the state of the network at time t.

I We assume wii = 0 for all units i in the network.

I Exercise

. Specify E(t) for the symmetric networks on the previos page.

Energy Minimization

E(t) = −12

Pk θkvk(t)

= −P

k θkvk(t)

I Exercise

. How does an update change the energy of a symmetric network(you may assume that θk = 0 for all k)?

Energy Minimization

E(t) = −12

Pk θkvk(t)

= −P

k θkvk(t)

I Exercise

I Theorem E is monotone decreasing, i.e., E(t + 1) ≤ E(t).

Energy Minimization

E(t) = −12

Pk θkvk(t)

= −P

k θkvk(t)

I Exercise

I Exercise Does this theorem still hold if we drop the assumption that wij = wji?

Energy Minimization

E(t) = −12

Pk θkvk(t)

= −P

k θkvk(t)

I Exercise

I Exercise Does this theorem still hold if we drop the assumption that wij = wji?

I Exercise How plausible is the assumption that wij = wji?

Stochastic Networks or Boltzmann Machines

I Hinton, Sejnowski 1983: Can we escape local minima?

I A stochastic network is a symmetric network,but the values are computed probabilistically

P (vk = 1) =1

1 + e(θk−pk)/T

where T is called pseudo temperature.

I In equilibrium stochastic networks are more likely to be in a state with low energy.

I Kirkpatrick etal. 1983: Can we compute a global minima?

I Simulated annealing decrease T gradually.

I Theorem (Geman, Geman 1984)A global minima is reached if T is decreased in infinitesimal small steps.

I Applications Combinatorial optimization problems like thetravelling salesman problem or graph bipartitioning problem.

Literature

I Geman, Geman 1984: Stochastic Relaxation, Gibbs Distribution, and the BayesianRestoration of Image. IEEE Transactions on Pattern Analysis and Machine Intelli-gence 6, 721-741.

I Hinton, Sejnowski 1983: Optimal Perceptual Inference. In: Proceedings of theIEEE Conference on Computer Vision and Recognition, 448-453.

I Hopfield 1982: Neural Networks and Physical Systems with Emergent CollectiveComputational Abilities. In: Proceedings of the National Academy of SciencesUSA, 2554-2558.

I Kirkpatrick etal. 1983: Optimization by Simulated Annealing. Science 220, 671-680.

Propositional Logic

I Variables are p1, . . . , pn.

I Connectives are ¬,∨,∧.

I Atoms are variables.

I Literals are atoms and negated atoms.

I Clauses are (generalized) disjunctions of literals.

I Formulas in clause form are (generalized) conjunctions of clauses.

Propositional Logic

I Variables are p1, . . . , pn.

I Connectives are ¬,∨,∧.

I Atoms are variables.

I Literals are atoms and negated atoms.

I Clauses are (generalized) disjunctions of literals.

I Formulas in clause form are (generalized) conjunctions of clauses.

I Notation Sometimes variables are denoted by different lettersif there is a bijection between these letters and p1, . . . , pn.

I Example

(¬o ∨m) ∧ (¬s ∨ ¬m) ∧ (¬c ∨m) ∧ (¬c ∨ s) ∧ (¬v ∨ ¬m)

which is abbreviated by

〈[¬o, m], [¬s,¬m], [¬c, m], [¬c, s], [¬v,¬m]〉.

Interpretations and Models

I Notation (all symbols may be indexed)

. A denotes an atom.

. L denotes a literal.

. F, G denote formulas.

. C denotes a clause.

I Interpretations are mappings from {p1, . . . , pn} to {0, 1}.

. They can be encoded as ~v.

. They are extended to formulas as follows:

pi(~v) = vi

(¬F )(~v) = 1− F (~v)(F ∧G)(~v) = F (~v)×G(~v)(F ∨G)(~v) = F (~v) + G(~v)− F (~v)×G(~v)

. They are extended to formulas as follows:

pi(~v) = vi

(¬F )(~v) = 1− F (~v)(F ∧G)(~v) = F (~v)×G(~v)(F ∨G)(~v) = F (~v) + G(~v)− F (~v)×G(~v)

I ~v is a model for F iff F (~v) = 1.

I F is satisfiable if it has a model.

Interpretations and Models – Example

I Let F = 〈[¬p1, p2], [p3,¬p2]〉 and ~v = ~101, then:

F (~v)= [¬p1, p2](~v)× [p3,¬p2](~v)

F (~v)= [¬p1, p2](~v)× [p3,¬p2](~v)= ((¬p1)(~v) + p2(~v)− (¬p1)(~v)× p2(~v))

F (~v)= [¬p1, p2](~v)× [p3,¬p2](~v)= ((¬p1)(~v) + p2(~v)− (¬p1)(~v)× p2(~v))× (p3(~v) + (¬p2)(~v)− p3(~v)× (¬p2)(~v))

= ((1− p1(~v)) + p2(~v)− (1− p1(~v))× p2(~v))

= ((1− p1(~v)) + p2(~v)− (1− p1(~v))× p2(~v))× (p3(~v) + (1− p2(~v))− p3(~v)× (1− p2(~v)))

= ((1− 1) + 0− (1− 1)× 0)× (1 + (1− 0)− 1× (1− 0))= 0× 1

= ((1− 1) + 0− (1− 1)× 0)× (1 + (1− 0)− 1× (1− 0))= 0× 1= 1

I Hence, ~v is not a model for F , but is a model for [p3,¬p2].

I Exercise

. Is F satisfiable? Prove your claim.

. Is 〈[¬p], [p,¬q], [q]〉 satisfiable? Prove your claim.

. Find all models of 〈[¬o, m], [¬s,¬m], [¬c, m], [¬c, s], [¬v,¬m]〉.

Propositional Reasoning and Energy Minimization

I Pinkas 1991:Is there a link between propositional logic and symmetric networks?

I Let F = 〈C1, . . . , Cm〉 be a propositional formula in clause form.

I We define

τ (C) =

8>><>>:0 if C = [ ],A if C = [A],1−A if C = [¬A],τ (C1) + τ (C2)− τ (C1)τ (C2) if C = (C1 ∨ C2).

τ (F ) =Pm

i=1(1− τ (Ci))

I Example τ (〈[¬o, m], [¬s,¬m], [¬c, m], [¬c, s], [¬v,¬m]〉)= vm− cm− cs + sm− om + 2c + o.

I Exercise Compute τ (〈[¬p], [p,¬q], [q]〉).

Propositional Reasoning and Symmetric Networks

I Theorem F (~v) = 1 iff τ (F ) has a global minima at ~v and τ (F )(~v) = 0.

I Compare τ (F ) = vm− cm− cs + sm− om + 2c + o

with E(~v) = −P

k<j wkjvjvk +P

k θkvk.

mu1 = o

mu2 = m

mu3 = s

mu5 = v

mu4 = c

with E(~v) = −P

k<j wkjvjvk +P

k θkvk.

mu1 = o

mu2 = m

mu3 = s

mu5 = v

mu4 = c

��

with E(~v) = −P

k<j wkjvjvk +P

k θkvk.

mu1 = o

mu2 = m

mu3 = s

mu5 = v

mu4 = c

��

with E(~v) = −P

k<j wkjvjvk +P

k θkvk.

mu1 = o

mu2 = m

mu3 = s

mu5 = v

mu4 = c

��

with E(~v) = −P

k<j wkjvjvk +P

k θkvk.

mu1 = o

mu2 = m

mu3 = s

mu5 = v

mu4 = c

��

HHHHHH

with E(~v) = −P

k<j wkjvjvk +P

k θkvk.

mu1 = o

mu2 = m

mu3 = s

mu5 = v

mu4 = c

��

HHHHHH

��

with E(~v) = −P

k<j wkjvjvk +P

k θkvk.

mu1 = o

mu2 = m

mu3 = s

mu5 = v

mu4 = c

��

HHHHHH

��

with E(~v) = −P

k<j wkjvjvk +P

k θkvk.

mu1 = o

mu2 = m

mu3 = s

mu5 = v

mu4 = c

��

HHHHHH

��

with E(~v) = −P

k<j wkjvjvk +P

k θkvk.

mu1 = o

mu2 = m

mu3 = s

mu5 = v

mu4 = c

��

HHHHHH

��

with E(~v) = −P

k<j wkjvjvk +P

k θkvk.

mu1 = o

mu2 = m

mu3 = s

mu5 = v

mu4 = c

��

HHHHHH

��

Propositional Non-Monotonic Reasoning

I Pinkas 1991a:Can the above mentioned approach be extended to non-monotonic reasoning?

I Consider F = 〈(C1, k1), . . . , (Cm, km)〉, where Ci are clauses and ki ∈ R+.

I The penalty of ~v for (C, k) is k if C(~v) = 0 and 0 otherwise.

I The penalty of ~v for F is the sum of the penalties for (Ci, ki).

I ~v is preferred over ~w wrt F

if the penalty of ~v for F is smaller than the penalty of ~w for F .

I Modify τ to become τ (F ) =Pm

i=1 ki(1− τ (Ci)), e.g.,

τ (〈([¬o, m], 1), ([¬s,¬m], 2), ([¬c, m], 4), ([¬c, s], 4), ([¬v,¬m], 4)〉)= 4vm− 4cm− 4cs + 2sm− om + 8c + o.

I The corresponding stochastic network computes most preferred interpretations.

Exercises and Literature

I Exercise Consider

F = 〈([¬o, m], 1), ([¬s,¬m], 2), ([¬c, m], 4), ([¬c, s], 4), ([¬v,¬m], 4)〉.

. Compute the most preferred interpretations of F .

. What happens if we add (o, 100) to F ?

. What happens if we add (o, 100) and (s, 100) to F ?

I Literature

. Pinkas 1991: Symmetric Neural Networks and Logic Satisfiability. NeuralComputation 3, 282-291.

. Pinkas 1991a: Propositional Non-Monotonic Reasoning and Inconsistencyin Symmetrical Neural Networks. In: Proceedings International Joint Con-ference on Artificial Intelligence, 525-530.

Proposititonal Logic Programs and the Core Method

I The Very Idea

I Logic Programs

I Propositional Core Method

I Backpropagation

I Knowledge-Based Artificial Neural Networks

I Propositional Core Method using Sigmoidal Units

I Further Extensions

The Very Idea

I Various semantics for logic programs coincide with fixed points of associated im-mediate consequence operators (e.g., Apt, vanEmden 1982).

The Very Idea

I Banach Contraction Mapping Theorem A contraction mapping f defined ona complete metric space (X, d) has a unique fixed point. The sequencey, f(y), f(f(y)), . . . converges to this fixed point for any y ∈ X.

. Fitting 1994: Consider logic programs,whose immediate consequence operator is a contraction.

The Very Idea

I Banach Contraction Mapping Theorem A contraction mapping f defined ona complete metric space (X, d) has a unique fixed point. The sequencey, f(y), f(f(y)), . . . converges to this fixed point for any y ∈ X.

. Fitting 1994: Consider logic programs,whose immediate consequence operator is a contraction.

I Funahashi 1989: Every continuous function on the reals can be uniformly approx-imated by feedforward connectionist networks.

. Holldobler, Kalinke, Storr 1999: Consider logic programs,whose immediate consequence operator is continuous on the reals.

Metrics

I A metric on a space M is a mapping d : M ×M → R such that

. d(x, y) = 0 iff x = y,

. d(x, y) = d(y, x), and

. d(x, y) ≤ d(x, z) + d(z, y).

I Let (M, d) be a metric space and S = (si | si ∈M) a sequence.

. S converges if (∃s ∈M)(∀ε > 0)(∃N)(∀n ≥ N) d(sn, s) ≤ ε.

. S is Cauchy if (∀ε > O)(∃N)(∀n, m ≥ N) d(sn, sm) ≤ ε.

. (M, d) is complete if every Cauchy sequence converges.

I A mapping f : M →M is a contraction on (M, d)if (∃0 < k < 1)(∀x, y ∈M) d(f(x), f(y)) ≤ k · d(x, y).

Propositional Logic Programs

I A propositional logic programP over a propositional language Lis a finite set of clauses

A← L1 ∧ . . . ∧ Ln,

where A is an atom, Li are literals and n ≥ 0.P is definite if all Li, 1 ≤ i ≤ n are atoms.

I Let V be the set of all propositional variables occurring in L.

I An interpretation I is a mapping V → {>,⊥}.I I can be represented by the set of atoms which are mapped to> under I.

I 2V is the set of all interpretations.

I Immediate consequence operator TP : 2V → 2V:

TP(I) = {A | there is a clause A← L1 ∧ . . . ∧ Ln ∈ P

such that I |= L1 ∧ . . . ∧ Ln}.

I I is a supported model iff TP(I) = I.

Exercises

I ConsiderP = {p, q ← p, r ← q}

. Draw the lattice of all interpretations ofP wrt the⊆ ordering.

. Mark the models ofP .

. Compute TP(∅), TP(TP(∅)), . . ..

. Mark the supported models ofP .

I LetP be a definite program.

. Show that if M1 and M2 are models ofP then so is M1 ∩M2.

. Let M be the least model ofP . Show that M is a supported model.

The Core Method

I Let L be a logic language.

I Given a logic programP together with immediate consequence operator TP .

I Let I be the set of interpretations forP .

I Find a mapping R : I → Rn.

I Construct a feed-forward network computing fP : Rn → Rn, called the core,such that the following holds:

. If TP(I) = J then fP(R(I)) = R(J), where I, J ∈ I.

. If fP(~s) = ~t then TP(R−1(~s)) = R−1(~t), where ~s,~t ∈ Rn.

I Connect the units in the output layer recursively to the units in the input layer.

I Show that the following holds

. I = lfp (TP) iff the recurrent network converges to or approximates R(I).

The Core Method

I Let L be a logic language.

I Given a logic programP together with immediate consequence operator TP .

I Let I be the set of interpretations forP .

I Find a mapping R : I → Rn.

I Construct a feed-forward network computing fP : Rn → Rn, called the core,such that the following holds:

. If TP(I) = J then fP(R(I)) = R(J), where I, J ∈ I.

. If fP(~s) = ~t then TP(R−1(~s)) = R−1(~t), where ~s,~t ∈ Rn.

I Connect the units in the output layer recursively to the units in the input layer.

I Show that the following holds

. I = lfp (TP) iff the recurrent network converges to or approximates R(I).

Connectionist model generation using recurrent networks with feed forward core.

3-Layer Recurrent Networks

input layer

hidden layer

output layer

input layer

hidden layer

output layer

��

��36

QQQk 6

��

��3

��

AAAAAAAAAA core

input layer

hidden layer

output layer

��

��36

QQQk 6

��

��3

��

AAAAAAAAAA core

6 6�� 6 6��

��

input layer

hidden layer

output layer

��

��36

QQQk 6

��

��3

��

AAAAAAAAAA core

6 6�� 6 6��

��

I At each point in time all units do:

. apply activation function to obtain potential,

. apply output function to obtain output.

Propositional Core Method using Binary Threshold Units

I Let L be the language of propositional logic over a set V of variables.

I LetP be a propositional logic program, e.g.,

P = {p, r ← p ∧ ¬q, r ← ¬p ∧ q}.

I I = 2V is the set of interpretations forP .

I TP(I) = {A | A← L1 ∧ . . . ∧ Lm ∈ P such that I |= L1 ∧ . . . ∧ Lm}.

TP(∅) = {p}

P = {p, r ← p ∧ ¬q, r ← ¬p ∧ q}.

TP(∅) = {p}TP({p}) = {p, r}

P = {p, r ← p ∧ ¬q, r ← ¬p ∧ q}.

TP(∅) = {p}TP({p}) = {p, r}TP({p, r}) = {p, r} = lfp (TP)

Representing Interpretations

I I = 2V

I Let n = |V| and identify V with {1, . . . , n}.I Define R : I → Rn such that for all 1 ≤ j ≤ n we find:

R(I)[j] =

1 if j ∈ I,

0 if j 6∈ I.

E.g., if V = {p, q, r} = {1, 2, 3} and I = {p, r} then R(I) = (1, 0, 1).

I Other encodings are possible, e.g.,

R(I)[j] =

1 if j ∈ I,

−1 if j 6∈ I.

Computing the Core

I Consider againP = {p, r ← p ∧ ¬q, r ← ¬p ∧ q}.I A translation algorithm translatesP into a core of binary threshold units:

12 input layer

ω2 output layer

hidden layer

Computing the Core

12 input layer

ω2 output layer

hidden layer

−ω2

Computing the Core

12 input layer

ω2 output layer

hidden layer

−ω2

��6

��

Computing the Core

12 input layer

ω2 output layer

hidden layer

−ω2

��6

��

I Exercise Specify the core for {p1 ← p2, p1 ← p3 ∧ p4, p1 ← p5 ∧ p6}.

Some Results

I Proposition 2-layer networks cannot compute TP for definiteP .

I Theorem For each programP , there exists a core computing TP .

I RecallP = {p, r ← p ∧ ¬q, r ← ¬p ∧ q}.I Adding recurrent connections:

−ω2

��6

��

Some Results

−ω2

��6

��

Some Results

−ω2

��6

��

1−ω2

Some Results

−ω2

��6

��

1−ω2

Some Results

−ω2

��6

��

1−ω2

Some Results

−ω2

��6

��

1−ω2

Some Results

−ω2

��6

��

1−ω2

Some Results

−ω2

��6

��

1−ω2

Strongly Determined Programs

I A logic programsP is said to be strongly determined if there exists a metric d onthe set of all Herbrand interpretations forP such that TP is a contraction wrt d.

I Exercise Are the following programs strongly determined?

. {p, q ← p, r ← q},

. {p1 ← p2, p1 ← p3 ∧ p4, p1 ← p5 ∧ p6},

. {p← ¬p}.

I Corollary Let P be a strongly determined program. Then there exists a corewith recurrent connections such that the computation with an arbitrary initial inputconverges and yields the unique fixed point of TP .

Time and Space Complexity

I Let n be the number of clausesand m be the number of propositional variables occurring inP .

. 2m + n units, 2mn connections in the core.

. TP(I) is computed in 2 steps.

. The parallel computational model to compute TP(I) is optimal.

. The recurrent network settles down in 3n steps in the worst case.

I Exercise Give an example of a program with worst case time behavior.

Rule Extraction (1)

I PropositionFor each core C there exists a programP such that C computes TP .

-0.2 0.2

-0.40.3 0.6

0.7 0 -1 -0.2

1 -2-0.5 1.5

0.3 0.8

u1 u2 u3 u4 u5 u6 u7

p3 v3 p4 v4 p5 v5 p6 v6 p7 v7

0 0 0 0 0 1 0 0 0 1 −1 00 1 1.5 1 .3 1 .8 1 1.8 1 .7 11 0 1 1 −1 0 −.5 0 2 1 .7 11 1 2.5 1 −.7 0 .3 0 2 1 .7 1

Rule Extraction (2)

I Extracted program:

P = { q1 ← ¬q1 ∧ ¬q2,

q1 ← ¬q1 ∧ q2, q2 ← ¬q1 ∧ q2,

q1 ← q1 ∧ ¬q2, q2 ← q1 ∧ ¬q2,

q1 ← q1 ∧ q2, q2 ← q1 ∧ q2 }.

I Simplified form:P = {q1, q2 ← q1, q2 ← ¬q1 ∧ q2}.

I You can do much better compared to this simple approach(see Mayer-Eichberger 2006).

Literature

I Apt, van Emden 1982: Contributions to the Theory of Logic Programming. Journalof the ACM 29, 841-862.

I Fitting 1994: Metric Methods – Three Examples and a Theorem. Journal of LogicProgramming 21, 113-127.

I Funahashi 1989: On the Approximate Realization of Continuous Mappings byNeural Networks. Neural Networks 2, 183-192.

I Hitzler, Holldobler, Seda 2004: Logic Programs and Connectionist Networks.Journal of Applied Logic 2, 245-272.

I Holldobler, Kalinke 1994: Towards a Massively Parallel Computational Model forLogic Programming. In: Proceedings of the ECAI94 Workshop on Combining Sym-bolic and Connectionist Processing, 68-77.

I Holldobler, Kalinke, Storr 1999: Approximating the Semantics of Logic Programsby Recurrent Neural Networks. Applied Intelligence 11, 45-59.

I Markus-Eichberger 2006: Extracting Propositional Logic Programs from NeuralNetworks: A Decompositional Approach. Bacherlor Thesis TU Dresden.

3-Layer Feed-Forward Networks Revisited

I Theorem (Funahashi 1989) Suppose that Ψ : R → R is non-constant, bounded,monotone increasing and continuous. Let K ⊆ Rn be compact, let f : K → Rbe continuous, and let ε > 0. Then there exists a 3-layer feed-forward networkwith output function Ψ for the hidden layer and linear output function for the inputand output layer whose input-output mapping f : K → R satisfies

maxx∈K|f(x)− f(x)| < ε.

. Every continuous function f : K → R can be uniformly approximated byinput-output functions of 3-layer feed-forward networks.

I uk is a sigmoidal unit if

Φ(~ik) = pk =Pm

j=1 wkjvj

Ψ(pk) = vk = 1

1+eβ(θk−pk)

where θk ∈ R is a threshold (or bias) and β > 0 a steepness parameter.

Backpropagation

I Bryson, Ho 1969, Werbos 1974, Parker 1985, Rumelhart, etal. 1986:Can 3-layer feed-forward networks learn a particular function?

I Training set of input-output pairs {(~il, ~ol) | 1 ≤ l ≤ n}.I Minimize E =

Pl El where El = 1

lk − vl

I Gradient descent algorithm to learn appropriate weights.

I Backpropagation

. Initialize weights arbitrarily.

. Do until all input-output patterns are correctly classified.

1 Present input pattern ~il at time t.2 Compute output pattern ~vl at time t + 2.3 Change weights according to ∆wl

ij = ηδliv

lj, where

δli =

Ψ′i(p

li)× (ol

i − vli) if i is output unit,

Ψ′i(pli)×

Pk δl

kwki if i is hidden unit,

η > 0 is called learning rate.

Output Functions Revisited

I Remember sigmoidal function (with β = 1):

1 + e−(P

j wijvj+θi)

I We finddvi

j wijvj + θi)= vi(1− vi).

I Hence

δli =

i(1− vli)(o

li − vl

i) if ui is an output unit,vl

i(1− vli)

Pk δl

kwki if ui is a hidden unit.

I Units are active if vi ≥ 0.9 and passive if vi ≤ 0.1.

Properties

I Learning rate η:

. If η is large, then system learns rapidly but may oscillate.

. If η is small, then system learns slowly but will not oscillate.

. In the ideal case η should be adapted during learning:

∆wij(t + 1) = ηδi(t)vj(t) + α∆wij(t)

where α is a constant and α∆wij(t) is called momentum term.

I Almost all functions can be learned.

I Learning is NP–hard.

I Literature Rumelhart etal. 1986: Parallel Distributed Processing. MIT Press.

Level Mappings and Hierarchical Logic Programs

I Let V be a set of propositional variablesandP be a propositional logic program wrt V .

I A level mapping forP is a function l : V → N.

. We define l(¬A) = l(A).

I P is hierarchical if for all clauses A← L1 ∧ . . . ∧ Ln ∈ P we findl(A) > l(L1) for all 1 ≤ i ≤ n.

Knowledge Based Artificial Neural Networks

I Towell, Shavlik 1994: Can we do better than empirical learning?

I Sets of hierarchical logic programs, e.g.,

P = {A← B ∧ C ∧ ¬D, A← D ∧ ¬E, H ← F ∧G, K ← A,¬H}.

B C D E F G

P = {A← B ∧ C ∧ ¬D, A← D ∧ ¬E, H ← F ∧G, K ← A,¬H}.

B C D E F G

��

��6

��

P = {A← B ∧ C ∧ ¬D, A← D ∧ ¬E, H ← F ∧G, K ← A,¬H}.

B C D E F G

��

��6

��

��6

P = {A← B ∧ C ∧ ¬D, A← D ∧ ¬E, H ← F ∧G, K ← A,¬H}.

B C D E F G

��

��6

��

��6

��

��6

3ω2 H

P = {A← B ∧ C ∧ ¬D, A← D ∧ ¬E, H ← F ∧G, K ← A,¬H}.

B C D E F G

��

��6

��

��6

��

��6

3ω2 H

��

P = {A← B ∧ C ∧ ¬D, A← D ∧ ¬E, H ← F ∧G, K ← A,¬H}.

B C D E F G

��

��6

��

��6

��

��6

3ω2 H

��

P = {A← B ∧ C ∧ ¬D, A← D ∧ ¬E, H ← F ∧G, K ← A,¬H}.

B C D E F G

��

��6

��

��6

��

��6

3ω2 H

��

P = {A← B ∧ C ∧ ¬D, A← D ∧ ¬E, H ← F ∧G, K ← A,¬H}.

B C D E F G

��

��6

��

��6

��

��6

3ω2 H

��

P = {A← B ∧ C ∧ ¬D, A← D ∧ ¬E, H ← F ∧G, K ← A,¬H}.

B C D E F G

��

��6

��

��6

��

��6

3ω2 H

��

Knowledge Based Artificial Neural Networks – Learning

I Given hierachical sets of propositional rules as background knowledge.

I Map rules into multi-layer feed forward networks with sigmoidal units.

I Add hidden units (optional).

I Add units for known input features that are not referenced in the rules.

I Fully connect layers.

I Add near-zero random numbers to all links and thresholds.

I Apply backpropagation.

. Empirical evaluation: system performs betterthan purely empirical and purely hand-built classifiers.

Knowledge Based Artificial Neural Networks – A Problem

I Works if rules have few conditions andthere are few rules with the same head.

A1 A9 A10

��*6

19ω2A

A1 A9 A10

��*6

19ω2A

B1 B2 B10

��

��6

HHHHHHY

19ω2 B

A1 A9 A10

��*6

19ω2A

B1 B2 B10

��

��6

HHHHHHY

19ω2 B

��

��*

A1 A9 A10

��*6

19ω2A

B1 B2 B10

��

��6

HHHHHHY

19ω2 B

��

��*

A1 A9 A10

��*6

19ω2A

B1 B2 B10

��

��6

HHHHHHY

19ω2 B

��

��*

A1 A9 A10

��*6

19ω2A

B1 B2 B10

��

��6

HHHHHHY

19ω2 B

��

��*

I pA = pB = 9ω

A1 A9 A10

��*6

19ω2A

B1 B2 B10

��

��6

HHHHHHY

19ω2 B

��

��*

I pA = pB = 9ω and vA = vB = 11+eβ(9.5ω−9ω) ≈ 0.46 with β = 1.

A1 A9 A10

��*6

19ω2A

B1 B2 B10

��

��6

HHHHHHY

19ω2 B

��

��*

I pC = 0.92ω

A1 A9 A10

��*6

19ω2A

B1 B2 B10

��

��6

HHHHHHY

19ω2 B

��

��*

I pC = 0.92ω and vc = 11+eβ(0.5ω−0.92ω) ≈ 0.6 with β = 1.

A1 A9 A10

��*6

19ω2A

B1 B2 B10

��

��6

HHHHHHY

19ω2 B

��

��*

I pC = 0.92ω and vc = 11+eβ(0.5ω−0.92ω) ≈ 0.6 with β = 1.

I Literature Towell, Shavlik 1994: Knowledge Based Artificial Neural Networks.Artificial Intelligence 70, 119-165.

Propositional Core Method using Bipolar Sigmoidal Units

I d’Avila Garcez, Zaverucha, Carvalho 1997:Can we combine the ideas in Holldobler, Kalinke 1994 and Towell, Shavlik 1994while avoiding the above mentioned problem?

I Consider propositional logic language.

I Let I be an interpretation and a ∈ [0, 1].

R(I)[j] =

v ∈ [a, 1] if j ∈ I,

w ∈ [−1,−a] if j 6∈ I.

I Replace threshold and sigmoidal units by bipolar sigmoidal ones,i.e., units with

Φ(~ik) = pk =Pm

j=1 wkjvj,

Ψ(pk) = vk = 2

1+eβ(θk−pk) − 1,

where θk ∈ R is a threshold (or bias) and β > 0 a steepness parameter.

The Task

I How should a, ω and θi be selected such that:

. vi ∈ [a, 1] or vi ∈ [−1,−a] and

. the core computes the immediate consequence operator?

Hidden Layer Units

I Consider A← L1 ∧ . . . ∧ Ln.

I Let u be the hidden layer unit for this rule.

. Suppose I |= L1 ∧ . . . ∧ Ln.

• u receives input≥ ωa from unit representing Li.• pu ≥ nωa = p+

Hidden Layer Units

. Suppose I |= L1 ∧ . . . ∧ Ln.

. Suppose I 6|= L1 ∧ . . . ∧ Ln.

• u receives input≤ −ωa from at least one unit representing Li.• pu ≤ (n− 1)ω1− ωa = p−u .

Hidden Layer Units

. Suppose I |= L1 ∧ . . . ∧ Ln.

. Suppose I 6|= L1 ∧ . . . ∧ Ln.

• u receives input≤ −ωa from at least one unit representing Li.• pu ≤ (n− 1)ω1− ωa = p−u .

I θu = nωa+(n−1)ω−ωa2 = (na + n− 1− a)ω

2 = (n− 1)(a + 1)ω2 .

Output Layer Units

I Let µ be the number of clause with head A.

I Suppose I |= L1 ∧ . . . ∧ Ln.

. pA ≥ ωa + (µ− 1)ω(−1) = ωa− (µ− 1)ω = p+A.

Output Layer Units

I Suppose I |= L1 ∧ . . . ∧ Ln.

. pA ≥ ωa + (µ− 1)ω(−1) = ωa− (µ− 1)ω = p+A.

I Suppose for all rules of the form A← L1∧ . . .∧Ln we find I 6|= L1∧ . . .∧Ln.

. pA ≤ −µωa = p−A.

Output Layer Units

I Suppose I |= L1 ∧ . . . ∧ Ln.

. pA ≥ ωa + (µ− 1)ω(−1) = ωa− (µ− 1)ω = p+A.

I Suppose for all rules of the form A← L1∧ . . .∧Ln we find I 6|= L1∧ . . .∧Ln.

. pA ≤ −µωa = p−A.

I θA = ωa−(µ−1)ω−µωa2 = (a− µ + 1− µa)ω

2 = (1− µ)(a + 1)ω2 .

Computing a Value for a

I p+u > p−u :

. nωa > (n− 1)ω − ωa.

I p+u > p−u :

. nωa > (n− 1)ω − ωa.

. nωa + ωa > (n− 1)ω.

I p+u > p−u :

. nωa > (n− 1)ω − ωa.

. nωa + ωa > (n− 1)ω.

. a(n + 1)ω > (n− 1)ω.

I p+u > p−u :

. nωa > (n− 1)ω − ωa.

. nωa + ωa > (n− 1)ω.

. a(n + 1)ω > (n− 1)ω.

. a > n−1n+1 .

I p+u > p−u :

. nωa > (n− 1)ω − ωa.

. nωa + ωa > (n− 1)ω.

. a(n + 1)ω > (n− 1)ω.

. a > n−1n+1 .

I p+A > p−A:

. ωa− (µ− 1)ω > −µaω.

I p+u > p−u :

. nωa > (n− 1)ω − ωa.

. nωa + ωa > (n− 1)ω.

. a(n + 1)ω > (n− 1)ω.

. a > n−1n+1 .

I p+A > p−A:

. ωa− (µ− 1)ω > −µaω.

. ωa + µaω > (µ− 1)ω.

I p+u > p−u :

. nωa > (n− 1)ω − ωa.

. nωa + ωa > (n− 1)ω.

. a(n + 1)ω > (n− 1)ω.

. a > n−1n+1 .

I p+A > p−A:

. ωa− (µ− 1)ω > −µaω.

. ωa + µaω > (µ− 1)ω.

. a(1 + µ)ω > (µ− 1)ω.

I p+u > p−u :

. nωa > (n− 1)ω − ωa.

. nωa + ωa > (n− 1)ω.

. a(n + 1)ω > (n− 1)ω.

. a > n−1n+1 .

I p+A > p−A:

. ωa− (µ− 1)ω > −µaω.

. ωa + µaω > (µ− 1)ω.

. a(1 + µ)ω > (µ− 1)ω.

. a > µ−1µ+1 .

I p+u > p−u :

. nωa > (n− 1)ω − ωa.

. nωa + ωa > (n− 1)ω.

. a(n + 1)ω > (n− 1)ω.

. a > n−1n+1 .

I p+A > p−A:

. ωa− (µ− 1)ω > −µaω.

. ωa + µaω > (µ− 1)ω.

. a(1 + µ)ω > (µ− 1)ω.

. a > µ−1µ+1 .

I Consider all rules minimum value for a.

Computing a Value for ω

I Ψ(p) = 21+eβ(θ−p) − 1 ≥ a.

I 21+eβ(θ−p) ≥ 1 + a.

I Ψ(p) = 21+eβ(θ−p) − 1 ≥ a.

I 21+eβ(θ−p) ≥ 1 + a.

I 21+a ≥ 1 + eβ(θ−p).

I Ψ(p) = 21+eβ(θ−p) − 1 ≥ a.

I 21+eβ(θ−p) ≥ 1 + a.

I 21+a ≥ 1 + eβ(θ−p).

I 21+a − 1 = 2

1+a −1+a1+a = 1−a

1+a ≥ eβ(θ−p).

I Ψ(p) = 21+eβ(θ−p) − 1 ≥ a.

I 21+eβ(θ−p) ≥ 1 + a.

I 21+a ≥ 1 + eβ(θ−p).

I 21+a − 1 = 2

1+a −1+a1+a = 1−a

1+a ≥ eβ(θ−p).

I ln(1−a1+a) ≥ β(θ − p).

I Ψ(p) = 21+eβ(θ−p) − 1 ≥ a.

I 21+eβ(θ−p) ≥ 1 + a.

I 21+a ≥ 1 + eβ(θ−p).

I 21+a − 1 = 2

1+a −1+a1+a = 1−a

1+a ≥ eβ(θ−p).

I ln(1−a1+a) ≥ β(θ − p).

I 1β ln(1−a

1+a) ≥ θ − p.

I Ψ(p) = 21+eβ(θ−p) − 1 ≥ a.

I 21+eβ(θ−p) ≥ 1 + a.

I 21+a ≥ 1 + eβ(θ−p).

I 21+a − 1 = 2

1+a −1+a1+a = 1−a

1+a ≥ eβ(θ−p).

I ln(1−a1+a) ≥ β(θ − p).

I 1β ln(1−a

1+a) ≥ θ − p.

I Consider a hidden layer unit:

. 1β ln(1−a

1+a) ≥ (n− 1)(a + 1)ω2 −nωa = na+n−a−1−2na

2 ω = n−1−a(n+1)2 ω.

I Ψ(p) = 21+eβ(θ−p) − 1 ≥ a.

I 21+eβ(θ−p) ≥ 1 + a.

I 21+a ≥ 1 + eβ(θ−p).

I 21+a − 1 = 2

1+a −1+a1+a = 1−a

1+a ≥ eβ(θ−p).

I ln(1−a1+a) ≥ β(θ − p).

I 1β ln(1−a

1+a) ≥ θ − p.

. 1β ln(1−a

1+a) ≥ (n− 1)(a + 1)ω2 −nωa = na+n−a−1−2na

2 ω = n−1−a(n+1)2 ω.

. ω ≥ 2(n−1−a(n+1))β ln(1−a

1+a) because a ≥ n−1n+1 .

I Ψ(p) = 21+eβ(θ−p) − 1 ≥ a.

I 21+eβ(θ−p) ≥ 1 + a.

I 21+a ≥ 1 + eβ(θ−p).

I 21+a − 1 = 2

1+a −1+a1+a = 1−a

1+a ≥ eβ(θ−p).

I ln(1−a1+a) ≥ β(θ − p).

I 1β ln(1−a

1+a) ≥ θ − p.

. 1β ln(1−a

1+a) ≥ (n− 1)(a + 1)ω2 −nωa = na+n−a−1−2na

2 ω = n−1−a(n+1)2 ω.

. ω ≥ 2(n−1−a(n+1))β ln(1−a

1+a) because a ≥ n−1n+1 .

I Consider all hidden and output layer units as well as the case that Ψ(p) ≤ −a:

minimum value for ω.

Exercises

I Show that hierarchical programs are strongly determined.

I ConsiderP = {r ← p ∧ ¬q, r ← ¬p ∧ q, p← s ∧ t}.

. Compute values for a, ω and θi.

. Specify the core forP .

. How can the approach be extended to handle facts like s and t.?

I Consider nowP ′ = P ∪ {s, t}, whereP is as before.

. Show thatP ′ is strongly determined.

. Show that the recurrent network computes the least model ofP ∪ {s, t}.

Results

I Relation to logic programs is preserved.

I The core is trainable by backpropagation.

I Many interesting applications, e.g.:

. DNA sequence analysis.

. Power system fault diagnosis.

I Empirical evaluation:system performs better than well-known machine learning systems.

I See d’Avila Garcez, Broda, Gabbay 2002 for details.

I Literature

. d’Avila Garcez, Zaverucha, Carvalho 1997: Logic Programming and Induct-ive Inference in Artificial Neural Networks. In: Knowledge Representationin Neural Networks Logos, Berlin, 33-46.

. d’Avila Garcez, Broda, Gabbay 2002: Neural-Symbolic Learning Systems:Foundations and Applications, Springer.

Further Extensions

I Many-valued logic programs

I Modal logic programs

I Answer set programming

I Metalevel priorities

I Rule extraction

Propositional Core Method – Three-Valued Logic Programs

I Kalinke 1994: Consider truth values>, ⊥, u.

I Interpretations are pairs I = 〈I+, I−〉.I Immediate consequence operator ΦP(I) = 〈J+, J−〉, where

J+ = {A | A← L1 ∧ . . . ∧ Lm ∈ P and I(L1 ∧ . . . ∧ Lm) = >},J− = {A | for all A← L1 ∧ . . . ∧ Lm ∈ P : I(L1 ∧ . . . ∧ Lm) = ⊥}.

I Let n = |V| and identify V with {1, . . . , n}.I Define R : I → R2n as follows:

R(I)[2j − 1] =

1 if j ∈ I+

0 if j 6∈ I+

ffand R(I)[2j] =

1 if j ∈ I−

0 if j 6∈ I−

Propositional Core Method – Multi-Valued Logic Programs

I For each programP , there exists a core computing ΦP , e.g.,

P = {C ← A ∧ ¬B, D ← C ∧ E, D ← ¬C}.

A ¬A B ¬B C ¬C D ¬D E ¬E

P = {C ← A ∧ ¬B, D ← C ∧ E, D ← ¬C}.

��*

��

��*

��

��*

P = {C ← A ∧ ¬B, D ← C ∧ E, D ← ¬C}.

��*

��

��*

��

��*

XXXXXXXXXXXXy

��*

P = {C ← A ∧ ¬B, D ← C ∧ E, D ← ¬C}.

��*

��

��*

��

��*

XXXXXXXXXXXXy

��*

��1

��

��>

P = {C ← A ∧ ¬B, D ← C ∧ E, D ← ¬C}.

��*

��

��*

��

��*

XXXXXXXXXXXXy

��*

��1

��

��>

ω2 −

I Lane, Seda 2004: Extension to finitely determined sets of truth values.

Propositional Core Method – Modal Logic Programs

I d’Avila Garcez, Lamb, Gabbay 2002.

I Let L be a propositional logic language plus

. the modalities 2 and 3, and

. a finite set of labels w1, . . . , wk denoting worlds.

I Let B be an atom, then 2B and 3B are modal atoms.

I A modal definite logic programP is a set of clauses of the form

wi : A← A1 ∧ . . . ∧Am

together with a finite set of relations wi Iwj, wherewi, 1 ≤ i, j ≤ k, are labels and A, A1, . . . , Am are atoms or modal atoms.

I P =Sk

i=1Pi, wherePi consists of all clauses labelled with wi.

Modal Logic Programs – Semantics

I Example: P = {w1 : A, w1 : 3C ← A}∪ {w2 : B}∪ {w3 : B}∪ {w4 : B}∪ {w1 Iw2, w1 Iw3, w1 Iw4, w2 Iw4, }

I Kripke semantics:

• •

I Kripke semantics:

• •

��

��*

HHHHHHH

I Kripke semantics:

• •

��

��*

HHHHHHH

I Kripke semantics:

• •

��

��*

HHHHHHH

2B3B3C

I Kripke semantics:

• •

��

��*

HHHHHHH

2B3B3C

I Kripke semantics:

• •

��

��*

HHHHHHH

2B3B3C

I Kripke semantics:

• •

��

��*

HHHHHHH

2B3B3C

fC(w1) = w4

Modal Immediate Consequence Operator

I Interpretations are tuples I = 〈I1, . . . , Ik〉I Immediate consequence operator MTP(I) = 〈J1, . . . , Jk〉, where

Ji = {A | there exists A← A1 ∧ . . . ∧Am ∈ Pi

such that {A1, . . . , Am} ⊆ Ii}∪ {3A | there exists wi Iwj ∈ P and A ∈ Ij}∪ {2A | for all wi Iwj ∈ P we find A ∈ Ij}∪ {A | there exists wj Iwi ∈ P and 2A ∈ Ij}∪ {A | there exists wj Iwi ∈ P, 3A ∈ Ij and fA(wj) = wi}

Modal Logic Programs – The Translation Algorithm

I Let n = |V| and identify V with {1, . . . , n}.I Let a ∈ [0, 1].

I Define R : I → R3n as follows:

R(I)[3j − 2] =

v ∈ [a, 1] if j ∈ Ij

w ∈ [−1,−a] if j 6∈ Ij

R(I)[3j − 1] =

v ∈ [a, 1] if 2j ∈ Ij

w ∈ [−1,−a] if 2j 6∈ Ij

R(I)[3j] =

v ∈ [a, 1] if 3j ∈ Ij

w ∈ [−1,−a] if 3j 6∈ Ij

I Translation algorithm such that

. for each world the “local” part of MTP is computed by a core,

. the cores are turned into recurrent networks, and

. the cores are connected with respect to the given set of relations.

The Example Network

r6∧6

r6 66 6r

The Example Network

r6∧6

r6 66 6r

The Example Network

r6∧6

r6 66 6r

The Example Network

r6∧6

r6 66 6r

∧ ∧ ∧

The Example Network

r6∧6

r6 66 6r

∧ ∧ ∧

The Example Network

r6∧6

r6 66 6r

∧ ∧ ∧

The Example Network

r6∧6

r6 66 6r

∧ ∧ ∧

∧ ∨

The Example Network

r6∧6

r6 66 6r

∧ ∧ ∧

∧ ∨

The Example Network

r6∧6

r6 66 6r

∧ ∧ ∧

∧ ∨

The Example Network

r6∧6

r6 66 6r

∧ ∧ ∧

∧ ∨

The Example Network

r6∧6

r6 66 6r

∧ ∧ ∧

∧ ∨

The Example Network

r6∧6

r6 66 6r

∧ ∧ ∧

∧ ∨

The Example Network

r6∧6

r6 66 6r

∧ ∧ ∧

∧ ∨

The Example Network

r6∧6

r6 66 6r

∧ ∧ ∧

∧ ∨

The Example Network

r6∧6

r6 66 6r

∧ ∧ ∧

∧ ∨

First-Order Logic

I Existing Approaches

. Reflexive Reasoning and SHRUTI

. Connectionist Term Representations

• Holographic Reduced Representations Plate 1991• Recursive Auto-Associative Memory Pollack 1988

. Horn logic and CHCL Holldobler 1990, Holldobler, Kurfess 1992

. Other Approaches

I First-Order Logic Programs and the Core Method

. Initial Approach

. Construction of Approximating Networks

. Topological Analysis and Generalisations

. Employing Iterated Function Systems

Literature

I Holldobler 1990: A Structured Connectionist Unification Algorithm. In: Proceed-ings of the AAAI National Conference on Artificial Intelligence, 587-593.

I Holldobler, Kurfess 1992: CHCL – A Connectionist Inference System. In: Parallel-ization in Inference Systems, Lecture Notes in Artificial Intelligence, 590, 318-342.

I Plate 1991: Holographic Reduced Representations. In Proceedings of the Interna-tional Joint Conference on Artificial Intelligence, 30-35.

I Pollack 1988: Recursive auto-associative memory: Devising compositional dis-tributed representations. In: Proceedings of the Annual Conference of the Cognit-ive Science Society , 33-39.

Reflexive Reasoning

I Humans are capable of performing a wide variety of cognitive taskswith extreme ease and efficiency.

I For traditional AI systems, the same problems turn out to be intractable.

I Human consensus knowledge: about 108 rules and facts.

I Wanted: “Reflexive” decisions within sublinear time.

I Shastri, Ajjanagadde 1993: SHRUTI.

SHRUTI – Knowledge Base

I Finite set of constants C, finite set of variables V .

I Rules:

. (∀X1 . . . Xm) (p1(. . .) ∧ . . . ∧ pn(. . .)→ (∃Y1 . . . Yk p(. . .)).

. p, pi, 1 ≤ i ≤ n, are multi-place predicate symbols.

. Arguments of the pi: variables from {X1, . . . , Xm} ⊆ V .

. Arguments of p are from {X1, . . . , Xm} ∪ {Y1, . . . , Yk} ∪ C.

. {Y1, . . . , Yk} ⊆ V .

. {X1, . . . , Xm} ∩ {Y1, . . . , Yk} = ∅.

I Facts and queries (goals):

. (∃Z1 . . . Zl) q(. . .).

. Multi-place predicate symbol q.

. Arguments of q are from {Z1, . . . , Zl} ∪ C.

. {Z1, . . . , Zl} ⊆ V .

Further Restrictions

I Restrictions to rules, facts, and goals:

. No function symbols except constants.

. Only universally bound variables may occur as argumentsin the conditions of a rule.

. All variables occurring in a fact or goal occur only onceand are existentially bound.

. An existentially quantified variable is only unified with variables.

. A variable which occurs more than once in the conditions of a rule mustoccur in the conclusion of the rule and must be bound when the conclusionis unified with a goal.

. A rule is used only a fixed number of times.

Incompleteness.

SHRUTI – Example

I RulesP = { owns(Y, Z)← gives(X, Y, Z),

owns(X, Y )← buys(X, Y ),can-sell(X, Y )← owns(X, Y ),gives(john, josephine, book),(∃X) buys(john, X),owns(josephine, ball) },

I Queries:can-sell(josephine, book) ; yes(∃X) owns(josephine, X) ; yes {X 7→ book}

{X 7→ ball}

SHRUTI : The Network

�� AA AAgives AA AA�� m mm mm buys

- -�� HH HH

r rr rrr r r

from johnfrom jos.from book

r from john

��

@@I �

��

can-sell

��

@ rr@@ r

@ -��HH

- -�� HH HH

r rr rrr r r

r from john

��

@@I �

��

can-sell

��

@ rr@@ r

@ -��HH

- -�� HH HH

r rr rrr r r

r from john

��

@@I �

��

can-sell

��

@ rr@@ r

@ -��HH

- -�� HH HH

r rr rrr r r

r from john

��

@@I �

��

can-sell

��

@ rr@@ r

@ -��HH

H } } H } }

- -�� HH HH

r rr rrr r r

r from john

��

@@I �

��

can-sell

��

@ rr@@ r

@ -��HH

H } } H } }

- -�� HH HH

r rr rrr r r

r from john

��

@@I �

��

can-sell

��

@ rr@@ r

@ -��HH

H } } H } }

- -�� HH HH

r rr rrr r r

r from john

��

@@I �

��

can-sell

��

@ rr@@ r

@ -��HH

H } } H } }

- -�� HH HH

r rr rrr r r

r from john

��

@@I �

��

can-sell

��

@ rr@@ r

@ -��HH

H } } H } }

Solving the Variable Binding Problem

bookjohnball

josephinecan–sell4can–sell5

can–sell 1st argcan–sell 2nd arg

owns4owns5

owns 1st argowns 2nd arg

owns �

gives4gives5

gives 1st arggives 2nd arggives 3nd arg

gives �

buys4buys5

buys 1st argbuys 2nd arg

buys �

SHRUTI – Remarks

I Answers are derived in time proportional to depth of search space.

I Number of units as well as of connections is linear in the sizeof the knowledge base.

I Extensions:

. compute answer substitutions

. allow a fixed number of copies of rules

. allow multiple literals in the body of a rule

. built in a taxonomy

I ROBIN (Lange, Dyer 1989): signatures instead of phases.

I Biological plausibility.

I Trading expressiveness for time and size.

I Logical reconstruction by Beringer, Holldobler 1993:

. Reflexive reasoning is reasoning by reduction.

Literature

I Beringer, Holldobler 1993: On the Adequateness of the Connection Method. In:Proceedings of the AAAI National Conference on Artificial Intelligence, 9-14.

I Shastri, Ajjanagadde 1993: From Associations to Systematic Reasoning: A Con-nectionist Representation of Rules, Variables and Dynamic Bindings using Tem-poral Synchrony. Behavioural and Brain Sciences 16, 417-494.

I Lange, Dyer 1989: High-Level Inferencing in a Connectionist Network. ConnectionScience 1, 181-217.

First-Order Logic Programs and the Core Method

I Initial Approach

I Construction of Approximating Networks

I Topological Analysis and Generalisations

I Employing Iterated Function Systems

Logic Programs

I A logic programP over a first-order language L is a finite set of clauses

A← L1 ∧ . . . ∧ Ln,

where A is an atom, Li are literals and n ≥ 0.

I BL is the set of all ground atoms over L called Herbrand base.

I A Herbrand interpretation I is a mapping BL → {>,⊥}.I 2BL is the set of all Herbrand interpretations.

I ground(P) is the set of all ground instances of clauses inP .

I Immediate consequence operator TP : 2BL → 2BL:

TP(I) = {A | there is a clause A← L1 ∧ . . . ∧ Ln ∈ ground(P)such that I |= L1 ∧ . . . ∧ Ln}.

I I is a supported model iff TP(I) = I.

The Initial Approach

I Holldobler, Kalinke, Storr 1999:Can the core method be extended to first-order logic programs?

I Problem

. Given a logic programP over a first order language Ltogether with TP : 2BL → 2BL.

. BL is countably infinite.

. The method used to relate propositional logic and connectionist systems isnot applicable.

. How can the gap between the discrete, symbolic setting of logic, and thecontinuous, real valued setting of connectionist networks be closed?

The Goal

I Find R : 2BL → R and fP : R→ R such that the following conditions hold.

. TP(I) = I′ implies fP(R(I)) = R(I′).fP(x) = x′ implies TP(R−1(x)) = R−1(x′).

fP is a sound and complete encoding of TP .

. TP is a contraction on 2BL iff fP is a contraction on R.

The contraction property and fixed points are preserved.

. fP is continuous on R.

A connectionst network approximating fP is known to exist.

Acyclic Logic Programs

I LetP be a program over a first order language L.

I A level mapping forP is a function l : BL → N.

. We define l(¬A) = l(A).

I We can associate a metric dL with L and l. Let I, J ∈ 2BL:

dL(I, J) =

0 if I = J

2−n if n is the smallest level on which I and J differ.

I Proposition (Fitting 1994) (2BL, dL) is a complete metric space.

I P is said to be acyclic wrt a level mapping l,if for every A← L1 ∧ . . . ∧ Ln ∈ ground(P) we find l(A) > l(Li) for all i.

I Proposition LetP be an acyclic logic program wrt l and dL the metric associatedwith L and l, then TP is a contraction on (2BL, dL).

Mapping Interpretations to Real Numbers

I LetD = {r ∈ R | r =P∞

i=1 ai4−i, where ai ∈ {0, 1} for all i}.I Let l be a bijective level mapping.

I {>,⊥} can be identified with {0, 1}.I The set of all mappings BL → {>,⊥} can be identified with

the set of all mappings N→ {0, 1}.I Let IL be the set of all mappings from BL to {0, 1}.I Let R : IL → D be defined as

R(I) =∞Xi=1

I(l−1(i))4−i.

I Proposition R is a bijection.

We have a sound and complete encoding of interpretations.

Mapping Immediate Consequence Operators to Functions on the Reals

I We define fP : D → D : r 7→ R(TP(R−1(r))).

I I′

We have a sound and complete encoding of TP .

I Proposition LetP be an acylic program wrt a bijective level mapping.fP is a contraction onD.

Contraction property and fixed points are preserved.

Approximating Continuous Functions

I Corollary fP is continuous.

I Recall Funahashi’s theorem:

. Every continuous function f : K → R can be uniformly approximated byinput-output functions of 3-layer feed forward networks.

I Theorem fP can be uniformly approximated by input-output functions of 3-layerfeed forward networks.

. TP can be approximated as well by applying R−1 .

Connectionist network approximating immediate consequence operator exists.

An Example

I Consider P = {q(0), q(s(X))← q(X)} and let l(q(sn(0))) = n + 1.

. P is acyclic wrt l, l is bijective, R(BL) = 13.

. fP(R(I)) = 4−l(q(0)) +P

q(X)∈I 4−l(q(s(X)))

= 4−l(q(0)) +P

q(X)∈I 4−(l(q(X)))+1) = 1+R(I)4 .

I Approximation of fP to accuracy ε yields

f(x) ∈»1 + x

4− ε,

I Starting with some x and iterating f yields in the limit a value

r ∈»1− 4ε

3,1 + 4ε

I Applying R−1 to r we find

q(sn(0)) ∈ R−1(r) if n < −log4ε− 1.

Approximation of Interpretations

I LetP be a logic program over a first order language L and l a level mapping.

I An interpretation I approximates an interpretation J to a degree n ∈ Nif for all atoms A ∈ BL with l(A) < n we find I(A) = > iff J(A) = >.

. I approximates J to a degree n iff dL(I, J) ≤ 2−n.

Approximation of Supported Models

I Given an acyclic logic programP with bijective level mapping.

I Let TP be the immediate consequence operator associated withP andMP the least supported model ofP .

I We can approximate TP by a 3-layer feed forward network.

I We can turn this network into a recurrent one.

Does the recurrent network approximate the supported model ofP?

I Theorem For an arbitrary m ∈ N there exists a recursive network with sigmoidalactivation function for the hidden layer units and linear activation functions forthe input and output layer units computing a function fP such that there exists ann0 ∈ N such that for all n ≥ n0 and for all x ∈ [−1, 1] we find

dL(R−1(f

P(x)), MP) ≤ 2−m.

First Order Core Method – Extensions

I Detailed study in (topological) continuity of semantic operatorsHitzler, Seda 2003 and Hitzler, Holldobler, Seda 2004:

. many-valued logics,

. larger class of logic programs,

. other approximation theorems.

I A core method for reflexive reasoning Holldobler, Kalinke, Wunderlich 2000.

I The graph of fP is an attractor of some iterated function systemBader 2003 and Bader, Hitzler 2004:

. representation theorems,

. fractal interpolation,

. core with units computing radial basis functions.

I Finitely determined sets of truth values Lane, Seda 2004.

Constructive Approaches: Fibring Artificial Neural Networks

I Fibring function Φ associated with neuron i maps some weights w of a networkto new values depending on w and the input x of i (Garcez, Gabbay:2004).

x y= =

I Idea approximate fP by computing values of atoms with level n = 1, 2, . . ..

Clause1

Clause2

Clausex

TP(I)I

I Works well for acyclic logic programs with bijective level mapping(Bader, Garcez, Hitzler 2004).

Constructive Approaches: Approximating Piecewise Constant Functions

I Consider graph of fP .

Approximate fP up to a given level l.

Construct network computing piecewise constant function.

Step activation functions.Sigmoidal activation functions.Radial basis functions.

0 0.50.40.3

I Approximate fP up to a given level l.

Construct core computing piecewise constant function.

Step activation functions.Sigmoidal activation functions.Radial basis functions.

0 0.50.40.3

I Construct core computing piecewise constant function.

. Step activation functions.Sigmoidal activation functions.Radial basis functions.

0 0.50.40.3

. Step activation functions.

. Sigmoidal activation functions.Radial basis functions.

0 0.50.40.3

. Step activation functions.

. Sigmoidal activation functions.

. Radial basis functions.

3210-1-2

03210-1-2

I Bader, Hitzler, Witzel 2005.

Open Problems

I How can first order terms be represented and manipulatedin a connectionist system? Pollack 1990, Holldobler 1990, Plate 1994.

I Can the mapping R be learned? Gust, Kuhnberger 2004.

I How can first order rules be extracted from a connectionist system?

I How can multiple instances of first order rules be representedin a connectionist system? Shastri 1990.

I What does a theory for the integration of logic and connectionist systemslook like?

I Can such a theory be applied in real domains outperformingconventional approaches?

I How does the core method relate to model-based reasoning approachesin cognitive science (e.g. Barnden 1989, Johnson-Laird, Byrne 1993)?

€¦ · Introduction & Motivation: Connectionist Systems IWell-suited to learn, to adapt to...

Documents

Images, Frames, and Connectionist Hierarchies

Evolving Connectionist Systems: The Knowledge Engineering ...ewh.ieee.org/cmte/cis/mtsc/ieeecis/tutorial2007/TutorialKasabovIJ... · Evolving Connectionist Systems: The Knowledge

Connectionist Units, Probabilistic and Biologically Inspired

A Connectionist Model of Sla

Pitch Iwell [Final] - topsectorenergie.nl › sites › default › files... · Pitch Iwell [Final] Author: Vincent Ruijter Created Date: 5/3/2019 9:22:04 AM

Evolving Connectionist Systems

Connectionist Models and Figurative Speech

Connectionist models and their properties.pdf

Connectionist Artiï¬cial Neural Networks

Rule extraction from autoencoder-based connectionist

Iwell Web Reports 1) SB Login show button -> click on ...Iwell Web Reports 1) SB Login-> click on master for client list or client’s login ID & password-> click on client search

Intelligent Weather Monitoring Systems Using Connectionist

Connectionist neuropsychology: uncovering ultimate causes ...rstb.royalsocietypublishing.org/content/royptb/369/1634/20120398... · Connectionist neuropsychology: uncovering ultimate

Presentazione iwell: maximo transportation

Artiﬁcial Neural Networks Connectionist Learning Machinesci.louisville.edu/kerem/evraklar/sunum/CECS694_01_seminar.pdfArtiﬁcial Neural Networks Connectionist Learning Machines

Fast evaluation of Connectionist Language Models

Connectionist Models and Linguistic Theory

Simple learning in connectionist networks

Context, cortex, and associations: a connectionist ...jlmcc/papers/KolliasMcC13ContextCortexAnd... · Context, cortex, and associations: a connectionist developmental approach to

Connectionist Speaker Normalization with Generalized ...papers.nips.cc/paper/1016-connectionist-speaker... · Connectionist Speaker Normalization with Generalized Resource Allocating