Logic Tensor Networks

Samy Badreddine

^{a, b, *}

,Artur

d^{'}

Avila Garce

z^{c}

,Luciano Serafini

^{d}

,Michael Spranger

^{a, b}

^{a}

Sony Computer Science Laboratories Inc,3-14-13 Higashigotanda,141-0022,Tokyo,Japan

^{b}

Sony AI Inc,1-7-1 Konan,108-0075,Tokyo,Japan

^{c}

City,University of London,Northampton Square,EC1V 0HB,London,United Kingdom

^{d}

Fondazione Bruno Kessler,Via Sommarive 18,38123,Trento,Italy

Abstract

Attempts at combining logic and neural networks into neurosymbolic approaches have been on the increase in recent years. In a neurosymbolic system, symbolic knowledge assists deep learning, which typically uses a sub-symbolic distributed representation, to learn and reason at a higher level of abstraction. We present Logic Tensor Networks (LTN), a neurosymbolic framework that supports querying, learning and reasoning with both rich data and abstract knowledge about the world. LTN introduces a fully differentiable logical language, called Real Logic, whereby the elements of a first-order logic signature are grounded onto data using neural computational graphs and first-order fuzzy logic semantics. We show that LTN provides a uniform language to represent and compute efficiently many of the most important AI tasks such as multi-label classification, relational learning, data clustering, semi-supervised learning, regression, embedding learning and query answering. We implement and illustrate each of the above tasks with several simple explanatory examples using TensorFlow 2. The results indicate that LTN can be a general and powerful framework for neurosymbolic AI.

Keywords: Neurosymbolic AI, Deep Learning and Reasoning, Many-valued Logics.

1. Introduction

Artificial Intelligence (AI) agents are required to learn from their surroundings and reason about what has been learned to make decisions, act in the world, or react to various stimuli. The latest Machine Learning (ML) has adopted mostly a pure sub-symbolic learning approach. Using distributed representations of entities, the latest ML performs quick decision-making without building a comprehensible model of the world. While achieving impressive results in computer vision, natural language, game playing, and multimodal learning, such approaches are known to be data inefficient and to struggle at out-of-distribution generalization. Although the use of appropriate inductive biases can alleviate such shortcomings, in general, sub-symbolic models lack comprehensibility. By contrast, symbolic AI is based on rich, high-level representations of the world that use human-readable symbols. By rich knowledge, we refer to logical representations which are more expressive than propositional logic or propositional probabilistic approaches, and which can express knowledge using full first-order logic, including universal and existential quantification

(\forall x and \exists y)

,arbitrary

n

-ary relations over variables,e.g.

R (x, y, z, \dots)

,and function symbols,e.g. father

Of (x), x + y

,etc. Symbolic AI has achieved success at theorem proving,logical inference,and verification. However, it also has shortcomings when dealing with incomplete knowledge. It can be inefficient with large amounts of inaccurate data and lack robustness to outliers. Purely symbolic decision algorithms usually have high computational complexity making them impractical for the real world. It is now clear that the predominant approach to ML, where learning is based on recognizing the latent structures hidden in the data, is insufficient and may benefit from symbolic AI [17]. In this context, neurosymbolic AI, which stems from neural networks and symbolic AI, attempts to combine the strength of both paradigms (see [16, 40, 54] for recent surveys). That is to say, combine reasoning with complex representations of knowledge (knowledge-bases, semantic networks, ontologies, trees, and graphs) with learning from complex data (images, time series, sensorimotor data, natural language). Consequently, a main challenge for neurosymbolic AI is the grounding of symbols, including constants, functional and relational symbols, into real data, which is akin to the longstanding symbol grounding problem [30].

*Corresponding author

Email addresses: badreddine. [email protected] (Samy Badreddine), a. [email protected] (Artur d'Avila Garcez),

[email protected] (Luciano Serafini), michael . spranger@sony . com (Michael Spranger)

Logic Tensor Networks (LTN) are a neurosymbolic framework and computational model that supports learning and reasoning about data with rich knowledge. In LTN, one can represent and effectively compute the most important tasks of deep learning with a fully differentiable first-order logic language, called Real Logic, which adopts infinitely many truth-values in the interval

[0, 1] [22, 25]

. In particular,LTN supports the specification and computation of the following AI tasks uniformly using the same language: data clustering, classification, relational learning, query answering, semi-supervised learning, regression, and embedding learning.

LTN and Real Logic were first introduced in [62]. Since then, LTN has been applied to different AI tasks involving perception, learning, and reasoning about relational knowledge. In [18, 19], LTN was applied to semantic image interpretation whereby relational knowledge about objects was injected into deep networks for object relationship detection. In [6], LTN was evaluated on its capacity to perform reasoning about ontological knowledge. Furthermore, [7] shows how LTN can be used to learn an embedding of concepts into a latent real space by taking into consideration ontological knowledge about such concepts. In [3], LTN is used to annotate a reinforcement learning environment with prior knowledge and incorporate latent information into an agent. In [42], authors embed LTN in a state-of-the-art convolutional object detector. Extensions and generalizations of LTN have also been proposed in the past years, such as LYRICS [47] and Differentiable Fuzzy Logic (DFL) [68,69]. LYRICS provides an input language allowing one to define background knowledge using a first-order logic where predicate and function symbols are grounded onto any computational graph. DFL analyzes how a large collection of fuzzy logic operators behave in a differentiable learning setting. DFL also introduces new semantics for fuzzy logic implications called sigmoidal implications, and it shows that such semantics outperform other semantics in several semi-supervised machine learning tasks.

This paper provides a thorough description of the full formalism and several extensions of LTN. We show using an extensive set of explanatory examples, how LTN can be applied to solve many ML tasks with the help of logical knowledge. In particular, the earlier versions of LTN have been extended with: (1) Explicit domain declaration: constants, variables, functions and predicates are now domain typed (e.g. the constants John and Paris can be from the domain of person and city, respectively). The definition of structured domains is also possible (e.g. the domain couple can be defined as the Cartesian product of two domains of persons); (2) Guarded quantifiers: guarded universal and existential quantifiers now allow the user to limit the quantification to the elements that satisfy some Boolean condition,e.g.

\forall x : age (x) < 10

(playsPiano

(x) \to

enfantProdige

(x)

) restricts the quantification to the cases where age is lower than 10; (3) Diagonal quantification: Diagonal quantification allows the user to write statements about specific tuples extracted in order from

n

variables. For example,if the variables capital and country both have

k

instances such that the

i

-th instance of capital corresponds to the

i

-th instance of country,one can write

\forall

Diag(capital,country) capitalOf(capital,country).

Inspired by the work of [69], this paper also extends the product t-norm configuration of LTN with the generalized mean aggregator, and it introduces solutions to the vanishing or exploding gradient problems. Finally, the paper formally defines a semantic approach to refutation-based reasoning in Real Logic to verify if a statement is a logical consequence of a knowledge base. Example 4.8 proves that this new approach can better capture logical consequences compared to simply querying unknown formulas after learning (as done in [6]).

The new version of LTN has been implemented in TensorFlow 2 [1]. Both the LTN library and the code for the examples used in this paper are available at https://github.com/ logictensornetworks/logictensornetworks

The remainder of the paper is organized as follows: In Section 2, we define and illustrate Real Logic as a fully-differentiable first-order logic. In Section 3, we specify learning and reasoning in Real Logic and its modeling into deep networks with Logic Tensor Networks (LTN). In Section 4 we illustrate the reach of LTN by investigating a range of learning problems from clustering to embedding learning. In Section 5, we place LTN in the context of the latest related work in neurosymbolic AI. In Section 6 we conclude and discuss directions for future work. The Appendix contains information about the implementation of LTN in TensorFlow 2, experimental set-ups, the different options for the differentiable logic operators, and a study of their relationship with gradient computations.

2. Real Logic

2.1. Syntax

Real Logic forms the basis of Logic Tensor Networks. Real Logic is defined on a first-order language

L

with a signature that contains a set

C

of constant symbols (objects),a set

F

of functional symbols,a set

P

of relational symbols (predicates),and a set

X

of variable symbols.

L

-formulas allow us to specify relational knowledge with variables,e.g. the atomic formula is_friend

(v_{1}, v_{2})

may state that the person

v_{1}

is a friend of the person

v_{2}

,the formula

\forall x \forall y (is_friend (x, y) \to is_friend (y, x))

states that the relation is_friend is symmetric,and the formula

\forall x (\exists y (Italian (x) \land is_friend (x, y)))

states that every person has a friend that is Italian. Since we are interested in learning and reasoning in real-world scenarios where degrees of truth are often fuzzy and exceptions are present, formulas can be partially true, and therefore we adopt fuzzy semantics.

Objects can be of different types. Similarly, functions and predicates are typed. Therefore, we assume there exists a non-empty set of symbols

D

called domain symbols. To assign types to the elements of

L

we introduce the functions

D, D_{in}

and

D_{out}

such that:

$D : X \cup C \to D$ . Intuitively, $D (x)$ and $D (c)$ returns the domain of a variable $x$ or a constant $c$ .

$D_{in} : F \cup P \to D^{*}$ ,where $D^{*}$ is the Kleene star of $D$ ,that is the set of all finite sequences of symbols in $D$ . Intuitively, $D_{in} (f)$ and $D_{in} (p)$ returns the domains of the arguments of a

function

f

or a predicate

p

. If

f

takes two arguments (for example,

f (x, y)

D_{in} (f)

returns two domains, one per argument.

$D_{out} : F \to D$ . Intuitively, $D_{out} (f)$ returns the range of a function symbol.

Real Logic may also contain propositional variables,as follows: if

P

is a 0-ary predicate with

D_{in} (P) = ⟨ ⟩

(the empty sequence of domains) then

P

is a propositional variable (an atom with truth-value in the interval

[0, 1]

A term is constructed recursively in the usual way from constant symbols, variables, and function symbols. An expression formed by applying a predicate symbol to an appropriate number of terms with appropriate domains is called an atomic formula, which evaluates to true or false in classical logic and a number in

[0, 1]

in the case of Real Logic. We define the set of terms of the language as follows:

each element $t$ of $X \cup C$ is a term of the domain $D (t)$ ;

if $t_{i}$ is a term of domain $D (t_{i})$ for $1 \leq i \leq n$ then $t_{1} t_{2} \dots t_{n}$ (the sequence composed of $t_{1}$ followed by $t_{2}$ and so on,up to $t_{n}$ ) is a term of the domain $D (t_{1}) D (t_{2}) \dots D (t_{n})$ ;

if $t$ is a term of the domain $D_{in} (f)$ then $f (t)$ is a term of the domain $D_{out} (f)$ .

We allow the following set of formula in

L

$t_{1} = t_{2}$ is an atomic formula for any terms $t_{1}$ and $t_{2}$ with $D (t_{1}) = D (t_{2})$ ;

$p (t)$ is an atomic formula if $D (t) = D_{in} (p)$ ;

If $ϕ$ and $ψ$ are formula and $x_{1}, \dots, x_{n}$ are $n$ distinct variable symbols then $⋄ ϕ, ϕ \circ ψ$ and $Q x_{1} \dots x_{n} ϕ$ are formula,where $⋄$ is a unary connective, $\circ$ is a binary connective and $Q$ is a quantifier.

We use

⋄ \in {\neg}

(negation),

\circ \in {\land, \lor, \to, \leftrightarrow}

(conjunction,disjunction,implication and biconditional,respectively) and

Q \in {\forall, \exists}

(universal and existential,respectively).

Example 1. Let Town denote the domain of towns in the world and People denote the domain of living people. Suppose that

L

contains the constant symbols Alice,Bob and Charlie of domain People,and Rome and Seoul of domain Town. Let

x

be a variable of domain People and

u

be a variable of domain Town. The term

x, u

(i.e. the sequence

x

followed by

u

) has domain People,Town which denotes the Cartesian product between People and Town (People

\times

Town). Alice,Rome is interpreted as an element of the domain People, Town. Let lives in be a predicate with input domain

D_{in}

(lives_in) = People,Town. lives_in(Alice,Rome) is a well-formed expression,whereas lives_in(Bob, Charlie) is not.

2.2. Semantics of Real Logic

The semantics of Real Logic departs from the standard abstract semantics of First-order Logic (FOL). In Real Logic, domains are interpreted concretely by tensors in the real field T Every object denoted by constants, variables, and terms, is interpreted as a tensor of real values. Functions are interpreted as real functions or tensor operations. Predicates are interpreted as functions or tensor operations projecting onto a value in the interval

[0, 1]

^{1}

In the rest of the paper,we commonly use "tensor" to designate "tensor in the real field".

To emphasize the fact that in Real Logic symbols are grounded onto real-valued features, we use the term grounding,denoted by

G

,in place of interpretation

^{2}

. Notice that this is different from the common use of the term grounding in logic, which indicates the operation of replacing the variables of a term or formula with constants or terms containing no variables. To avoid confusion, we use the synonym instantiation for this purpose.

G

associates a tensor of real numbers to any term of

L

,and a real number in the interval

[0, 1]

to any formula

ϕ

L

. Intuitively,

G (t)

are the numeric features of the objects denoted by

t

,and

G (ϕ)

represents the system’s degree of confidence in the truth of

ϕ

; the higher the value,the higher the confidence.

2.2.1. Grounding domains and the signature

A grounding for a logical language

L

on the set of domains

D

provides the interpretation of both the domain symbols in

D

and the non-logical symbols in

L

Definition 1. A grounding

G

associates to each domain

D \in D

a set

G (D) \subseteq ⋃_{n_{1} \dots n_{d} \in N^{*}} R^{n_{1} \times \dots \times n_{d}}

For every

D_{1} \dots D_{n} \in D^{*}, G (D_{1} \dots D_{n}) = \times_{i = 1}^{n} G (D_{i})

,that is

G (D_{1}) \times G (D_{2}) \times \dots \times G (D_{n})

Notice that the elements in

G (D)

may be tensors of any rank

d

and any dimensions

n_{1} \times \dots \times n_{d}

, as

N^{*}

denotes the Kleene star of

N

Example 2. Let digit_images denote a domain of images of handwritten digits. If we use images of

256 \times 256

RGB pixels,then

G

(digit_images)

\subseteq R^{256 \times 256 \times 3}

. Let us consider the predicate is_digit

(Z, 8)

. The terms

Z, 8

have domains digit_images,digits. Any input to the predicate is a tuple in

G

(digit_images,digits)

= G

(digit_images)

\times G

(digits).

A grounding assigns to each constant symbol

c

,a tensor

G (c)

in the domain

G (D (c))

; It assigns to a variable

x

a finite sequence of tensors

d_{1} \dots d_{k}

,each in

G (D (x))

. These tensors represent the instances of

x

. Differently from in FOL where a variable is assigned to a single value of the domain of interpretations at a time, in Real Logic a variable is assigned to a sequence of values in its domain,the

k

examples of

x

. A grounding assigns to a function symbol

f

a function taking tensors from

G (D_{in} (f))

as input,and producing a tensor in

G (D_{out} (f))

as output. Finally,a grounding assigns to a predicate symbol

p

a function taking tensors from

G (D_{in} (p))

as input,and producing a truth-value in the interval

[0, 1]

as output.

Definition 2. A grounding

G

L

is a function defined on the signature of

L

that satisfies the following conditions:

$G (x) = ⟨ d_{1} \dots d_{k} ⟩ \in \times_{i = 1}^{k} G (D (x))$ for every variable symbol $x \in X$ ,with $k \in N_{0}^{+}$ . Notice that $G (x)$ is a sequence and not a set,meaning that the same value of $G (D (x))$ can occur multiple times in $G (x)$ ,as is usual in a Machine Learning data set with "attributes" and "values";

^{2}

An interpretation is an assignment of truth-values

t r u e

f a l s e

, or in the case of Real Logic a value in [0,1], to a formula. A model is an interpretation that maps a formula to true

^{3}

A tensor of rank 0 corresponds to a scalar,a tensor of rank 1 to a vector,a tensor of rank 2 to a matrix and so forth,in the usual way.

$G (f) \in G (D_{in} (f)) \to G (D_{out} (f))$ for every function symbol $f \in F$ ;

$G (p) \in G (D_{in} (p)) \to [0, 1]$ for every predicate symbol $p \in P$ .

If a grounding depends on a set of parameters

θ

, we denote it as

G_{θ} (\cdot)

G (\cdot ∣ θ)

interchangeably. Section 4 describes how such parameters can be learned using the concept of satisfiability.

2.2.2. Grounding terms and atomic formulas

We now extend the definition of grounding to all first-order terms and atomic formulas. Before formally defining these groundings, we describe on a high level what happens when grounding terms that contain free variables.

^{4}

Let

x

be a variable that denotes people. As explained in Definition 2,

x

is grounded as an explicit sequence of

k

instances

(k = | G (x) |)

. Consequently,a term height

(x)

is also grounded in

k

height values, each corresponding to one instance. We can generalize to expressions with multiple free variables, as shown in Example 3.

In the formal definition below, instead of considering a single term at a time, it is convenient to consider sequences of terms

t = t_{1} t_{2} \dots t_{k}

and define the grounding on

t

(with the definition of the grounding of a single term being derived as a special case). The fact that the sequence of terms

t

contains

n

distinct variables

x_{1}, \dots, x_{n}

is denoted by

t (x_{1}, \dots, x_{n})

. The grounding of

t (x_{1}, \dots, x_{n})

,denoted by

G (t (x_{1}, \dots, x_{n}))

,is a tensor with

n

corresponding axes,one for each free variable, defined as follows:

Definition 3. Let

t (x_{1}, \dots, x_{n})

be a sequence

t_{1} \dots t_{m}

m

terms containing

n

distinct variables

x_{1}, \dots, x_{n}

. Let each term

t_{i}

t

contain

n_{i}

variables

x_{j_{i 1}}, \dots, x_{j_{i n_{i}}}

$G (t)$ is a tensor with dimensions $(| G (x_{1}) |, \dots, | G (x_{n}) |)$ such that the element of this tensor indexed by $k_{1}, \dots, k_{n}$ ,written as $G {(t)}_{k_{1} \dots k_{n}}$ ,is equal to the concatenation of $G {(t_{i})}_{k_{j_{i 1}} \dots k_{j_{i n_{i}}}}$ for $1 \leq i \leq m$ ;

$G {(f (t))}_{i_{1} \dots i_{n}} = G (f) (G {(t)}_{i_{1} \dots i_{n}})$ ,i.e. the element-wise application of $G (f)$ to $G (t)$ ;

$G {(p (t))}_{i_{1} \dots i_{n}} = G (p) (G {(t)}_{i_{1} \dots i_{n}})$ ,i.e. the element-wise application of $G (p)$ to $G (t)$ .

If term

t_{i}

contains

n_{i}

variables

x_{j_{1}}, \dots, x_{j_{n_{i}}}

selected from

x_{1}, \dots, x_{n}

then

G {(t_{i})}_{k_{j_{1}} \dots k_{j_{n_{i}}}}

can be obtained from

G {(t)}_{i_{1} \dots i_{n}}

with an appropriate mapping of indices

i

k

^{4}

We assume the usual syntactic definition of free and bound variables in FOL. A variable is free if it is not bound by a quantifier

(\forall, \exists)

Figure 1: Illustration of Example 3 -

x

and

y

indicate dimensions associated with the free variables

x

and

y

. A tensor representing a term that includes a free variable

x

will have an axis

\vec{x}

. One can index

\vec{x}

to obtain results calculated using each of the

v_{1}, v_{2}

v_{3}

values of

x

. In our graphical convention,the depth of the boxes indicates that the tensor can have feature dimensions (refer to the end of Example 3).

Example 3. Suppose that

L

contains the variables

x

and

y

,the function

f

,the predicate

p

and the set of domains

D = {V, W}

. Let

D (x) = V, D (y) = W, D_{in} (f) = V W, D_{out} (f) = W

and

D (p) = V W

. In what follows,an example of the grounding of

L

and

D

is shown on the left,and the grounding of some examples of possible terms and atomic formulas is shown on the right.

G (V) = R^{+}

G (W) = R^{-}

G (x) = ⟨ v_{1}, v_{2}, v_{3} ⟩

G (y) = ⟨ w_{1}, w_{2} ⟩

G (p) : x, y \mapsto σ (x + y)

G (f) : x, y \mapsto x \cdot y

Notice the dimensions of the results.

G (f (x, y))

and

G (p (x, f (x, y)))

return

| G (x) | \times | G (y) | = 3 \times 2

values, one for each combination of individuals that occur in the variables. For functions, we can have additional dimensions associated to the output domain. Let us suppose a different grounding such that

G (D_{out} (f)) = R^{m}

. Then the dimensions of

G (f (x, y))

would have been

| G (x) | \times | G (y) | \times m

,where

| G (x) | \times | G (y) |

are the dimensions for indexing the free variables and

m

are dimensions associated to the output domain of

f

. Let us call the latter feature dimensions, as captioned in Figure 1. Notice that

G (p (x, f (x, y)))

will always return a tensor with the exact dimensions

| G (x) | \times | G (y) | \times 1

because,under any grounding,a predicate always returns a value in

[0, 1]

. Therefore,as the "feature dimensions" of predicates is always 1,we choose to "squeeze it" and not to represent it in our graphical convention (see Figure 1, the box output by the predicate has no depth).

Figure 2: Illustration of an element-wise operator implementing conjunction

(p (x) \land q (y))

. We assume that

x

and

y

are two different variables. The result has one number in the interval

[0, 1]

to every combination of individuals from

G (x)

and

G (y)

2.2.3. Connectives and Quantifiers

The semantics of the connectives is defined according to the semantics of first-order fuzzy logic [28]. Conjunction

(\land)

,disjunction

(\lor)

,implication

(\to)

and negation

(-)

are associated, respectively,with a t-norm

(T)

,a t-conorm

(S)

,a fuzzy implication

(I)

and a fuzzy negation

(N)

operation FuzzyOp

\in {T, S, I, N}

. Definitions of some common fuzzy operators are presented in Appendix B Let

ϕ

and

ψ

be two formulas with free variables

x_{1}, \dots, x_{m}

and

y_{1}, \dots, y_{n}

,respectively. Let us assume that the first

k

variables are common to

ϕ

and

ψ

. Recall that

⋄

and

\circ

denote the set of unary and binary connectives, respectively. Formally:

\begin{matrix} (1) & G {(⋄ ϕ)}_{i_{1}, \dots, i_{m}} = FuzzyOp (⋄) (G {(ϕ)}_{i_{1}, \dots, i_{m}}) \end{matrix}

\begin{matrix} (2) & G {(ϕ \circ ψ)}_{i_{1}, \dots, i_{m + n - k}} = FuzzyOp (\circ) (G {(ϕ)}_{i_{1}, \dots, i_{k}, i_{k + 1}, \dots, i_{m}} G {(ψ)}_{i_{1}, \dots, i_{k}, i_{m + 1}, \dots, i_{m + n - k}}) \end{matrix}

In (2),

(i_{1}, \dots, i_{k})

denote the indices of the

k

common variables,

(i_{k + 1}, \dots, i_{m})

denote the indices of the

m - k

variables appearing only in

ϕ

,and

(i_{m + 1}, \dots, i_{m + n - k})

denote the indices of the

n - k

variables appearing only in

ψ

. Intuitively,

G (ϕ \circ ψ)

is a tensor whose elements are obtained by applying

FuzzyOp (\circ)

element-wise to every combination of individuals from

x_{1}, \dots, x_{m}

and

y_{1}, \dots, y_{n}

(see Figure 2).

The semantics of the quantifiers

({\forall, \exists})

is defined with the use of aggregation. Let Agg be a symmetric and continuous aggregation operator,

Agg : ⋃_{\sim N} {[0, 1]}^{n} \to [0, 1]

. An analysis of suitable

aggregation operators is presented in Appendix Appendix B. For every formula

ϕ

containing

x_{1}, \dots, x_{n}

free variables,suppose,without loss of generality,that quantification applies to the first

h

variables. We shall therefore apply Agg to the first

h

axes of

G (ϕ)

,as follows:

\begin{matrix} (3) & G {(Q x_{1}, \dots, x_{h} (ϕ))}_{i_{h + 1}, \dots, i_{n}} = {Agg}_{(Q)} G {(ϕ)}_{i_{1}, \dots, i_{h}, i_{h + 1}, \dots, i_{n}} \end{matrix}

i_{h} = 1, \dots, | G (x_{h}) |

where

Agg (Q)

is the aggregation operator associated with the quantifier

Q

. Intuitively,we obtain

G (Q x_{1}, \dots, x_{h} (ϕ))

by reducing the dimensions associated with

x_{1}, \dots, x_{h}

using the operator

Agg (Q)

(see Figure 3).

Notice that the above grounded semantics can assign different meanings to the three formulas:

\forall x y (ϕ (x, y)) \forall x (\forall y (ϕ (x, y))) \forall y (\forall x (ϕ (x, y)))

Figure 3: Illustration of an aggregation operation implementing quantification

(\forall y \exists x)

over variables

x

and

y

. We assume that

x

and

y

have different domains. The result is a single number in the interval

[0, 1]

The semantics of the three formulas will coincide if the aggregation operator is bi-symmetric. LTN also allows the following form of quantification, here called diagonal quantification (Diag):

\begin{matrix} (4) & G {(Q Diag (x_{1}, \dots, x_{h}) (ϕ))}_{i_{h + 1}, \dots, i_{n}} = \underset{i = 1, \dots, min_{1 \leq j \leq h} | G (x_{j}) |}{Agg (Q)} G {(ϕ)}_{i, \dots, i, i_{h + 1}, \dots, i_{n}} \end{matrix}

Diag (x_{1}, \dots, x_{h})

quantifies over specific tuples such that the

i

-th tuple contains the

i

-th instance of each of the variables in the argument of Diag, under the assumption that all variables in the argument are grounded onto sequences with the same number of instances.

Diag (x_{1}, \dots, x_{h})

is called diagonal quantification because it quantifies over the diagonal of

G (ϕ)

along the axes associated with

x_{1} \dots x_{h}

,although in practice only the diagonal is built and not the entire

G (ϕ)

,as shown in Figure 4 For example,given a data set with samples

x

and target labels

y

,if looking to write a statement

p (x, y)

that holds true for each pair of sample and label,one can write

\forall Diag (x, y) p (x, y)

given that

| G (x) | = | G (y) |

. As another example,given two variables

x

and

y

whose groundings contain 10 instances of

x

and

y

each, the expression

\forall Diag (x, y) p (x, y)

produces 10 results such that the

i

-th result corresponds to the

i

-th instances of each grounding. Without Diag,the expression would be evaluated for all

10 \times 10

combinations of the elements in

G (x)

and

G (y)

5 Diag will find much application in the examples and experiments to follow.

2.3. Guarded Quantifiers

In many situations, one may wish to quantify over a set of elements of a domain whose grounding satisfy some condition. In particular, one may wish to express such condition using formulas of the language of the form:

\begin{matrix} (5) & \forall y (\exists x : age (x) > age (y) (parent (x, y))) \end{matrix}

The grounding of such a formula is obtained by aggregating the values of parent

(x, y)

only for the instances of

x

that satisfy the condition

age (x) > age (y)

,that is:

\begin{array}{l} Agg (\forall) Agg (\exists) G (parent (x, y))_{i, j} \\ j = 1, \dots, | G (y) | \to_{G (age (x))_{i} > G {(age (y))}_{j}}^{i = 1, \dots, | G (x) | s.t.} \end{array}

^{5}

Notice how Diag is not simply "syntactic sugar" for creating a new variable pairs_xy by stacking pairs of examples from

G (x)

and

G (y)

. If the groundings of

x

and

y

have incompatible ranks (for instance,if

x

denotes images and

y

denotes their labels),stacking them in a tensor

G

(pairs_xy) is non-trivial,requiring several reshaping operations.

Figure 4: Diagonal Quantification: Diag

(x_{1}, x_{2})

quantifies over specific tuples only,such that the

i

-th tuple contains the

i

-th instances of the variables

x_{1}

and

x_{2}

in the groundings

G (x_{1})

and

G (x_{2})

,respectively. Diag

(x_{1}, x_{2})

assumes,therefore, that

x_{1}

and

x_{2}

have the same number of instances as in the case of samples

x_{1}

and their labels

x_{2}

in a typical supervised learning tasks.

The evaluation of which tuple is safe is purely symbolic and non-differentiable. Guarded quantifiers operate over only a subset of the variables, when this symbolic knowledge is crisp and available. More generally,in what follows,

m

is a symbol representing the condition,which we shall call a mask,and

G (m)

associates a function 6 returning a Boolean to

m

\begin{matrix} (6) & G {(Q x_{1}, \dots, x_{h} : m (x_{1}, \dots, x_{n}) (ϕ))}_{i_{h + 1}, \dots, i_{n}} \overset{def}{=} \underset{i_{1} = 1, \dots, | G (x_{1}) |}{Agg (Q)} G {(ϕ)}_{i_{1}, \dots, i_{h}, i_{h + 1}, \dots, i_{n}} \end{matrix}

\begin{matrix} ⋮ \\ i_{h} = 1, \dots, | \dot{G} (x_{h}) | s.t. \\ G (m) (G {(x_{1})}_{i_{1}}, \dots, G {(x_{n})}_{i_{n}}) \end{matrix}

Notice that the semantics of a guarded sentence

\forall x : m (x) (ϕ (x))

is different than the semantics of

\forall x (m (x) \to ϕ (x))

. In crisp and traditional FOL,the two statements would be equivalent. In Real Logic,they can give different results. Let

G (x)

be a sequence of 3 values,

G (m (x)) = (0, 1, 1)

and

G (ϕ (x)) = (0.2, 0.7, 0.8)

. Only the second and third instances of

x

are safe,that is,are in the masked subset. Let

\to

be defined using the Reichenbach operator

I_{R} (a, b) = 1 - a + a b

and

\forall

be defined using the mean operator. We have

G (\forall x (m (x) \to ϕ (x))) = \frac{1 + 0.7 + 0.8}{3} = 0.833 \dots

whereas

G (\forall x : m (x) (ϕ (x))) = \frac{0.7 + 0.8}{2} = 0.75

. Also,in the computational graph of the guarded sentence, there are no gradients attached to the instances that do not verify the mask. Similarly, the semantics of

\exists x : m (x) (ϕ (x))

is not equivalent to that of

\exists x (m (x) \land ϕ (x))

^{6}

In some edge cases,a masking may produce an empty sequence,e.g. if for some value of

G (y)

,there is no value in

G (x)

that satisfies age

(x) > age (y)

,we resort to the concept of an empty semantics:

\forall

returns 1 and

\exists

returns 0 .

Figure 5: Example of Guarded Quantification: One can filter out elements of the various domains that do not satisfy some condition before the aggregation operators for

\forall

and

\exists

are applied.

2.4. Stable Product Real Logic

It has been shown in [69] that not all first-order fuzzy logic semantics are equally suited for gradient-descent optimization. Many fuzzy logic operators can lead to vanishing or exploding gradients. Some operators are also single-passing, in that they propagate gradients to only one input at a time.

In general, the best performing symmetric configuration 7 for the connectives uses the product t-norm

T_{P}

for conjunction,its dual t-conorm

S_{P}

for disjunction,standard negation

N_{S}

,and the Reichenbach implication

I_{R}

(the corresponding S-Implication to the above operators). This subset of Real Logic where the grounding of the connectives is restricted to the product configuration is called Product Real Logic in [69]. Given

a

and

b

two truth-values in

[0, 1]

\begin{matrix} (7) & \neg : N_{S} (a) = 1 - a \end{matrix}

\begin{matrix} (8) & \land : T_{P} (a, b) = a b \end{matrix}

\begin{matrix} (9) & \lor : S_{P} (a, b) = a + b - a b \end{matrix}

\begin{matrix} (10) & \to: I_{R} (a, b) = 1 - a + a b \end{matrix}

Appropriate aggregators for

\exists

and

\forall

are the generalized mean

A_{p M}

with

p \geq 1

to approximate the existential quantification,and the generalized mean w.r.t. the error

A_{p M E}

with

p \geq 1

to approximate the universal quantification. They can be understood as a smooth maximum and a smooth minimum,respectively. Given

n

truth-values

a_{1}, \dots, a_{n}

all in

[0, 1]

\begin{matrix} (11) & \exists : A_{p M} (a_{1}, \dots, a_{n}) = {(\frac{1}{n} \sum_{i = 1}^{n} a_{i}^{p})}^{\frac{1}{p}} p \geq 1 \end{matrix}

\begin{matrix} (12) & \forall : A_{p M E} (a_{1}, \dots, a_{n}) = 1 - {(\frac{1}{n} \sum_{i = 1}^{n} {(1 - a_{i})}^{p})}^{\frac{1}{p}} p \geq 1 \end{matrix}

A_{p M E}

measures the power of the deviation of each value from the ground truth 1 . With

p = 2

, it is equivalent to

1 - RMSE (a, 1)

,where RMSE is the root-mean-square error,

a

is the vector of truth-values and 1 is a vector of

1^{'} s

^{7}

We define a symmetric configuration as a set of fuzzy operators such that conjunction and disjunction are defined by a t-norm and its dual t-conorm, respectively, and the implication operator is derived from such conjunction or disjunction operators and standard negation (c.f. Appendix B for details). In [69], van Krieken et al. also analyze non-symmetric configurations and even operators that do not strictly verify fuzzy logic semantics.

The intuition behind the choice of

p

is that the higher that

p

is,the more weight that

A_{p M}

(resp.

A_{p M E})

will give to true (resp. false) truth-values,converging to the max (resp. min) operator. Therefore,the value of

p

can be seen as a hyper-parameter as it offers flexibility to account for outliers in the data depending on the application.

Nevertheless,Product Real Logic still has the following gradient problems:

T_{P} (a, b)

has vanishing gradients on the edge case

a = b = 0; S_{P} (a, b)

has vanishing gradients on the edge case

a = b = 1

;

I_{R} (a, b)

has vanishing gradients on the edge case

a = 0, b = 1; A_{p M} (a_{1}, \dots, a_{n})

has exploding gradients when

\sum_{i} {(a_{i})}^{p}

tends to

0; A_{p M E} (a_{1}, \dots, a_{n})

has exploding gradients when

\sum_{i} {(1 - a_{i})}^{p}

tends to 0 (see Appendix C for details).

To address these problems,we define the projections

π_{0}

and

π_{1}

below with

ϵ

an arbitrarily small positive real number:

\begin{matrix} (13) & π_{0} : [0, 1] \to] 0, 1] : a \to (1 - ϵ) a + ϵ \end{matrix}

\begin{matrix} (14) & π_{1} : [0, 1] \to [0, 1 [: a \to (1 - ϵ) a \end{matrix}

We then derive the following stable operators to produce what we call the Stable Product Real Logic configuration:

\begin{matrix} (15) & N_{S}^{'} (a) = N_{S} (a) \end{matrix}

\begin{matrix} (16) & T_{P}^{'} (a, b) = T_{P} (π_{0} (a), π_{0} (b)) \end{matrix}

\begin{matrix} (17) & S_{P}^{'} (a, b) = S_{P} (π_{1} (a), π_{1} (b)) \end{matrix}

\begin{matrix} (18) & I_{R}^{'} (a, b) = I_{R} (π_{0} (a), π_{1} (b)) \end{matrix}

\begin{matrix} (19) & A_{p M}^{'} (a_{1}, \dots, a_{n}) = A_{p M} (π_{0} (a_{1}), \dots, π_{0} (a_{n})) p \geq 1 \end{matrix}

\begin{matrix} (20) & A_{p M E}^{'} (a_{1}, \dots, a_{n}) = A_{p M E} (π_{1} (a_{1}), \dots, π_{1} (a_{n})) p \geq 1 \end{matrix}

It is important noting that the conjunction operator in stable product semantics is not a T-norm 8

T_{P}^{'} (a, b)

does not satisfy identity in

[0, 1 [

since for any

0 \leq a < 1, T_{P}^{'} (a, 1) = (1 - ϵ) a + ϵ \neq a

, although

ϵ

can be chosen arbitrarily small. In the experimental evaluations reported in Section 4, we find that the adoption of the stable product semantics is an important practical step to improve the numerical stability of the learning system.

3. Learning, Reasoning, and Querying in Real Logic

In Real Logic, one can define the tasks of learning, reasoning and query-answering. Given a Real Logic theory that represents the knowledge of an agent at a given time, learning is the task of making generalizations from specific observations obtained from data. This is often called inductive inference. Reasoning is the task of deriving what knowledge follows from the facts which are currently known. Query answering is the task of evaluating the truth value of a certain logical expression (called a query), or finding the set of objects in the data that evaluate a certain expression to true. In what follows, we define and exemplify each of these tasks. To do so, we first need to specify which types of knowledge can be represented in Real Logic.

^{8}

Recall that a T-norm is a function

T : [0, 1] \times [0, 1] \to [0, 1]

satisfying commutativity,monotonicity,associativity and identity,that is,

T (a, 1) = a

3.1. Representing Knowledge with Real Logic

In logic-based knowledge representation systems, knowledge is represented by logical formulas whose intended meanings are propositions about a domain of interest. The connection between the symbols occurring in the formulas and what holds in the domain is not represented in the knowledge base and is left implicit since it does not have any effect on the logic computations. In Real Logic, by contrast, the connection between the symbols and the domain is represented explicitly in the language by the grounding

G

,which plays an important role in both learning and reasoning.

G

is an integral part of the knowledge represented by Real Logic. A Real Logic knowledge base is therefore defined by the formulas of the logical language and knowledge about the domain in the form of groundings obtained from data. The following types of knowledge can be represented in Real Logic.

3.1.1. Knowledge through symbol groundings

Boundaries for domain grounding. These are constraints specifying that the value of a certain logical expression must be within a certain range. For instance, one may specify that the domain

D

must be interpreted in the

[0, 1]

hyper-cube or in the standard

n

-simplex,i.e. the set

d_{1}, \dots, d_{n} \in {(R^{+})}^{n}

such that

\sum_{i} d_{i} = 1

. Other intuitive examples of range constraints include the elements of the domain "colour" grounded onto points in

{[0, 1]}^{3}

such that every element is associated with the triplet of values

(R, G, B)

with

R, G, B \in [0, 1]

,or the range of a function

age (x)

as an integer between 0 and 100 .

Explicit definition of grounding for symbols. Knowledge can be more strictly incorporated by fixing the grounding of some symbols. If a constant

c

denotes an object with known features

v_{c} \in R^{n}

,we can fix its grounding

G (c) = v_{c}

. Training data that consists in a set of

n

data items such as

n

images (or tuples known as training examples) can be specified in Real Logic by

n

constants,e.g.

i m g_{1}, i m g_{2}, \dots, i m g_{n}

,and by their groundings,e.g.

G (i m g_{1}) = Ω, G (i m g_{2}) = Ω, \dots, G (i m g_{n}) = Ω

. These can be gathered in a variable imgs. A binary predicate sim that measures the similarity of two objects can be grounded as, e.g., a cosine similarity function of two vectors

v

and

w, (v, w) \mapsto \frac{v \cdot w}{∥ v ∥∥ w ∥}

. The output layer of the neural network associated with a multi-class single-label predicate

P (x,class)

can be a softmax function normalizing the output such that it guarantees exclusive classification, i.e.

\sum_{i} P (x, i) = 1

? Grounding of constants and functions allows the computation of the grounding of their results. If,for example,

G

(transp) is the function that transposes a matrix then

G (transp ({img}_{1})) = \infty

Parametric definition of grounding for symbols. Here,the exact grounding of a symbol

σ

is not known, but it is known that it can be obtained by finding a set of real-valued parameters, that is,via learning. To emphasize this fact,we adopt the notation

G (σ) = G (σ ∣ θ_{σ})

where

θ_{σ}

is the set of parameter values that determines the value of

G (σ)

. The typical example of parametric grounding for constants is the learning of an embedding. Let

e m b (w o r d ∣ θ_{e m b})

be a word embedding with parameters

θ_{e m b}

which takes as input a word and returns its embedding in

R^{n}

. If the words of a vocabulary

W = {w_{1}, \dots, w_{| W |}}

are constant symbols, their groundings

G (w_{i} ∣ θ_{e m b})

are defined parametrically w.r.t.

θ_{e m b}

emb (w_{i} ∣ θ_{e m b})

. An example of parametric grounding for a function symbol

f

is to assume that

G (f)

is a linear function such that

G (f) : R^{m} \to R^{n}

maps each

v \in R^{m}

into

A_{f} v + b_{f}

,with

A_{f}

a matrix of real numbers and

b

a vector of real numbers. In this case,

G (f) = G (f ∣ θ_{f})

,where

θ_{f} = {A_{f}, b_{f}}

. Finally, the grounding of a predicate symbol can be given, for example, by a neural network

N

with parameters

θ_{N}

. As an example,consider a neural network

N

trained for image classification into

n

classes: cat,dog,horse,etc.

N

takes as input a vector

v

of pixel values and produces as output a vector

y = (y_{cat}, y_{dog}, y_{horse}, \dots)

{[0, 1]}^{n}

such that

y = N (v ∣ θ_{N})

, where

y_{c}

is the probability that input image

v

is of class

c

. In case classes are,alternatively, chosen to be represented by unary predicate symbols such as

cat (v), dog (v), horse (v), \dots

then

G (cat (v)) = N {(v ∣ θ_{N})}_{cat}, G (dog (v)) = N {(v ∣ θ_{N})}_{dog}, G (horse (v)) = N {(v ∣ θ_{N})}_{horse}

,etc.

^{9}

Notice that softmax is often used as the last layer in neural networks to turn logits into a probability distribution. However, we do not use the softmax function as such here. Instead, we use it here to enforce an exclusivity constraint on satisfiability scores.

3.1.2. Knowledge through formulas

Factual propositions. Knowledge about the properties of specific objects in the domain is represented, as usual, by logical propositions, as exemplified below: Suppose that it is known that

i m g_{1}

is a number eight,

i m g_{2}

is a number nine,and

i m g_{n}

is a number two. This can be represented by adding the following facts to the knowledge-base: nine

(i m g_{1})

,eight

(i m g_{2}), \dots

, two

(i m g_{n})

. Supervised learning,that is,learning with the use of training examples which include target values (labelled data), is specified in Real Logic by combining grounding definitions and factual propositions. For example,the fact that an image

Z

is a positive example for the class nine and a negative example for the class eight is specified by defining

G (i m g_{1}) = Z

alongside the propositions nine

(i m g_{1})

and

\neg eight (i m g_{1})

. Notice how semi-supervision can be specified naturally in Real Logic by adding propositions containing disjunctions,e.g.

eight (i m g_{1}) \lor nine (i m g_{1})

,which state that

i m g_{1}

is either an eight or a nine (or both). Finally, relational learning can be achieved by relating logically multiple objects (defined as constants or variables or even as more complex sequences of terms) such as e.g.:

nine (i m g_{1}) \to - nine (i m g_{2})

(if

i m g_{1}

is a nine then

i m g_{2}

is not a nine) or nine

(i m g) \to \neg

eight

(i m g)

(if an image is a nine then it is not an eight). The use of more complex knowledge including the use of variables such as img above is the topic of generalized propositions, discussed next.

Generalized propositions. General knowledge about all or some of the objects of some domains can be specified in Real Logic by using first-order logic formulas with quantified variables. This general type of knowledge allows one to specify arbitrary constraints on the groundings independently from the specific data available. It allows one to specify, in a concise way, knowledge that holds true for all the objects of a domain. This is especially useful in Machine Learning in the semi-supervised and unsupervised settings, where there is no specific knowledge about a single individual. For example, as part of a task of multi-label classification with constraints on the labels [12], a positive label constraint may express that if an example is labelled with

l_{1}, \dots, l_{k}

then it should also be labelled with

l_{k + 1}

. This can be specified in Real Logic with a universally quantified formula:

\forall x (l_{1} (x) \land \dots \land l_{k} (x) \to l_{k + 1} (x))^{10}

Another example of soft constraints used in Statistical Relational Learning associates the labels of related examples. For instance, in Markov Logic Networks [55], as part of the well-known Smokers and Friends example, people who are smokers are associated by the friendship relation. In Real Logic,the formula

\forall x y ((smokes (x) \land friend (x, y)) \to smokes (y))

would be used to encode the soft constraint that friends of smokers are normally smokers.

^{10}

This can also be specified using a guarded quantifier

\forall x : ((l_{1} (x) \land \dots \land l_{k} (x)) > t h) l_{k + 1} (x)

where

t h

is a threshold value in

[0, 1]

3.1.3. Knowledge through fuzzy semantics

Definition for operators. The grounding of a formula

ϕ

depends on the operators approximating the connectives and quantifiers that appear in

ϕ

. Different operators give different interpretations of the satisfaction associated with the formula. For instance, the operator

A_{p M E} (a_{1}, \dots, a_{n})

that approximates universal quantification can be understood as a smooth minimum. It depends on a hyper-parameter

p

(the exponent used in the generalized mean). If

p = 1

then

A_{p M E} (a_{1}, \dots, a_{n})

corresponds to the arithmetic mean. As

p

increases,given the same input,the value of the universally quantified formula will decrease as

A_{p M E}

converges to the min operator. To define how strictly the universal quantification should be interpreted in each proposition,one can use different values of

p

for different propositions of the knowledge base. For instance,a formula

\forall x P (x)

where

A_{p M E}

is used with a low value for

p

will in fact denote that

P

holds for some

x

,whereas a formula

\forall x Q (x)

with a higher

p

may denote that

Q

holds for most

x

3.1.4. Satisfiability

In summary, a Real Logic knowledge-base has three components: the first describes knowledge about the grounding of symbols (domains, constants, variables, functions, and predicate symbols); the second is a set of closed logical formulas describing factual propositions and general knowledge; the third lies in the operators and the hyperparameters used to evaluate each formula. The definition that follows formalizes this notion.

Definition 4 (Theory/Knowledge-base). A theory of Real Logic is a triple

T = ⟨ K, G (\cdot ∣ θ), Θ ⟩

, where

K

is a set of closed first-order logic formulas defined on the set of symbols

S = D \cup X \cup

C \cup F \cup P

denoting,respectively,domains,variables,constants,function and predicate symbols;

G (\cdot ∣ θ)

is a parametric grounding for all the symbols

s \in S

and all the logical operators; and

Θ = {Θ_{s}}_{s \in S}

is the hypothesis space for each set of parameters

θ_{s}

associated with symbol

s

Learning and reasoning in a Real Logic theory are both associated with searching and applying the set of values of parameters

θ

from the hypothesis space

Θ

that maximize the satisfaction of the formulas in

K

. We use the term grounded theory,denoted by

⟨ K, G_{θ} ⟩

,to refer to a Real Logic theory with a specific set of learned parameter values. This idea shares some similarity with the weighted MAX-SAT problem [43],where the weights for formulas in

K

are given by their fuzzy truth-values obtained by choosing the parameter values of the grounding. To define this optimization problem, we aggregate the truth-values of all the formulas in

K

by selecting a formula aggregating operator SatAgg :

{[0, 1]}^{*} \to [0, 1]

Definition 5. The satisfiability of a theory

T = ⟨ K, G_{θ} ⟩

with respect to the aggregating operator SatAgg is defined as

{SatAgg}_{ϕ \in K} G_{θ} (ϕ)

3.2. Learning

Given a Real Logic theory

T = (K, G (\cdot ∣ θ), Θ)

,learning is the process of searching for the set of parameter values

θ^{*}

that maximize the satisfiability of

T

w.r.t. a given aggregator:

θ^{*} = \underset{θ \in Θ}{argmax} \underset{ϕ \in K}{SatAgg} G_{θ} (ϕ)

Notice that with this general formulation, one can learn the grounding of constants, functions, and predicates. The learning of the grounding of constants corresponds to the learning of em-beddings. The learning of the grounding of functions corresponds to the learning of generative models or a regression task. Finally, the learning of the grounding of predicates corresponds to a classification task in Machine Learning.

In some cases, it is useful to impose some regularization (as done customarily in ML) on the set of parameters

θ

,thus encoding a preference on the hypothesis space

Θ

,such as a preference for smaller parameter values. In this case, learning is defined as follows:

θ^{*} = \underset{θ \in Θ}{argmax} ({SatAgg}_{θ} G_{θ} (ϕ) - λ R (θ))

where

λ \in R^{+}

is the regularization parameter and

R

is a regularization function,e.g.

L_{1}

L_{2}

regularization,that is,

L_{1} (θ) = \sum_{θ \in θ} | θ |

and

L_{2} (θ) = \sum_{θ \in θ} θ^{2}

LTN can generalize and extrapolate when querying formulas grounded with unseen data (for example, new individuals from a domain), using knowledge learned with previous groundings (for example, re-using a trained predicate). This is explained in Section 3.3

3.3. Querying

Given a grounded theory

T = (K, G_{θ})

,query answering allows one to check if a certain fact is true (or, more precisely, by how much it is true since in Real Logic truth-values are real numbers in the interval

[0, 1])

. There are various types of queries that can be asked of a grounded theory.

A first type of query is called truth queries. Any formula in the language of

T

can be a truth query. The answer to a truth query

ϕ_{q}

is the truth value of

ϕ_{q}

obtained by computing its grounding, i.e.

G_{θ} (ϕ_{q})

. Notice that,if

ϕ_{q}

is a closed formula,the answer is a scalar in

[0, 1]

denoting the truth-value of

ϕ_{q}

according to

G_{θ}

. if

ϕ_{q}

contains

n

free variables

x_{1}, \dots, x_{n}

,the answer to the query is a tensor of order

n

such that the component indexed by

i_{1} \dots i_{n}

is the truth-value of

ϕ_{q}

evaluated in

G_{θ} {(x_{1})}_{i_{1}}, \dots, G_{θ} {(x_{n})}_{i_{n}} .

The second type of query is called value queries. Any term in the language of

T

can be a value query. The answer to a value query

t_{q}

is a tensor of real numbers obtained by computing the grounding of the term,i.e.

G_{θ} (t_{q})

. Analogously to truth queries,the answer to a value query is a "tensor of tensors" if

t_{q}

contains variables. Using value queries,one can inspect how a constant or a term, more generally, is embedded in the manifold.

The third type of query is called generalization truth queries. With generalization truth queries, we are interested in knowing the truth-values of formulas when these are applied to a new (unseen) set of objects of a domain, such as a validation or a test set of examples typically used in the evaluation of machine learning systems. A generalization truth query is a pair

(ϕ_{q} (x), U)

,where

ϕ_{q}

is a formula with a free variable

x

and

U = (u^{(1)}, \dots, u^{(k)})

is a set of unseen examples whose dimensions are compatible with those of the domain of

x

. The answer to the query

({\hat{ϕ}}_{q} (x), U)

G_{θ} (ϕ_{q} (x))

for

x

taking each value

u^{(i)}, 1 \leq i \leq k

,in

U

. The result of this query is therefore a vector of

| U |

truth-values corresponding to the evaluation of

ϕ_{q}

on new data

u^{(1)}, \dots, u^{(k)}

The fourth and final type of query is generalization value queries. These are analogous to generalization truth queries with the difference that they evaluate a term

t_{q} (x)

,and not a formula,on new data

U

. The result,therefore,is a vector of

| U |

values corresponding to the evaluation of the trained model on a regression task using test data

U

3.4. Reasoning

3.4.1. Logical consequence in Real Logic

From a pure logic perspective, reasoning is the task of verifying if a formula is a logical consequence of a set of formulas. This can be achieved semantically using model theory

(⊨)

or syntactically via a proof theory

(⊢)

. To characterize reasoning in Real Logic,we adapt the notion of logical consequence for fuzzy logic provided in [9]: A formula

ϕ

is a fuzzy logical consequence of a finite set of formulas

Γ

,in symbols

Γ ⊨ ϕ

if for every fuzzy interpretation

f

,if all the formulas in

Γ

are true (i.e. evaluate to 1) in

f

then

ϕ

is true in

f

. In other words,every model of

Γ

is a model of

ϕ

. A direct application of this definition to Real Logic is not practical since in most practical cases the level of satisfiability of a grounded theory

(K, G_{θ})

will not be equal to 1 . We therefore define an interval

[q, 1]

with

\frac{1}{2} < q < 1

and assume that a formula is true if its truth-value is in the interval

[q, 1]

. This leads to the following definition:

Definition 6. A closed formula

ϕ

is a logical consequence of a knowledge-base

(K, G (\cdot ∣ θ), Θ)

,in symbols

(K, G (\cdot ∣ θ), Θ) ⊨_{q} ϕ

,if,for every grounded theory

⟨ K, G_{θ} ⟩

,if

SatAgg (K, G_{θ}) \geq q

then

G_{θ} (ϕ) \geq q

3.4.2. Reasoning by optimization

Logical consequence by direct application of Definition 6 requires querying the truth value of

ϕ

for a potentially infinite set of groundings. Therefore,we consider in practice the following directions:

Reasoning Option 1 (Querying after learning). This is approximate logical inference by considering only the grounded theories that maximally satisfy

(K, G (\cdot ∣ θ), Θ)

. We therefore define that

ϕ

is a brave logical consequence of a Real Logic knowledge-base

(K, G (\cdot ∣ θ), Θ)

G_{θ^{*}} (ϕ) \geq q

for all the

θ^{*}

such that:

θ^{*} = \underset{θ}{argmax} SatAgg (K, G_{θ}) and SatAgg (K, G_{θ^{*}}) \geq q

The objective is to find all

θ^{*}

that optimally satisfy the knowledge base and to measure if they also satisfy

ϕ

. One can search for such

θ^{*}

by running multiple optimizations with the objective function of Section 3.2

This approach is somewhat naive. Even if we run the optimization multiple times with multiple parameter initializations (to, hopefully, reach different optima in the search space), the obtained groundings may not be representative of other optimal or close-to-optimal groundings. In Section 4.8 we give an example that shows the limitations of this approach and motivates the next one.

Reasoning Option 2 (Proof by Refutation). Here, we reason by refutation and search for a counterexample to the logical consequence by introducing an alternative search objective. Normally, according to Definition 6, one tries to verify that 11

\begin{matrix} (21) & for all θ \in Θ,if G_{θ} (K) \geq q then G_{θ} (ϕ) \geq q . \end{matrix}

Instead, we solve the dual problem:

\begin{matrix} (22) & there exists θ \in Θ such that G_{θ} (K) \geq q and G_{θ} (ϕ) < q . \end{matrix}

If Eq.(22) is true then a counterexample to Eq.(21) has been found and the logical consequence does not hold. If Eq. 22 is false then no counterexample to Eq. 21 has been found and the logical consequence is assumed to hold true. A search for such parameters

θ

(the counterexample) can be performed by minimizing

G_{θ} (ϕ)

while imposing a constraint that seeks to invalidate results where

G_{θ} (K) < q

. We therefore define:

penalty (G_{θ}, q) = {\begin{array}{l} c if G_{θ} (K) < q, \\ 0 otherwise, \end{array} where c > 1

^{11}

For simplicity,we temporarily define the notation

G (K) := {SatAgg}_{ϕ \in K} (K, G)

Given

G^{*}

such that:

\begin{matrix} (23) & G^{*} = \underset{G_{θ}}{argmin} (G_{θ} (ϕ) + penalty (G_{θ}, q)) \end{matrix}

If $G^{*} (K) < q$ : Then for all $G_{θ}, G_{θ} (K) < q$ and therefore $(K, G (\cdot ∣ θ), Θ) ⊨_{q} ϕ$ .

If $G^{*} (K) \geq q$ and $G^{*} (ϕ) \geq q$ : Then for all $G_{θ}$ with $G_{θ} (K) \geq q$ ,we have that $G_{θ} (ϕ) \geq G^{*} (ϕ) \geq q$ and therefore $(K, G (\cdot ∣ θ), Θ) ⊨_{q} ϕ$ .

If $G^{*} (K) \geq q$ and $G^{*} (ϕ) < q :$ Then $(K, G (\cdot ∣ θ), Θ) {\vDash \not{}}_{q} ϕ$ .

Clearly, Equation (23) cannot be used as an objective function for gradient-descent due to null derivatives. Therefore, we propose to approximate the penalty function with the soft constraint:

elu (α, β (q - G_{θ} (K))) = {\begin{cases} β (q - G_{θ} (K)) & if G_{θ} (K) \leq q, \\ α (e^{q - G_{θ} (K)} - 1) & otherwise, \end{cases}

where

α \geq 0

and

β \geq 0

are hyper-parameters (see Figure 6). When

G_{θ} (K) < q

,the penalty is linear in

q - G_{θ} (K)

with a slope of

β

. Setting

β

high,the gradients for

G_{θ} (K)

will be high in absolute value if the knowledge-base is not satisfied. When

G_{θ} (K) > q

,the penalty is a negative exponential that converges to

- α

. Setting

α

low but non-zero seeks to ensure that the gradients do not vanish when the penalty should not apply (when the knowledge-base is satisfied). We obtain the following approximate objective function:

\begin{matrix} (24) & G^{*} = \underset{G_{θ}}{argmin} (G_{θ} (ϕ) + elu (α, β (q - G_{θ} (K))) \end{matrix}

Section 4.8 will illustrate the use of reasoning by refutation with an example in comparison with reasoning as querying after learning. Of course, other forms of reasoning are possible, not least that adopted in [6], but a direct comparison is outside the scope of this paper and left as future work.

4. The Reach of Logic Tensor Networks

The objective of this section is to show how the language of Real Logic can be used to specify a number of tasks that involve learning from data and reasoning. Examples of such tasks are classification, regression, clustering, and link prediction. The solution of a problem specified in Real Logic is obtained by interpreting such a specification in Logic Tensor Networks. The LTN library implements Real Logic in Tensorflow 2 [1] and is available from GitHub

^{13}

Every logical operator is grounded using Tensorflow primitives such that LTN implements directly a Tensorflow graph. Due to Tensorflow built-in optimization, LTN is relatively efficient while providing the expressive power of first-order logic. Details on the implementation of the examples described in this section are reported in Appendix A. The implementation of the examples presented here is also available from the LTN repository on GitHub. Except when stated otherwise, the results reported are the average result over 10 runs using a

95 %

confidence interval. Every example uses a stable real product configuration to approximate the Real Logic operators and the Adam optimizer [35] with a learning rate of 0.001 . Table A.3 in the Appendix gives an overview of the network architectures used to obtain the results reported in this section.

^{12}

In the objective function,

G^{*}

should satisfy

G^{*} (K) \geq q

before reducing

G^{*} (ϕ)

because the penalty

c

which is greater than 1 is higher than any potential reduction in

G (ϕ)

which is smaller or equal to 1 .

^{13}

https://github.com/logictensornetworks/logictensornetworks

\begin{matrix} (25) & \forall x_{+} A (x_{+}) \end{matrix}

\begin{matrix} (26) & \forall x_{-} \neg A (x_{-}) \end{matrix}

Grounding:

G

(points)

= {[0, 1]}^{2}

G (x) \in {[0, 1]}^{m \times 2} (G (x) is a sequence of m points,that is, m examples)

G (x_{+}) = ⟨ d \in G (x) ∣∥ d - (0.5, 0.5) ∥< 0.09 ⟩

G (x_{-}) = ⟨ d \in G (x) ∣∥ d - (0.5, 0.5) ∥\geq 0.09 ⟩ ∥^{15}

G (A ∣ θ) : x \mapsto sigmoid ({MLP}_{θ} (x))

,where MLP is a Multilayer Perceptron with a single output neuron,whose parameters

θ

are to be learned

^{16}

Learning:

Let us define

D

the data set of all examples. The objective function with

K = {\forall x_{+} A (x_{+}), \forall x_{-} \neg A (x_{-})}

is given by

{argmax}_{θ \in Θ} {SatAgg}_{ϕ \in K} G_{θ, x \leftarrow D} (ϕ)

^{17}]

In practice,the optimizer uses the following loss function:

L = (1 - \underset{ϕ \in K}{SatAgg} G_{θ, x \leftarrow B} (ϕ))

where

B

is a mini-batch sampled from

D^{18}

The objective and loss functions depend on the following hyper-parameters:

the choice of fuzzy logic operator semantics used to approximate each connective and quantifier,

the choice of hyper-parameters underlying the operators, such as the value of the exponent $p$ in any generalized mean,

the choice of formula aggregator function.

Using the stable product configuration to approximate connectives and quantifiers,and

p = 2

for every occurrence of

A_{p M E}

,and using for the formula aggregator also

A_{p M E}

with

p = 2

, yields the following satisfaction equation:

\underset{ϕ \in K}{SatAgg} G_{θ} (ϕ) = 1 - \frac{1}{2} (1 - {(1 - {(\frac{1}{| G (x_{+}) |} \sum_{v \in G (x_{+})} {(1 - sigmoid ({MLP}_{θ} (v)))}^{2})}^{\frac{1}{2} \cdot 2})}^{\frac{1}{2} \cdot 2})

{+ 1 - (1 - {(\frac{1}{| G (x_{-}) |} \sum_{v \in G (x_{-})} {({sigmoid}_{-} ({MLP}_{θ} (v)))}^{2})}^{\frac{1}{2} \cdot 2}))}^{\frac{1}{2}}

^{14} G (x_{+})

are,by definition in this example,the training examples with Euclidean distance to the center

(0.5, 0.5)

smaller than the threshold of 0.09 .

^{15} G (x_{-})

are,by definition,the training examples with Euclidean distance to the centre

(0.5, 0.5)

larger or equal to the threshold of 0.09 .

^{16} sigmoid (x) = \frac{1}{1 + e^{-} x}

^{17}

The notation

G_{x \leftarrow D} (ϕ (x))

means that the variable

x

is grounded with the data

D

(that is,

G (x) := D

) when grounding

ϕ (x)

^{18}

As usual in ML,while it is possible to compute the loss function and gradients over the entire data set,it is preferred to use mini-batches of the examples.

Figure 7: Symbolic Tensor Computational Graph for the Binary Classification Example. In the figure,

G x_{+}

and

G x_{-}

are inputs to the network

G_{θ} (A)

and the dotted lines indicate the propagation of activation from each input through the network, which produces two outputs.

The computational graph of Figure 7 shows Sat

{Agg}_{ϕ \in K} G_{θ} (ϕ)

) as used with the above loss function.

We are therefore interested in learning the parameters

θ

of the MLP used to model the binary classifier. We sample 100 data points uniformly from

{[0, 1]}^{2}

to populate the data set of positive and negative examples. The data set was split into 50 data points for training and 50 points for testing. The training was carried out for a fixed number of 1000 epochs using backpropagation with the Adam optimizer [35] with a batch size of 64 examples. Figure 8 shows the classification accuracy and satisfaction level of the LTN on both training and test sets averaged over 10 runs using a 95% confidence interval. The accuracy shown is the ratio of examples correctly classified, with an example deemed as being positive if the classifier outputs a value higher than 0.5 .

Notice that a model can reach an accuracy of

100 %

while satisfaction of the knowledge base is yet not maximized. For example, if the threshold for an example to be deemed as positive is 0.7 , all examples may be classified correctly with a confidence score of 0.7 . In that case, while the accuracy is already maximized,the satisfaction of

\forall x_{+} A (x_{+})

would still be 0.7,and can still improve until the confidence for every sample reaches 1.0.

This first example, although straightforward, illustrates step-by-step the process of using LTN in a simple setting. Notice that, according to the nomenclature of Section 3.3 measuring accuracy amounts to querying the truth query (respectively,the generalization truth query)

A (x)

for all the examples of the training set (respectively, test set) and comparing the results with the classification threshold. In Figure 9,we show the results of such queries

A (x)

after optimization. Next,we show how the LTN language can be used to solve progressively more complex problems by combining learning and reasoning.

4.2. Multi-Class Single-Label Classification

The natural extension of binary classification is a multi-class classification task. We first approach multi-class single-label classification, which assumes that each example is assigned to one and only one label.

For illustration purposes, we use the Iris flower data set [20], which consists of classification into three mutually exclusive classes; call these

A, B

,and

C

. While one could train three unary predicates

A (x), B (x)

and

C (x)

,it turns out to be more effective if this problem is modeled by a single binary predicate

P (x, l)

,where

l

is a variable denoting a multi-class label,in this case,

Figure 8: Binary Classification task (training and test set performance): Average accuracy (left) and satisfiability (right). Due to the random initializations, accuracy and satisfiability start on average at 0.5 with performance increasing rapidly after a few epochs.

classes

A, B

C

. This syntax allows one to write statements quantifying over the classes,e.g.

\forall x (\exists l (P (x, l)))

. Since the classes are mutually exclusive,the output layer of the MLP representing

P (x, l)

will be a softmax layer,instead of a sigmoid function,to ensure the exclusivity constraint on satisfiability scores

^{19}

The problem can be specified as follows:

Domains:

items, denoting the examples from the Iris flower data set.

labels, denoting the class labels.

Variables:

x_{A}, x_{B}, x_{C}

for the positive examples of classes

A, B, C

x

for all examples.

D (x_{A}) = D (x_{B}) = D (x_{C}) = D (x) =

items.

Constants:

l_{A}, l_{B}, l_{C}

,the labels of classes

A

(Iris setosa),

B

(Iris virginica),

C

(Iris versicolor),respectively.

D (l_{A}) = D (l_{B}) = D (l_{C}) =

labels.

Predicates:

P (x, l)

denoting the fact that item

x

is classified as

l

D_{in} (P) =

items,labels.

Axioms:

\begin{matrix} (27) & \forall x_{A} P (x_{A}, l_{A}) \end{matrix}

\begin{matrix} (28) & \forall x_{B} P (x_{B}, l_{B}) \end{matrix}

\begin{matrix} (29) & \forall x_{C} P (x_{C}, l_{C}) \end{matrix}

Notice that rules about exclusiveness such as

\forall x (P (x, l_{A}) \to (\neg P (x, l_{B}) \land \neg P (x, l_{C})))

are not included since such constraints are already imposed by the grounding of

P

below,more specifically the softmax function.

^{19} softmax (x) = e^{x_{i}} / \sum_{j} e^{x_{j}}

Grounding:

G

(items)

= R^{4}

,items are described by 4 features: the length and the width of the sepals and petals, in centimeters.

G

(labels)

= N^{3}

,we use a one-hot encoding to represent classes.

G (x_{A}) \in R^{m_{1} \times 4}

,that is,

G (x_{A})

is a sequence of

m_{1}

examples of class

A

G (x_{B}) \in R^{m_{2} \times 4}, G (x_{B})

is a sequence of

m_{2}

examples of class

B

G (x_{C}) \in R^{m_{3} \times 4}, G (x_{C})

is a sequence of

m_{3}

examples of class

C

G (x) \in R^{(m_{1} + m_{2} + m_{3}) \times 4}, G (x)

is a sequence of all the examples.

G (l_{A}) = [1, 0, 0], G (l_{B}) = [0, 1, 0], G (l_{C}) = [0, 0, 1]

G (P ∣ θ) : x, l \mapsto l^{⊤} \cdot softmax ({MLP}_{θ} (x))

,where the MLP has three output neurons corresponding to as many classes,and

\cdot

denotes the dot product as a way of selecting an output for

G (P ∣ θ)

; multiplying the MLP’s output by the one-hot vector

l^{⊤}

gives the truth degree corresponding to the class denoted by

l

Learning:

The logical operators and connectives are approximated using the stable product configuration with

p = 2

for

A_{p M E}

. For the formula aggregator,

A_{p M E}

is used also with

p = 2

The computational graph of Figure 10 illustrates how

{SatAgg}_{ϕ \in K} G_{θ} (ϕ)

is obtained. If

U

denotes batches sampled from the data set of all examples, the loss function (to minimize) is:

L = 1 - \underset{ϕ \in K}{SatAgg} G_{θ, x \leftarrow B} (ϕ) .

Figure 11 shows the result of training with the Adam optimizer with batches of 64 examples. Accuracy measures the ratio of examples correctly classified,with example

x

labeled as

{argmax}_{l} (P (x, l)) [^{20}]

Classification accuracy reaches an average value near 1.0 for both the training and test data after some 100 epochs. Satisfaction levels of the Iris flower predictions continue to increase for the rest of the training (500 epochs) to more than 0.8 .

It is worth contrasting the choice of using a binary predicate

(P (x, l))

in this example with the option of using multiple unary predicates

(l_{A} (x), l_{B} (x), l_{C} (x))

,one for each class. Notice how each predicate is normally associated with an output neuron. In the case of the unary predicates, the networks would be disjoint (or modular), whereas weight-sharing takes place with the use of the binary predicate. Since

l

is instantiated into

l_{A}, l_{B}, l_{C}

,in practice

P (x, l)

becomes

P (x, l_{A}), P (x, l_{B}), P (x, l_{C})

,which is implemented via three output neurons to which a softmax function applies.

4.3. Multi-Class Multi-Label Classification

We now turn to multi-label classification, whereby multiple labels can be assigned to each example. As a first example of the reach of LTNs, we shall see how the previous example can be extended naturally using LTN to account for multiple labels, not always a trivial extension for most ML algorithms. The standard approach to the multi-label problem is to provide explicit negative examples for each class. By contrast, LTN can use background knowledge to relate classes directly to each other, thus becoming a powerful tool in the case of the multi-label problem when typically the labeled data is scarce. We explore the Leptograpsus crabs data set [10] consisting of 200 examples of 5 morphological measurements of 50 crabs. The task is to classify the crabs according to their color and sex. There are four labels: blue, orange, male, and female. The color labels are mutually exclusive, and so are the labels for sex. LTN will be used to specify such information logically.

^{20}

This is also known as top-1 accuracy,as proposed in [39]. Cross-entropy results

\sum (t \log (y))

could have been reported here as is common with the use of softmax, although it is worth noting that, of course, the loss function used by LTN is different.

Domains:

items denoting the examples from the crabs dataset.

labels denoting the class labels.

Variables:

x_{blue}, x_{orange}, x_{male}, x_{female}

for the positive examples of each class.

x

,used to denote all the examples.

D (x_{blue}) = D (x_{orange}) = D (x_{male}) = D (x_{female}) = D (x) =

items.

Constants:

l_{blue}, l_{orange}, l_{male}, l_{female}

(the labels for each class).

D (l_{blue}) = D (l_{orange}) = D (l_{male}) = D (l_{female}) =

labels.

Predicates:

P (x, l)

,denotes the fact that item

x

is labelled as

l

D_{in} (P) =

items,labels.

Axioms:

\begin{matrix} (30) & \forall x_{blue} P (x_{blue}, l_{blue}) \end{matrix}

\begin{matrix} (31) & \forall x_{orange} P (x_{orange}, l_{orange}) \end{matrix}

\begin{matrix} (32) & \forall x_{male} P (x_{male}, l_{male}) \end{matrix}

\begin{matrix} (33) & \forall x_{female} P (x_{female}, l_{female}) \end{matrix}

\begin{matrix} (34) & \forall x \neg (P (x, l_{blue}) \land P (x, l_{orange})) \end{matrix}

\begin{matrix} (35) & \forall x \neg (P (x, l_{male}) \land P (x, l_{female})) \end{matrix}

Notice how logical rules 34 and 35 above represent the mutual exclusion of the labels on

colour and sex, respectively. As a result, negative examples are not used explicitly in this specification.

Grounding:

G

(items)

= R^{5}

; the examples from the data set are described using 5 features.

G

(labels)

= N^{4}

; one-hot vectors are used to represent class labels

^{21}

G (x_{blue}) \in R^{m_{1} \times 5}, G (x_{orange}) \in R^{m_{2} \times 5}, G (x_{male}) \in R^{m_{3} \times 5}, G (x_{female}) \in R^{m_{4} \times 5}

. These sequences are not mutually-exclusive,one example can for instance be in both

x_{blue}

and

x_{male}

\dot{G} (l_{blue}) = [1, 0, 0, 0], G (l_{orange}) = [0, 1, 0, 0], G (l_{male}) = [0, 0, 1, 0], G (l_{female}) = [0, 0, 0, 1]

G (P ∣ θ) : x, l \mapsto l^{⊤} \cdot sigmoid ({MLP}_{θ} (x))

,with the MLP having four output neurons corresponding to as many classes. As before,

\cdot

denotes the dot product which selects a single output. By contrast with the previous example, notice the use of a sigmoid function instead of a softmax function.

^{21}

There are two possible approaches here: either each item is labeled with one multi-hot encoding or each item is labeled with several one-hot encodings. The latter approach was used in this example.

Learning:

As before, the fuzzy logic operators and connectives are approximated using the stable product configuration with

p = 2

for

A_{p M E}

,and for the formula aggregator,

A_{p M E}

is also used with

p = 2

Figure 12 shows the result of the Adam optimizer using backpropagation trained with batches of 64 examples. This time, the accuracy is defined as 1 -HL, where HL is the average Hamming loss, i.e. the fraction of labels predicted incorrectly, with a classification threshold of 0.5 (given an example

u

,if the model outputs a value greater than 0.5 for class

C

then

u

is deemed as belonging to class

C

). The rightmost graph in Figure 12 illustrates how LTN learns the constraint that a crab cannot have both blue and orange color, which is discussed in more detail in what follows.

Querying:

To illustrate the learning of constraints by LTN, we have queried three formulas that were not explicitly part of the knowledge-base, over time during learning:

\begin{matrix} (36) & ϕ_{1} : \forall x (P (x, l_{blue}) \to \neg P (x, l_{orange})) \end{matrix}

\begin{matrix} (37) & ϕ_{2} : \forall x (P (x, l_{blue}) \to P (x, l_{orange})) \end{matrix}

\begin{matrix} (38) & ϕ_{3} : \forall x (P (x, l_{blue}) \to P (x, l_{male})) \end{matrix}

For querying,we use

p = 5

when approximating the universal quantifiers with

A_{p M E}

. A higher

p

denotes a stricter universal quantification with a stronger focus on outliers (see Section 2.4).

^{22}

We should expect

ϕ_{1}

to hold true (every blue crab cannot be orange and vice-versa

^{23}

,and we should expect

ϕ_{2}

(every blue crab is also orange) and

ϕ_{3}

(every blue crab is male) to be false. The results are reported in the rightmost plot of Figure 12 Prior to training,the truth-values of

ϕ_{1}

ϕ_{3}

are non-informative. During training one can see,with the maximization of the satisfaction of the knowledge-base, a trend towards the satisfaction of

ϕ_{1}

,and an opposite trend of

ϕ_{2}

and

ϕ_{3}

towards false.

4.4. Semi-Supervised Pattern recognition

Let us now explore two, more elaborate, classification tasks, which showcase the benefit of using logical reasoning alongside machine learning. With these two examples, we also aim to provide a more direct comparison with a related neurosymbolic system DeepProbLog [41]. The benchmark examples below were introduced in the DeepProbLog paper [41].

^{22}

Training should usually not focus on outliers,as optimizers would struggle to generalize and tend to get stuck in local minima. However,when querying

ϕ_{1}, ϕ_{2}, ϕ_{3}

,we wish to be more careful about the interpretation of our statement. See also 3.1.3

\forall x (P (x, l_{blue}) \leftrightarrow \neg P (x, l_{orange}))

Figure 12: Multi-Class Multi-Label Classification: Classification Accuracy (left), Satisfiability level (middle), and Querying of Constraints (right).

Single Digits Addition: Consider the predicate addition

(X, Y, N)

,where

X

and

Y

are images of digits (the MNIST data set will be used), and N is a natural number corresponding to the sum of these digits. This predicate should return an estimate of the validity of the addition. For instance,addition

(\dot{B}, Ω, 11)

is a valid addition; addition

(B, Ω, 5)

is not.

Multi Digits Addition: The experiment is extended to numbers with more than one digit. Consider the predicate addition

([X_{1}, X_{2}], [Y_{1}, Y_{2}], N) . [X_{1}, X_{2}]

and

[Y_{1}, Y_{2}]

are lists of images of digits,representing two multi-digit numbers;

N

is a natural number corresponding to the sum of the two multi-digit numbers. For instance,addition

([3, 5], [7, 2], 130)

is a valid addition; addition

([3, 8], [9, 2], 26)

is not.

A natural neurosymbolic approach is to seek to learn a single-digit classifier and benefit from knowledge readily available about the properties of addition in this case. For instance, suppose that a predicate

digit (x, d)

gives the likelihood of an image

x

being of digit

d

. A definition for addition

(3, 8, 11)

in LTN is:

\exists d_{1}, d_{2} : d_{1} + d_{2} = 11 (digit (B, d_{1}) \land digit (C, d_{2}))

In [41], the above task is made more complicated by not providing labels for the single-digit images during training. Instead, training takes place on pairs of images with labels made available for the result only, that is, the sum of the individual labels. The single-digit classifier is not explicitly trained by itself; its output is a piece of latent information that is used by the logic. However, this does not pose a problem for end-to-end neurosymbolic systems such as LTN or DeepProbLog for which the gradients can propagate through the logical structures.

We start by illustrating a LTN theory that can be used to learn the predicate digit. The specification of the theory below is for the single digit addition example, although it can be extended easily to the multiple digits case.

Domains:

images, denoting the MNIST digit images,

results, denoting the integers that label the results of the additions,

digits, denoting the digits from 0 to 9 .

Variables:

x, y

,ranging over the MNIST images in the data,

n

for the labels,i.e. the result of each addition,

d_{1}, d_{2}

ranging over digits.

D (x) = D (y) =

images,

D (n) =

results,

D (d_{1}) = D (d_{2}) =

digits.

Predicates:

digit (x, d)

for the single digit classifier,where

d

is a term denoting a digit constant or a digit variable. The classifier should return the probability of an image

x

being of digit

d

D_{in} (digit) =

images,digits. Axioms:

Single Digit Addition:

\forall Diag (x, y, n)

\begin{matrix} (39) & (\exists d_{1}, d_{2} : d_{1} + d_{2} = n \end{matrix}

(digit (x, d_{1}) \land digit (y, d_{2})))

Multiple Digit Addition:

\forall Diag (x_{1}, x_{2}, y_{1}, y_{2}, n)

\begin{matrix} (40) & (\exists d_{1}, d_{2}, d_{3}, d_{4} : 10 d_{1} + d_{2} + 10 d_{3} + d_{4} = n \end{matrix}

(digit (x_{1}, d_{1}) \land digit (x_{2}, d_{2}) \land digit (y_{1}, d_{3}) \land digit (y_{2}, d_{4})))

Notice the use of Diag: when grounding

x, y, n

with three sequences of values,the

i

-th examples of each variable are matching. That is,

(G {(x)}_{i}, G {(y)}_{i}, G {(n)}_{i})

is a tuple from our dataset of valid additions. Using the diagonal quantification, LTN aggregates pairs of images and their corresponding result, rather than any combination of images and results.

Notice also the guarded quantification: by quantifying only on the latent "digit labels" (i.e.

d_{1}, d_{2}, \dots)

that can add up to the result label (

n

,given in the dataset),we incorporate symbolic information into the system. For example,in (39),if

n = 3

,the only valid tuples

(d_{1}, d_{2})

are

(0, 3), (3, 0), (1, 2), (2, 1)

. Gradients will only backpropagate to these values.

Grounding:

G

(images)

= {[0, 1]}^{28 \times 28 \times 1}

. The MNIST data set has images of 28 by 28 pixels. The images are grayscale and have just one channel. The RGB pixel values from 0 to 255 of the MNIST data set are converted to the range

[0, 1]

G

(results)

= N

G (digits) = {0, 1, \dots, 9}

G (x) \in {[0, 1]}^{m \times 28 \times 28 \times 1}, G (y) \in {[0, 1]}^{m \times 28 \times 28 \times 1}, G (n) \in N^{m}

G (d_{1}) = G (d_{2}) = ⟨ 0, 1, \dots, 9 ⟩ .

G (digit ∣ θ) : x, d \mapsto

onehot

{(d)}^{⊤} \cdot softmax ({CNN}_{θ} (x))

,where

{CNN}_{is a}

Convolutional Neural Network with 10 output neurons for each class. Notice that, in contrast with the previous examples,

d

is an integer label; onehot

(d)

converts it into a one-hot label.

^{24}

Notice the use of the same number

m

of examples for each of these variables as they are supposed to match one-to-one due to the use of Diag.

Learning:

The computational graph of Figure 13 shows the objective function for the satisfiability of the knowledge base. A stable product configuration is used with hyper-parameter

p = 2

of the operator

A_{p M E}

for universal quantification

(\forall)

. Let

p_{\exists}

denote the exponent hyper-parameter used in the generalized mean

A_{p M}

for existential quantification (3). Three scenarios are investigated and compared in the Multiple Digit experiment (Figure 15):

$p_{\exists} = 1$ throughout the entire experiment,

$p_{\exists} = 2$ throughout the entire experiment,or

$p_{\exists}$ follows a schedule,changing from $p = 1$ to $p = 6$ gradually with the number of training epochs.

In the Single Digit experiment, only the last scenario above (schedule) is investigated (Figure 14).

We train to maximize satisfiability by using batches of 32 examples of image pairs, labeled by the result of their addition. As done in [41], the experimental results vary the number of examples in the training set to emphasize the generalization abilities of a neurosymbolic approach. Accuracy is measured by predicting the digit values using the predicate digit and reporting the ratio of examples for which the addition is correct. A comparison is made with the same baseline method used in [41]: given a pair of MNIST images, a non-pre-trained CNN outputs embeddings for each image (Siamese neural network). The embeddings are provided as input to dense layers that classify the addition into one of the 19 (respectively, 199) possible results of the Single Digit Addition (respectively, Multiple Digit Addition) experiments. The baseline is trained using a cross-entropy loss between the labels and the predictions. As expected, such a standard deep learning approach struggles with the task without the provision of symbolic meaning about intermediate parts of the problem.

Experimentally, we find that the optimizer for the neurosymbolic system gets stuck in a local optimum at the initialization in about 1 out of 5 runs. We, therefore, present the results on an average of the 10 best outcomes out of 15 runs of each algorithm (that is, for the baseline as well). The examples of digit pairs selected from the full MNIST data set are randomized at each run.

Figure 15 shows that the use of

p_{\exists} = 2

from the start produces poor results. A higher value for

p_{\exists}

A_{p M}

weighs up the instances with a higher truth-value (see also Appendix C for a discussion). Starting already with a high value for

p_{\exists}

,the classes with a higher initial truth-value for a given example will have higher gradients and be prioritized for training, which does not make practical sense when randomly initializing the predicates. Increasing

p_{\exists}

by following a schedule is the most promising approach. In this particular example,

p_{\exists} = 1

is also shown to be adequate purely from a learning perspective. However,

p_{\exists} = 1

implements a simple average which does not account for the meaning of

\exists

well; the resulting satisfaction value is not meaningful within a reasoning perspective.

Table 1 shows that the training and test times of LTN are of the same order of magnitude as those of the CNN baselines. Table 2 shows that LTN reaches similar accuracy as that reported by DeepProbLog.

Model	(Single Digits)		(Multi Digits)
Model	Train	Test	Train	Test
baseline	$2.72 \pm 0.23 ms$	$1.45 \pm 0.21 ms$	$3.87 \pm 0.24 ms$	$2.10 \pm 0.30 ms$
LTN	$5.36 \pm 0.25 ms$	$3.44 \pm 0.39 ms$	$8.51 \pm 0.72 ms$	$5.72 \pm 0.57 ms$

Table 1: The computation time of training and test steps on the single and multiple digit addition tasks, measured on a computer with a single Nvidia Tesla V100 GPU and averaged over 1000 steps. Each step operates on a batch of 32 examples. The computational efficiency of the LTN and the CNN baseline systems are of the same order of magnitude.

Model	Number of training examples
	(Single Digits)		(Multi Digits)
	30 000	3 000	15 000	1 500
baseline	$95.95 \pm 0.27$	$70.59 \pm 1.45$	$47.19 \pm 0.69$	$2.07 \pm 0.12$
LTN	$98.04 \pm 0.13$	$93.49 \pm 0.28$	$95.37 \pm 0.29$	$88.21 \pm 0.63$
DeepProbLog	$97.20 \pm 0.45$	$92.18 \pm 1.57$	$95.16 \pm 1.70$	$87.21 \pm 1.92$

Table 2: Accuracy (in %) on the test set: comparison of the final results obtained with LTN and those reported with DeepProbLog[41]. Although it is difficult to compare directly the results over time (the frameworks are implemented in different libraries), while achieving similar computational efficiency as the CNN baseline, LTN also reaches similar accuracy as that reported by DeepProbLog.

4.5. Regression

Another important problem in Machine Learning is regression where a relationship is estimated between one independent variable

X

and a continuous dependent variable

Y

. The essence of regression is,therefore,to approximate a function

f (x) = y

by a function

f^{*}

,given examples

(x_{i}, y_{i})

such that

f (x_{i}) = y_{i}

. In LTN one can model a regression task by defining

f^{*}

as a learnable function whose parameter values are constrained by data. Additionally, a regression task requires a notion of equality. We,therefore,define the predicate eq as a smooth version of the symbol

=

to turn the constraint

f (x_{i}) = y_{i}

into a smooth optimization problem.

In this example,we explore regression using a problem from a real estate data

{set}^{25}

with 414 examples, each described in terms of 6 real-numbered features: the transaction date (converted to a float), the age of the house, the distance to the nearest station, the number of convenience stores in the vicinity, and the latitude and longitude coordinates. The model has to predict the house price per unit area.

Domains:

samples, denoting the houses and their features.

prices, denoting the house prices. Variables:

x

for the samples.

y

for the prices.

D (x) =

samples.

D (y) =

prices.

^{25}

https://www.kaggle.com/quantbruce/real-estate-price-prediction

Figure 15: Multiple Digit Addition Task: Accuracy and satisfiability results (top) and results in the presence of fewer examples (bottom) in comparison with standard Deep Learning using a CNN (blue lines).

Functions:

f^{*} (x)

,the regression function to be learned.

D_{in} (f^{*}) =

samples,

D_{out} (f^{*}) =

prices.

Predicates:

eq (y_{1}, y_{2})

,a smooth equality predicate that measures how similar

y_{1}

and

y_{2}

are.

D_{in} (eq) =

prices,prices.

Axioms:

\begin{matrix} (41) & \forall Diag (x, y) eq (f^{*} (x), y) \end{matrix}

Notice again the use of Diag: when grounding

x

and

y

onto sequences of values,this is done by obeying a one-to-one correspondence between the sequences. In other words, we aggregate pairs of corresponding samples and prices, instead of any combination thereof.

Grounding:

G

(samples)

= R^{6}

G

(prices)

= R

G (x) \in R^{m} \times 6, G (y) \in R^{m} \times 1

. Notice that this specification refers to the same number

m

of examples for

x

and

y

due to the above one-to-one correspondence obtained with the use of Diag.

G (eq (u, v)) = \exp (- α \sqrt{\sum_{j} {(u_{j} - v_{j})}^{2}})

,where the hyper-parameter

α

is a real number that scales how strict the smooth equality is.

^{26}

In our experiments,we use

α = 0.05

Figure 17: Visualization of LTN solving a regression problem.

G (f^{*} (x) ∣ θ) = {MLP}_{θ} (x)

,where

{MLP}_{θ}

is a multilayer perceptron which ends in one neuron

corresponding to a price prediction, with a linear output layer (no activation function).

Learning:

The theory is constrained by the parameters of the model of

f^{*}

. LTN is used to estimate such parameters by maximizing the satisfaction of the knowledge-base, in the usual way. Approximating

\forall

using

A_{p M E}

with

p = 2

,as before,we randomly split the data set into 330 examples for training and 84 examples for testing. Figure 16 shows the satisfaction level over 500 epochs. We also plot the Root Mean Squared Error (RMSE) between the predicted prices and the labels (i.e. actual prices, also known as target values). We visualize in Figure 17 the strong correlation between actual and predicted prices at the end of one of the runs.

4.6. Unsupervised Learning (Clustering)

In unsupervised learning, labels are either not available or are not used for learning. Clustering is a form of unsupervised learning whereby, without labels, the data is characterized by constraints

^{26}

Intuitively,the smooth equality is

\exp (- α d (u, v))

,where

d (u, v)

is the Euclidean distance between

u

and

v

. It produces a 1 if the distance is zero; as the distance increases, the result decreases exponentially towards 0 . In case an exponential decrease is undesirable,one can adopt the following alternative equation:

eq (u, v) = \frac{1}{1 + α d (u, v)}

alone. LTN can formulate such constraints, such as:

clusters should be disjoint,

every example should be assigned to a cluster,

a cluster should not be empty,

if the points are near, they should belong to the same cluster,

if the points are far, they should belong to different clusters, etc.

Domains:

points, denoting the data to cluster.

points_pairs, denoting pairs of examples.

clusters, denoting the cluster.

Variables:

x, y

for all points.

D (x) = D (y) =

points.

D (c) =

clusters.

Predicates:

C (x, c)

,the truth degree of a given point belonging in a given cluster.

D_{in} (C) =

points,clusters.

Axioms:

\begin{matrix} (42) & \forall x \exists c C (x, c) \end{matrix}

\begin{matrix} (43) & \forall c \exists x C (x, c) \end{matrix}

\begin{matrix} (44) & \forall (c, x, y : | x - y | < {th}_{close}) (C (x, c) \leftrightarrow C (y, c)) \end{matrix}

\begin{matrix} (45) & \forall (c, x, y : | x - y | > {th}_{distant}) \neg (C (x, c) \land C (y, c)) \end{matrix}

Notice the use of guarded quantifiers: all the pairs of points with Euclidean distance lower (resp. higher) than a value

{th}_{close}

(resp.

{th}_{distant}

) should belong in the same cluster (resp. should not).

{th}_{close}

and

{th}_{distant}

are arbitrary threshold values that define some of the closest and most distant pairs of points. In our example, they are set to, respectively, 0.2 and 1.0.

As done in the example of Section 4.2, the clustering predicate has mutually exclusive satisfiability scores for each cluster using a softmax layer. Therefore, there is no explicit constraint about clusters being disjoint.

Grounding:

G

(points)

= {[- 1, 1]}^{2}

G

(clusters)

= N^{4}

, we use one-hot vectors to represent a choice of 4 clusters.

G (x) \in {[- 1, 1]}^{m \times 2}

,that is,

x

is a sequence of

m

points.

G (y) = G (x)

{th}_{close} = 0.2, {th}_{distant} = 1.0

G (c) = ⟨ [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1] ⟩ .

G (C ∣ θ) : x, c \mapsto c^{⊤} \cdot softmax ({MLP}_{θ} (x))

,where MLP has 4 output neurons corresponding to the 4 clusters.

Figure 18: LTN solving a clustering problem by constraint optimization: ground-truth (top) and querying of each cluster

C 0, C 1, C 2

and

C 3

,in turn.

Learning:

We use the stable real product configuration to approximate the logical operators. For

\forall

,we use

A_{p M E}

with

p = 4

. For

\exists

,we use

A_{p M}

with

p = 1

during the first 100 epochs,and

p = 6

thereafter, as a simplified version of the schedule used in Section 4.4. The formula aggregator is approximated by

A_{p M E}

with

p = 2

. The model is trained for a total of 1000 epochs using the Adam optimizer, which is sufficient for LTN to solve the clustering problem shown in Figure 18 Ground-truth data for this task was generated artificially by creating 4 centers, and generating 50 random samples from a multivariate Gaussian distribution around each center. The trained LTN achieves a satisfaction level of the clustering constraints of 0.857 .

4.7. Learning Embeddings with LTN

A classic example of Statistical Relational Learning is the smokers-friends-cancer example introduced in [55]. Below, we show how this example can be formalized in LTN using semi-supervised embedding learning.

There are 14 people divided into two groups

{a, b, \dots, h}

and

{i, j, \dots, n}

. Within each group, there is complete knowledge about smoking habits. In the first group, there is complete knowledge about who has and who does not have cancer. Knowledge about the friendship relation is complete within each group only if symmetry is assumed,that is,

\forall x, y

(friends

(x, y) \to friends (y, x)

Otherwise,knowledge about friendship is incomplete in that it may be known that e.g.

a

is a friend of

b

,and it may be not known whether

b

is a friend of

a

. Finally,there is general knowledge about smoking, friendship, and cancer, namely that smoking causes cancer, friendship is normally symmetric and anti-reflexive, everyone has a friend, and smoking propagates (actively or passively) among friends. All this knowledge is represented in the axioms further below.

Domains:

people, to denote the individuals.

Constants:

a, b, \dots, h, i, j, \dots, n

,the 14 individuals. Our goal is to learn an adequate embedding for each

constant.

D (a) = D (b) = \dots = D (n) = people.

Variables:

x, y

ranging over the individuals.

D (x) = D (y) =

people.

Predicates:

S (x)

for smokes,

F (x, y)

for friends,

C (x)

for cancer.

D (S) = D (C) =

people.

D (F) =

people,people.

Axioms:

Let

X_{1} = {a, b, \dots, h}

and

X_{2} = {i, j, \dots, n}

be the two groups of individuals.

Let

S = {a, e, f, g, j, n}

be the smokers; knowledge is complete in both groups.

Let

C = {a, e}

be the individuals with cancer; knowledge is complete in

X_{1}

only.

Let

F = {(a, b), (a, e), (a, f), (a, g), (b, c), (c, d), (e, f), (g, h), (i, j), (j, m), (k, l), (m, n)}

be the

set of friendship relations; knowledge is complete if assuming symmetry.

These facts are illustrated in Figure 20a

We have the following axioms:

F (u, v)

\begin{matrix} (46) & for (u, v) \in F \end{matrix}

\neg F (u, v)

\begin{matrix} (47) & for (u, v) \notin F, u > v \end{matrix}

S (u)

\begin{matrix} (48) & for u \in S \end{matrix}

\neg S (u)

for

u \in (X_{1} \cup X_{2}) ∖ S

(49)

C (u)

for

u \in C

(50)

\neg C (u)

for

u \in X_{1} ∖ C

(51)

\begin{matrix} (52) & \forall x \neg F (x, x) \end{matrix}

\begin{matrix} (53) & \forall x, y (F (x, y) \to F (y, x)) \end{matrix}

\begin{matrix} (54) & \forall x \exists y F (x, y) \end{matrix}

\begin{matrix} (55) & \forall x, y ((F (x, y) \land S (x)) \to S (y)) \end{matrix}

\begin{matrix} (56) & \forall x (S (x) \to C (x)) \end{matrix}

\begin{matrix} (57) & \forall x (\neg C (x) \to \neg S (x)) \end{matrix}

Notice that the knowledge base is not satisfiable in the strict logical sense of the word. For

instance,

f

is said to smoke but not to have cancer,which is inconsistent with the rule

\forall x (S (x) \to C (x))

. Hence,it is important to adopt a fuzzy approach as done with MLN or a many-valued fuzzy logic interpretation as done with LTN.

Grounding:

G

(people)

= R^{5}

. The model is expected to learn embeddings in

R^{5}

G (a ∣ θ) = v_{θ} (a), \dots, G (n ∣ θ) = v_{θ} (n)

. Every individual is associated with a vector of 5 real

numbers. The embedding is initialized randomly uniformly.

G (x ∣ θ) = G (y ∣ θ) = ⟨ v_{θ} (a), \dots, v_{θ} (n) ⟩ .

G (S ∣ θ) : x \mapsto sigmoid (MLP_S_{θ} (x))

,where MLP_

S_{θ}

has 1 output neuron.

G (F ∣ θ) : x, y \mapsto sigmoid (MLP_F_{θ} (x, y))

,where MLP_

F_{θ}

has 1 output neuron.

G (C ∣ θ) : x \mapsto sigmoid (MLP_C_{θ} (x))

,where MLP

_C_{θ}

has 1 output neuron.

The MLP models for

S, F, C

are kept simple,so that most of the learning is focused on the embedding.

Learning:

We use the stable real product configuration to approximate the operators. For

\forall

,we use

A_{p M E}

with

p = 2

for all the rules,except for rules (52) and (53),where we use

p = 6

. The intuition behind this choice of

p

is that no outliers are to be accepted for the friendship relation since it is expected to be symmetric and anti-reflexive, but outliers are accepted for the other rules. For

\exists

,we use

A_{p M}

with

p = 1

during the first 200 epochs of training,and

p = 6

thereafter,with the same motivation as that of the schedule used in Section 4.4 The formula aggregator is approximated by

A_{p M E}

with

p = 2

Figure 19 shows the satisfiability over 1000 epochs of training. At the end of one of these runs,we query

S (x), F (x, y), C (x)

for each individual; the results are shown in Figure 20b We also plot the principal components of the learned embeddings [51] in Figure 21. The friendship relations are learned as expected: (56) "smoking implies cancer" is inferred for group 2 even though such information was not present in the knowledge base. For group 1,the given facts for smoking and cancer for the individuals

f

and

g

are slightly altered,as these were inconsistent with the rules. (the rule for smoking propagating via friendship (55) is incompatible with many of the given facts). Increasing the satisfaction of this rule would require decreasing the overall satisfaction of the knowledge base, which explains why it is partly ignored by LTN during training. Finally, it is interesting to note that the principal components for the learned embeddings seem to be linearly separable for the smoking and cancer classifiers (c.f. Figure 21, top right and bottom right plots).

Querying:

To illustrate querying in LTN, we query over time two formulas that are not present in the knowledge-base:

\begin{matrix} (58) & ϕ_{1} : \forall p : C (p) \to S (p) \end{matrix}

\begin{matrix} (59) & ϕ_{2} : \forall p, q : (C (p) \lor C (q)) \to F (p, q) \end{matrix}

We use

p = 5

when approximating

\forall

since the impact of an outlier at querying time should be seen as more important than at learning time. It can be seen that as the grounding approaches satisfiability of the knowledge-base,

ϕ_{1}

approaches true,whereas

ϕ_{2}

approaches false (c.f. Figure 20a).

Figure 21: Smoker-Friends-Cancer example: learned embeddings showing the result of applying PCA on the individuals (top left); truth-values of smokes and cancer predicates for each embedding (top and bottom right); illustration of the friendship relations which are satisfied after learning (bottom left).

4.8. Reasoning in LTN

The essence of reasoning is to find out if a closed formula

ϕ

is the logical consequence of a knowledge-base

(K, G_{θ}, Θ)

. Section 3.4 introduced two approaches to this problem in LTN:

By simply querying after learning 27 one seeks to verify if for the grounded theories that maximally satisfy $K$ ,the grounding of $ϕ$ gives a truth-value greater than a threshold $q$ . This often requires checking an infinite number of groundings. Instead, the user approximates the search for these grounded theories by running the optimization a fixed number of times only.

Reasoning by refutation one seeks to find out a counter-example: a grounding that satisfies the knowledge-base $K$ but not the formula $ϕ$ given the threshold $q$ . A search is performed here using a different objective function.

We now demonstrate that reasoning by refutation is the preferred option using a simple example where we seek to find out whether

(A \lor B) ⊨_{q} A

Propositional Variables:

The symbols

A

and

B

denote two propositionial variables.

Axioms:

\begin{matrix} (60) & A \lor B \end{matrix}

^{27}

Here,learning refers to Section 3.2 which is optimizing using the satisfaction of the knowledge base as an objective.

Figure 22: Querying after learning: 10 runs of the optimizer with objective

G^{*} = {argmax}_{G_{θ}} (G_{θ} (K))

. All runs converge to the optimum

G_{1}

; the grid search misses the counter-example.

Grounding:

G (A) = a, G (B) = b

,where

a

and

b

are two real-valued parameters. The set of parameters is therefore

θ = {a, b}

. At initialization,

a = b = 0

We use the probabilistic-sum

S_{P}

to approximate

\lor

,resulting in the following satisfiability measure.

\begin{matrix} (61) & G_{θ} (K) = G_{θ} (A \lor B) = a + b - a b . \end{matrix}

There are infinite global optima maximizing the satisfiability of the theory,as any

G_{θ}

such that

G_{θ} (A) = 1

(resp.

G_{θ} (B) = 1

) gives a satisfiability

G_{θ} (K) = 1

for any value of

G_{θ} (B)

(resp.

G_{θ} (A))

. As expected,the following groundings are examples of global optima:

G_{1} : G_{1} (A) = 1, G_{1} (B) = 1, G_{1} (K) = 1,

G_{2} : G_{2} (A) = 1, G_{2} (B) = 0, G_{2} (K) = 1,

G_{3} : G_{3} (A) = 0, G_{3} (B) = 1, G_{3} (K) = 1.

Reasoning:

(A \lor B) ⊨_{q} A

? That is,given the threshold

q = 0.95

,does every

G_{θ}

such that

G_{θ} (K) \geq q

verify

G_{θ} (ϕ) \geq q

. Immediately,one can notice that this is not the case. For instance,the grounding

G_{3}

is a counter-example.

If one simply reasons by querying multiple groundings after learning with the usual objective

{argmax}_{(G_{θ})} G_{θ} (K)

,the results will all converge to

G_{1} : \frac{\partial G_{θ} (K)}{\partial a} = 1 - b

and

\frac{\partial G_{θ} (K)}{\partial b} = 1 - a

. Every run of the optimizer will increase

a

and

b

simultaneously until they reach the optimum

a = b = 1

. Because the grid search always converges to the same point,no counter-example is found and the logical consequence is mistakenly assumed true. This is illustrated in Figure 22

Reasoning by refutation, however, the objective function has an incentive to find a counterexample with

\neg A

,as illustrated in Figure 23 LTN converges to the optimum

G_{3}

,which refutes the logical consequence.

^{28}

We use the notation

G (K) := {SatAgg}_{ϕ \in K} (K, G)

Figure 23: Reasoning by refutation: one run of the optimizer with objective

G^{*} = {argmin}_{G_{θ}} (G_{θ} (ϕ) + elu (α, β (q - G_{θ} (K))))

q = 0.95, α = 0.05, β = 10

. In the first training epochs,the directed search prioritizes the satisfaction of the knowledge base. Then,the minimization of

G_{θ} (ϕ)

starts to weigh in more and the search focuses on finding a counter-example. Eventually,the run converges to the optimum

G_{3}

,which refutes the logical consequence.

5. Related Work

The past years have seen considerable work aiming to integrate symbolic systems and neural networks. We shall focus on work whose objective is to build computational models that integrate deep learning and logical reasoning into a so-called end-to-end (fully differentiable) architecture. We summarize a categorization in Figure 24 where the class containing LTN is further expanded into three sub-classes. The sub-class highlighted in red is the one that contains LTN. The reason why one may wish to combine symbolic

AI

and neural networks into a neurosymbolic

AI

system may vary, c.f. [17] for a recent comprehensive overview of approaches and challenges for neurosymbolic AI.

5.1. Neural architectures for logical reasoning

These use neural networks to perform (probabilistic) inference on logical theories. Early work in this direction has shown correspondences between various logical-symbolic systems and neural network models [27, 32, 52, 63, 65]. They have also highlighted the limits of current neural networks as models for knowledge representation. In a nutshell, current neural networks (including deep learning) have been shown capable of representing propositional logic, nonmonotonic logic programming, propositional modal logic, and fragments of first-order logic, but not full first-order or higher-order logic. Recently, there has been a resurgence of interest in the topic with many proposals emerging [13, 48, 53]. In [13], each clause of a Stochastic Logic Program is converted into a factor graph with reasoning becoming differentiable so that it can be implemented by deep networks. In [49], a differentiable unification algorithm is introduced with theorem proving sought to be carried out inside the neural network. Furthermore, in [11, 49] neural networks are used to learn reasoning strategies and logical rule induction.

Reasoning with LTN (Section 3.4) is reminiscent of this category, given that knowledge is not represented in a traditional logical language but in Real Logic.

5.2. Logical specification of neural network architectures

Here the goal is to use a logical language to specify the architecture of a neural network. Examples include [13, 24, 26, 56, 66]. In [26], the languages of extended logic programming (logic programs with negation by failure) and answer set programming are used as background knowledge to set up the initial architecture and set of weights of a recurrent neural network, which is subsequently trained from data using backpropagation. In [24], first-order logic programs in the form of Horn clauses are used to define a neural network that can solve Inductive Logic Programming tasks, starting from the most specific hypotheses covering the set of examples. Lifted relational neural networks [66] is a declarative framework where a Datalog program is used as a compact specification of a diverse range of existing advanced neural architectures, with a particular focus on Graph Neural Networks (GNNs) and their generalizations. In [56] a weighted Real Logic is introduced and used to specify neurons in a highly modular neural network that resembles a tree structure, whereby neurons with different activation functions are used to implement the different logic operators.

To some extent, it is also possible to specify neural architectures using logic in LTN. For example,a user can define a classifier

P (x, y)

as the formula

P (x, y) = (Q (x, y) \land R (y)) \lor S (x, y)

G (P)

becomes a computational graph that combines the sub-architectures

G (Q), G (R)

,and

G (S)

according to the syntax of the logical formula.

5.3. Neurosymbolic architectures for the integration of inductive learning and deductive reasoning

These architectures seek to enable the integration of inductive and deductive reasoning in a unique fully differentiable framework [15, 23, 41, 46, 47]. The systems that belong to this class combine a neural component with a logical component. The former consists of one or more neural networks, the latter provides a set of algorithms for performing logical tasks such as model checking, satisfiability, and logical consequence. These two components are tightly integrated so that learning and inference in the neural component are influenced by reasoning in the logical component and vice versa. Logic Tensor Networks belong to this category. Neurosymbolic architectures for integrating learning and reasoning can be further separated into three sub-classes:

Approaches that introduce additional layers to the neural network to encode logical constraints which modify the predictions of the network. This sub-class includes Deep Logic Models [46] and Knowledge Enhanced Neural Networks [15].

Approaches that integrate logical knowledge as additional constraints in the objective function or loss function used to train the neural network (LTN and [23, 33, 47]).

Approaches that apply (differentiable) logical inference to compute the consequences of the predictions made by a set of base neural networks. Examples of this sub-class are DeepProblog [41] and Abductive Learning [14].

In what follows, we revise recent neurosymbolic architectures in the same class as LTN: Integrating learning and reasoning.

Systems that modify the predictions of a base neural network:. Among the approaches that modify the predictions of the neural network using logical constraints are Deep Logic Models [46] and Knowledge Enhanced Neural Networks [15]. Deep Logic Models (DLM) are a general architecture for learning with constraints. Here, we will consider the special case where constraints are expressed by logical formulas. In this case,a DLM predicts the truth-values of a set of

n

ground atoms of a domain

Δ = {a_{1}, \dots, a_{k}}

. It consists of two models: a neural network

f (x ∣ w)

which takes as input the features

x

of the elements of

Δ

and produces as output an evaluation

f

for all the ground atoms,i.e.

f \in {[0, 1]}^{n}

,and a probability distribution

p (y ∣ f, \dot{λ})

which is modeled by an undirected graphical model of the exponential family with each logical constraint characterized by a clique that contains the ground atoms, rather similarly to GNNs. The model returns the assignment to the atoms that maximize the weighted truth-value of the constraints and minimize the difference between the prediction of the neural network and a target value

y

. Formally:

DLM (x ∣ λ, w) = \underset{y}{argmax} (\sum_{c} λ_{c} Φ_{c} (y_{c}) - \frac{1}{2} ∥ y - f (x ∣ w) ∥^{2})

Figure 24: Three classes of neurosymbolic approaches with Architectures Integrating Learning and Reasoning further subdivided into three sub-classes, with LTN belonging to the sub-class highlighted in red.

Each

Φ_{c} (y_{c})

corresponds to a ground propositional formula which is evaluated w.r.t. the target truth assignment

y

,and

λ_{c}

is the weight associated with formula

Φ_{c}

. Intuitively,the upper model (the undirected graphical model) should modify the prediction of the lower model (the neural network) minimally to satisfy the constraints.

f

and

y

are truth-values of all the ground atoms obtained from the constraints appearing in the upper model in the domain specified by the data input.

Similar to LTN, DLM evaluates constraints using fuzzy semantics. However, it considers only propositional connectives, whereas universal and existential quantifiers are supported in LTN.

Inference in DLM requires maximizing the prediction of the model, which might be prohibitive in the presence of a large number of instances. In LTN, inference involves only a forward pass through the neural component which is rather simple and can be carried out in parallel. However, in DLM the weight associated with constraints can be learned, while in LTN they are specified in the background knowledge.

The approach taken in Knowledge Enhanced Neural Networks (KENN) [15] is similar to that of DLM. Starting from the predictions

y = f_{n n} (x ∣ w)

made by a base neural network

f_{n n} (\cdot ∣ w)

, KENN adds a knowledge enhancer,which is a function that modifies

y

based on a set of weighted constraints formulated in terms of clauses. The formal model can be specified as follows:

KENN (x ∣ λ, w) = σ (f_{n n}^{'} (x ∣ w) + \sum_{c} λ_{c} \cdot (softmax (sign (c) ⊙ f_{n n}^{'} (x ∣ w)) ⊙ sign (c)))

where

f_{n n}^{'} (x ∣ w)

are the pre-activations of

f_{n n} (x ∣ w)

,sign

(c)

is a vector of the same dimension of

y

containing

1, - 1

and

- \infty

,such that

sign {(c)}_{i} = 1

(resp.

sign {(c)}_{i} = - 1

) if the

i

-th atom occurs positively (resp. negatively) in

c

,or

- \infty

otherwise,and

⊙

is the element-wise product. KENN learns the weights

λ

of the clauses in the background knowledge and the base network parameters

w

by minimizing some standard loss,(e.g. cross-entropy) on a set of training data. If the training data is inconsistent with the constraint, the weight of the constraint will be close to zero. This intuitively implies that the latent knowledge present in the data is preferred to the knowledge specified in the constraints. In LTN, instead, training data and logical constraints are represented uniformly with a formula, and we require that they are both satisfied. A second difference between KENN and LTN is the language: while LTN supports constraints written in full first-order logic, constraints in KENN are limited to universally quantified clauses.

Systems that add knowledge to a neural network by adding a term to the loss function:. In [33], a framework is proposed that learns simultaneously from labeled data and logical rules. The proposed architecture is made of a student network

f_{n n}

and a teacher network,denoted by

q

. The student network is trained to do the actual predictions, while the teacher network encodes the information of the logical rules. The transfer of information from the teacher to the student network is done by defining a joint loss

L

for both networks as a convex combination of the loss of the student and the teacher. If

\tilde{y} = f_{n n} (x ∣ w)

is the prediction of the student network for input

x

,the loss is defined as:

(1 - π) \cdot L (y, \tilde{y}) + π \cdot L (q (\tilde{y} ∣ x), \tilde{y})

where

q (\tilde{y} ∣ x) = \exp (- \sum_{c} λ_{c} (1 - ϕ_{c} (x, \tilde{y})))

measures how much the predictions

\tilde{y}

satisfy the constraints encoded in the set of clauses

{λ_{c} : ϕ_{c}}_{c \in C}

. Training is iterative. At every iteration,the parameters of the student network are optimized to minimize the loss that takes into account the feedback of the teacher network on the predictions from the previous step. The main difference between this approach and LTN is how the constraints are encoded in the loss. LTN integrates the constraints in the network and optimizes directly their satisfiability with no need for additional training data. Furthermore, the constraints proposed in [33] are universally quantified formulas only.

The approach adopted by Lxrcs [47] is analogous to the first version of LTN [61]. Logical constraints are translated into a loss function that measures the (negative) satisfiability level of the network. Differently from LTN, formulas in Lyrics can be associated with weights that are hyper-parameters. In [47], a logarithmic loss function is also used when the product t-norm is adopted. Notice that weights can also be added (indirectly) to LTN by introducing a 0-ary predicate

p_{w}

to represent a constraint of the form

p_{w} \land ϕ

. An advantage of this approach would be that the weights could be learned.

In [72], a neural network computes the probability of some events being true. The neural network should satisfy a set of propositional logic constraints on its output. These constraints are compiled into arithmetic circuits for weighted model counting, which are then used to compute a loss function. The loss function then captures how close the neural network is to satisfying the propositional logic constraints.

Systems that apply logical reasoning on the predictions of a base neural network:. The most notable architecture in this category is DeepProblog [41]. DeepProblog extends the ProbLog framework for probabilistic logic programming to allow the computation of probabilistic evidence from neural networks. A ProbLog program is a logic program where facts and rules can be associated with probability values. Such values can be learned. Inference in ProbLog to answer a query

q

is performed by knowledge compilation into a function

p (q ∣ λ)

that computes the probability that

q

is true according to the logic program with relative frequencies

λ

. In DeepProbLog,a neural network

f_{n n}

that outputs a probability distribution

t = (t_{1}, \dots, t_{n})

over a set of atoms

a = (a_{1}, \dots, a_{n})

is integrated into ProbLog by extending the logic program with

a

and the respective probabilities

t

. The probability of a query

q

is then given by

p^{'} (q ∣ λ, f_{n n} (x ∣ w))

,where

x

is the input of

f_{n n}

and

p^{'}

is the function corresponding to the logic program extended with

a

. Given a set of queries

q

, input vectors

x

and ground-truths

y

for all the queries,training is performed by minimizing a loss function that measures the distance between the probabilities predicted by the logic program and the ground-truths, as follows:

L (y, p^{'} (q ∣ λ, f_{n n} (x ∣ w)))

The most important difference between DeepProbLog and LTN concerns the logic on which they are based. DeepProbLog adopts probabilistic logic programming. The output of the base neural network is interpreted as the probability of certain atoms being true. LTN instead is based on many-valued logic. The predictions of the base neural network are interpreted as fuzzy truth-values (though previous work [67] also formalizes Real Logic as handling probabilities with relaxed constraints). This difference of logic leads to the second main difference between LTN and Deep-Problog: their inference mechanism. DeepProblog performs probabilistic inference (based on model counting) while LTN inference consists of computing the truth-value of a formula starting from the truth-values of its atomic components. The two types of inference are incomparable. However, computing the fuzzy truth-value of a formula is more efficient than model counting, resulting in a more scalable inference task that allows LTN to use full first-order logic with function symbols. In DeepProblog, to perform probabilistic inference, a closed-world assumption is made and a function-free language is used. Typically, DeepProbLog clauses are compiled into Sentential Decision Diagrams (SDDs) to accelerate inference considerably [36], although the compilation step of clauses into the SDD circuit is still costly.

An approach that extends the predictions of a base neural network using abductive reasoning is [14]. Given a neural network

f_{n n} (x ∣ w)

that produces a crisp output

y \in {0, 1}^{n}

for

n

predicates

p_{1}, \dots, p_{n}

and background knowledge in the form of a logic program

p

,parameters

w

f_{n n}

are learned alongside a set of additional rules

Δ_{C}

that define a new concept

C

w.r.t.

p_{1}, \dots, p_{n}

such that,for every object

o

with features

x_{o}

\begin{matrix} (62) & p \cup f_{n n} (x_{o} ∣ w) \cup Δ_{C} ⊨ C (o) if o is an instance of C \end{matrix}

p \cup f_{n n} (x_{o} ∣ w) \cup Δ_{C} ⊨ \neg C (o) if o is not an instance of C

The task is solved by iterating the following three steps:

Given the predictions of the neural network ${f_{n n} (x_{o} ∣ w)}_{o \in O}$ on the set $O$ of training objects, search for the best $Δ_{C}$ that maximize the number of objects for which (62) holds;

For each object $o$ ,compute by abduction on $p \cup Δ_{C}$ ,the explanation $p^{'} (o)$ ;

Retrain $f_{n n}$ with the training set ${x_{o}, p^{'} (o)}_{o \in O}$ .

Differently from LTN, in [14] the optimization is done separately in an iterative way. The semantics of the logic is crisp, neither fuzzy nor probabilistic, and therefore not fully differentiable. Abductive reasoning is adopted, which is a potentially relevant addition for comparison with symbolic ML and Inductive Logic Programming approaches [50].

Various other loosely-coupled approaches have been proposed recently such as [44], where image classification is carried out by a neural network in combination with reasoning from text data for concept learning at a higher level of abstraction than what is normally possible with pixel data alone. The proliferation of such approaches has prompted Henry Kautz to propose a taxonomy for neurosymbolic AI in [34] (also discussed in [17]), including recent work combining neural networks with graphical models and graph neural networks [4, 40, 58], statistical relational learning [21, 55], and even verification of neural multi-agent systems [2, 8].

6. Conclusions and Future Work

In this paper, we have specified the theory and exemplified the reach of Logic Tensor Networks as a model and system for neurosymbolic AI. LTN is capable of combining approximate reasoning and deep learning, knowledge and data.

For ML practitioners, learning in LTN (see Section 3.2) can be understood as optimizing under first-order logic constraints relaxed into a loss function. For logic practitioners, learning is similar to inductive inference: given a theory, learning makes generalizations from specific observations obtained from data. Compared to other neuro-symbolic architectures (see Section 5), the LTN framework has useful properties for gradient-based optimization (see Section 2.4) and a syntax that supports many traditional ML tasks and their inductive biases (see Section 4), all while remaining computationally efficient (see Table 1).

Section 3.4 discussed reasoning in LTN. Reasoning is normally under-specified within neural networks. Logical reasoning is the task of proving if some knowledge follows from the facts which are currently known. It is traditionally achieved semantically using model theory or syntactically via a proof system. The current LTN framework approaches reasoning semantically, although it should be possible to use LTN and querying alongside a proof system. When reasoning by refutation in LTN,to find out if a statement

ϕ

is a logical consequence of given data and knowledge-base

K

,a proof by refutation attempts to find a semantic counterexample where

\neg ϕ

and

K

are satisfied. If the search fails then

ϕ

is assumed to hold. This approach is efficient in LTN when we allow for a direct search to find counterexamples via gradient-descent optimization. It is assumed that

ϕ

,the statement to prove or disprove,is known. Future work could explore automatically inducing which statement

ϕ

to consider,possibly using syntactical reasoning in the process.

The paper formalizes Real Logic, the language supporting LTN. The semantics of Real Logic are close to the semantics of Fuzzy FOL with the following major differences: 1) Real Logic domains are typed and restricted to real numbers and real-valued tensors, 2) Real Logic variables are sequences of fixed length, whereas FOL variables are a placeholder for any individual in a domain, 3) Real Logic relations relations are interpreted as mathematical functions, whereas Fuzzy Logic relations are interpreted as fuzzy set membership functions. Concerning the semantics of connectives and quantifiers, some LTN implementations correspond to semantics for t-norm fuzzy logic, but not all. For example, the conjunction operator in stable product semantics is not a t-norm, as pointed out at the end of Section 2.4

Integrative neural-symbolic approaches are known for either seeking to bring neurons into a symbolic system (neurons into symbols) [41] or to bring symbols into a neural network (symbols into neurons) [60]. LTN adopts the latter approach but maintaining a close link between the symbols and their grounding into the neural network. The discussion around these two options - neurons into symbols vs. symbols into neurons - is likely to take center stage in the debate around neurosymbolic AI in the next decade. LTN and related approaches are well placed to play an important role in this debate by offering a rich logical language tightly coupled with an efficient distributed implementation into TensorFlow computational graphs.

The close connection between first-order logic and its implementation in LTN makes LTN very suitable as a model for the neural-symbolic cycle [27, 29], which seeks to translate between neural and symbolic representations. Such translations can take place at the level of the structure of a neural network, given a symbolic language [27], or at the level of the loss functions, as done by LTN and related approaches [13,45,46]. LTN opens up a number of promising avenues for further research:

Firstly, a continual learning approach might allow one to start with very little knowledge, build up and validate knowledge over time by querying the LTN network. Translations to and from neural and symbolic representations will enable reasoning also to take place at the symbolic level (e.g. alongside a proof system), as proposed recently in [70] with the goal of improving fairness of the network model.

Secondly, LTN should be compared in large-scale practical use cases with other recent efforts to add structure to neural networks such as the neuro-symbolic concept learner [44] and high-level capsules which were used recently to learn the part-of relation [38], similarly to how LTN was used for semantic image interpretation in [19].

Finally, LTN should also be compared with Tensor Product Representations, e.g. [59], which show that state-of-the-art recurrent neural networks may fail at simple question-answering tasks, despite achieving very high accuracy. Efforts in the area of transfer learning, mostly in computer vision, which seek to model systematicity could also be considered a benchmark [5]. Experiments using fewer data and therefore lower energy consumption, out-of-distribution extrapolation, and knowledge-based transfer are all potentially suitable areas of application for LTN as a framework for neurosymbolic AI based on learning from data and compositional knowledge.

Acknowledgement

We would like to thank Benedikt Wagner for his comments and a number of productive discussions on continual learning, knowledge extraction and reasoning in LTNs.

References

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Good-fellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Ku-nal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from ten-sorflow.org.

[2] Michael Akintunde, Elena Botoeva, Panagiotis Kouvaros, and Alessio Lomuscio. Verifying strategic abilities of neural multi-agent systems. In Proceedings of 17th International Conference on Principles of Knowledge Representation and Reasoning, KR2020, Rhodes, Greece, September 2020.

[3] Samy Badreddine and Michael Spranger. Injecting Prior Knowledge for Transfer Learning into Reinforcement Learning Algorithms using Logic Tensor Networks. arXiv:1906.06576 [cs, stat], June 2019. arXiv: 1906.06576.

[4] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, and Koray kavukcuoglu. Interaction networks for learning about objects, relations and physics. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS'16, pages 4509-4517, USA, 2016. Curran Associates Inc.

[5] Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, Sebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning to disentangle causal mechanisms. In International Conference on Learning Representations, 2020.

[6] Federico Bianchi and Pascal Hitzler. On the capabilities of logic tensor networks for deductive reasoning. In Proceedings of the AAAI 2019 Spring Symposium on Combining Machine Learning with Knowledge Engineering (AAAI-MAKE 2019) Stanford University, Palo Alto, California, USA, March 25-27, 2019., Stanford University, Palo Alto, California, USA, March 25-27, 2019., 2019.

[7] Federico Bianchi, Matteo Palmonari, Pascal Hitzler, and Luciano Serafini. Complementing logical reasoning with sub-symbolic commonsense. In International Joint Conference on Rules and Reasoning, pages 161-170. Springer, 2019.

[8] Rafael Borges, Artur d'Avila Garcez, and Luís Lamb. Learning and representing temporal knowledge in recurrent networks. IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, 22:2409-21, 122011.

[9] Liber Běhounek, Petr Cintula, and Petr Hájek. Introduction to mathematical fuzzy logic. In Petr Cintula, Petr Hájek, and Carles Noguera, editors, Handbook of Mathematical Fuzzy Logic, Volume 1, volume 37 of Studies in Logic, Mathematical Logic and Foundations, pages 1-102. College Publications, 2011.

[10] N. A. Campbell and R. J. Mahon. A multivariate study of variation in two species of rock crab of the genus Leptograpsus. Australian Journal of Zoology, 22(3):417-425, 1974. Publisher: CSIRO PUBLISHING.

[11] Andres Campero, Aldo Pareja, Tim Klinger, Josh Tenenbaum, and Sebastian Riedel. Logical rule induction and theory learning using neural theorem proving. CoRR, abs/1809.02193, 2018.

[12] Benhui Chen, Xuefen Hong, Lihua Duan, and Jinglu Hu. Improving multi-label classification performance by label constraints. In The 2013 International Joint Conference on Neural Networks (IJCNN), pages 1-5. IEEE, 2013.

[13] William W. Cohen, Fan Yang, and Kathryn Mazaitis. Tensorlog: A probabilistic database implemented using deep-learning infrastructure. J. Artif. Intell. Res., 67:285-325, 2020.

[14] W.-Z. Dai, Q. Xu, Y. Yu, and Z.-H. Zhou. Bridging machine learning and logical reasoning by abductive learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, NeurIPS'19, USA, 2019. Curran Associates Inc.

[15] Alessandro Daniele and Luciano Serafini. Knowledge enhanced neural networks. In Pacific Rim International Conference on Artificial Intelligence, pages 542-554. Springer, 2019.

[16] Artur d'Avila Garcez, Marco Gori, Luís C. Lamb, Luciano Serafini, Michael Spranger, and Son N. Tran. Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning. FLAP, 6(4):611-632, 2019.

[17] Artur d'Avila Garcez and Luis C. Lamb. Neurosymbolic AI: The 3rd wave, 2020.

[18] Ivan Donadello and Luciano Serafini. Compensating supervision incompleteness with prior knowledge in semantic image interpretation. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1-8. IEEE, 2019.

[19] Ivan Donadello, Luciano Serafini, and Artur d'Avila Garcez. Logic tensor networks for semantic image interpretation. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 1596-1602, 2017.

[20] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.

[21] Richard Evans and Edward Grefenstette. Learning explanatory rules from noisy data. J. Artif. Intell. Res., 61:1-64, 2018.

[22] Ronald Fagin, Ryan Riegel, and Alexander Gray. Foundations of reasoning with uncertainty via real-valued logics, 2020.

[23] Marc Fischer, Mislav Balunovic, Dana Drachsler-Cohen, Timon Gehr, Ce Zhang, and Martin Vechev. D12: Training and querying neural networks with logic. In International Conference on Machine Learning, pages 1931-1941, 2019.

[24] Manoel Franca, Gerson Zaverucha, and Artur d'Avila Garcez. Fast relational learning using bottom clause propositionalization with artificial neural networks. Machine Learning, 94:81- 104,012014.

[25] Dov M. Gabbay and John Woods, editors. The Many Valued and Nonmonotonic Turn in Logic, volume 8 of Handbook of the History of Logic. Elsevier, 2007.

[26] Artur d'Avila Garcez, Dov M. Gabbay, and Krysia B. Broda. Neural-Symbolic Learning System: Foundations and Applications. Springer-Verlag, Berlin, Heidelberg, 2002.

[27] Artur d'Avila Garcez, Lus C. Lamb, and Dov M. Gabbay. Neural-Symbolic Cognitive Reasoning. Springer Publishing Company, Incorporated, 1 edition, 2008.

[28] Petr Hajek. Metamathematics of Fuzzy Logic. Kluwer Academic Publishers, 1998.

[29] Barbara Hammer and Pascal Hitzler, editors. Perspectives of Neural-Symbolic Integration, volume 77 of Studies in Computational Intelligence. Springer, 2007.

[30] Stevan Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335- 346, 1990.

[31] Patrick Hohenecker and Thomas Lukasiewicz. Ontology reasoning with deep neural networks. Journal of Artificial Intelligence Research, 68:503-540, 2020.

[32] Steffen Hölldobler and Franz J. Kurfess. CHCL - A connectionist infernce system. In Bertram Fronhöfer and Graham Wrightson, editors, Parallelization in Inference Systems, International Workshop, Dagstuhl Castle, Germany, December 17-18, 1990, Proceedings, volume 590 of Lecture Notes in Computer Science, pages 318-342. Springer, 1990.

[33] Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. Harnessing deep neural networks with logic rules. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2410-2420, Berlin, Germany, August 2016. Association for Computational Linguistics.

[34] Henry Kautz. The Third AI Summer, AAAI Robert S. Engelmore Memorial Lecture, Thirty-fourth AAAI Conference on Artificial Intelligence, New York, NY, February 10, 2020.

[35] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], January 2017. arXiv: 1412.6980.

[36] Doga Kisa, Guy Van den Broeck, Arthur Choi, and Adnan Darwiche. Probabilistic sentential decision diagrams. In Proceedings of the 14th International Conference on Principles of Knowledge Representation and Reasoning (KR), July 2014.

[37] Erich Peter Klement, Radko Mesiar, and Endre Pap. Triangular Norms, volume 8 of Trends in Logic. Springer Netherlands, Dordrecht, 2000.

[38] Adam Kosiorek, Sara Sabour, Yee Whye Teh, and Geoffrey E Hinton. Stacked capsule autoen-coders. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 15512-15522. Curran Associates, Inc., 2019.

[39] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS'12, page 1097-1105, Red Hook, NY, USA, 2012. Curran Associates Inc.

[40] Luís C. Lamb, Artur d'Avila Garcez, Marco Gori, Marcelo O. R. Prates, Pedro H. C. Avelar, and Moshe Y. Vardi. Graph neural networks meet neural-symbolic computing: A survey and perspective. In Christian Bessiere, editor, Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020 [scheduled for July 2020, Yokohama, Japan, postponed due to the Corona pandemic], pages 4877-4884. ijcai.org, 2020.

[41] Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. Deepproblog: Neural probabilistic logic programming. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NeurIPS'18, pages 3753-3763, USA, 2018. Curran Associates Inc.

[42] Francesco Manigrasso, Filomeno Davide Miro, Lia Morra, and Fabrizio Lamberti. Faster-LTN: a neuro-symbolic, end-to-end object detection architecture. arXiv:2107.01877 [cs], July 2021.

[43] Vasco Manquinho, Joao Marques-Silva, and Jordi Planes. Algorithms for weighted boolean optimization. In International conference on theory and applications of satisfiability testing, pages 495-508. Springer, 2009.

[44] Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. CoRR, abs/1904.12584, 2019.

[45] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and Marco Gori. Constraint-based visual generation. In Igor V. Tetko, Vera Kurková, Pavel Karpov, and Fabian J. Theis, editors, Artificial Neural Networks and Machine Learning - ICANN 2019: Image Processing - 28th International Conference on Artificial Neural Networks, Munich, Germany, September 17-19, 2019, Proceedings, Part III, volume 11729 of Lecture Notes in Computer Science, pages 565-577. Springer, 2019.

[46] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and Marco Gori. Integrating learning and reasoning with deep logic models. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2019, Wiirzburg, Germany, September 16-20, 2019, Proceedings, Part II, volume 11907 of Lecture Notes in Computer Science, pages 517-532. Springer, 2019.

[47] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and Marco Gori. Lyrics: A general interface layer to integrate logic inference and deep learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 283-298. Springer, 2019.

[48] Giuseppe Marra and Ondřej Kuželka. Neural markov logic networks. arXiv preprint arXiv:1905.13462, 2019.

[49] Pasquale Minervini, Sebastian Riedel, Pontus Stenetorp, Edward Grefenstette, and Tim Rock-täschel. Learning reasoning strategies in end-to-end differentiable proving, 2020.

[50] Stephen H. Muggleton, Dianhuan Lin, Niels Pahlavi, and Alireza Tamaddoni-Nezhad. Meta-interpretive learning: Application to grammatical inference. Mach. Learn., 94(1):25-49, January 2014.

[51] Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559-572, 1901.

[52] Gadi Pinkas. Reasoning, nonmonotonicity and learning in connectionist networks that capture propositional knowledge. Artif. Intell., 77(2):203-247, 1995.

[53] Meng Qu and Jian Tang. Probabilistic logic neural networks for reasoning. In Advances in Neural Information Processing Systems, pages 7712-7722, 2019.

[54] Luc De Raedt, Sebastijan Dumančić, Robin Manhaeve, and Giuseppe Marra. From statistical relational to neuro-symbolic artificial intelligence, 2020.

[55] Matthew Richardson and Pedro Domingos. Markov logic networks. Mach. Learn., 62(1-2):107- 136, February 2006.

[56] Ryan Riegel, Alexander Gray, Francois Luus, Naweed Khan, Ndivhuwo Makondo, Is-mail Yunus Akhalwaya, Haifeng Qian, Ronald Fagin, Francisco Barahona, Udit Sharma, Sha-jith Ikbal, Hima Karanam, Sumit Neelam, Ankita Likhyani, and Santosh Srivastava. Logical Neural Networks. arXiv:2006.13155 [cs], June 2020. arXiv: 2006.13155.

[57] Tim Rocktäschel and Sebastian Riedel. End-to-end differentiable proving. In Advances in Neural Information Processing Systems, pages 3788-3800, 2017.

[58] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfar-dini. The graph neural network model. Trans. Neur. Netw., 20(1):61-80, January 2009.

[59] Imanol Schlag and Jürgen Schmidhuber. Learning to reason with third-order tensor products. CoRR, abs/1811.12143, 2018.

[60] Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, Jürgen Schmidhuber, and Jianfeng Gao. Enhancing the transformer with explicit relational encoding for math problem solving. CoRR, abs/1910.06611, 2019.

[61] Luciano Serafini and Artur d'Avila Garcez. Logic tensor networks: Deep learning and logical reasoning from data and knowledge. arXiv preprint arXiv:1606.04422, 2016.

[62] Luciano Serafini and Artur d’Avila Garcez. Learning and reasoning with logic tensor networks. In Conference of the Italian Association for Artificial Intelligence, pages 334–348. Springer, 2016.

[63] Lokendra Shastri. Advances in SHRUTI-A neurally motivated model of relational knowledge representation and rapid inference using temporal synchrony. Appl. Intell., 11(1):79-108, 1999.

[64] Yun Shi. A deep study of fuzzy implications. PhD thesis, Ghent University, 2009.

[65] Paul Smolensky and Géraldine Legendre. The Harmonic Mind: From Neural Computation to Optimality-Theoretic GrammarVolume I: Cognitive Architecture (Bradford Books). The MIT Press, 2006.

[66] Gustav Sourek, Vojtech Aschenbrenner, Filip Zelezny, Steven Schockaert, and Ondrej Kuzelka. Lifted relational neural networks: Efficient learning of latent relational structures. Journal of Artificial Intelligence Research, 62:69-100, 2018.

[67] Emile van Krieken, Erman Acar, and Frank van Harmelen. Semi-Supervised Learning using Differentiable Reasoning. arXiv:1908.04700 [cs], August 2019. arXiv: 1908.04700.

[68] Emile van Krieken, Erman Acar, and Frank van Harmelen. Analyzing Differentiable Fuzzy Implications. In Proceedings of the 17th International Conference on Principles of Knowledge Representation and Reasoning, pages 893-903, 92020.

[69] Emile van Krieken, Erman Acar, and Frank van Harmelen. Analyzing Differentiable Fuzzy Logic Operators. arXiv:2002.06100 [cs], February 2020. arXiv: 2002.06100.

[70] Benedikt Wagner and Artur d'Avila Garcez. Neural-Symbolic Integration for Fairness in AI. Proceedings of the AAAI Spring Symposium: Combining Machine Learning with Knowledge Engineering 2021, page 14, 2021.

[71] Po-Wei Wang, Priya L Donti, Bryan Wilder, and Zico Kolter. Satnet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver. arXiv preprint arXiv:1905.12149, 2019.

[72] Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Broeck. A Semantic Loss Function for Deep Learning with Symbolic Knowledge. In International Conference on Machine Learning, pages 5502-5511. PMLR, July 2018. ISSN: 2640-3498.

Appendix A. Implementation Details

The LTN library is implemented in Tensorflow 2 [1] and is available from GitHub

^{29}

. Every logical operator is grounded using Tensorflow primitives. The LTN code implements directly a Tensorflow graph. Due to Tensorflow built-in optimization, LTN is relatively efficient while providing the expressive power of FOL.

Table A.3 shows an overview of the network architectures used to obtain the results of the examples in Section 4 The LTN repository includes the code for these examples. Except if explicitly mentioned otherwise,the reported results are averaged over 10 runs using a

95 %

confidence interval. Every example uses a stable real product configuration to approximate Real Logic operators, and the Adam optimizer [35] with a learning rate of 0.001 to train the parameters. are usually used with some additional layer(s) to ground symbols. For instance,in experiment 4.2 in

G (P) : x, l \mapsto

l^{⊤}

softmax

(MLP (x))

,the softmax layer normalizes the raw predictions of MLP to probabilities in

[0, 1]

,and the multiplication with the one-hot label

l

selects the probability for one given class.

Task	Network	Architecture
4.1 4.2	MLP	Dense(16), Dense(16), Dense(1)
	MLP	Dense $(16)$ ,Dropout $(0.2),$ Dense $(16)$ ,Dropout $(0.2)$ ,
		Dense(8)*, Dropout(0.2), Dense(1)
4.3 4.4	MLP	$Dense {(16)}^{}, Dense {(16)}^{}, Dense {(8)}^{*}, Dense (1)$
	CNN	MNISTConv, Dense(84)*, Dense(10)
	baseline – SD	$MNISTConv \times 2, Dense {(84)}^{*}, Dense (19), Softmax$
	baseline – MD	MNISTConv $\times 4$ ,Dense ${(128)}^{*}$ ,Dense $(199)$ ,Softmax
4.5 4.6 4.7	MLP	Dense(8), Dense(8), Dense(1)
	MLP	Dense(16), Dense(16), Dense(16)*, Dense(1)
	MLP_S	Dense(8), Dense(8), Dense(1)
	MLP_F	Dense(8), Dense(8), Dense(1)
	MLP_C	Dense(8), Dense(8), Dense(1)

: layer ends with an elu activation

Dense

(n)

: regular fully-connected layer of

n

units

Dropout

(r)

: dropout layer with rate

r

Conv (f, k)

: 2D convolution layer with

f

filters and a kernel of size

k

MP (w, h)

: max pooling operation with a

w \times h

pooling window

MNISTConv : Conv

{(6, 5)}^{*}, MP (2, 2), Conv {(16, 5)}^{*}, MP (2, 2), Dense {(100)}^{*}

Table A.3: Overview of the neural network architectures used in each example. Notice that in the examples, the networks

^{29}

https://github.com/logictensornetworks/logictensornetworks

Appendix B. Fuzzy Operators and Properties

This appendix presents the most common operators used in fuzzy logic literature and some noteworthy properties [28, 37, 64, 69].

Appendix B.1. Negation

Definition 7. A negation is a function

N : [0, 1] \to [0, 1]

that at least satisfies:

N1. Boundary conditions:

N (0) = 1

and

N (1) = 0

N2. Monotonically decreasing:

\forall (x, y) \in {[0, 1]}^{2}, x \leq y \to N (x) \geq N (y)

Moreover,a negation is said to be strict if

N

is continuous and strictly decreasing. A negation is said to be strong if

\forall x \in [0, 1], N (N (x)) = x

We commonly use the standard strict and strong negation

N_{S} (a) = 1 - a

Appendix B.2. Conjunction

Definition 8. A conjuction is a function

C : {[0, 1]}^{2} \to [0, 1]

that at least satisfies:

C1. boundary conditions:

C (0, 0) = C (0, 1) = C (1, 0) = 0

and

C (1, 1) = 1

C2. monotonically increasing:

\forall (x, y, z) \in {[0, 1]}^{3}

,if

x \leq y

,then

C (x, z) \leq C (y, z)

and

C (z, x) \leq

C (z, y)

In fuzzy logic, t-norms are widely used to model conjunction operators.

Definition 9. A t-norm (triangular norm) is a function

t : {[0, 1]}^{2} \to [0, 1]

that at least satisifies:

T1. boundary conditions:

T (x, 1) = x

T2. monotonically increasing,

T3. commutative,

T4. associative.

Example 4. Three commonly used t-norms are:

\begin{matrix} (minimum) & T_{M} (x, y) = min (x, y) \end{matrix}

\begin{matrix} (product) & T_{P} (x, y) = x \cdot y \end{matrix}

\begin{matrix} (Łukasiewicz) & T_{L} (x, y) = max (x + y - 1, 0) \end{matrix}

Appendix B.3. Disjunction

Name	$a \land b$	$a \lor b$	$a \to_{R} c$	$a \to_{S} c$
Goedel	$min (a, b)$	$max (a, b)$	if $a \leq c$ otherwise	$max (1 - a, c)$
Goguen/Product	$a \cdot b$	$a + b - a \cdot b$	${\begin{cases} 1, & if a \leq c \\ \frac{c}{a}, & otherwise \end{cases}$	$1 - a + a \cdot c$
Lukasiewicz	$max (a + b - 1, 0)$	$min (a + b, 1)$	$min (1 - a + c, 1)$	$min (1 - a + c, 1)$

Table B.4: Common Symmetric Configurations

Definition 10. A disjunction is a function

D : {[0, 1]}^{2} \to [0, 1]

that at least satisfies:

D1. boundary conditions:

D (0, 0) = 0

and

D (0, 1) = D (1, 0) = D (1, 1) = 1

D2. monotonically increasing:

\forall (x, y, z) \in {[0, 1]}^{3}

,if

x \leq y

,then

D (x, z) \leq D (y, z)

and

D (z, x) \leq

D (z, y)

Disjunctions in fuzzy logic are often modeled with t-conorms.

Definition 11. A t-conorm (triangular conorm) is a function

S : {[0, 1]}^{2} \to [0, 1]

that at least satisfies:

S1. boundary conditions:

S (x, 0) = x

S2. monotonically increasing,

S3. commutative,

S4. associative.

Example 5. Three commonly used t-conorms are:

\begin{matrix} (maximum) & S_{M} (x, y) = max (x, y) \end{matrix}

\begin{matrix} (probabilistic sum) & S_{P} (x, y) = x + y - x \cdot y \end{matrix}

\begin{matrix} (Łukasiewicz) & S_{L} (x, y) = min (x + y, 1) \end{matrix}

Note that the only distributive pair of t-norm and t-conorm is

T_{M}

and

S_{M}

- that is,distributivity of the t-norm over the t-conorm, and inversely.

Definition 12. The

N

-dual t-conorm

S

of a t-norm

T

w.r.t. a strict fuzzy negation

N

is defined as:

\begin{matrix} (B.1) & \forall (x, y) \in {[0, 1]}^{2}, S (x, y) = N (T (N (x), N (y))) . \end{matrix}

N

is a strong negation,we also get:

\begin{matrix} (B.2) & \forall (x, y) \in {[0, 1]}^{2}, T (x, y) = N (S (N (x), N (y))) . \end{matrix}

Appendix B.4. Implication

Definition 13. An implication is a function

I : {[0, 1]}^{2} \to [0, 1]

that at least satisfies:

I1. boundary Conditions:

I (0, 0) = I (0, 1) = I (1, 1) = 1

and

I (1, 0) = 0

Definition 14. There are two main classes of implications generated from the fuzzy logic operators for negation, conjunction and disjunction.

S-Implications Strong implications are defined using

x \to y = \neg x \lor y

(material implication).

R-Implications Residuated implications are defined using

x \to y = sup {z \in [0, 1] ∣ x \land z \leq y}

. One way of understanding this approach is a generalization of modus ponens: the consequent is at least as true as the (fuzzy) conjunction of the antecedent and the implication.

Example 6. Popular fuzzy implications and their classes are presented in Table B. 5

Name	$I (x, y) =$	S-Implication	R-Implication
Kleene-Dienes $I_{K D}$	$max (1 - x, y)$	$S = S_{M}$ $N = N_{S}$	-
Goedel $I_{G}$	${\begin{cases} 1, & x \leq y \\ y, & otherwise \end{cases}$	$-$	$T = T_{M}$
Reichenbach $I_{R}$	$1 - x + x y$	$S = S_{P}$ $N = N_{S}$	-
Goguen $I_{P}$	${\begin{cases} 1, & x \leq y \\ y / x, & otherwise \end{cases}$	-	$T = T_{P}$
Lukasiewicz $I_{L u k}$	$min (1 - x + y, 1)$	$S = S_{L}$ $N = N_{S}$	$T = T_{L}$

Table B.5: Popular fuzzy implications and their classes. Strong implications (S-Implications) are defined using a fuzzy negation and fuzzy disjunction. Residuated implications (R-Implications) are defined using a fuzzy conjunction.

Appendix B.5. Aggregation

Definition 15. An aggregation operator is a function

A : ⋃_{n \in N} {[0, 1]}^{n} \to [0, 1]

that at least satisfies:

A1.

A (x_{1}, \dots, x_{n}) \leq A (y_{1}, \dots, y_{n})

whenever

x_{i} \leq y_{i}

for all

i \in {1, \dots, n}

A2.

A (x) = x

forall

x \in [0, 1]

A3.

A (0, \dots, 0) = 0

and

A (1, \dots, 1) = 1

Example 7. Candidates for universal quantification

\forall

can be obtained using t-norms with

A_{T} (x_{i}) =

x_{i}

and

A_{T} (x_{1}, \dots, x_{n}) = T (x_{1}, A_{T} (x_{2}, \dots, x_{n}))

\begin{matrix} (minimum) & A_{T_{M}} (x_{1}, \dots, x_{n}) = min (x_{1}, \dots, x_{n}) \end{matrix}

\begin{matrix} (product) & A_{T_{P}} (x_{1}, \dots, x_{n}) = \prod_{i = 1}^{n} x_{i} \end{matrix}

\begin{matrix} (Łukasiewicz) & A_{T_{L}} (x_{1}, \dots, x_{n}) = max (\sum_{i = 1}^{n} x_{i} - n + 1, 0) \end{matrix}

Similarly,candidates for existential quantification

\exists

can be obtained using s-norms with

A_{S} (x_{i}) =

x_{i}

and

A_{S} (x_{1}, \dots, x_{n}) = S (x_{1}, A_{S} (x_{2}, \dots, x_{n}))

\begin{matrix} (maximum) & A_{S_{M}} (x_{1}, \dots, x_{n}) = max (x_{1}, \dots, x_{n}) \end{matrix}

\begin{matrix} (probabilistic sum) & A_{S_{P}} (x_{1}, \dots, x_{n}) = 1 - \prod_{i = 1}^{n} (1 - x_{i}) \end{matrix}

\begin{matrix} (Łukasiewicz) & A_{S_{L}} (x_{1}, \dots, x_{n}) = min (\sum_{i = 1}^{n} x_{i}, 1) \end{matrix}

	$(T_{M}, S_{M}, N_{S})$		$(T_{P}, S_{P}, N_{S})$		$(T_{L}, S_{L}, N_{S})$
	$I_{K D}$	$I_{G}$	$I_{R}$	$I_{P}$	$I_{L u k}$
Commutativity of $\land$ , $\lor$	$✓$	$✓$	$✓$	$✓$	$✓$
Associativity of $\land$ , $\lor$	$✓$	$✓$	$✓$	$✓$	$✓$
Distributivity of $\land$ over $\lor$	$✓$	$✓$
Distributivity of $\lor$ over $\land$	$✓$	$✓$
Distrib. of $\to$ over $\lor, \land$	$✓$	$✓$
Double negation $\neg \neg p = p$	$✓$	$✓$	$✓$	$✓$	$✓$
Law of excluded middle					$✓$
Law of non contradiction					$✓$
De Morgan’s laws	$✓$	$✓$	$✓$	$✓$	$✓$
Material Implication	$✓$		$✓$		$✓$
Contraposition	$✓$		$✓$		$✓$

Table B.6: Common properties for different configurations

Following are other common aggregators:

\begin{matrix} (mean) & A_{M} (x_{1}, \dots, x_{n}) = \frac{1}{n} \sum_{i = 1}^{n} x_{i} \end{matrix}

\begin{matrix} (p-mean) & A_{p M} (x_{1}, \dots, x_{n}) = {(\frac{1}{n} \sum_{i = 1}^{n} x_{i}^{p})}^{\frac{1}{p}} \end{matrix}

\begin{matrix} (p-mean error) & A_{p M E} (x_{1}, \dots, x_{n}) = 1 - {(\frac{1}{n} \sum_{i = 1}^{n} {(1 - x_{i})}^{p})}^{\frac{1}{p}} \end{matrix}

Where

A_{p M}

is the generalized mean,and

A_{p M E}

can be understood as the generalized mean measured w.r.t. the errors. That is,

A_{p M E}

measures the power of the deviation of each value from the ground truth 1. A few particular values of

p

yield special cases of aggregators. Notably:

- lim p \to + \infty A_{p M} (x_{1}, \dots, x_{n}) = max (x_{1}, \dots, x_{n}),

- lim p \to - \infty A_{p M} (x_{1}, \dots, x_{n}) = min (x_{1}, \dots, x_{n}),

- lim p \to + \infty A_{p M E} (x_{1}, \dots, x_{n}) = min (x_{1}, \dots, x_{n}),

- lim p \to - \infty A_{p M E} (x_{1}, \dots, x_{n}) = max (x_{1}, \dots, x_{n}) .

These "smooth" min (resp. max) approximators are good candidates for

\forall

(resp.

\exists

) in a fuzzy context. The value of

p

leaves more or less room for outliers depending on the use case and its needs. Note that

A_{p M E}

and

A_{p M}

are related in the same way that

\exists

and

\forall

are related using the definition

\exists \equiv \neg \forall \neg

,where

\neg

would be approximated by the standard negation.

We propose to use

A_{p M E}

with

p \geq 1

to approximate

\forall

and

A_{p M}

with

p \geq 1

to approximate

\exists

. When

p \geq 1

,these operators resemble the

l^{p}

norm of a vector

u = (u_{1}, u_{2}, \dots, u_{n})

,where

∥ u ∥_{p} = {({| u_{1} |}^{p} + {| u_{2} |}^{p} + \dots + {| u_{n} |}^{p})}^{1 / p}

. In our case,many properties of the

l^{p}

norm can apply to

A_{p M}

(positive homogeneity,triangular inequality,...).

Appendix C. Analyzing Gradients of Generalized Mean Aggregators

[69] show that some operators used in Fuzzy Logics are unsuitable for use in a differentiable learning setting. Three types of gradient problems commonly arise in fuzzy logic operators.

Single-Passing The derivatives of some operators are non-null for only one argument. The gradients propagate to only one input at a time.

Vanishing Gradients The gradients vanish on some part of the domain. The learning does not update inputs that are in the vanishing domain.

Exploding Gradients Large error gradients accumulate and result in unstable updates.

Tables C. 7 and C. 8 summarize their conclusions for the most common operators. Also, we underline here exploding gradients issues that arise experimentally in

A_{p M}

and

A_{p M E}

,which are not in the original report. Given the truth values of

n

propositions

(x_{1}, \dots, x_{n})

{[0, 1]}^{n}

1. A_{p M} (x_{1}, \dots, x_{n}) = {(\frac{1}{n} \sum_{i} x_{i}^{p})}^{\frac{1}{p}}

The partial derivatives are

\frac{\partial A_{p M} (x_{1}, \dots, x_{n})}{\partial x_{i}} = \frac{1}{n} \frac{1}{p} {(\sum_{j = 1}^{n} x_{j}^{p})}^{\frac{1}{p} - 1} x_{i}^{p - 1}

When

p > 1

,the operator weights more for inputs with a higher true value -i.e. their partial derivative is also higher - and suits for existential quantification. When

p < 1

,the operator weights more for inputs with a lower true value and suits for universal quantification.

Exploding Gradients When

p > 1

,if

\sum_{j = 1}^{n} x_{j}^{p} \to 0

,then

{(\sum_{j = 1}^{n} x_{j}^{p})}^{\frac{1}{p} - 1} \to \infty

and the gradients explode. When

p < 1

,if

x_{i} \to 0

,then

x_{i}^{p - 1} \to \infty

2. A_{p M E} (x_{1}, \dots, x_{n}) = 1 - {(\frac{1}{n} \sum_{i} {(1 - x_{i})}^{p})}^{\frac{1}{p}}

The partial derivatives are

\frac{\partial A_{p M E} (x_{1}, \dots, x_{n})}{\partial x_{i}} = \frac{1}{n} \frac{1}{p} {(\sum_{j = 1}^{n} {(1 - x_{j})}^{p})}^{\frac{1}{p} - 1} {(1 - x_{i})}^{p - 1}

. When

p > 1

, the operator weights more for inputs with a lower true value -i.e. their partial derivative is also higher - and suits for universal quantification. When

p < 1

,the operator weights more for inputs with a higher true value and suits for existential quantification.

Exploding Gradients

When

p > 1

,if

\sum_{j = 1}^{n} {(1 - x_{j})}^{p} \to 0

,then

{(\sum_{j = 1}^{n} {(1 - x_{j})}^{p})}^{\frac{1}{p} - 1} \to \infty

and the gradients

explode. When

p < 1

,if

1 - x_{i} \to 0

,then

{(1 - x_{i})}^{p - 1} \to \infty

We propose the following stable product configuration that does not have any of the aforemen-

	Single-Passing	Vanishing	Exploding
Goedel (mininum)
$T_{M}, S_{M}$	$x$
$I_{K D}$	$x$
$I_{G}$	$x$	$x$
Goguen (product)
$T_{P}$ , $S_{P}$		$(\begin{matrix} X \end{matrix})$
$I_{R}$		$(\begin{matrix} X \end{matrix})$
$I_{K D}$		$x$	$(X)$
Lukasiewicz
$T_{L}, S_{L}$		$x$
$I_{L u k}$		$x$

Table C.7: Gradient problems for some binary connectives.

(X)

means that the problem only appears on an edge case.

	Single-Passing	Vanishing	Exploding
$A_{T_{M}} / A_{S_{M}}$	$x$
$A_{T_{P}} / A_{S_{P}}$		$x$
$A_{T_{L}} / A_{S_{L}}$		$x$
$A_{p M}$			$(\begin{matrix} X \end{matrix})$
$A_{p M E}$			$(\begin{matrix} X \end{matrix})$

Table C.8: Gradient problems for some aggregators.

(X)

means that the problem only appears on an edge case.

tioned gradient problems:

\begin{matrix} (C.1) & π_{0} (x) = (1 - ϵ) x + ϵ \end{matrix}

\begin{matrix} (C.2) & π_{1} (x) = (1 - ϵ) x \end{matrix}

\begin{matrix} (C.3) & N_{S} (x) = 1 - x \end{matrix}

\begin{matrix} (C.4) & T_{P}^{'} (x, y) = π_{0} (x) π_{0} (y) \end{matrix}

\begin{matrix} (C.5) & S_{P}^{'} (x, y) = π_{1} (x) + π_{1} (y) - π_{1} (x) π_{1} (y) \end{matrix}

\begin{matrix} (C.6) & I_{R}^{'} (x, y) = 1 - π_{0} (x) + π_{0} (x) π_{1} (y) \end{matrix}

\begin{matrix} (C.7) & A_{p M}^{'} (x_{1}, \dots, x_{n}) = {(\frac{1}{n} \sum_{i = 1}^{n} π_{0} {(x_{i})}^{p})}^{\frac{1}{p}} p \geq 1 \end{matrix}

\begin{matrix} (C.8) & A_{p M E}^{'} (x_{1}, \dots, y_{n}) = 1 - {(\frac{1}{n} \sum_{i = 1}^{n} {(1 - π_{1} (x_{i}))}^{p})}^{\frac{1}{p}} p \geq 1 \end{matrix}

N_{S}

is the operator for negation,

T_{P}^{'}

for conjunction,

S_{P}^{'}

for disjunction,

I_{P}^{'}

for implication,

A_{p M}^{'}

for existential aggregation,

A_{p M E}^{'}

for universal aggregation.

Logic Tensor Networks

Abstract

1. Introduction

2. Real Logic

2.1. Syntax

2.2. Semantics of Real Logic

2.2.1. Grounding domains and the signature

2.2.2. Grounding terms and atomic formulas

2.2.3. Connectives and Quantifiers

2.3. Guarded Quantifiers

2.4. Stable Product Real Logic

3. Learning, Reasoning, and Querying in Real Logic

3.1. Representing Knowledge with Real Logic

3.1.1. Knowledge through symbol groundings

3.1.2. Knowledge through formulas

3.1.3. Knowledge through fuzzy semantics

3.1.4. Satisfiability

3.2. Learning

3.3. Querying

3.4. Reasoning

3.4.1. Logical consequence in Real Logic

3.4.2. Reasoning by optimization

4. The Reach of Logic Tensor Networks

4.1. Binary Classification

Domains:

Variables:

Grounding:

Learning:

4.2. Multi-Class Single-Label Classification

Domains:

Variables:

Constants:

Grounding:

Learning:

4.3. Multi-Class Multi-Label Classification

Domains:

Variables:

Constants:

Axioms:

Grounding:

Learning:

Querying:

4.4. Semi-Supervised Pattern recognition

Domains:

Variables:

Predicates:

Grounding:

Learning:

4.5. Regression

Functions:

Predicates:

Grounding:

Learning:

4.6. Unsupervised Learning (Clustering)

Predicates:

Grounding:

Learning:

4.7. Learning Embeddings with LTN

Domains:

Predicates:

Axioms:

Grounding:

Learning:

Querying:

4.8. Reasoning in LTN

Propositional Variables:

Grounding:

Reasoning:

5. Related Work

5.1. Neural architectures for logical reasoning

5.2. Logical specification of neural network architectures

5.3. Neurosymbolic architectures for the integration of inductive learning and deductive reasoning

6. Conclusions and Future Work

Acknowledgement

References

Appendix A. Implementation Details

Appendix B. Fuzzy Operators and Properties

Appendix B.5. Aggregation

Appendix C. Analyzing Gradients of Generalized Mean Aggregators