Logic Tensor Networks

Samy Badreddine a,b, ,Artur d Avila Garce zc ,Luciano Serafini d ,Michael Spranger a,b
a Sony Computer Science Laboratories Inc,3-14-13 Higashigotanda,141-0022,Tokyo,Japan
b Sony AI Inc,1-7-1 Konan,108-0075,Tokyo,Japan
c City,University of London,Northampton Square,EC1V 0HB,London,United Kingdom
d Fondazione Bruno Kessler,Via Sommarive 18,38123,Trento,Italy

Abstract

Attempts at combining logic and neural networks into neurosymbolic approaches have been on the increase in recent years. In a neurosymbolic system, symbolic knowledge assists deep learning, which typically uses a sub-symbolic distributed representation, to learn and reason at a higher level of abstraction. We present Logic Tensor Networks (LTN), a neurosymbolic framework that supports querying, learning and reasoning with both rich data and abstract knowledge about the world. LTN introduces a fully differentiable logical language, called Real Logic, whereby the elements of a first-order logic signature are grounded onto data using neural computational graphs and first-order fuzzy logic semantics. We show that LTN provides a uniform language to represent and compute efficiently many of the most important AI tasks such as multi-label classification, relational learning, data clustering, semi-supervised learning, regression, embedding learning and query answering. We implement and illustrate each of the above tasks with several simple explanatory examples using TensorFlow 2. The results indicate that LTN can be a general and powerful framework for neurosymbolic AI.
Keywords: Neurosymbolic AI, Deep Learning and Reasoning, Many-valued Logics.

1. Introduction

Artificial Intelligence (AI) agents are required to learn from their surroundings and reason about what has been learned to make decisions, act in the world, or react to various stimuli. The latest Machine Learning (ML) has adopted mostly a pure sub-symbolic learning approach. Using distributed representations of entities, the latest ML performs quick decision-making without building a comprehensible model of the world. While achieving impressive results in computer vision, natural language, game playing, and multimodal learning, such approaches are known to be data inefficient and to struggle at out-of-distribution generalization. Although the use of appropriate inductive biases can alleviate such shortcomings, in general, sub-symbolic models lack comprehensibility. By contrast, symbolic AI is based on rich, high-level representations of the world that use human-readable symbols. By rich knowledge, we refer to logical representations which are more expressive than propositional logic or propositional probabilistic approaches, and which can express knowledge using full first-order logic, including universal and existential quantification (xandy) ,arbitrary n -ary relations over variables,e.g. R(x,y,z,) ,and function symbols,e.g. father Of(x),x+y ,etc. Symbolic AI has achieved success at theorem proving,logical inference,and verification. However, it also has shortcomings when dealing with incomplete knowledge. It can be inefficient with large amounts of inaccurate data and lack robustness to outliers. Purely symbolic decision algorithms usually have high computational complexity making them impractical for the real world. It is now clear that the predominant approach to ML, where learning is based on recognizing the latent structures hidden in the data, is insufficient and may benefit from symbolic AI [17]. In this context, neurosymbolic AI, which stems from neural networks and symbolic AI, attempts to combine the strength of both paradigms (see [16, 40, 54] for recent surveys). That is to say, combine reasoning with complex representations of knowledge (knowledge-bases, semantic networks, ontologies, trees, and graphs) with learning from complex data (images, time series, sensorimotor data, natural language). Consequently, a main challenge for neurosymbolic AI is the grounding of symbols, including constants, functional and relational symbols, into real data, which is akin to the longstanding symbol grounding problem [30].

*Corresponding author
Email addresses: badreddine. [email protected] (Samy Badreddine), a. [email protected] (Artur d'Avila Garcez),
[email protected] (Luciano Serafini), michael . spranger@sony . com (Michael Spranger)

Logic Tensor Networks (LTN) are a neurosymbolic framework and computational model that supports learning and reasoning about data with rich knowledge. In LTN, one can represent and effectively compute the most important tasks of deep learning with a fully differentiable first-order logic language, called Real Logic, which adopts infinitely many truth-values in the interval [0,1][22,25] . In particular,LTN supports the specification and computation of the following AI tasks uniformly using the same language: data clustering, classification, relational learning, query answering, semi-supervised learning, regression, and embedding learning.
LTN and Real Logic were first introduced in [62]. Since then, LTN has been applied to different AI tasks involving perception, learning, and reasoning about relational knowledge. In [18, 19], LTN was applied to semantic image interpretation whereby relational knowledge about objects was injected into deep networks for object relationship detection. In [6], LTN was evaluated on its capacity to perform reasoning about ontological knowledge. Furthermore, [7] shows how LTN can be used to learn an embedding of concepts into a latent real space by taking into consideration ontological knowledge about such concepts. In [3], LTN is used to annotate a reinforcement learning environment with prior knowledge and incorporate latent information into an agent. In [42], authors embed LTN in a state-of-the-art convolutional object detector. Extensions and generalizations of LTN have also been proposed in the past years, such as LYRICS [47] and Differentiable Fuzzy Logic (DFL) [68,69]. LYRICS provides an input language allowing one to define background knowledge using a first-order logic where predicate and function symbols are grounded onto any computational graph. DFL analyzes how a large collection of fuzzy logic operators behave in a differentiable learning setting. DFL also introduces new semantics for fuzzy logic implications called sigmoidal implications, and it shows that such semantics outperform other semantics in several semi-supervised machine learning tasks.
This paper provides a thorough description of the full formalism and several extensions of LTN. We show using an extensive set of explanatory examples, how LTN can be applied to solve many ML tasks with the help of logical knowledge. In particular, the earlier versions of LTN have been extended with: (1) Explicit domain declaration: constants, variables, functions and predicates are now domain typed (e.g. the constants John and Paris can be from the domain of person and city, respectively). The definition of structured domains is also possible (e.g. the domain couple can be defined as the Cartesian product of two domains of persons); (2) Guarded quantifiers: guarded universal and existential quantifiers now allow the user to limit the quantification to the elements that satisfy some Boolean condition,e.g. x:age(x)<10 (playsPiano (x) enfantProdige (x) ) restricts the quantification to the cases where age is lower than 10; (3) Diagonal quantification: Diagonal quantification allows the user to write statements about specific tuples extracted in order from n variables. For example,if the variables capital and country both have k instances such that the i -th instance of capital corresponds to the i -th instance of country,one can write Diag(capital,country) capitalOf(capital,country).
Inspired by the work of [69], this paper also extends the product t-norm configuration of LTN with the generalized mean aggregator, and it introduces solutions to the vanishing or exploding gradient problems. Finally, the paper formally defines a semantic approach to refutation-based reasoning in Real Logic to verify if a statement is a logical consequence of a knowledge base. Example 4.8 proves that this new approach can better capture logical consequences compared to simply querying unknown formulas after learning (as done in [6]).
The new version of LTN has been implemented in TensorFlow 2 [1]. Both the LTN library and the code for the examples used in this paper are available at https://github.com/ logictensornetworks/logictensornetworks
The remainder of the paper is organized as follows: In Section 2, we define and illustrate Real Logic as a fully-differentiable first-order logic. In Section 3, we specify learning and reasoning in Real Logic and its modeling into deep networks with Logic Tensor Networks (LTN). In Section 4 we illustrate the reach of LTN by investigating a range of learning problems from clustering to embedding learning. In Section 5, we place LTN in the context of the latest related work in neurosymbolic AI. In Section 6 we conclude and discuss directions for future work. The Appendix contains information about the implementation of LTN in TensorFlow 2, experimental set-ups, the different options for the differentiable logic operators, and a study of their relationship with gradient computations.

2. Real Logic

2.1. Syntax

Real Logic forms the basis of Logic Tensor Networks. Real Logic is defined on a first-order language L with a signature that contains a set C of constant symbols (objects),a set F of functional symbols,a set P of relational symbols (predicates),and a set X of variable symbols. L -formulas allow us to specify relational knowledge with variables,e.g. the atomic formula is_friend (v1,v2) may state that the person v1 is a friend of the person v2 ,the formula xy(is_friend(x,y)is_friend(y,x)) states that the relation is_friend is symmetric,and the formula x(y(Italian(x)is_friend(x,y))) states that every person has a friend that is Italian. Since we are interested in learning and reasoning in real-world scenarios where degrees of truth are often fuzzy and exceptions are present, formulas can be partially true, and therefore we adopt fuzzy semantics.
Objects can be of different types. Similarly, functions and predicates are typed. Therefore, we assume there exists a non-empty set of symbols D called domain symbols. To assign types to the elements of L we introduce the functions D,Din  and Dout  such that:
  • D:XCD . Intuitively, D(x) and D(c) returns the domain of a variable x or a constant c .
  • Din:FPD ,where D is the Kleene star of D ,that is the set of all finite sequences of symbols in D . Intuitively, Din(f) and Din(p) returns the domains of the arguments of a
function f or a predicate p . If f takes two arguments (for example, f(x,y) ), Din(f) returns two domains, one per argument.
  • Dout :FD . Intuitively, Dout (f) returns the range of a function symbol.
Real Logic may also contain propositional variables,as follows: if P is a 0-ary predicate with Din(P)= (the empty sequence of domains) then P is a propositional variable (an atom with truth-value in the interval [0,1] ).
A term is constructed recursively in the usual way from constant symbols, variables, and function symbols. An expression formed by applying a predicate symbol to an appropriate number of terms with appropriate domains is called an atomic formula, which evaluates to true or false in classical logic and a number in [0,1] in the case of Real Logic. We define the set of terms of the language as follows:
  • each element t of XC is a term of the domain D(t) ;
  • if ti is a term of domain D(ti) for 1in then t1t2tn (the sequence composed of t1 followed by t2 and so on,up to tn ) is a term of the domain D(t1)D(t2)D(tn) ;
  • if t is a term of the domain Din(f) then f(t) is a term of the domain Dout(f) .
We allow the following set of formula in L :
  • t1=t2 is an atomic formula for any terms t1 and t2 with D(t1)=D(t2) ;
  • p(t) is an atomic formula if D(t)=Din(p) ;
  • If ϕ and ψ are formula and x1,,xn are n distinct variable symbols then ϕ,ϕψ and Qx1xnϕ are formula,where is a unary connective, is a binary connective and Q is a quantifier.
We use {¬} (negation), {,,,} (conjunction,disjunction,implication and biconditional,respectively) and Q{,} (universal and existential,respectively).
Example 1. Let Town denote the domain of towns in the world and People denote the domain of living people. Suppose that L contains the constant symbols Alice,Bob and Charlie of domain People,and Rome and Seoul of domain Town. Let x be a variable of domain People and u be a variable of domain Town. The term x,u (i.e. the sequence x followed by u ) has domain People,Town which denotes the Cartesian product between People and Town (People × Town). Alice,Rome is interpreted as an element of the domain People, Town. Let lives in be a predicate with input domain Din  (lives_in) = People,Town. lives_in(Alice,Rome) is a well-formed expression,whereas lives_in(Bob, Charlie) is not.

2.2. Semantics of Real Logic

The semantics of Real Logic departs from the standard abstract semantics of First-order Logic (FOL). In Real Logic, domains are interpreted concretely by tensors in the real field T Every object denoted by constants, variables, and terms, is interpreted as a tensor of real values. Functions are interpreted as real functions or tensor operations. Predicates are interpreted as functions or tensor operations projecting onto a value in the interval [0,1] .

1 In the rest of the paper,we commonly use "tensor" to designate "tensor in the real field".

To emphasize the fact that in Real Logic symbols are grounded onto real-valued features, we use the term grounding,denoted by G ,in place of interpretation 2 . Notice that this is different from the common use of the term grounding in logic, which indicates the operation of replacing the variables of a term or formula with constants or terms containing no variables. To avoid confusion, we use the synonym instantiation for this purpose. G associates a tensor of real numbers to any term of L ,and a real number in the interval [0,1] to any formula ϕ of L . Intuitively, G(t) are the numeric features of the objects denoted by t ,and G(ϕ) represents the system’s degree of confidence in the truth of ϕ ; the higher the value,the higher the confidence.

2.2.1. Grounding domains and the signature

A grounding for a logical language L on the set of domains D provides the interpretation of both the domain symbols in D and the non-logical symbols in L .
Definition 1. A grounding G associates to each domain DD a set G(D)n1ndNRn1××nd .
For every D1DnD,G(D1Dn)=×i=1nG(Di) ,that is G(D1)×G(D2)××G(Dn) .
Notice that the elements in G(D) may be tensors of any rank d and any dimensions n1××nd , as N denotes the Kleene star of N 3
Example 2. Let digit_images denote a domain of images of handwritten digits. If we use images of 256×256 RGB pixels,then G (digit_images) R256×256×3 . Let us consider the predicate is_digit (Z,8) . The terms Z,8 have domains digit_images,digits. Any input to the predicate is a tuple in G (digit_images,digits) =G (digit_images) ×G (digits).
A grounding assigns to each constant symbol c ,a tensor G(c) in the domain G(D(c)) ; It assigns to a variable x a finite sequence of tensors d1dk ,each in G(D(x)) . These tensors represent the instances of x . Differently from in FOL where a variable is assigned to a single value of the domain of interpretations at a time, in Real Logic a variable is assigned to a sequence of values in its domain,the k examples of x . A grounding assigns to a function symbol f a function taking tensors from G(Din(f)) as input,and producing a tensor in G(Dout(f)) as output. Finally,a grounding assigns to a predicate symbol p a function taking tensors from G(Din (p)) as input,and producing a truth-value in the interval [0,1] as output.
Definition 2. A grounding G of L is a function defined on the signature of L that satisfies the following conditions:
  1. G(x)=d1dk×i=1kG(D(x)) for every variable symbol xX ,with kN0+ . Notice that G(x) is a sequence and not a set,meaning that the same value of G(D(x)) can occur multiple times in G(x) ,as is usual in a Machine Learning data set with "attributes" and "values";

2 An interpretation is an assignment of truth-values true or false , or in the case of Real Logic a value in [0,1], to a formula. A model is an interpretation that maps a formula to true
3 A tensor of rank 0 corresponds to a scalar,a tensor of rank 1 to a vector,a tensor of rank 2 to a matrix and so forth,in the usual way.

  1. G(f)G(Din (f))G(Dout (f)) for every function symbol fF ;
  1. G(p)G(Din(p))[0,1] for every predicate symbol pP .
If a grounding depends on a set of parameters θ , we denote it as Gθ() or G(θ) interchangeably. Section 4 describes how such parameters can be learned using the concept of satisfiability.

2.2.2. Grounding terms and atomic formulas

We now extend the definition of grounding to all first-order terms and atomic formulas. Before formally defining these groundings, we describe on a high level what happens when grounding terms that contain free variables. 4
Let x be a variable that denotes people. As explained in Definition 2, x is grounded as an explicit sequence of k instances (k=|G(x)|) . Consequently,a term height (x) is also grounded in k height values, each corresponding to one instance. We can generalize to expressions with multiple free variables, as shown in Example 3.
In the formal definition below, instead of considering a single term at a time, it is convenient to consider sequences of terms t=t1t2tk and define the grounding on t (with the definition of the grounding of a single term being derived as a special case). The fact that the sequence of terms t contains n distinct variables x1,,xn is denoted by t(x1,,xn) . The grounding of t(x1,,xn) ,denoted by G(t(x1,,xn)) ,is a tensor with n corresponding axes,one for each free variable, defined as follows:
Definition 3. Let t(x1,,xn) be a sequence t1tm of m terms containing n distinct variables x1,,xn . Let each term ti in t contain ni variables xji1,,xjini .
  • G(t) is a tensor with dimensions (|G(x1)|,,|G(xn)|) such that the element of this tensor indexed by k1,,kn ,written as G(t)k1kn ,is equal to the concatenation of G(ti)kji1kjini for 1im ;
  • G(f(t))i1in=G(f)(G(t)i1in) ,i.e. the element-wise application of G(f) to G(t) ;
  • G(p(t))i1in=G(p)(G(t)i1in) ,i.e. the element-wise application of G(p) to G(t) .
If term ti contains ni variables xj1,,xjni selected from x1,,xn then G(ti)kj1kjni can be obtained from G(t)i1in with an appropriate mapping of indices i to k .

4 We assume the usual syntactic definition of free and bound variables in FOL. A variable is free if it is not bound by a quantifier (,) .

Figure 1: Illustration of Example 3 - x and y indicate dimensions associated with the free variables x and y . A tensor representing a term that includes a free variable x will have an axis x . One can index x to obtain results calculated using each of the v1,v2 or v3 values of x . In our graphical convention,the depth of the boxes indicates that the tensor can have feature dimensions (refer to the end of Example 3).
Example 3. Suppose that L contains the variables x and y ,the function f ,the predicate p and the set of domains D={V,W} . Let D(x)=V,D(y)=W,Din(f)=VW,Dout(f)=W and D(p)=VW . In what follows,an example of the grounding of L and D is shown on the left,and the grounding of some examples of possible terms and atomic formulas is shown on the right.
G(V)=R+
G(W)=R
G(x)=v1,v2,v3
G(y)=w1,w2
G(p):x,yσ(x+y)
G(f):x,yxy
Notice the dimensions of the results. G(f(x,y)) and G(p(x,f(x,y))) return |G(x)|×|G(y)|=3×2 values, one for each combination of individuals that occur in the variables. For functions, we can have additional dimensions associated to the output domain. Let us suppose a different grounding such that G(Dout (f))=Rm . Then the dimensions of G(f(x,y)) would have been |G(x)|×|G(y)|×m ,where |G(x)|×|G(y)| are the dimensions for indexing the free variables and m are dimensions associated to the output domain of f . Let us call the latter feature dimensions, as captioned in Figure 1. Notice that G(p(x,f(x,y))) will always return a tensor with the exact dimensions |G(x)|×|G(y)|×1 because,under any grounding,a predicate always returns a value in [0,1] . Therefore,as the "feature dimensions" of predicates is always 1,we choose to "squeeze it" and not to represent it in our graphical convention (see Figure 1, the box output by the predicate has no depth).
Figure 2: Illustration of an element-wise operator implementing conjunction (p(x)q(y)) . We assume that x and y are two different variables. The result has one number in the interval [0,1] to every combination of individuals from G(x) and G(y) .

2.2.3. Connectives and Quantifiers

The semantics of the connectives is defined according to the semantics of first-order fuzzy logic [28]. Conjunction () ,disjunction () ,implication () and negation () are associated, respectively,with a t-norm (T) ,a t-conorm (S) ,a fuzzy implication (I) and a fuzzy negation (N) operation FuzzyOp {T,S,I,N} . Definitions of some common fuzzy operators are presented in Appendix B Let ϕ and ψ be two formulas with free variables x1,,xm and y1,,yn ,respectively. Let us assume that the first k variables are common to ϕ and ψ . Recall that and denote the set of unary and binary connectives, respectively. Formally:
(1)G(ϕ)i1,,im=FuzzyOp()(G(ϕ)i1,,im)
(2)G(ϕψ)i1,,im+nk=FuzzyOp()(G(ϕ)i1,,ik,ik+1,,imG(ψ)i1,,ik,im+1,,im+nk)
In (2), (i1,,ik) denote the indices of the k common variables, (ik+1,,im) denote the indices of the mk variables appearing only in ϕ ,and (im+1,,im+nk) denote the indices of the nk variables appearing only in ψ . Intuitively, G(ϕψ) is a tensor whose elements are obtained by applying FuzzyOp() element-wise to every combination of individuals from x1,,xm and y1,,yn (see Figure 2).
The semantics of the quantifiers ({,}) is defined with the use of aggregation. Let Agg be a symmetric and continuous aggregation operator, Agg:N[0,1]n[0,1] . An analysis of suitable
aggregation operators is presented in Appendix Appendix B. For every formula ϕ containing x1,,xn free variables,suppose,without loss of generality,that quantification applies to the first h variables. We shall therefore apply Agg to the first h axes of G(ϕ) ,as follows:
(3)G(Qx1,,xh(ϕ))ih+1,,in=Agg(Q)G(ϕ)i1,,ih,ih+1,,in
ih=1,,|G(xh)|
where Agg(Q) is the aggregation operator associated with the quantifier Q . Intuitively,we obtain G(Qx1,,xh(ϕ)) by reducing the dimensions associated with x1,,xh using the operator Agg(Q) (see Figure 3).
Notice that the above grounded semantics can assign different meanings to the three formulas:
xy(ϕ(x,y))x(y(ϕ(x,y)))y(x(ϕ(x,y)))
Figure 3: Illustration of an aggregation operation implementing quantification (yx) over variables x and y . We assume that x and y have different domains. The result is a single number in the interval [0,1] .
The semantics of the three formulas will coincide if the aggregation operator is bi-symmetric. LTN also allows the following form of quantification, here called diagonal quantification (Diag):
(4)G(QDiag(x1,,xh)(ϕ))ih+1,,in=Agg(Q)i=1,,min1jh|G(xj)|G(ϕ)i,,i,ih+1,,in
Diag(x1,,xh) quantifies over specific tuples such that the i -th tuple contains the i -th instance of each of the variables in the argument of Diag, under the assumption that all variables in the argument are grounded onto sequences with the same number of instances. Diag(x1,,xh) is called diagonal quantification because it quantifies over the diagonal of G(ϕ) along the axes associated with x1xh ,although in practice only the diagonal is built and not the entire G(ϕ) ,as shown in Figure 4 For example,given a data set with samples x and target labels y ,if looking to write a statement p(x,y) that holds true for each pair of sample and label,one can write Diag(x,y)p(x,y) given that |G(x)|=|G(y)| . As another example,given two variables x and y whose groundings contain 10 instances of x and y each, the expression Diag(x,y)p(x,y) produces 10 results such that the i -th result corresponds to the i -th instances of each grounding. Without Diag,the expression would be evaluated for all 10×10 combinations of the elements in G(x) and G(y) 5 Diag will find much application in the examples and experiments to follow.

2.3. Guarded Quantifiers

In many situations, one may wish to quantify over a set of elements of a domain whose grounding satisfy some condition. In particular, one may wish to express such condition using formulas of the language of the form:
(5)y(x:age(x)>age(y)(parent(x,y)))
The grounding of such a formula is obtained by aggregating the values of parent (x,y) only for the instances of x that satisfy the condition age(x)>age(y) ,that is:
Agg()Agg()G(parent(x,y))i,jj=1,,|G(y)|G(age(x))i>G(age(y))ji=1,,|G(x)| s.t. 

5 Notice how Diag is not simply "syntactic sugar" for creating a new variable pairs_xy by stacking pairs of examples from G(x) and G(y) . If the groundings of x and y have incompatible ranks (for instance,if x denotes images and y denotes their labels),stacking them in a tensor G (pairs_xy) is non-trivial,requiring several reshaping operations.

Figure 4: Diagonal Quantification: Diag (x1,x2) quantifies over specific tuples only,such that the i -th tuple contains the i -th instances of the variables x1 and x2 in the groundings G(x1) and G(x2) ,respectively. Diag (x1,x2) assumes,therefore, that x1 and x2 have the same number of instances as in the case of samples x1 and their labels x2 in a typical supervised learning tasks.
The evaluation of which tuple is safe is purely symbolic and non-differentiable. Guarded quantifiers operate over only a subset of the variables, when this symbolic knowledge is crisp and available. More generally,in what follows, m is a symbol representing the condition,which we shall call a mask,and G(m) associates a function 6 returning a Boolean to m .
(6)G(Qx1,,xh:m(x1,,xn)(ϕ))ih+1,,in= def Agg(Q)i1=1,,|G(x1)|G(ϕ)i1,,ih,ih+1,,in
ih=1,,|G˙(xh)| s.t. G(m)(G(x1)i1,,G(xn)in)
Notice that the semantics of a guarded sentence x:m(x)(ϕ(x)) is different than the semantics of x(m(x)ϕ(x)) . In crisp and traditional FOL,the two statements would be equivalent. In Real Logic,they can give different results. Let G(x) be a sequence of 3 values, G(m(x))=(0,1,1) and G(ϕ(x))=(0.2,0.7,0.8) . Only the second and third instances of x are safe,that is,are in the masked subset. Let be defined using the Reichenbach operator IR(a,b)=1a+ab and be defined using the mean operator. We have G(x(m(x)ϕ(x)))=1+0.7+0.83=0.833 whereas G(x:m(x)(ϕ(x)))=0.7+0.82=0.75 . Also,in the computational graph of the guarded sentence, there are no gradients attached to the instances that do not verify the mask. Similarly, the semantics of x:m(x)(ϕ(x)) is not equivalent to that of x(m(x)ϕ(x)) .

6 In some edge cases,a masking may produce an empty sequence,e.g. if for some value of G(y) ,there is no value in G(x) that satisfies age (x)>age(y) ,we resort to the concept of an empty semantics: returns 1 and returns 0 .

Figure 5: Example of Guarded Quantification: One can filter out elements of the various domains that do not satisfy some condition before the aggregation operators for and are applied.

2.4. Stable Product Real Logic

It has been shown in [69] that not all first-order fuzzy logic semantics are equally suited for gradient-descent optimization. Many fuzzy logic operators can lead to vanishing or exploding gradients. Some operators are also single-passing, in that they propagate gradients to only one input at a time.
In general, the best performing symmetric configuration 7 for the connectives uses the product t-norm TP for conjunction,its dual t-conorm SP for disjunction,standard negation NS ,and the Reichenbach implication IR (the corresponding S-Implication to the above operators). This subset of Real Logic where the grounding of the connectives is restricted to the product configuration is called Product Real Logic in [69]. Given a and b two truth-values in [0,1] :
(7)¬:NS(a)=1a
(8):TP(a,b)=ab
(9):SP(a,b)=a+bab
(10)→:IR(a,b)=1a+ab
Appropriate aggregators for and are the generalized mean ApM with p1 to approximate the existential quantification,and the generalized mean w.r.t. the error ApME with p1 to approximate the universal quantification. They can be understood as a smooth maximum and a smooth minimum,respectively. Given n truth-values a1,,an all in [0,1] :
(11):ApM(a1,,an)=(1ni=1naip)1pp1
(12):ApME(a1,,an)=1(1ni=1n(1ai)p)1pp1
ApME measures the power of the deviation of each value from the ground truth 1 . With p=2 , it is equivalent to 1RMSE(a,1) ,where RMSE is the root-mean-square error, a is the vector of truth-values and 1 is a vector of 1s .

7 We define a symmetric configuration as a set of fuzzy operators such that conjunction and disjunction are defined by a t-norm and its dual t-conorm, respectively, and the implication operator is derived from such conjunction or disjunction operators and standard negation (c.f. Appendix B for details). In [69], van Krieken et al. also analyze non-symmetric configurations and even operators that do not strictly verify fuzzy logic semantics.

The intuition behind the choice of p is that the higher that p is,the more weight that ApM (resp. ApME) will give to true (resp. false) truth-values,converging to the max (resp. min) operator. Therefore,the value of p can be seen as a hyper-parameter as it offers flexibility to account for outliers in the data depending on the application.
Nevertheless,Product Real Logic still has the following gradient problems: TP(a,b) has vanishing gradients on the edge case a=b=0;SP(a,b) has vanishing gradients on the edge case a=b=1 ; IR(a,b) has vanishing gradients on the edge case a=0,b=1;ApM(a1,,an) has exploding gradients when i(ai)p tends to 0;ApME(a1,,an) has exploding gradients when i(1ai)p tends to 0 (see Appendix C for details).
To address these problems,we define the projections π0 and π1 below with ϵ an arbitrarily small positive real number:
(13)π0:[0,1]]0,1]:a(1ϵ)a+ϵ
(14)π1:[0,1][0,1[:a(1ϵ)a
We then derive the following stable operators to produce what we call the Stable Product Real Logic configuration:
(15)NS(a)=NS(a)
(16)TP(a,b)=TP(π0(a),π0(b))
(17)SP(a,b)=SP(π1(a),π1(b))
(18)IR(a,b)=IR(π0(a),π1(b))
(19)ApM(a1,,an)=ApM(π0(a1),,π0(an))p1
(20)ApME(a1,,an)=ApME(π1(a1),,π1(an))p1
It is important noting that the conjunction operator in stable product semantics is not a T-norm 8 TP(a,b) does not satisfy identity in [0,1[ since for any 0a<1,TP(a,1)=(1ϵ)a+ϵa , although ϵ can be chosen arbitrarily small. In the experimental evaluations reported in Section 4, we find that the adoption of the stable product semantics is an important practical step to improve the numerical stability of the learning system.

3. Learning, Reasoning, and Querying in Real Logic

In Real Logic, one can define the tasks of learning, reasoning and query-answering. Given a Real Logic theory that represents the knowledge of an agent at a given time, learning is the task of making generalizations from specific observations obtained from data. This is often called inductive inference. Reasoning is the task of deriving what knowledge follows from the facts which are currently known. Query answering is the task of evaluating the truth value of a certain logical expression (called a query), or finding the set of objects in the data that evaluate a certain expression to true. In what follows, we define and exemplify each of these tasks. To do so, we first need to specify which types of knowledge can be represented in Real Logic.

8 Recall that a T-norm is a function T:[0,1]×[0,1][0,1] satisfying commutativity,monotonicity,associativity and identity,that is, T(a,1)=a .

3.1. Representing Knowledge with Real Logic

In logic-based knowledge representation systems, knowledge is represented by logical formulas whose intended meanings are propositions about a domain of interest. The connection between the symbols occurring in the formulas and what holds in the domain is not represented in the knowledge base and is left implicit since it does not have any effect on the logic computations. In Real Logic, by contrast, the connection between the symbols and the domain is represented explicitly in the language by the grounding G ,which plays an important role in both learning and reasoning. G is an integral part of the knowledge represented by Real Logic. A Real Logic knowledge base is therefore defined by the formulas of the logical language and knowledge about the domain in the form of groundings obtained from data. The following types of knowledge can be represented in Real Logic.

3.1.1. Knowledge through symbol groundings

Boundaries for domain grounding. These are constraints specifying that the value of a certain logical expression must be within a certain range. For instance, one may specify that the domain D must be interpreted in the [0,1] hyper-cube or in the standard n -simplex,i.e. the set d1,,dn(R+)n such that idi=1 . Other intuitive examples of range constraints include the elements of the domain "colour" grounded onto points in [0,1]3 such that every element is associated with the triplet of values (R,G,B) with R,G,B[0,1] ,or the range of a function age(x) as an integer between 0 and 100 .
Explicit definition of grounding for symbols. Knowledge can be more strictly incorporated by fixing the grounding of some symbols. If a constant c denotes an object with known features vcRn ,we can fix its grounding G(c)=vc . Training data that consists in a set of n data items such as n images (or tuples known as training examples) can be specified in Real Logic by n constants,e.g. img1,img2,,imgn ,and by their groundings,e.g. G(img1)=Ω,G(img2)=Ω,,G(imgn)=Ω . These can be gathered in a variable imgs. A binary predicate sim that measures the similarity of two objects can be grounded as, e.g., a cosine similarity function of two vectors v and w,(v,w)vwv∥∥w . The output layer of the neural network associated with a multi-class single-label predicate P(x,class) can be a softmax function normalizing the output such that it guarantees exclusive classification, i.e. iP(x,i)=1 ? Grounding of constants and functions allows the computation of the grounding of their results. If,for example, G (transp) is the function that transposes a matrix then G(transp(img1))= .
Parametric definition of grounding for symbols. Here,the exact grounding of a symbol σ is not known, but it is known that it can be obtained by finding a set of real-valued parameters, that is,via learning. To emphasize this fact,we adopt the notation G(σ)=G(σθσ) where θσ is the set of parameter values that determines the value of G(σ) . The typical example of parametric grounding for constants is the learning of an embedding. Let emb(wordθemb) be a word embedding with parameters θemb which takes as input a word and returns its embedding in Rn . If the words of a vocabulary W={w1,,w|W|} are constant symbols, their groundings G(wiθemb) are defined parametrically w.r.t. θemb as emb(wiθemb) . An example of parametric grounding for a function symbol f is to assume that G(f) is a linear function such that G(f):RmRn maps each vRm into Afv+bf ,with Af a matrix of real numbers and b a vector of real numbers. In this case, G(f)=G(fθf) ,where θf={Af,bf} . Finally, the grounding of a predicate symbol can be given, for example, by a neural network N with parameters θN . As an example,consider a neural network N trained for image classification into n classes: cat,dog,horse,etc. N takes as input a vector v of pixel values and produces as output a vector y=(ycat ,ydog ,yhorse ,) in [0,1]n such that y=N(vθN) , where yc is the probability that input image v is of class c . In case classes are,alternatively, chosen to be represented by unary predicate symbols such as cat(v),dog(v),horse(v), then G(cat(v))=N(vθN)cat ,G(dog(v))=N(vθN)dog ,G(horse(v))=N(vθN)horse  ,etc.

9 Notice that softmax is often used as the last layer in neural networks to turn logits into a probability distribution. However, we do not use the softmax function as such here. Instead, we use it here to enforce an exclusivity constraint on satisfiability scores.

3.1.2. Knowledge through formulas

Factual propositions. Knowledge about the properties of specific objects in the domain is represented, as usual, by logical propositions, as exemplified below: Suppose that it is known that img1 is a number eight, img2 is a number nine,and imgn is a number two. This can be represented by adding the following facts to the knowledge-base: nine (img1) ,eight (img2), , two (imgn) . Supervised learning,that is,learning with the use of training examples which include target values (labelled data), is specified in Real Logic by combining grounding definitions and factual propositions. For example,the fact that an image Z is a positive example for the class nine and a negative example for the class eight is specified by defining G(img1)=Z alongside the propositions nine (img1) and ¬eight(img1) . Notice how semi-supervision can be specified naturally in Real Logic by adding propositions containing disjunctions,e.g. eight(img1)nine(img1) ,which state that img1 is either an eight or a nine (or both). Finally, relational learning can be achieved by relating logically multiple objects (defined as constants or variables or even as more complex sequences of terms) such as e.g.: nine(img1)nine(img2) (if img1 is a nine then img2 is not a nine) or nine (img)¬ eight (img) (if an image is a nine then it is not an eight). The use of more complex knowledge including the use of variables such as img above is the topic of generalized propositions, discussed next.
Generalized propositions. General knowledge about all or some of the objects of some domains can be specified in Real Logic by using first-order logic formulas with quantified variables. This general type of knowledge allows one to specify arbitrary constraints on the groundings independently from the specific data available. It allows one to specify, in a concise way, knowledge that holds true for all the objects of a domain. This is especially useful in Machine Learning in the semi-supervised and unsupervised settings, where there is no specific knowledge about a single individual. For example, as part of a task of multi-label classification with constraints on the labels [12], a positive label constraint may express that if an example is labelled with l1,,lk then it should also be labelled with lk+1 . This can be specified in Real Logic with a universally quantified formula: x(l1(x)lk(x)lk+1(x))10 Another example of soft constraints used in Statistical Relational Learning associates the labels of related examples. For instance, in Markov Logic Networks [55], as part of the well-known Smokers and Friends example, people who are smokers are associated by the friendship relation. In Real Logic,the formula xy((smokes(x)friend(x,y))smokes(y)) would be used to encode the soft constraint that friends of smokers are normally smokers.

10 This can also be specified using a guarded quantifier x:((l1(x)lk(x))>th)lk+1(x) where th is a threshold value in [0,1] .

3.1.3. Knowledge through fuzzy semantics

Definition for operators. The grounding of a formula ϕ depends on the operators approximating the connectives and quantifiers that appear in ϕ . Different operators give different interpretations of the satisfaction associated with the formula. For instance, the operator ApME(a1,,an) that approximates universal quantification can be understood as a smooth minimum. It depends on a hyper-parameter p (the exponent used in the generalized mean). If p=1 then ApME(a1,,an) corresponds to the arithmetic mean. As p increases,given the same input,the value of the universally quantified formula will decrease as ApME converges to the min operator. To define how strictly the universal quantification should be interpreted in each proposition,one can use different values of p for different propositions of the knowledge base. For instance,a formula xP(x) where ApME is used with a low value for p will in fact denote that P holds for some x ,whereas a formula xQ(x) with a higher p may denote that Q holds for most x .

3.1.4. Satisfiability

In summary, a Real Logic knowledge-base has three components: the first describes knowledge about the grounding of symbols (domains, constants, variables, functions, and predicate symbols); the second is a set of closed logical formulas describing factual propositions and general knowledge; the third lies in the operators and the hyperparameters used to evaluate each formula. The definition that follows formalizes this notion.
Definition 4 (Theory/Knowledge-base). A theory of Real Logic is a triple T=K,G(θ),Θ , where K is a set of closed first-order logic formulas defined on the set of symbols S=DX CFP denoting,respectively,domains,variables,constants,function and predicate symbols; G(θ) is a parametric grounding for all the symbols sS and all the logical operators; and Θ={Θs}sS is the hypothesis space for each set of parameters θs associated with symbol s .
Learning and reasoning in a Real Logic theory are both associated with searching and applying the set of values of parameters θ from the hypothesis space Θ that maximize the satisfaction of the formulas in K . We use the term grounded theory,denoted by K,Gθ ,to refer to a Real Logic theory with a specific set of learned parameter values. This idea shares some similarity with the weighted MAX-SAT problem [43],where the weights for formulas in K are given by their fuzzy truth-values obtained by choosing the parameter values of the grounding. To define this optimization problem, we aggregate the truth-values of all the formulas in K by selecting a formula aggregating operator SatAgg : [0,1][0,1] .
Definition 5. The satisfiability of a theory T=K,Gθ with respect to the aggregating operator SatAgg is defined as SatAggϕKGθ(ϕ) .

3.2. Learning

Given a Real Logic theory T=(K,G(θ),Θ) ,learning is the process of searching for the set of parameter values θ that maximize the satisfiability of T w.r.t. a given aggregator:
θ=argmaxθΘSatAggϕKGθ(ϕ)
Notice that with this general formulation, one can learn the grounding of constants, functions, and predicates. The learning of the grounding of constants corresponds to the learning of em-beddings. The learning of the grounding of functions corresponds to the learning of generative models or a regression task. Finally, the learning of the grounding of predicates corresponds to a classification task in Machine Learning.
In some cases, it is useful to impose some regularization (as done customarily in ML) on the set of parameters θ ,thus encoding a preference on the hypothesis space Θ ,such as a preference for smaller parameter values. In this case, learning is defined as follows:
θ=argmaxθΘ(SatAggθGθ(ϕ)λR(θ))
where λR+ is the regularization parameter and R is a regularization function,e.g. L1 or L2 regularization,that is, L1(θ)=θθ|θ| and L2(θ)=θθθ2 .
LTN can generalize and extrapolate when querying formulas grounded with unseen data (for example, new individuals from a domain), using knowledge learned with previous groundings (for example, re-using a trained predicate). This is explained in Section 3.3

3.3. Querying

Given a grounded theory T=(K,Gθ) ,query answering allows one to check if a certain fact is true (or, more precisely, by how much it is true since in Real Logic truth-values are real numbers in the interval [0,1]) . There are various types of queries that can be asked of a grounded theory.
A first type of query is called truth queries. Any formula in the language of T can be a truth query. The answer to a truth query ϕq is the truth value of ϕq obtained by computing its grounding, i.e. Gθ(ϕq) . Notice that,if ϕq is a closed formula,the answer is a scalar in [0,1] denoting the truth-value of ϕq according to Gθ . if ϕq contains n free variables x1,,xn ,the answer to the query is a tensor of order n such that the component indexed by i1in is the truth-value of ϕq evaluated in Gθ(x1)i1,,Gθ(xn)in.
The second type of query is called value queries. Any term in the language of T can be a value query. The answer to a value query tq is a tensor of real numbers obtained by computing the grounding of the term,i.e. Gθ(tq) . Analogously to truth queries,the answer to a value query is a "tensor of tensors" if tq contains variables. Using value queries,one can inspect how a constant or a term, more generally, is embedded in the manifold.
The third type of query is called generalization truth queries. With generalization truth queries, we are interested in knowing the truth-values of formulas when these are applied to a new (unseen) set of objects of a domain, such as a validation or a test set of examples typically used in the evaluation of machine learning systems. A generalization truth query is a pair (ϕq(x),U) ,where ϕq is a formula with a free variable x and U=(u(1),,u(k)) is a set of unseen examples whose dimensions are compatible with those of the domain of x . The answer to the query (ϕ^q(x),U) is Gθ(ϕq(x)) for x taking each value u(i),1ik ,in U . The result of this query is therefore a vector of |U| truth-values corresponding to the evaluation of ϕq on new data u(1),,u(k) .
The fourth and final type of query is generalization value queries. These are analogous to generalization truth queries with the difference that they evaluate a term tq(x) ,and not a formula,on new data U . The result,therefore,is a vector of |U| values corresponding to the evaluation of the trained model on a regression task using test data U .

3.4. Reasoning

3.4.1. Logical consequence in Real Logic

From a pure logic perspective, reasoning is the task of verifying if a formula is a logical consequence of a set of formulas. This can be achieved semantically using model theory () or syntactically via a proof theory () . To characterize reasoning in Real Logic,we adapt the notion of logical consequence for fuzzy logic provided in [9]: A formula ϕ is a fuzzy logical consequence of a finite set of formulas Γ ,in symbols Γϕ if for every fuzzy interpretation f ,if all the formulas in Γ are true (i.e. evaluate to 1) in f then ϕ is true in f . In other words,every model of Γ is a model of ϕ . A direct application of this definition to Real Logic is not practical since in most practical cases the level of satisfiability of a grounded theory (K,Gθ) will not be equal to 1 . We therefore define an interval [q,1] with 12<q<1 and assume that a formula is true if its truth-value is in the interval [q,1] . This leads to the following definition:
Definition 6. A closed formula ϕ is a logical consequence of a knowledge-base (K,G(θ),Θ) ,in symbols (K,G(θ),Θ)qϕ ,if,for every grounded theory K,Gθ ,if SatAgg(K,Gθ)q then Gθ(ϕ)q .

3.4.2. Reasoning by optimization

Logical consequence by direct application of Definition 6 requires querying the truth value of ϕ for a potentially infinite set of groundings. Therefore,we consider in practice the following directions:
Reasoning Option 1 (Querying after learning). This is approximate logical inference by considering only the grounded theories that maximally satisfy (K,G(θ),Θ) . We therefore define that ϕ is a brave logical consequence of a Real Logic knowledge-base (K,G(θ),Θ) if Gθ(ϕ)q for all the θ such that:
θ=argmaxθSatAgg(K,Gθ) and SatAgg(K,Gθ)q
The objective is to find all θ that optimally satisfy the knowledge base and to measure if they also satisfy ϕ . One can search for such θ by running multiple optimizations with the objective function of Section 3.2
This approach is somewhat naive. Even if we run the optimization multiple times with multiple parameter initializations (to, hopefully, reach different optima in the search space), the obtained groundings may not be representative of other optimal or close-to-optimal groundings. In Section 4.8 we give an example that shows the limitations of this approach and motivates the next one.
Reasoning Option 2 (Proof by Refutation). Here, we reason by refutation and search for a counterexample to the logical consequence by introducing an alternative search objective. Normally, according to Definition 6, one tries to verify that 11
(21)for allθΘ,ifGθ(K)qthenGθ(ϕ)q.
Instead, we solve the dual problem:
(22)there existsθΘsuch thatGθ(K)qandGθ(ϕ)<q.
If Eq.(22) is true then a counterexample to Eq.(21) has been found and the logical consequence does not hold. If Eq. 22 is false then no counterexample to Eq. 21 has been found and the logical consequence is assumed to hold true. A search for such parameters θ (the counterexample) can be performed by minimizing Gθ(ϕ) while imposing a constraint that seeks to invalidate results where Gθ(K)<q . We therefore define:
penalty(Gθ,q)={c if Gθ(K)<q,0 otherwise, where c>1

11 For simplicity,we temporarily define the notation G(K):=SatAggϕK(K,G) .

Given G such that:
(23)G=argminGθ(Gθ(ϕ)+penalty(Gθ,q))
  • If G(K)<q : Then for all Gθ,Gθ(K)<q and therefore (K,G(θ),Θ)qϕ .
  • If G(K)q and G(ϕ)q : Then for all Gθ with Gθ(K)q ,we have that Gθ(ϕ)G(ϕ)q and therefore (K,G(θ),Θ)qϕ .
  • If G(K)q and G(ϕ)<q: Then (K,G(θ),Θ)\vDash \not{} qϕ .
Clearly, Equation (23) cannot be used as an objective function for gradient-descent due to null derivatives. Therefore, we propose to approximate the penalty function with the soft constraint:
elu(α,β(qGθ(K)))={β(qGθ(K)) if Gθ(K)q,α(eqGθ(K)1) otherwise,
where α0 and β0 are hyper-parameters (see Figure 6). When Gθ(K)<q ,the penalty is linear in qGθ(K) with a slope of β . Setting β high,the gradients for Gθ(K) will be high in absolute value if the knowledge-base is not satisfied. When Gθ(K)>q ,the penalty is a negative exponential that converges to α . Setting α low but non-zero seeks to ensure that the gradients do not vanish when the penalty should not apply (when the knowledge-base is satisfied). We obtain the following approximate objective function:
(24)G=argminGθ(Gθ(ϕ)+elu(α,β(qGθ(K)))
Section 4.8 will illustrate the use of reasoning by refutation with an example in comparison with reasoning as querying after learning. Of course, other forms of reasoning are possible, not least that adopted in [6], but a direct comparison is outside the scope of this paper and left as future work.

4. The Reach of Logic Tensor Networks

The objective of this section is to show how the language of Real Logic can be used to specify a number of tasks that involve learning from data and reasoning. Examples of such tasks are classification, regression, clustering, and link prediction. The solution of a problem specified in Real Logic is obtained by interpreting such a specification in Logic Tensor Networks. The LTN library implements Real Logic in Tensorflow 2 [1] and is available from GitHub 13 Every logical operator is grounded using Tensorflow primitives such that LTN implements directly a Tensorflow graph. Due to Tensorflow built-in optimization, LTN is relatively efficient while providing the expressive power of first-order logic. Details on the implementation of the examples described in this section are reported in Appendix A. The implementation of the examples presented here is also available from the LTN repository on GitHub. Except when stated otherwise, the results reported are the average result over 10 runs using a 95% confidence interval. Every example uses a stable real product configuration to approximate the Real Logic operators and the Adam optimizer [35] with a learning rate of 0.001 . Table A.3 in the Appendix gives an overview of the network architectures used to obtain the results reported in this section.

12 In the objective function, G should satisfy G(K)q before reducing G(ϕ) because the penalty c which is greater than 1 is higher than any potential reduction in G(ϕ) which is smaller or equal to 1 .
13 https://github.com/logictensornetworks/logictensornetworks

Figure 6: elu(α,βx) where α0 and β0 are hyper-parameters. The function elu(α,β(qGθ(K))) with α low and β high is a soft constraint for penalty (Gθ,q) suitable for learning.

4.1. Binary Classification

The simplest machine learning task is binary classification. Suppose that one wants to learn a binary classifier A for a set of points in [0,1]2 . Suppose that a set of positive and negative training examples is given. LTN uses the following language and grounding:

Domains:

points (denoting the examples).

Variables:

x+ for the positive examples.
x for the negative examples.
x for all examples.
D(x)=D(x+)=D(x)= points.
Predicates:
A(x) for the trainable classifier.
Din(A)= points. Axioms:
(25)x+A(x+)
(26)x¬A(x)

Grounding:

G (points) =[0,1]2 .
G(x)[0,1]m×2(G(x)is a sequence ofmpoints,that is,mexamples) .
G(x+)=dG(x)∣∥d(0.5,0.5)∥<0.09
G(x)=dG(x)∣∥d(0.5,0.5)∥≥0.0915
G(Aθ):xsigmoid(MLPθ(x)) ,where MLP is a Multilayer Perceptron with a single output neuron,whose parameters θ are to be learned 16

Learning:

Let us define D the data set of all examples. The objective function with K={x+A(x+),x¬A(x)} is given by argmaxθΘSatAggϕKGθ,xD(ϕ) . 17] In practice,the optimizer uses the following loss function:
L=(1SatAggϕKGθ,xB(ϕ))
where B is a mini-batch sampled from D18 The objective and loss functions depend on the following hyper-parameters:
  • the choice of fuzzy logic operator semantics used to approximate each connective and quantifier,
  • the choice of hyper-parameters underlying the operators, such as the value of the exponent p in any generalized mean,
  • the choice of formula aggregator function.
Using the stable product configuration to approximate connectives and quantifiers,and p=2 for every occurrence of ApME ,and using for the formula aggregator also ApME with p=2 , yields the following satisfaction equation:
SatAggϕKGθ(ϕ)=112(1(1(1|G(x+)|vG(x+)(1sigmoid(MLPθ(v)))2)122)122)
+1(1(1|G(x)|vG(x)(sigmoid(MLPθ(v)))2)122))12

14G(x+) are,by definition in this example,the training examples with Euclidean distance to the center (0.5,0.5) smaller than the threshold of 0.09 .
15G(x) are,by definition,the training examples with Euclidean distance to the centre (0.5,0.5) larger or equal to the threshold of 0.09 .
16sigmoid(x)=11+ex
17 The notation GxD(ϕ(x)) means that the variable x is grounded with the data D (that is, G(x):=D ) when grounding
ϕ(x) .
18 As usual in ML,while it is possible to compute the loss function and gradients over the entire data set,it is preferred to use mini-batches of the examples.

Figure 7: Symbolic Tensor Computational Graph for the Binary Classification Example. In the figure, Gx+ and Gx are inputs to the network Gθ(A) and the dotted lines indicate the propagation of activation from each input through the network, which produces two outputs.
The computational graph of Figure 7 shows Sat AggϕKGθ(ϕ) ) as used with the above loss function.
We are therefore interested in learning the parameters θ of the MLP used to model the binary classifier. We sample 100 data points uniformly from [0,1]2 to populate the data set of positive and negative examples. The data set was split into 50 data points for training and 50 points for testing. The training was carried out for a fixed number of 1000 epochs using backpropagation with the Adam optimizer [35] with a batch size of 64 examples. Figure 8 shows the classification accuracy and satisfaction level of the LTN on both training and test sets averaged over 10 runs using a 95% confidence interval. The accuracy shown is the ratio of examples correctly classified, with an example deemed as being positive if the classifier outputs a value higher than 0.5 .
Notice that a model can reach an accuracy of 100% while satisfaction of the knowledge base is yet not maximized. For example, if the threshold for an example to be deemed as positive is 0.7 , all examples may be classified correctly with a confidence score of 0.7 . In that case, while the accuracy is already maximized,the satisfaction of x+A(x+) would still be 0.7,and can still improve until the confidence for every sample reaches 1.0.
This first example, although straightforward, illustrates step-by-step the process of using LTN in a simple setting. Notice that, according to the nomenclature of Section 3.3 measuring accuracy amounts to querying the truth query (respectively,the generalization truth query) A(x) for all the examples of the training set (respectively, test set) and comparing the results with the classification threshold. In Figure 9,we show the results of such queries A(x) after optimization. Next,we show how the LTN language can be used to solve progressively more complex problems by combining learning and reasoning.

4.2. Multi-Class Single-Label Classification

The natural extension of binary classification is a multi-class classification task. We first approach multi-class single-label classification, which assumes that each example is assigned to one and only one label.
For illustration purposes, we use the Iris flower data set [20], which consists of classification into three mutually exclusive classes; call these A,B ,and C . While one could train three unary predicates A(x),B(x) and C(x) ,it turns out to be more effective if this problem is modeled by a single binary predicate P(x,l) ,where l is a variable denoting a multi-class label,in this case,
Figure 8: Binary Classification task (training and test set performance): Average accuracy (left) and satisfiability (right). Due to the random initializations, accuracy and satisfiability start on average at 0.5 with performance increasing rapidly after a few epochs.
classes A,B or C . This syntax allows one to write statements quantifying over the classes,e.g. x(l(P(x,l))) . Since the classes are mutually exclusive,the output layer of the MLP representing P(x,l) will be a softmax layer,instead of a sigmoid function,to ensure the exclusivity constraint on satisfiability scores 19 The problem can be specified as follows:

Domains:

items, denoting the examples from the Iris flower data set.
labels, denoting the class labels.

Variables:

xA,xB,xC for the positive examples of classes A,B,C .
x for all examples.
D(xA)=D(xB)=D(xC)=D(x)= items.

Constants:

lA,lB,lC ,the labels of classes A (Iris setosa), B (Iris virginica), C (Iris versicolor),respectively.
D(lA)=D(lB)=D(lC)= labels.
Predicates:
P(x,l) denoting the fact that item x is classified as l .
Din(P)= items,labels.
Axioms:
(27)xAP(xA,lA)
(28)xBP(xB,lB)
(29)xCP(xC,lC)
Notice that rules about exclusiveness such as x(P(x,lA)(¬P(x,lB)¬P(x,lC))) are not included since such constraints are already imposed by the grounding of P below,more specifically the softmax function.

19softmax(x)=exi/jexj

Figure 9: Binary Classification task (querying the trained predicate A(x) ): It is interesting to see how A(x) could be appropriately named as denoting the inside of the central region shown in the figure,and therefore ¬A(x) represents the outside of the region.

Grounding:

G (items) =R4 ,items are described by 4 features: the length and the width of the sepals and petals, in centimeters.
G (labels) =N3 ,we use a one-hot encoding to represent classes.
G(xA)Rm1×4 ,that is, G(xA) is a sequence of m1 examples of class A .
G(xB)Rm2×4,G(xB) is a sequence of m2 examples of class B .
G(xC)Rm3×4,G(xC) is a sequence of m3 examples of class C .
G(x)R(m1+m2+m3)×4,G(x) is a sequence of all the examples.
G(lA)=[1,0,0],G(lB)=[0,1,0],G(lC)=[0,0,1] .
G(Pθ):x,llsoftmax(MLPθ(x)) ,where the MLP has three output neurons corresponding to as many classes,and denotes the dot product as a way of selecting an output for G(Pθ) ; multiplying the MLP’s output by the one-hot vector l gives the truth degree corresponding to the class denoted by l .

Learning:

The logical operators and connectives are approximated using the stable product configuration with p=2 for ApME . For the formula aggregator, ApME is used also with p=2 .
The computational graph of Figure 10 illustrates how SatAggϕKGθ(ϕ) is obtained. If U denotes batches sampled from the data set of all examples, the loss function (to minimize) is:
L=1SatAggϕKGθ,xB(ϕ).
Figure 11 shows the result of training with the Adam optimizer with batches of 64 examples. Accuracy measures the ratio of examples correctly classified,with example x labeled as argmaxl(P(x,l))[20] Classification accuracy reaches an average value near 1.0 for both the training and test data after some 100 epochs. Satisfaction levels of the Iris flower predictions continue to increase for the rest of the training (500 epochs) to more than 0.8 .
It is worth contrasting the choice of using a binary predicate (P(x,l)) in this example with the option of using multiple unary predicates (lA(x),lB(x),lC(x)) ,one for each class. Notice how each predicate is normally associated with an output neuron. In the case of the unary predicates, the networks would be disjoint (or modular), whereas weight-sharing takes place with the use of the binary predicate. Since l is instantiated into lA,lB,lC ,in practice P(x,l) becomes P(x,lA),P(x,lB),P(x,lC) ,which is implemented via three output neurons to which a softmax function applies.

4.3. Multi-Class Multi-Label Classification

We now turn to multi-label classification, whereby multiple labels can be assigned to each example. As a first example of the reach of LTNs, we shall see how the previous example can be extended naturally using LTN to account for multiple labels, not always a trivial extension for most ML algorithms. The standard approach to the multi-label problem is to provide explicit negative examples for each class. By contrast, LTN can use background knowledge to relate classes directly to each other, thus becoming a powerful tool in the case of the multi-label problem when typically the labeled data is scarce. We explore the Leptograpsus crabs data set [10] consisting of 200 examples of 5 morphological measurements of 50 crabs. The task is to classify the crabs according to their color and sex. There are four labels: blue, orange, male, and female. The color labels are mutually exclusive, and so are the labels for sex. LTN will be used to specify such information logically.

20 This is also known as top-1 accuracy,as proposed in [39]. Cross-entropy results (tlog(y)) could have been reported here as is common with the use of softmax, although it is worth noting that, of course, the loss function used by LTN is different.

Figure 10: Symbolic Tensor Computational Graph for the Multi-Class Single-Label Problem. As before, the dotted lines in the figure indicate the propagation of activation from each input through the network, in this case producing three outputs.
Figure 11: Multi-Class Single-Label Classification: Classification accuracy (left) and satisfaction level (right).

Domains:

items denoting the examples from the crabs dataset.
labels denoting the class labels.

Variables:

xblue ,xorange ,xmale ,xfemale  for the positive examples of each class.
x ,used to denote all the examples.
D(xblue )=D(xorange )=D(xmale )=D(xfemale )=D(x)= items.

Constants:

lblue ,lorange ,lmale ,lfemale  (the labels for each class).
D(lblue )=D(lorange )=D(lmale )=D(lfemale )= labels.
Predicates:
P(x,l) ,denotes the fact that item x is labelled as l .
Din(P)= items,labels.

Axioms:

(30)xblue P(xblue ,lblue )
(31)xorange P(xorange ,lorange )
(32)xmale P(xmale ,lmale )
(33)xfemale P(xfemale ,lfemale )
(34)x¬(P(x,lblue )P(x,lorange ))
(35)x¬(P(x,lmale )P(x,lfemale ))
Notice how logical rules 34 and 35 above represent the mutual exclusion of the labels on
colour and sex, respectively. As a result, negative examples are not used explicitly in this specification.

Grounding:

G (items) =R5 ; the examples from the data set are described using 5 features.
G (labels) =N4 ; one-hot vectors are used to represent class labels 21
G(xblue )Rm1×5,G(xorange )Rm2×5,G(xmale )Rm3×5,G(xfemale )Rm4×5 . These sequences are not mutually-exclusive,one example can for instance be in both xblue  and xmale  . G˙(lblue )=[1,0,0,0],G(lorange )=[0,1,0,0],G(lmale )=[0,0,1,0],G(lfemale )=[0,0,0,1] . G(Pθ):x,llsigmoid(MLPθ(x)) ,with the MLP having four output neurons corresponding to as many classes. As before, denotes the dot product which selects a single output. By contrast with the previous example, notice the use of a sigmoid function instead of a softmax function.

21 There are two possible approaches here: either each item is labeled with one multi-hot encoding or each item is labeled with several one-hot encodings. The latter approach was used in this example.

Learning:

As before, the fuzzy logic operators and connectives are approximated using the stable product configuration with p=2 for ApME ,and for the formula aggregator, ApME is also used with p=2 .
Figure 12 shows the result of the Adam optimizer using backpropagation trained with batches of 64 examples. This time, the accuracy is defined as 1 -HL, where HL is the average Hamming loss, i.e. the fraction of labels predicted incorrectly, with a classification threshold of 0.5 (given an example u ,if the model outputs a value greater than 0.5 for class C then u is deemed as belonging to class C ). The rightmost graph in Figure 12 illustrates how LTN learns the constraint that a crab cannot have both blue and orange color, which is discussed in more detail in what follows.

Querying:

To illustrate the learning of constraints by LTN, we have queried three formulas that were not explicitly part of the knowledge-base, over time during learning:
(36)ϕ1:x(P(x,lblue )¬P(x,lorange ))
(37)ϕ2:x(P(x,lblue )P(x,lorange ))
(38)ϕ3:x(P(x,lblue )P(x,lmale ))
For querying,we use p=5 when approximating the universal quantifiers with ApME . A higher p denotes a stricter universal quantification with a stronger focus on outliers (see Section 2.4). 22 We should expect ϕ1 to hold true (every blue crab cannot be orange and vice-versa 23 ,and we should expect ϕ2 (every blue crab is also orange) and ϕ3 (every blue crab is male) to be false. The results are reported in the rightmost plot of Figure 12 Prior to training,the truth-values of ϕ1 to ϕ3 are non-informative. During training one can see,with the maximization of the satisfaction of the knowledge-base, a trend towards the satisfaction of ϕ1 ,and an opposite trend of ϕ2 and ϕ3 towards false.

4.4. Semi-Supervised Pattern recognition

Let us now explore two, more elaborate, classification tasks, which showcase the benefit of using logical reasoning alongside machine learning. With these two examples, we also aim to provide a more direct comparison with a related neurosymbolic system DeepProbLog [41]. The benchmark examples below were introduced in the DeepProbLog paper [41].

22 Training should usually not focus on outliers,as optimizers would struggle to generalize and tend to get stuck in local minima. However,when querying ϕ1,ϕ2,ϕ3 ,we wish to be more careful about the interpretation of our statement. See also 3.1.3 x(P(x,lblue )¬P(x,lorange ))

Figure 12: Multi-Class Multi-Label Classification: Classification Accuracy (left), Satisfiability level (middle), and Querying of Constraints (right).
Single Digits Addition: Consider the predicate addition (X,Y,N) ,where X and Y are images of digits (the MNIST data set will be used), and N is a natural number corresponding to the sum of these digits. This predicate should return an estimate of the validity of the addition. For instance,addition (B˙,Ω,11) is a valid addition; addition (B,Ω,5) is not.
Multi Digits Addition: The experiment is extended to numbers with more than one digit. Consider the predicate addition ([X1,X2],[Y1,Y2],N).[X1,X2] and [Y1,Y2] are lists of images of digits,representing two multi-digit numbers; N is a natural number corresponding to the sum of the two multi-digit numbers. For instance,addition ([3,5],[7,2],130) is a valid addition; addition ([3,8],[9,2],26) is not.
A natural neurosymbolic approach is to seek to learn a single-digit classifier and benefit from knowledge readily available about the properties of addition in this case. For instance, suppose that a predicate digit(x,d) gives the likelihood of an image x being of digit d . A definition for addition (3,8,11) in LTN is:
d1,d2:d1+d2=11(digit(B,d1)digit(C,d2))
In [41], the above task is made more complicated by not providing labels for the single-digit images during training. Instead, training takes place on pairs of images with labels made available for the result only, that is, the sum of the individual labels. The single-digit classifier is not explicitly trained by itself; its output is a piece of latent information that is used by the logic. However, this does not pose a problem for end-to-end neurosymbolic systems such as LTN or DeepProbLog for which the gradients can propagate through the logical structures.
We start by illustrating a LTN theory that can be used to learn the predicate digit. The specification of the theory below is for the single digit addition example, although it can be extended easily to the multiple digits case.

Domains:

images, denoting the MNIST digit images,
results, denoting the integers that label the results of the additions,
digits, denoting the digits from 0 to 9 .

Variables:

x,y ,ranging over the MNIST images in the data,
n for the labels,i.e. the result of each addition,
d1,d2 ranging over digits.
D(x)=D(y)= images,
D(n)= results,
D(d1)=D(d2)= digits.

Predicates:

digit(x,d) for the single digit classifier,where d is a term denoting a digit constant or a digit variable. The classifier should return the probability of an image x being of digit d . Din ( digit )= images,digits. Axioms:
Single Digit Addition:
Diag(x,y,n)
(39)(d1,d2:d1+d2=n
(digit(x,d1)digit(y,d2)))
Multiple Digit Addition:
Diag(x1,x2,y1,y2,n)
(40)(d1,d2,d3,d4:10d1+d2+10d3+d4=n
(digit(x1,d1)digit(x2,d2)digit(y1,d3)digit(y2,d4)))
Notice the use of Diag: when grounding x,y,n with three sequences of values,the i -th examples of each variable are matching. That is, (G(x)i,G(y)i,G(n)i) is a tuple from our dataset of valid additions. Using the diagonal quantification, LTN aggregates pairs of images and their corresponding result, rather than any combination of images and results.
Notice also the guarded quantification: by quantifying only on the latent "digit labels" (i.e. d1,d2,) that can add up to the result label ( n ,given in the dataset),we incorporate symbolic information into the system. For example,in (39),if n=3 ,the only valid tuples (d1,d2) are (0,3),(3,0),(1,2),(2,1) . Gradients will only backpropagate to these values.

Grounding:

G (images) =[0,1]28×28×1 . The MNIST data set has images of 28 by 28 pixels. The images are grayscale and have just one channel. The RGB pixel values from 0 to 255 of the MNIST data set are converted to the range [0,1] .
G (results) =N .
G(digits)={0,1,,9} .
G(x)[0,1]m×28×28×1,G(y)[0,1]m×28×28×1,G(n)Nm 24
G(d1)=G(d2)=0,1,,9.
G(digitθ):x,d onehot (d)softmax(CNNθ(x)) ,where CNNis a  Convolutional Neural Network with 10 output neurons for each class. Notice that, in contrast with the previous examples, d is an integer label; onehot (d) converts it into a one-hot label.

24 Notice the use of the same number m of examples for each of these variables as they are supposed to match one-to-one due to the use of Diag.

Learning:

The computational graph of Figure 13 shows the objective function for the satisfiability of the knowledge base. A stable product configuration is used with hyper-parameter p=2 of the operator ApME for universal quantification () . Let p denote the exponent hyper-parameter used in the generalized mean ApM for existential quantification (3). Three scenarios are investigated and compared in the Multiple Digit experiment (Figure 15):
  1. p=1 throughout the entire experiment,
  1. p=2 throughout the entire experiment,or
  1. p follows a schedule,changing from p=1 to p=6 gradually with the number of training epochs.
In the Single Digit experiment, only the last scenario above (schedule) is investigated (Figure 14).
We train to maximize satisfiability by using batches of 32 examples of image pairs, labeled by the result of their addition. As done in [41], the experimental results vary the number of examples in the training set to emphasize the generalization abilities of a neurosymbolic approach. Accuracy is measured by predicting the digit values using the predicate digit and reporting the ratio of examples for which the addition is correct. A comparison is made with the same baseline method used in [41]: given a pair of MNIST images, a non-pre-trained CNN outputs embeddings for each image (Siamese neural network). The embeddings are provided as input to dense layers that classify the addition into one of the 19 (respectively, 199) possible results of the Single Digit Addition (respectively, Multiple Digit Addition) experiments. The baseline is trained using a cross-entropy loss between the labels and the predictions. As expected, such a standard deep learning approach struggles with the task without the provision of symbolic meaning about intermediate parts of the problem.
Experimentally, we find that the optimizer for the neurosymbolic system gets stuck in a local optimum at the initialization in about 1 out of 5 runs. We, therefore, present the results on an average of the 10 best outcomes out of 15 runs of each algorithm (that is, for the baseline as well). The examples of digit pairs selected from the full MNIST data set are randomized at each run.
Figure 15 shows that the use of p=2 from the start produces poor results. A higher value for p in ApM weighs up the instances with a higher truth-value (see also Appendix C for a discussion). Starting already with a high value for p ,the classes with a higher initial truth-value for a given example will have higher gradients and be prioritized for training, which does not make practical sense when randomly initializing the predicates. Increasing p by following a schedule is the most promising approach. In this particular example, p=1 is also shown to be adequate purely from a learning perspective. However, p=1 implements a simple average which does not account for the meaning of well; the resulting satisfaction value is not meaningful within a reasoning perspective.
Table 1 shows that the training and test times of LTN are of the same order of magnitude as those of the CNN baselines. Table 2 shows that LTN reaches similar accuracy as that reported by DeepProbLog.
Figure 13: Symbolic Tensor Computational Graph for the Single Digit Addition task. Notice that the figure does not depict accurate dimensions for the tensors; G(x) and G(y) are in fact 4D tensors of dimensions m×28×28×1 . Computing results with the variables d1 or d2 corresponds to the addition of a further axes of dimension 10 .
Figure 14: Single Digit Addition Task: Accuracy and satisfiability results (top) and results in the presence of fewer examples (bottom) in comparison with standard Deep Learning using a CNN (blue lines).
Model(Single Digits)(Multi Digits)
TrainTestTrainTest
baseline2.72±0.23ms1.45±0.21ms3.87±0.24ms2.10±0.30ms
LTN5.36±0.25ms3.44±0.39ms8.51±0.72ms5.72±0.57ms
Table 1: The computation time of training and test steps on the single and multiple digit addition tasks, measured on a computer with a single Nvidia Tesla V100 GPU and averaged over 1000 steps. Each step operates on a batch of 32 examples. The computational efficiency of the LTN and the CNN baseline systems are of the same order of magnitude.
ModelNumber of training examples
(Single Digits)(Multi Digits)
30 0003 00015 0001 500
baseline95.95±0.2770.59±1.4547.19±0.692.07±0.12
LTN98.04±0.1393.49±0.2895.37±0.2988.21±0.63
DeepProbLog97.20±0.4592.18±1.5795.16±1.7087.21±1.92
Table 2: Accuracy (in %) on the test set: comparison of the final results obtained with LTN and those reported with DeepProbLog[41]. Although it is difficult to compare directly the results over time (the frameworks are implemented in different libraries), while achieving similar computational efficiency as the CNN baseline, LTN also reaches similar accuracy as that reported by DeepProbLog.

4.5. Regression

Another important problem in Machine Learning is regression where a relationship is estimated between one independent variable X and a continuous dependent variable Y . The essence of regression is,therefore,to approximate a function f(x)=y by a function f ,given examples (xi,yi) such that f(xi)=yi . In LTN one can model a regression task by defining f as a learnable function whose parameter values are constrained by data. Additionally, a regression task requires a notion of equality. We,therefore,define the predicate eq as a smooth version of the symbol = to turn the constraint f(xi)=yi into a smooth optimization problem.
In this example,we explore regression using a problem from a real estate data set25 with 414 examples, each described in terms of 6 real-numbered features: the transaction date (converted to a float), the age of the house, the distance to the nearest station, the number of convenience stores in the vicinity, and the latitude and longitude coordinates. The model has to predict the house price per unit area.

Domains:
samples, denoting the houses and their features.
prices, denoting the house prices. Variables:
x for the samples.
y for the prices.
D(x)= samples.
D(y)= prices.


25 https://www.kaggle.com/quantbruce/real-estate-price-prediction

Figure 15: Multiple Digit Addition Task: Accuracy and satisfiability results (top) and results in the presence of fewer examples (bottom) in comparison with standard Deep Learning using a CNN (blue lines).

Functions:

f(x) ,the regression function to be learned.
Din(f)= samples, Dout(f)= prices.

Predicates:

eq(y1,y2) ,a smooth equality predicate that measures how similar y1 and y2 are.
Din(eq)= prices,prices.
Axioms:
(41)Diag(x,y)eq(f(x),y)
Notice again the use of Diag: when grounding x and y onto sequences of values,this is done by obeying a one-to-one correspondence between the sequences. In other words, we aggregate pairs of corresponding samples and prices, instead of any combination thereof.

Grounding:

G (samples) =R6 .
G (prices) =R .
G(x)Rm×6,G(y)Rm×1 . Notice that this specification refers to the same number m of examples for x and y due to the above one-to-one correspondence obtained with the use of Diag.
G(eq(u,v))=exp(αj(ujvj)2) ,where the hyper-parameter α is a real number that scales how strict the smooth equality is. 26 In our experiments,we use α=0.05 .
Figure 17: Visualization of LTN solving a regression problem.
G(f(x)θ)=MLPθ(x) ,where MLPθ is a multilayer perceptron which ends in one neuron
corresponding to a price prediction, with a linear output layer (no activation function).

Learning:

The theory is constrained by the parameters of the model of f . LTN is used to estimate such parameters by maximizing the satisfaction of the knowledge-base, in the usual way. Approximating using ApME with p=2 ,as before,we randomly split the data set into 330 examples for training and 84 examples for testing. Figure 16 shows the satisfaction level over 500 epochs. We also plot the Root Mean Squared Error (RMSE) between the predicted prices and the labels (i.e. actual prices, also known as target values). We visualize in Figure 17 the strong correlation between actual and predicted prices at the end of one of the runs.

4.6. Unsupervised Learning (Clustering)

In unsupervised learning, labels are either not available or are not used for learning. Clustering is a form of unsupervised learning whereby, without labels, the data is characterized by constraints

26 Intuitively,the smooth equality is exp(αd(u,v)) ,where d(u,v) is the Euclidean distance between u and v . It produces a 1 if the distance is zero; as the distance increases, the result decreases exponentially towards 0 . In case an exponential decrease is undesirable,one can adopt the following alternative equation: eq(u,v)=11+αd(u,v) .

alone. LTN can formulate such constraints, such as:
  • clusters should be disjoint,
  • every example should be assigned to a cluster,
  • a cluster should not be empty,
  • if the points are near, they should belong to the same cluster,
  • if the points are far, they should belong to different clusters, etc.
Domains:
points, denoting the data to cluster.
points_pairs, denoting pairs of examples.
clusters, denoting the cluster.
Variables:

x,y for all points.
D(x)=D(y)= points.
D(c)= clusters.

Predicates:

C(x,c) ,the truth degree of a given point belonging in a given cluster.
Din(C)= points,clusters.
Axioms:
(42)xcC(x,c)
(43)cxC(x,c)
(44)(c,x,y:|xy|<thclose )(C(x,c)C(y,c))
(45)(c,x,y:|xy|>thdistant )¬(C(x,c)C(y,c))
Notice the use of guarded quantifiers: all the pairs of points with Euclidean distance lower (resp. higher) than a value thclose  (resp. thdistant  ) should belong in the same cluster (resp. should not). thclose  and thdistant  are arbitrary threshold values that define some of the closest and most distant pairs of points. In our example, they are set to, respectively, 0.2 and 1.0.
As done in the example of Section 4.2, the clustering predicate has mutually exclusive satisfiability scores for each cluster using a softmax layer. Therefore, there is no explicit constraint about clusters being disjoint.

Grounding:

G (points) =[1,1]2 .
G (clusters) =N4 , we use one-hot vectors to represent a choice of 4 clusters.
G(x)[1,1]m×2 ,that is, x is a sequence of m points. G(y)=G(x) .
thclose =0.2,thdistant =1.0 .
G(c)=[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1].
G(Cθ):x,ccsoftmax(MLPθ(x)) ,where MLP has 4 output neurons corresponding to the 4 clusters.
Figure 18: LTN solving a clustering problem by constraint optimization: ground-truth (top) and querying of each cluster C0,C1,C2 and C3 ,in turn.

Learning:

We use the stable real product configuration to approximate the logical operators. For ,we use ApME with p=4 . For ,we use ApM with p=1 during the first 100 epochs,and p=6 thereafter, as a simplified version of the schedule used in Section 4.4. The formula aggregator is approximated by ApME with p=2 . The model is trained for a total of 1000 epochs using the Adam optimizer, which is sufficient for LTN to solve the clustering problem shown in Figure 18 Ground-truth data for this task was generated artificially by creating 4 centers, and generating 50 random samples from a multivariate Gaussian distribution around each center. The trained LTN achieves a satisfaction level of the clustering constraints of 0.857 .

4.7. Learning Embeddings with LTN

A classic example of Statistical Relational Learning is the smokers-friends-cancer example introduced in [55]. Below, we show how this example can be formalized in LTN using semi-supervised embedding learning.
There are 14 people divided into two groups {a,b,,h} and {i,j,,n} . Within each group, there is complete knowledge about smoking habits. In the first group, there is complete knowledge about who has and who does not have cancer. Knowledge about the friendship relation is complete within each group only if symmetry is assumed,that is, x,y (friends (x,y)friends(y,x) ).
Otherwise,knowledge about friendship is incomplete in that it may be known that e.g. a is a friend of b ,and it may be not known whether b is a friend of a . Finally,there is general knowledge about smoking, friendship, and cancer, namely that smoking causes cancer, friendship is normally symmetric and anti-reflexive, everyone has a friend, and smoking propagates (actively or passively) among friends. All this knowledge is represented in the axioms further below.

Domains:

people, to denote the individuals.
Constants:
a,b,,h,i,j,,n ,the 14 individuals. Our goal is to learn an adequate embedding for each
constant.
D(a)=D(b)==D(n)= people. 
Variables:
x,y ranging over the individuals.
D(x)=D(y)= people.

Predicates:

S(x) for smokes, F(x,y) for friends, C(x) for cancer.
D(S)=D(C)= people. D(F)= people,people.

Axioms:

Let X1={a,b,,h} and X2={i,j,,n} be the two groups of individuals.
Let S={a,e,f,g,j,n} be the smokers; knowledge is complete in both groups.
Let C={a,e} be the individuals with cancer; knowledge is complete in X1 only.
Let F={(a,b),(a,e),(a,f),(a,g),(b,c),(c,d),(e,f),(g,h),(i,j),(j,m),(k,l),(m,n)} be the
set of friendship relations; knowledge is complete if assuming symmetry.
These facts are illustrated in Figure 20a
We have the following axioms:
F(u,v)
(46)for(u,v)F
¬F(u,v)
(47)for(u,v)F,u>v
S(u)
(48)foruS
¬S(u) for u(X1X2)S(49)
C(u) for uC(50)
¬C(u) for uX1C
(51)
(52)x¬F(x,x)
(53)x,y(F(x,y)F(y,x))
(54)xyF(x,y)
(55)x,y((F(x,y)S(x))S(y))
(56)x(S(x)C(x))
(57)x(¬C(x)¬S(x))
Notice that the knowledge base is not satisfiable in the strict logical sense of the word. For
instance, f is said to smoke but not to have cancer,which is inconsistent with the rule
x(S(x)C(x)) . Hence,it is important to adopt a fuzzy approach as done with MLN or a many-valued fuzzy logic interpretation as done with LTN.

Grounding:

G (people) =R5 . The model is expected to learn embeddings in R5 .
G(aθ)=vθ(a),,G(nθ)=vθ(n) . Every individual is associated with a vector of 5 real
numbers. The embedding is initialized randomly uniformly.
G(xθ)=G(yθ)=vθ(a),,vθ(n).
G(Sθ):xsigmoid(MLP_Sθ(x)) ,where MLP_ Sθ has 1 output neuron.
G(Fθ):x,ysigmoid(MLP_Fθ(x,y)) ,where MLP_ Fθ has 1 output neuron.
G(Cθ):xsigmoid(MLP_Cθ(x)) ,where MLP _Cθ has 1 output neuron.
The MLP models for S,F,C are kept simple,so that most of the learning is focused on the embedding.

Learning:

We use the stable real product configuration to approximate the operators. For ,we use ApME with p=2 for all the rules,except for rules (52) and (53),where we use p=6 . The intuition behind this choice of p is that no outliers are to be accepted for the friendship relation since it is expected to be symmetric and anti-reflexive, but outliers are accepted for the other rules. For ,we use ApM with p=1 during the first 200 epochs of training,and p=6 thereafter,with the same motivation as that of the schedule used in Section 4.4 The formula aggregator is approximated by ApME with p=2 .
Figure 19 shows the satisfiability over 1000 epochs of training. At the end of one of these runs,we query S(x),F(x,y),C(x) for each individual; the results are shown in Figure 20b We also plot the principal components of the learned embeddings [51] in Figure 21. The friendship relations are learned as expected: (56) "smoking implies cancer" is inferred for group 2 even though such information was not present in the knowledge base. For group 1,the given facts for smoking and cancer for the individuals f and g are slightly altered,as these were inconsistent with the rules. (the rule for smoking propagating via friendship (55) is incompatible with many of the given facts). Increasing the satisfaction of this rule would require decreasing the overall satisfaction of the knowledge base, which explains why it is partly ignored by LTN during training. Finally, it is interesting to note that the principal components for the learned embeddings seem to be linearly separable for the smoking and cancer classifiers (c.f. Figure 21, top right and bottom right plots).

Querying:

To illustrate querying in LTN, we query over time two formulas that are not present in the knowledge-base:
(58)ϕ1:p:C(p)S(p)
(59)ϕ2:p,q:(C(p)C(q))F(p,q)
We use p=5 when approximating since the impact of an outlier at querying time should be seen as more important than at learning time. It can be seen that as the grounding approaches satisfiability of the knowledge-base, ϕ1 approaches true,whereas ϕ2 approaches false (c.f. Figure 20a).
Figure 19: Smoker-Friends-Cancer example: Satisfiability levels during training (left) and truth-values of queries ϕ1 and ϕ2 over time (right).
(a) Incomplete facts in the knowledge-base: axioms for smokers and cancer for individuals a to n (left),friendship relations in group 1 (middle), and friendship relations in group 2 (right).
(b) Querying all the truth-values using LTN after training: smokers and cancer (left), friendship relations (middle and right).
Figure 20: Smoker-Friends-Cancer example: Illustration of the facts before and after training.
Figure 21: Smoker-Friends-Cancer example: learned embeddings showing the result of applying PCA on the individuals (top left); truth-values of smokes and cancer predicates for each embedding (top and bottom right); illustration of the friendship relations which are satisfied after learning (bottom left).

4.8. Reasoning in LTN

The essence of reasoning is to find out if a closed formula ϕ is the logical consequence of a knowledge-base (K,Gθ,Θ) . Section 3.4 introduced two approaches to this problem in LTN:
  • By simply querying after learning 27 one seeks to verify if for the grounded theories that maximally satisfy K ,the grounding of ϕ gives a truth-value greater than a threshold q . This often requires checking an infinite number of groundings. Instead, the user approximates the search for these grounded theories by running the optimization a fixed number of times only.
  • Reasoning by refutation one seeks to find out a counter-example: a grounding that satisfies the knowledge-base K but not the formula ϕ given the threshold q . A search is performed here using a different objective function.
We now demonstrate that reasoning by refutation is the preferred option using a simple example where we seek to find out whether (AB)qA .

Propositional Variables:

The symbols A and B denote two propositionial variables.
Axioms:
(60)AB

27 Here,learning refers to Section 3.2 which is optimizing using the satisfaction of the knowledge base as an objective.

Figure 22: Querying after learning: 10 runs of the optimizer with objective G=argmaxGθ(Gθ(K)) . All runs converge to the optimum G1 ; the grid search misses the counter-example.

Grounding:

G(A)=a,G(B)=b ,where a and b are two real-valued parameters. The set of parameters is therefore θ={a,b} . At initialization, a=b=0 .
We use the probabilistic-sum SP to approximate ,resulting in the following satisfiability measure.
(61)Gθ(K)=Gθ(AB)=a+bab.
There are infinite global optima maximizing the satisfiability of the theory,as any Gθ such that Gθ(A)=1 (resp. Gθ(B)=1 ) gives a satisfiability Gθ(K)=1 for any value of Gθ(B) (resp. Gθ(A)) . As expected,the following groundings are examples of global optima:
G1:G1(A)=1,G1(B)=1,G1(K)=1,
G2:G2(A)=1,G2(B)=0,G2(K)=1,
G3:G3(A)=0,G3(B)=1,G3(K)=1.

Reasoning:

(AB)qA ? That is,given the threshold q=0.95 ,does every Gθ such that Gθ(K)q verify Gθ(ϕ)q . Immediately,one can notice that this is not the case. For instance,the grounding G3 is a counter-example.
If one simply reasons by querying multiple groundings after learning with the usual objective argmax(Gθ)Gθ(K) ,the results will all converge to G1:Gθ(K)a=1b and Gθ(K)b=1a . Every run of the optimizer will increase a and b simultaneously until they reach the optimum a=b=1 . Because the grid search always converges to the same point,no counter-example is found and the logical consequence is mistakenly assumed true. This is illustrated in Figure 22
Reasoning by refutation, however, the objective function has an incentive to find a counterexample with ¬A ,as illustrated in Figure 23 LTN converges to the optimum G3 ,which refutes the logical consequence.

28 We use the notation G(K):=SatAggϕK(K,G) .

Figure 23: Reasoning by refutation: one run of the optimizer with objective G=argminGθ(Gθ(ϕ)+elu(α,β(qGθ(K)))) , q=0.95,α=0.05,β=10 . In the first training epochs,the directed search prioritizes the satisfaction of the knowledge base. Then,the minimization of Gθ(ϕ) starts to weigh in more and the search focuses on finding a counter-example. Eventually,the run converges to the optimum G3 ,which refutes the logical consequence.

5. Related Work

The past years have seen considerable work aiming to integrate symbolic systems and neural networks. We shall focus on work whose objective is to build computational models that integrate deep learning and logical reasoning into a so-called end-to-end (fully differentiable) architecture. We summarize a categorization in Figure 24 where the class containing LTN is further expanded into three sub-classes. The sub-class highlighted in red is the one that contains LTN. The reason why one may wish to combine symbolic AI and neural networks into a neurosymbolic AI system may vary, c.f. [17] for a recent comprehensive overview of approaches and challenges for neurosymbolic AI.

5.1. Neural architectures for logical reasoning

These use neural networks to perform (probabilistic) inference on logical theories. Early work in this direction has shown correspondences between various logical-symbolic systems and neural network models [27, 32, 52, 63, 65]. They have also highlighted the limits of current neural networks as models for knowledge representation. In a nutshell, current neural networks (including deep learning) have been shown capable of representing propositional logic, nonmonotonic logic programming, propositional modal logic, and fragments of first-order logic, but not full first-order or higher-order logic. Recently, there has been a resurgence of interest in the topic with many proposals emerging [13, 48, 53]. In [13], each clause of a Stochastic Logic Program is converted into a factor graph with reasoning becoming differentiable so that it can be implemented by deep networks. In [49], a differentiable unification algorithm is introduced with theorem proving sought to be carried out inside the neural network. Furthermore, in [11, 49] neural networks are used to learn reasoning strategies and logical rule induction.
Reasoning with LTN (Section 3.4) is reminiscent of this category, given that knowledge is not represented in a traditional logical language but in Real Logic.

5.2. Logical specification of neural network architectures

Here the goal is to use a logical language to specify the architecture of a neural network. Examples include [13, 24, 26, 56, 66]. In [26], the languages of extended logic programming (logic programs with negation by failure) and answer set programming are used as background knowledge to set up the initial architecture and set of weights of a recurrent neural network, which is subsequently trained from data using backpropagation. In [24], first-order logic programs in the form of Horn clauses are used to define a neural network that can solve Inductive Logic Programming tasks, starting from the most specific hypotheses covering the set of examples. Lifted relational neural networks [66] is a declarative framework where a Datalog program is used as a compact specification of a diverse range of existing advanced neural architectures, with a particular focus on Graph Neural Networks (GNNs) and their generalizations. In [56] a weighted Real Logic is introduced and used to specify neurons in a highly modular neural network that resembles a tree structure, whereby neurons with different activation functions are used to implement the different logic operators.
To some extent, it is also possible to specify neural architectures using logic in LTN. For example,a user can define a classifier P(x,y) as the formula P(x,y)=(Q(x,y)R(y))S(x,y) . G(P) becomes a computational graph that combines the sub-architectures G(Q),G(R) ,and G(S) according to the syntax of the logical formula.

5.3. Neurosymbolic architectures for the integration of inductive learning and deductive reasoning

These architectures seek to enable the integration of inductive and deductive reasoning in a unique fully differentiable framework [15, 23, 41, 46, 47]. The systems that belong to this class combine a neural component with a logical component. The former consists of one or more neural networks, the latter provides a set of algorithms for performing logical tasks such as model checking, satisfiability, and logical consequence. These two components are tightly integrated so that learning and inference in the neural component are influenced by reasoning in the logical component and vice versa. Logic Tensor Networks belong to this category. Neurosymbolic architectures for integrating learning and reasoning can be further separated into three sub-classes:
  1. Approaches that introduce additional layers to the neural network to encode logical constraints which modify the predictions of the network. This sub-class includes Deep Logic Models [46] and Knowledge Enhanced Neural Networks [15].
  1. Approaches that integrate logical knowledge as additional constraints in the objective function or loss function used to train the neural network (LTN and [23, 33, 47]).
  1. Approaches that apply (differentiable) logical inference to compute the consequences of the predictions made by a set of base neural networks. Examples of this sub-class are DeepProblog [41] and Abductive Learning [14].
In what follows, we revise recent neurosymbolic architectures in the same class as LTN: Integrating learning and reasoning.
Systems that modify the predictions of a base neural network:. Among the approaches that modify the predictions of the neural network using logical constraints are Deep Logic Models [46] and Knowledge Enhanced Neural Networks [15]. Deep Logic Models (DLM) are a general architecture for learning with constraints. Here, we will consider the special case where constraints are expressed by logical formulas. In this case,a DLM predicts the truth-values of a set of n ground atoms of a domain Δ={a1,,ak} . It consists of two models: a neural network f(xw) which takes as input the features x of the elements of Δ and produces as output an evaluation f for all the ground atoms,i.e. f[0,1]n ,and a probability distribution p(yf,λ˙) which is modeled by an undirected graphical model of the exponential family with each logical constraint characterized by a clique that contains the ground atoms, rather similarly to GNNs. The model returns the assignment to the atoms that maximize the weighted truth-value of the constraints and minimize the difference between the prediction of the neural network and a target value y . Formally:
DLM(xλ,w)=argmaxy(cλcΦc(yc)12yf(xw)2)
Figure 24: Three classes of neurosymbolic approaches with Architectures Integrating Learning and Reasoning further subdivided into three sub-classes, with LTN belonging to the sub-class highlighted in red.
Each Φc(yc) corresponds to a ground propositional formula which is evaluated w.r.t. the target truth assignment y ,and λc is the weight associated with formula Φc . Intuitively,the upper model (the undirected graphical model) should modify the prediction of the lower model (the neural network) minimally to satisfy the constraints. f and y are truth-values of all the ground atoms obtained from the constraints appearing in the upper model in the domain specified by the data input.
Similar to LTN, DLM evaluates constraints using fuzzy semantics. However, it considers only propositional connectives, whereas universal and existential quantifiers are supported in LTN.
Inference in DLM requires maximizing the prediction of the model, which might be prohibitive in the presence of a large number of instances. In LTN, inference involves only a forward pass through the neural component which is rather simple and can be carried out in parallel. However, in DLM the weight associated with constraints can be learned, while in LTN they are specified in the background knowledge.
The approach taken in Knowledge Enhanced Neural Networks (KENN) [15] is similar to that of DLM. Starting from the predictions y=fnn(xw) made by a base neural network fnn(w) , KENN adds a knowledge enhancer,which is a function that modifies y based on a set of weighted constraints formulated in terms of clauses. The formal model can be specified as follows:
KENN(xλ,w)=σ(fnn(xw)+cλc(softmax(sign(c)fnn(xw))sign(c)))
where fnn(xw) are the pre-activations of fnn(xw) ,sign (c) is a vector of the same dimension of y containing 1,1 and ,such that sign(c)i=1 (resp. sign(c)i=1 ) if the i -th atom occurs positively (resp. negatively) in c ,or otherwise,and is the element-wise product. KENN learns the weights λ of the clauses in the background knowledge and the base network parameters w by minimizing some standard loss,(e.g. cross-entropy) on a set of training data. If the training data is inconsistent with the constraint, the weight of the constraint will be close to zero. This intuitively implies that the latent knowledge present in the data is preferred to the knowledge specified in the constraints. In LTN, instead, training data and logical constraints are represented uniformly with a formula, and we require that they are both satisfied. A second difference between KENN and LTN is the language: while LTN supports constraints written in full first-order logic, constraints in KENN are limited to universally quantified clauses.
Systems that add knowledge to a neural network by adding a term to the loss function:. In [33], a framework is proposed that learns simultaneously from labeled data and logical rules. The proposed architecture is made of a student network fnn and a teacher network,denoted by q . The student network is trained to do the actual predictions, while the teacher network encodes the information of the logical rules. The transfer of information from the teacher to the student network is done by defining a joint loss L for both networks as a convex combination of the loss of the student and the teacher. If y~=fnn(xw) is the prediction of the student network for input x ,the loss is defined as:
(1π)L(y,y~)+πL(q(y~x),y~)
where q(y~x)=exp(cλc(1ϕc(x,y~))) measures how much the predictions y~ satisfy the constraints encoded in the set of clauses {λc:ϕc}cC . Training is iterative. At every iteration,the parameters of the student network are optimized to minimize the loss that takes into account the feedback of the teacher network on the predictions from the previous step. The main difference between this approach and LTN is how the constraints are encoded in the loss. LTN integrates the constraints in the network and optimizes directly their satisfiability with no need for additional training data. Furthermore, the constraints proposed in [33] are universally quantified formulas only.
The approach adopted by Lxrcs [47] is analogous to the first version of LTN [61]. Logical constraints are translated into a loss function that measures the (negative) satisfiability level of the network. Differently from LTN, formulas in Lyrics can be associated with weights that are hyper-parameters. In [47], a logarithmic loss function is also used when the product t-norm is adopted. Notice that weights can also be added (indirectly) to LTN by introducing a 0-ary predicate pw to represent a constraint of the form pwϕ . An advantage of this approach would be that the weights could be learned.
In [72], a neural network computes the probability of some events being true. The neural network should satisfy a set of propositional logic constraints on its output. These constraints are compiled into arithmetic circuits for weighted model counting, which are then used to compute a loss function. The loss function then captures how close the neural network is to satisfying the propositional logic constraints.
Systems that apply logical reasoning on the predictions of a base neural network:. The most notable architecture in this category is DeepProblog [41]. DeepProblog extends the ProbLog framework for probabilistic logic programming to allow the computation of probabilistic evidence from neural networks. A ProbLog program is a logic program where facts and rules can be associated with probability values. Such values can be learned. Inference in ProbLog to answer a query q is performed by knowledge compilation into a function p(qλ) that computes the probability that q is true according to the logic program with relative frequencies λ . In DeepProbLog,a neural network fnn that outputs a probability distribution t=(t1,,tn) over a set of atoms a=(a1,,an) is integrated into ProbLog by extending the logic program with a and the respective probabilities t . The probability of a query q is then given by p(qλ,fnn(xw)) ,where x is the input of fnn and p is the function corresponding to the logic program extended with a . Given a set of queries q , input vectors x and ground-truths y for all the queries,training is performed by minimizing a loss function that measures the distance between the probabilities predicted by the logic program and the ground-truths, as follows:
L(y,p(qλ,fnn(xw)))
The most important difference between DeepProbLog and LTN concerns the logic on which they are based. DeepProbLog adopts probabilistic logic programming. The output of the base neural network is interpreted as the probability of certain atoms being true. LTN instead is based on many-valued logic. The predictions of the base neural network are interpreted as fuzzy truth-values (though previous work [67] also formalizes Real Logic as handling probabilities with relaxed constraints). This difference of logic leads to the second main difference between LTN and Deep-Problog: their inference mechanism. DeepProblog performs probabilistic inference (based on model counting) while LTN inference consists of computing the truth-value of a formula starting from the truth-values of its atomic components. The two types of inference are incomparable. However, computing the fuzzy truth-value of a formula is more efficient than model counting, resulting in a more scalable inference task that allows LTN to use full first-order logic with function symbols. In DeepProblog, to perform probabilistic inference, a closed-world assumption is made and a function-free language is used. Typically, DeepProbLog clauses are compiled into Sentential Decision Diagrams (SDDs) to accelerate inference considerably [36], although the compilation step of clauses into the SDD circuit is still costly.
An approach that extends the predictions of a base neural network using abductive reasoning is [14]. Given a neural network fnn(xw) that produces a crisp output y{0,1}n for n predicates p1,,pn and background knowledge in the form of a logic program p ,parameters w of fnn are learned alongside a set of additional rules ΔC that define a new concept C w.r.t. p1,,pn such that,for every object o with features xo :
(62)pfnn(xow)ΔCC(o) if o is an instance of C
pfnn(xow)ΔC¬C(o)ifois not an instance ofC
The task is solved by iterating the following three steps:
  1. Given the predictions of the neural network {fnn(xow)}oO on the set O of training objects, search for the best ΔC that maximize the number of objects for which (62) holds;
  1. For each object o ,compute by abduction on pΔC ,the explanation p(o) ;
  1. Retrain fnn with the training set {xo,p(o)}oO .
Differently from LTN, in [14] the optimization is done separately in an iterative way. The semantics of the logic is crisp, neither fuzzy nor probabilistic, and therefore not fully differentiable. Abductive reasoning is adopted, which is a potentially relevant addition for comparison with symbolic ML and Inductive Logic Programming approaches [50].
Various other loosely-coupled approaches have been proposed recently such as [44], where image classification is carried out by a neural network in combination with reasoning from text data for concept learning at a higher level of abstraction than what is normally possible with pixel data alone. The proliferation of such approaches has prompted Henry Kautz to propose a taxonomy for neurosymbolic AI in [34] (also discussed in [17]), including recent work combining neural networks with graphical models and graph neural networks [4, 40, 58], statistical relational learning [21, 55], and even verification of neural multi-agent systems [2, 8].

6. Conclusions and Future Work

In this paper, we have specified the theory and exemplified the reach of Logic Tensor Networks as a model and system for neurosymbolic AI. LTN is capable of combining approximate reasoning and deep learning, knowledge and data.
For ML practitioners, learning in LTN (see Section 3.2) can be understood as optimizing under first-order logic constraints relaxed into a loss function. For logic practitioners, learning is similar to inductive inference: given a theory, learning makes generalizations from specific observations obtained from data. Compared to other neuro-symbolic architectures (see Section 5), the LTN framework has useful properties for gradient-based optimization (see Section 2.4) and a syntax that supports many traditional ML tasks and their inductive biases (see Section 4), all while remaining computationally efficient (see Table 1).
Section 3.4 discussed reasoning in LTN. Reasoning is normally under-specified within neural networks. Logical reasoning is the task of proving if some knowledge follows from the facts which are currently known. It is traditionally achieved semantically using model theory or syntactically via a proof system. The current LTN framework approaches reasoning semantically, although it should be possible to use LTN and querying alongside a proof system. When reasoning by refutation in LTN,to find out if a statement ϕ is a logical consequence of given data and knowledge-base K ,a proof by refutation attempts to find a semantic counterexample where ¬ϕ and K are satisfied. If the search fails then ϕ is assumed to hold. This approach is efficient in LTN when we allow for a direct search to find counterexamples via gradient-descent optimization. It is assumed that ϕ ,the statement to prove or disprove,is known. Future work could explore automatically inducing which statement ϕ to consider,possibly using syntactical reasoning in the process.
The paper formalizes Real Logic, the language supporting LTN. The semantics of Real Logic are close to the semantics of Fuzzy FOL with the following major differences: 1) Real Logic domains are typed and restricted to real numbers and real-valued tensors, 2) Real Logic variables are sequences of fixed length, whereas FOL variables are a placeholder for any individual in a domain, 3) Real Logic relations relations are interpreted as mathematical functions, whereas Fuzzy Logic relations are interpreted as fuzzy set membership functions. Concerning the semantics of connectives and quantifiers, some LTN implementations correspond to semantics for t-norm fuzzy logic, but not all. For example, the conjunction operator in stable product semantics is not a t-norm, as pointed out at the end of Section 2.4
Integrative neural-symbolic approaches are known for either seeking to bring neurons into a symbolic system (neurons into symbols) [41] or to bring symbols into a neural network (symbols into neurons) [60]. LTN adopts the latter approach but maintaining a close link between the symbols and their grounding into the neural network. The discussion around these two options - neurons into symbols vs. symbols into neurons - is likely to take center stage in the debate around neurosymbolic AI in the next decade. LTN and related approaches are well placed to play an important role in this debate by offering a rich logical language tightly coupled with an efficient distributed implementation into TensorFlow computational graphs.
The close connection between first-order logic and its implementation in LTN makes LTN very suitable as a model for the neural-symbolic cycle [27, 29], which seeks to translate between neural and symbolic representations. Such translations can take place at the level of the structure of a neural network, given a symbolic language [27], or at the level of the loss functions, as done by LTN and related approaches [13,45,46]. LTN opens up a number of promising avenues for further research:
Firstly, a continual learning approach might allow one to start with very little knowledge, build up and validate knowledge over time by querying the LTN network. Translations to and from neural and symbolic representations will enable reasoning also to take place at the symbolic level (e.g. alongside a proof system), as proposed recently in [70] with the goal of improving fairness of the network model.
Secondly, LTN should be compared in large-scale practical use cases with other recent efforts to add structure to neural networks such as the neuro-symbolic concept learner [44] and high-level capsules which were used recently to learn the part-of relation [38], similarly to how LTN was used for semantic image interpretation in [19].
Finally, LTN should also be compared with Tensor Product Representations, e.g. [59], which show that state-of-the-art recurrent neural networks may fail at simple question-answering tasks, despite achieving very high accuracy. Efforts in the area of transfer learning, mostly in computer vision, which seek to model systematicity could also be considered a benchmark [5]. Experiments using fewer data and therefore lower energy consumption, out-of-distribution extrapolation, and knowledge-based transfer are all potentially suitable areas of application for LTN as a framework for neurosymbolic AI based on learning from data and compositional knowledge.

Acknowledgement

We would like to thank Benedikt Wagner for his comments and a number of productive discussions on continual learning, knowledge extraction and reasoning in LTNs.

References

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Good-fellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Ku-nal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from ten-sorflow.org.
[2] Michael Akintunde, Elena Botoeva, Panagiotis Kouvaros, and Alessio Lomuscio. Verifying strategic abilities of neural multi-agent systems. In Proceedings of 17th International Conference on Principles of Knowledge Representation and Reasoning, KR2020, Rhodes, Greece, September 2020.
[3] Samy Badreddine and Michael Spranger. Injecting Prior Knowledge for Transfer Learning into Reinforcement Learning Algorithms using Logic Tensor Networks. arXiv:1906.06576 [cs, stat], June 2019. arXiv: 1906.06576.
[4] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, and Koray kavukcuoglu. Interaction networks for learning about objects, relations and physics. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS'16, pages 4509-4517, USA, 2016. Curran Associates Inc.
[5] Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, Sebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning to disentangle causal mechanisms. In International Conference on Learning Representations, 2020.
[6] Federico Bianchi and Pascal Hitzler. On the capabilities of logic tensor networks for deductive reasoning. In Proceedings of the AAAI 2019 Spring Symposium on Combining Machine Learning with Knowledge Engineering (AAAI-MAKE 2019) Stanford University, Palo Alto, California, USA, March 25-27, 2019., Stanford University, Palo Alto, California, USA, March 25-27, 2019., 2019.
[7] Federico Bianchi, Matteo Palmonari, Pascal Hitzler, and Luciano Serafini. Complementing logical reasoning with sub-symbolic commonsense. In International Joint Conference on Rules and Reasoning, pages 161-170. Springer, 2019.
[8] Rafael Borges, Artur d'Avila Garcez, and Luís Lamb. Learning and representing temporal knowledge in recurrent networks. IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, 22:2409-21, 122011.
[9] Liber Běhounek, Petr Cintula, and Petr Hájek. Introduction to mathematical fuzzy logic. In Petr Cintula, Petr Hájek, and Carles Noguera, editors, Handbook of Mathematical Fuzzy Logic, Volume 1, volume 37 of Studies in Logic, Mathematical Logic and Foundations, pages 1-102. College Publications, 2011.
[10] N. A. Campbell and R. J. Mahon. A multivariate study of variation in two species of rock crab of the genus Leptograpsus. Australian Journal of Zoology, 22(3):417-425, 1974. Publisher: CSIRO PUBLISHING.
[11] Andres Campero, Aldo Pareja, Tim Klinger, Josh Tenenbaum, and Sebastian Riedel. Logical rule induction and theory learning using neural theorem proving. CoRR, abs/1809.02193, 2018.
[12] Benhui Chen, Xuefen Hong, Lihua Duan, and Jinglu Hu. Improving multi-label classification performance by label constraints. In The 2013 International Joint Conference on Neural Networks (IJCNN), pages 1-5. IEEE, 2013.
[13] William W. Cohen, Fan Yang, and Kathryn Mazaitis. Tensorlog: A probabilistic database implemented using deep-learning infrastructure. J. Artif. Intell. Res., 67:285-325, 2020.
[14] W.-Z. Dai, Q. Xu, Y. Yu, and Z.-H. Zhou. Bridging machine learning and logical reasoning by abductive learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, NeurIPS'19, USA, 2019. Curran Associates Inc.
[15] Alessandro Daniele and Luciano Serafini. Knowledge enhanced neural networks. In Pacific Rim International Conference on Artificial Intelligence, pages 542-554. Springer, 2019.
[16] Artur d'Avila Garcez, Marco Gori, Luís C. Lamb, Luciano Serafini, Michael Spranger, and Son N. Tran. Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning. FLAP, 6(4):611-632, 2019.
[17] Artur d'Avila Garcez and Luis C. Lamb. Neurosymbolic AI: The 3rd wave, 2020.
[18] Ivan Donadello and Luciano Serafini. Compensating supervision incompleteness with prior knowledge in semantic image interpretation. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1-8. IEEE, 2019.
[19] Ivan Donadello, Luciano Serafini, and Artur d'Avila Garcez. Logic tensor networks for semantic image interpretation. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 1596-1602, 2017.
[20] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
[21] Richard Evans and Edward Grefenstette. Learning explanatory rules from noisy data. J. Artif. Intell. Res., 61:1-64, 2018.
[22] Ronald Fagin, Ryan Riegel, and Alexander Gray. Foundations of reasoning with uncertainty via real-valued logics, 2020.
[23] Marc Fischer, Mislav Balunovic, Dana Drachsler-Cohen, Timon Gehr, Ce Zhang, and Martin Vechev. D12: Training and querying neural networks with logic. In International Conference on Machine Learning, pages 1931-1941, 2019.
[24] Manoel Franca, Gerson Zaverucha, and Artur d'Avila Garcez. Fast relational learning using bottom clause propositionalization with artificial neural networks. Machine Learning, 94:81- 104,012014.
[25] Dov M. Gabbay and John Woods, editors. The Many Valued and Nonmonotonic Turn in Logic, volume 8 of Handbook of the History of Logic. Elsevier, 2007.
[26] Artur d'Avila Garcez, Dov M. Gabbay, and Krysia B. Broda. Neural-Symbolic Learning System: Foundations and Applications. Springer-Verlag, Berlin, Heidelberg, 2002.
[27] Artur d'Avila Garcez, Lus C. Lamb, and Dov M. Gabbay. Neural-Symbolic Cognitive Reasoning. Springer Publishing Company, Incorporated, 1 edition, 2008.
[28] Petr Hajek. Metamathematics of Fuzzy Logic. Kluwer Academic Publishers, 1998.
[29] Barbara Hammer and Pascal Hitzler, editors. Perspectives of Neural-Symbolic Integration, volume 77 of Studies in Computational Intelligence. Springer, 2007.
[30] Stevan Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335- 346, 1990.
[31] Patrick Hohenecker and Thomas Lukasiewicz. Ontology reasoning with deep neural networks. Journal of Artificial Intelligence Research, 68:503-540, 2020.
[32] Steffen Hölldobler and Franz J. Kurfess. CHCL - A connectionist infernce system. In Bertram Fronhöfer and Graham Wrightson, editors, Parallelization in Inference Systems, International Workshop, Dagstuhl Castle, Germany, December 17-18, 1990, Proceedings, volume 590 of Lecture Notes in Computer Science, pages 318-342. Springer, 1990.
[33] Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. Harnessing deep neural networks with logic rules. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2410-2420, Berlin, Germany, August 2016. Association for Computational Linguistics.
[34] Henry Kautz. The Third AI Summer, AAAI Robert S. Engelmore Memorial Lecture, Thirty-fourth AAAI Conference on Artificial Intelligence, New York, NY, February 10, 2020.
[35] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], January 2017. arXiv: 1412.6980.
[36] Doga Kisa, Guy Van den Broeck, Arthur Choi, and Adnan Darwiche. Probabilistic sentential decision diagrams. In Proceedings of the 14th International Conference on Principles of Knowledge Representation and Reasoning (KR), July 2014.
[37] Erich Peter Klement, Radko Mesiar, and Endre Pap. Triangular Norms, volume 8 of Trends in Logic. Springer Netherlands, Dordrecht, 2000.
[38] Adam Kosiorek, Sara Sabour, Yee Whye Teh, and Geoffrey E Hinton. Stacked capsule autoen-coders. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 15512-15522. Curran Associates, Inc., 2019.
[39] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS'12, page 1097-1105, Red Hook, NY, USA, 2012. Curran Associates Inc.
[40] Luís C. Lamb, Artur d'Avila Garcez, Marco Gori, Marcelo O. R. Prates, Pedro H. C. Avelar, and Moshe Y. Vardi. Graph neural networks meet neural-symbolic computing: A survey and perspective. In Christian Bessiere, editor, Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020 [scheduled for July 2020, Yokohama, Japan, postponed due to the Corona pandemic], pages 4877-4884. ijcai.org, 2020.
[41] Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. Deepproblog: Neural probabilistic logic programming. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NeurIPS'18, pages 3753-3763, USA, 2018. Curran Associates Inc.
[42] Francesco Manigrasso, Filomeno Davide Miro, Lia Morra, and Fabrizio Lamberti. Faster-LTN: a neuro-symbolic, end-to-end object detection architecture. arXiv:2107.01877 [cs], July 2021.
[43] Vasco Manquinho, Joao Marques-Silva, and Jordi Planes. Algorithms for weighted boolean optimization. In International conference on theory and applications of satisfiability testing, pages 495-508. Springer, 2009.
[44] Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. CoRR, abs/1904.12584, 2019.
[45] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and Marco Gori. Constraint-based visual generation. In Igor V. Tetko, Vera Kurková, Pavel Karpov, and Fabian J. Theis, editors, Artificial Neural Networks and Machine Learning - ICANN 2019: Image Processing - 28th International Conference on Artificial Neural Networks, Munich, Germany, September 17-19, 2019, Proceedings, Part III, volume 11729 of Lecture Notes in Computer Science, pages 565-577. Springer, 2019.
[46] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and Marco Gori. Integrating learning and reasoning with deep logic models. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2019, Wiirzburg, Germany, September 16-20, 2019, Proceedings, Part II, volume 11907 of Lecture Notes in Computer Science, pages 517-532. Springer, 2019.
[47] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and Marco Gori. Lyrics: A general interface layer to integrate logic inference and deep learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 283-298. Springer, 2019.
[48] Giuseppe Marra and Ondřej Kuželka. Neural markov logic networks. arXiv preprint arXiv:1905.13462, 2019.
[49] Pasquale Minervini, Sebastian Riedel, Pontus Stenetorp, Edward Grefenstette, and Tim Rock-täschel. Learning reasoning strategies in end-to-end differentiable proving, 2020.
[50] Stephen H. Muggleton, Dianhuan Lin, Niels Pahlavi, and Alireza Tamaddoni-Nezhad. Meta-interpretive learning: Application to grammatical inference. Mach. Learn., 94(1):25-49, January 2014.
[51] Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559-572, 1901.
[52] Gadi Pinkas. Reasoning, nonmonotonicity and learning in connectionist networks that capture propositional knowledge. Artif. Intell., 77(2):203-247, 1995.
[53] Meng Qu and Jian Tang. Probabilistic logic neural networks for reasoning. In Advances in Neural Information Processing Systems, pages 7712-7722, 2019.
[54] Luc De Raedt, Sebastijan Dumančić, Robin Manhaeve, and Giuseppe Marra. From statistical relational to neuro-symbolic artificial intelligence, 2020.
[55] Matthew Richardson and Pedro Domingos. Markov logic networks. Mach. Learn., 62(1-2):107- 136, February 2006.
[56] Ryan Riegel, Alexander Gray, Francois Luus, Naweed Khan, Ndivhuwo Makondo, Is-mail Yunus Akhalwaya, Haifeng Qian, Ronald Fagin, Francisco Barahona, Udit Sharma, Sha-jith Ikbal, Hima Karanam, Sumit Neelam, Ankita Likhyani, and Santosh Srivastava. Logical Neural Networks. arXiv:2006.13155 [cs], June 2020. arXiv: 2006.13155.
[57] Tim Rocktäschel and Sebastian Riedel. End-to-end differentiable proving. In Advances in Neural Information Processing Systems, pages 3788-3800, 2017.
[58] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfar-dini. The graph neural network model. Trans. Neur. Netw., 20(1):61-80, January 2009.
[59] Imanol Schlag and Jürgen Schmidhuber. Learning to reason with third-order tensor products. CoRR, abs/1811.12143, 2018.
[60] Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, Jürgen Schmidhuber, and Jianfeng Gao. Enhancing the transformer with explicit relational encoding for math problem solving. CoRR, abs/1910.06611, 2019.
[61] Luciano Serafini and Artur d'Avila Garcez. Logic tensor networks: Deep learning and logical reasoning from data and knowledge. arXiv preprint arXiv:1606.04422, 2016.
[62] Luciano Serafini and Artur d’Avila Garcez. Learning and reasoning with logic tensor networks. In Conference of the Italian Association for Artificial Intelligence, pages 334–348. Springer, 2016.
[63] Lokendra Shastri. Advances in SHRUTI-A neurally motivated model of relational knowledge representation and rapid inference using temporal synchrony. Appl. Intell., 11(1):79-108, 1999.
[64] Yun Shi. A deep study of fuzzy implications. PhD thesis, Ghent University, 2009.
[65] Paul Smolensky and Géraldine Legendre. The Harmonic Mind: From Neural Computation to Optimality-Theoretic GrammarVolume I: Cognitive Architecture (Bradford Books). The MIT Press, 2006.
[66] Gustav Sourek, Vojtech Aschenbrenner, Filip Zelezny, Steven Schockaert, and Ondrej Kuzelka. Lifted relational neural networks: Efficient learning of latent relational structures. Journal of Artificial Intelligence Research, 62:69-100, 2018.
[67] Emile van Krieken, Erman Acar, and Frank van Harmelen. Semi-Supervised Learning using Differentiable Reasoning. arXiv:1908.04700 [cs], August 2019. arXiv: 1908.04700.
[68] Emile van Krieken, Erman Acar, and Frank van Harmelen. Analyzing Differentiable Fuzzy Implications. In Proceedings of the 17th International Conference on Principles of Knowledge Representation and Reasoning, pages 893-903, 92020.
[69] Emile van Krieken, Erman Acar, and Frank van Harmelen. Analyzing Differentiable Fuzzy Logic Operators. arXiv:2002.06100 [cs], February 2020. arXiv: 2002.06100.
[70] Benedikt Wagner and Artur d'Avila Garcez. Neural-Symbolic Integration for Fairness in AI. Proceedings of the AAAI Spring Symposium: Combining Machine Learning with Knowledge Engineering 2021, page 14, 2021.
[71] Po-Wei Wang, Priya L Donti, Bryan Wilder, and Zico Kolter. Satnet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver. arXiv preprint arXiv:1905.12149, 2019.
[72] Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Broeck. A Semantic Loss Function for Deep Learning with Symbolic Knowledge. In International Conference on Machine Learning, pages 5502-5511. PMLR, July 2018. ISSN: 2640-3498.

Appendix A. Implementation Details

The LTN library is implemented in Tensorflow 2 [1] and is available from GitHub 29 . Every logical operator is grounded using Tensorflow primitives. The LTN code implements directly a Tensorflow graph. Due to Tensorflow built-in optimization, LTN is relatively efficient while providing the expressive power of FOL.
Table A.3 shows an overview of the network architectures used to obtain the results of the examples in Section 4 The LTN repository includes the code for these examples. Except if explicitly mentioned otherwise,the reported results are averaged over 10 runs using a 95% confidence interval. Every example uses a stable real product configuration to approximate Real Logic operators, and the Adam optimizer [35] with a learning rate of 0.001 to train the parameters. are usually used with some additional layer(s) to ground symbols. For instance,in experiment 4.2 in G(P):x,l l softmax (MLP(x)) ,the softmax layer normalizes the raw predictions of MLP to probabilities in [0,1] ,and the multiplication with the one-hot label l selects the probability for one given class.
TaskNetworkArchitecture
4.1 4.2MLPDense(16)*, Dense(16)*, Dense(1)
MLPDense (16) *,Dropout (0.2), Dense (16) *,Dropout (0.2) ,
Dense(8)*, Dropout(0.2), Dense(1)
4.3 4.4MLPDense(16),Dense(16),Dense(8),Dense(1)
CNNMNISTConv, Dense(84)*, Dense(10)
baseline – SDMNISTConv×2,Dense(84),Dense(19),Softmax
baseline – MDMNISTConv ×4 ,Dense (128) ,Dense (199) ,Softmax
4.5 4.6 4.7MLPDense(8)*, Dense(8)*, Dense(1)
MLPDense(16)*, Dense(16)*, Dense(16)*, Dense(1)
MLP_SDense(8)*, Dense(8)*, Dense(1)
MLP_FDense(8)*, Dense(8)*, Dense(1)
MLP_CDense(8)*, Dense(8)*, Dense(1)
  • : layer ends with an elu activation
Dense (n) : regular fully-connected layer of n units
Dropout (r) : dropout layer with rate r
Conv(f,k) : 2D convolution layer with f filters and a kernel of size k
MP(w,h) : max pooling operation with a w×h pooling window
MNISTConv : Conv (6,5),MP(2,2),Conv(16,5),MP(2,2),Dense(100)
Table A.3: Overview of the neural network architectures used in each example. Notice that in the examples, the networks

29 https://github.com/logictensornetworks/logictensornetworks

Appendix B. Fuzzy Operators and Properties

This appendix presents the most common operators used in fuzzy logic literature and some noteworthy properties [28, 37, 64, 69].
Appendix B.1. Negation
Definition 7. A negation is a function N:[0,1][0,1] that at least satisfies:
N1. Boundary conditions: N(0)=1 and N(1)=0 ,
N2. Monotonically decreasing: (x,y)[0,1]2,xyN(x)N(y) .
Moreover,a negation is said to be strict if N is continuous and strictly decreasing. A negation is said to be strong if x[0,1],N(N(x))=x .
We commonly use the standard strict and strong negation NS(a)=1a .
Appendix B.2. Conjunction
Definition 8. A conjuction is a function C:[0,1]2[0,1] that at least satisfies:
C1. boundary conditions: C(0,0)=C(0,1)=C(1,0)=0 and C(1,1)=1 ,
C2. monotonically increasing: (x,y,z)[0,1]3 ,if xy ,then C(x,z)C(y,z) and C(z,x) C(z,y) .
In fuzzy logic, t-norms are widely used to model conjunction operators.
Definition 9. A t-norm (triangular norm) is a function t:[0,1]2[0,1] that at least satisifies:
T1. boundary conditions: T(x,1)=x ,
T2. monotonically increasing,
T3. commutative,
T4. associative.
Example 4. Three commonly used t-norms are:
(minimum)TM(x,y)=min(x,y)
(product)TP(x,y)=xy
(Łukasiewicz)TL(x,y)=max(x+y1,0)
Appendix B.3. Disjunction
NameababaRcaSc
Goedelmin(a,b)max(a,b)if ac otherwisemax(1a,c)
Goguen/Productaba+bab{1, if acca, otherwise 1a+ac
Lukasiewiczmax(a+b1,0)min(a+b,1)min(1a+c,1)min(1a+c,1)
Table B.4: Common Symmetric Configurations
Definition 10. A disjunction is a function D:[0,1]2[0,1] that at least satisfies:
D1. boundary conditions: D(0,0)=0 and D(0,1)=D(1,0)=D(1,1)=1 ,
D2. monotonically increasing: (x,y,z)[0,1]3 ,if xy ,then D(x,z)D(y,z) and D(z,x) D(z,y) .
Disjunctions in fuzzy logic are often modeled with t-conorms.
Definition 11. A t-conorm (triangular conorm) is a function S:[0,1]2[0,1] that at least satisfies:
S1. boundary conditions: S(x,0)=x ,
S2. monotonically increasing,
S3. commutative,
S4. associative.
Example 5. Three commonly used t-conorms are:
(maximum)SM(x,y)=max(x,y)
(probabilistic sum)SP(x,y)=x+yxy
(Łukasiewicz)SL(x,y)=min(x+y,1)
Note that the only distributive pair of t-norm and t-conorm is TM and SM - that is,distributivity of the t-norm over the t-conorm, and inversely.
Definition 12. The N -dual t-conorm S of a t-norm T w.r.t. a strict fuzzy negation N is defined as:
(B.1)(x,y)[0,1]2,S(x,y)=N(T(N(x),N(y))).
If N is a strong negation,we also get:
(B.2)(x,y)[0,1]2,T(x,y)=N(S(N(x),N(y))).
Appendix B.4. Implication
Definition 13. An implication is a function I:[0,1]2[0,1] that at least satisfies:
I1. boundary Conditions: I(0,0)=I(0,1)=I(1,1)=1 and I(1,0)=0
Definition 14. There are two main classes of implications generated from the fuzzy logic operators for negation, conjunction and disjunction.
S-Implications Strong implications are defined using xy=¬xy (material implication).
R-Implications Residuated implications are defined using xy=sup{z[0,1]xzy} . One way of understanding this approach is a generalization of modus ponens: the consequent is at least as true as the (fuzzy) conjunction of the antecedent and the implication.
Example 6. Popular fuzzy implications and their classes are presented in Table B. 5
NameI(x,y)=S-ImplicationR-Implication
Kleene-Dienes IKDmax(1x,y)S=SM N=NS-
Goedel IG{1,xyy, otherwise T=TM
Reichenbach IR1x+xyS=SP N=NS-
Goguen IP{1,xyy/x, otherwise -T=TP
Lukasiewicz ILukmin(1x+y,1)S=SL N=NST=TL
Table B.5: Popular fuzzy implications and their classes. Strong implications (S-Implications) are defined using a fuzzy negation and fuzzy disjunction. Residuated implications (R-Implications) are defined using a fuzzy conjunction.

Appendix B.5. Aggregation

Definition 15. An aggregation operator is a function A:nN[0,1]n[0,1] that at least satisfies:
A1. A(x1,,xn)A(y1,,yn) whenever xiyi for all i{1,,n} ,
A2. A(x)=x forall x[0,1] ,
A3. A(0,,0)=0 and A(1,,1)=1 .
Example 7. Candidates for universal quantification can be obtained using t-norms with AT(xi)= xi and AT(x1,,xn)=T(x1,AT(x2,,xn)) :
(minimum)ATM(x1,,xn)=min(x1,,xn)
(product)ATP(x1,,xn)=i=1nxi
(Łukasiewicz)ATL(x1,,xn)=max(i=1nxin+1,0)
Similarly,candidates for existential quantification can be obtained using s-norms with AS(xi)= xi and AS(x1,,xn)=S(x1,AS(x2,,xn)) :
(maximum)ASM(x1,,xn)=max(x1,,xn)
(probabilistic sum)ASP(x1,,xn)=1i=1n(1xi)
(Łukasiewicz)ASL(x1,,xn)=min(i=1nxi,1)
(TM,SM,NS)(TP,SP,NS)(TL,SL,NS)
IKDIGIRIPILuk
Commutativity of ,
Associativity of ,
Distributivity of over
Distributivity of over
Distrib. of over ,
Double negation ¬¬p=p
Law of excluded middle
Law of non contradiction
De Morgan’s laws
Material Implication
Contraposition
Table B.6: Common properties for different configurations
Following are other common aggregators:
(mean)AM(x1,,xn)=1ni=1nxi
(p-mean)ApM(x1,,xn)=(1ni=1nxip)1p
(p-mean error)ApME(x1,,xn)=1(1ni=1n(1xi)p)1p
Where ApM is the generalized mean,and ApME can be understood as the generalized mean measured w.r.t. the errors. That is, ApME measures the power of the deviation of each value from the ground truth 1. A few particular values of p yield special cases of aggregators. Notably:
-limp+ApM(x1,,xn)=max(x1,,xn),
-limpApM(x1,,xn)=min(x1,,xn),
-limp+ApME(x1,,xn)=min(x1,,xn),
-limpApME(x1,,xn)=max(x1,,xn).
These "smooth" min (resp. max) approximators are good candidates for (resp. ) in a fuzzy context. The value of p leaves more or less room for outliers depending on the use case and its needs. Note that ApME and ApM are related in the same way that and are related using the definition ¬¬ ,where ¬ would be approximated by the standard negation.
We propose to use ApME with p1 to approximate and ApM with p1 to approximate . When p1 ,these operators resemble the lp norm of a vector u=(u1,u2,,un) ,where up=(|u1|p+|u2|p++|un|p)1/p . In our case,many properties of the lp norm can apply to ApM (positive homogeneity,triangular inequality,...).

Appendix C. Analyzing Gradients of Generalized Mean Aggregators

[69] show that some operators used in Fuzzy Logics are unsuitable for use in a differentiable learning setting. Three types of gradient problems commonly arise in fuzzy logic operators.
Single-Passing The derivatives of some operators are non-null for only one argument. The gradients propagate to only one input at a time.
Vanishing Gradients The gradients vanish on some part of the domain. The learning does not update inputs that are in the vanishing domain.
Exploding Gradients Large error gradients accumulate and result in unstable updates.
Tables C. 7 and C. 8 summarize their conclusions for the most common operators. Also, we underline here exploding gradients issues that arise experimentally in ApM and ApME ,which are not in the original report. Given the truth values of n propositions (x1,,xn) in [0,1]n :
1.ApM(x1,,xn)=(1nixip)1p
The partial derivatives are ApM(x1,,xn)xi=1n1p(j=1nxjp)1p1xip1 .
When p>1 ,the operator weights more for inputs with a higher true value -i.e. their partial derivative is also higher - and suits for existential quantification. When p<1 ,the operator weights more for inputs with a lower true value and suits for universal quantification.
Exploding Gradients When p>1 ,if j=1nxjp0 ,then (j=1nxjp)1p1 and the gradients explode. When p<1 ,if xi0 ,then xip1 .
2.ApME(x1,,xn)=1(1ni(1xi)p)1p
The partial derivatives are ApME(x1,,xn)xi=1n1p(j=1n(1xj)p)1p1(1xi)p1 . When p>1 , the operator weights more for inputs with a lower true value -i.e. their partial derivative is also higher - and suits for universal quantification. When p<1 ,the operator weights more for inputs with a higher true value and suits for existential quantification.
Exploding Gradients
When p>1 ,if j=1n(1xj)p0 ,then (j=1n(1xj)p)1p1 and the gradients
explode. When p<1 ,if 1xi0 ,then (1xi)p1 .
We propose the following stable product configuration that does not have any of the aforemen-
Single-PassingVanishingExploding
Goedel (mininum)
TM,SMx
IKDx
IGxx
Goguen (product)
TP , SP(X)
IR(X)
IKDx(X)
Lukasiewicz
TL,SLx
ILukx
Table C.7: Gradient problems for some binary connectives. (X) means that the problem only appears on an edge case.
Single-PassingVanishingExploding
ATM/ASMx
ATP/ASPx
ATL/ASLx
ApM(X)
ApME(X)
Table C.8: Gradient problems for some aggregators. (X) means that the problem only appears on an edge case.
tioned gradient problems:
(C.1)π0(x)=(1ϵ)x+ϵ
(C.2)π1(x)=(1ϵ)x
(C.3)NS(x)=1x
(C.4)TP(x,y)=π0(x)π0(y)
(C.5)SP(x,y)=π1(x)+π1(y)π1(x)π1(y)
(C.6)IR(x,y)=1π0(x)+π0(x)π1(y)
(C.7)ApM(x1,,xn)=(1ni=1nπ0(xi)p)1pp1
(C.8)ApME(x1,,yn)=1(1ni=1n(1π1(xi))p)1pp1
NS is the operator for negation, TP for conjunction, SP for disjunction, IP for implication, ApM for existential aggregation, ApME for universal aggregation.