A Logical Language with a Prototypical Semantics

Embed Size (px)

DESCRIPTION

ABSTRACT: In a pair of papers from 1995 and 1997, I developed a computational theory of legal argument, but left open a question about the key concept of a "prototype." Contemporary trends in machine learning have now shed new light on the subject. In this talk, I will describe my recent work on "manifold learning," as well as some work in progress on "deep learning." Taken together, this work leads to a logical language grounded in a prototypical perceptual semantics, with implications for legal theory.

Citation preview

  • 1. A Logical Language with a Prototypical Semantics L. Thorne McCarty Rutgers University

2. Background Reading An Implementation of Eisner v. Macomber, in ICAIL-'95. Computational reconstruction of corporate tax case. Based on a theory of prototypes and deformations. Some Arguments About Legal Arguments, in ICAIL-'97. Critical review of the literature. Discussion of The Correct Theory in Section 5. 3. ICAIL-'97, Section 5: Most machine learning algorithms assume that concepts have classical definitions, with necessary and sufficient conditions, but legal concepts tend to be defined by prototypes. When you first look at prototype models [Smith and Medin, 1981], they seem to make the learning problem harder, rather than easier, since the space of possible concepts seems to be exponentially larger in these models than it is in the classical model. But empirically, this is not the case. Somehow, the requirement that the exemplar of a concept must be similar to a prototype (a kind of horizontal constraint) seems to reinforce the requirement that the exemplar must be placed at some determinate level of the concept hierarchy (a kind of vertical constraint). How is this possible? This is one of the great mysteries of cognitive science. It is also one of the great mysteries of legal theory. ... 4. Manifold Learning S. Rifai, Y.N. Dauphin, P. Vincent, Y. Bengio, X. Muller, The Manifold Tangent Classifier, in NIPS 2011. Three hypotheses: 1. ... 2. The (unsupervised) manifold hypothesis, according to which real world data presented in high dimensional spaces is likely to concentrate in the vicinity of non-linear sub-manifolds of much lower dimensionality ... [citations omitted] 3. The manifold hypothesis for classification, according to which points of different classes are likely to concentrate along different sub-manifolds, separated by low density regions of the input space. 5. Manifold Learning L.T. McCarty, Clustering, Coding and the Concept of Similarity, preprint, arXiv:1401.2411 [cs.LG] (10 Jan 2014). A theory of clustering and coding which combines a geometric model with a probabilistic model in a principled way. Geometric model: A Riemannian manifold with a Riemannian metric, which is interpreted as a measure of dissimilarity. Probabilistic model: A stochastic process with an invariant probability measure, which matches the density of the sample input data. The models are linked by a potential function, , and its gradient, . The dissimilarity metric is used to define a low-dimensional coordinate system on the embedded Riemannian manifold. U x U x 6. Probabilistic Model Stochastic Process: Brownian motion with a drift term, . Invariant Probability Measure: U x eU x U x , y ,z=ax ,by ,cz U x , y , z= 1 2 ax 2 by 2 cz 2 20 10 0 10 20 20 10 0 10 20 x y 7. Probabilistic Model Stochastic Process: Brownian motion with a drift term, Invariant Probability Measure: U x eU x U x , y , zOx 6 y 6 z 6 U x , y , zOx 5 y 5 z 5 20 10 0 10 20 20 10 0 10 20 x y 8. Geometric Model Prototype Coding: Radial coordinate, , which follows . Directional coordinates, 1 , 2 , ..., orthogonal to . U x U x 10 5 0 5 10 x 5 0 5 y 5 0 5 z 10 5 0 5 10 x 5 0 5 y 1.0 0.5 0.0 0.5 1.0 z 9. Geometric Model Prototype Coding: Radial coordinate, , which follows . Directional coordinates, 1 , 2 , ..., orthogonal to . 20 10 010 20 x 10 010 y 10 010 z 20 10 010 20 x 10 010 y 5 05 z U x U x 10. Geometric Model How to construct a low-dimensional manifold? Define a Riemannian metric, , which is interpreted as a measure of dissimilarity. , and depends only on . The dissimilarity should be small in a region in which the probability density is high, and vice versa. Overall, minimize dissimilarity and maximize probability. Because of the prominent role of the Riemannian metric in the theory, we will refer to it as a theory of differential similarity. gij x g00x = U x 2 U xgij x 11. Geometric Model Principal Axis: Find a point at a fixed Euclidean distance from the origin for which the Riemannian distance from the origin is minimal. 20 10 010 20 x 10 010 y 5 05 z 12. Geometric Model Principal Directions: Diagonalize the Riemannian matrix, , at a point on the principal axis, and select the eigenvectors associated with the k-1 smallest eigenvalues. 20 10 010 20 x 10 010 y 5 05 z gij x 13. Geometric Model Coordinate Curves: Follow the geodesics of the Riemannian metric, , in each of the k-1 principal directions. The coordinate, along with the k-1 geodesics, defines a k dimensional submanifold embedded in the original n dimensional space. 20 10 010 20 x 10 010 y 5 05 z gij x 14. Prototypical Clusters Probability density is a mixture: and can be computed independently. These two clusters are exponentially far apart. e U x p1 e U 1 x p2 e U 2x U 1x U 2 x 15. Deep Learning S. Rifai, Y.N. Dauphin, P. Vincent, Y. Bengio, X. Muller, The Manifold Tangent Classifier, in NIPS 2011. Three hypotheses: 1. The semi-supervised learning hypothesis, according to which learning aspects of the input distribution p(x) can improve models of the conditional distribution of the supervised target p(y|x) ... [citation omitted]. This hypothesis underlies not only the strict semi-supervised setting where one has many more unlabeled examples at his disposal than labeled ones, but also the successful unsupervised pretraining approach for learning deep architectures [citations omitted]. 2. ... 3. ... 16. Standard Example Historically, used as a benchmark for supervised learning: Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-Based Learning Applied to Document Recognition." Proceedings of the IEEE, 86(11):2278-2324 (November, 1998). We will treat it as a problem in unsupervised learning: L.T. McCarty, Differential Similarity in Higher-Dimensional Spaces: Theory and Applications, forthcoming (2014). MNIST Dataset 28 X 28 pixels 60,000 training set images 10,000 test set images 17. Standard Architecture 7X7 patch 60,000 images 600,000 patches 49 dimensions 12 dimensions sample scan encode scan 14X14 patch 48 dimensions encode 12 dimensions encode Category: 4 12 dimensions48 dimensions 18. Prototype Coding Prototypes: is estimated from the data and smoothed. Partition the space. U x=0 U x 19. Prototype Coding Principal Axes: Find the point at a fixed Euclidean distance from the origin that has the minimal Riemannian distance from the origin. U x 20. Prototype Coding Coordinate Curves: Compute the principal eigenvectors at the extremal points on the principal axes And then Compute the geodesic curves in each principal direction. 21. Prototype Coding Radial Coordinates: Follow outward, from the prototype through a point on one of the coordinate curves. U x U x 22. A Logical Language This is a Horn Clause. The logic is intuitionistic, not classical. AND AND AND 23. Prototypes and Deformations ICAIL-'97, Section 5: Somehow, the requirement that the exemplar of a concept must be similar to a prototype (a kind of horizontal constraint) seems to reinforce the requirement that the exemplar must be placed at some determinate level of the concept hierarchy (a kind of vertical constraint). How is this possible? This is one of the great mysteries of cognitive science. It is also one of the great mysteries of legal theory. Q: Is the mystery now solved?