115 kB

Why Philosophers Should Care About Computational Complexity

Scott Aaronson*

Abstract

One might think that, once we know something is computable, how efficiently it can be computed is a practical question with little further philosophical importance. In this essay, I offer a detailed case that one would be wrong. In particular, I argue that computational complexity theory—the field that studies the resources (such as time, space, and randomness) needed to solve computational problems—leads to new perspectives on the nature of mathematical knowledge, the strong AI debate, computationalism, the problem of logical omniscience, Hume’s problem of induction, Goodman’s grue riddle, the foundations of quantum mechanics, economic rationality, closed timelike curves, and several other topics of philosophical interest. I end by discussing aspects of complexity theory itself that could benefit from philosophical analysis.

1	Introduction	2
1.1	What This Essay Won’t Cover . . . . .	3
2	Complexity 101	5
3	The Relevance of Polynomial Time	6
3.1	The Entscheidungsproblem Revisited . . . . .	6
3.2	Evolvability . . . . .	8
3.3	Known Integers . . . . .	9
3.4	Summary . . . . .	10
4	Computational Complexity and the Turing Test	10
4.1	The Lookup-Table Argument . . . . .	12
4.2	Relation to Previous Work . . . . .	13
4.3	Can Humans Solve NP-Complete Problems Efficiently? . . . . .	14
4.4	Summary . . . . .	16
5	The Problem of Logical Omniscience	16
5.1	The Cobham Axioms . . . . .	18
5.2	Omniscience Versus Infinity . . . . .	20
5.3	Summary . . . . .	22

*MIT. Email: aaronson@csail.mit.edu. This material is based upon work supported by the National Science Foundation under Grant No. 0844626. Also supported by a DARPA YFA grant, the Sloan Foundation, and a TIBCO Chair.

6	Computationalism and Waterfalls	22
6.1	“Reductions” That Do All The Work . . . . .	24
7	PAC-Learning and the Problem of Induction	25
7.1	Drawbacks of the Basic PAC Model . . . . .	27
7.2	Computational Complexity, Bleen, and Grue . . . . .	29
8	Quantum Computing	32
8.1	Quantum Computing and the Many-Worlds Interpretation . . . . .	34
9	New Computational Notions of Proof	36
9.1	Zero-Knowledge Proofs . . . . .	37
9.2	Other New Notions . . . . .	39
10	Complexity, Space, and Time	40
10.1	Closed Timelike Curves . . . . .	42
10.2	The Evolutionary Principle . . . . .	43
10.3	Closed Timelike Curve Computation . . . . .	44
11	Economics	45
11.1	Bounded Rationality and the Iterated Prisoners’ Dilemma . . . . .	46
11.2	The Complexity of Equilibria . . . . .	47
12	Conclusions	48
12.1	Criticisms of Complexity Theory . . . . .	48
12.2	Future Directions . . . . .	50
13	Acknowledgments	51

1 Introduction

The view that machines cannot give rise to surprises is due, I believe, to a fallacy to which philosophers and mathematicians are particularly subject. This is the assumption that as soon as a fact is presented to a mind all consequences of that fact spring into the mind simultaneously with it. It is a very useful assumption under many circumstances, but one too easily forgets that it is false. —Alan M. Turing [126]

The theory of computing, created by Alan Turing, Alonzo Church, Kurt Gödel, and others in the 1930s, didn’t only change civilization; it also had a lasting impact on philosophy. Indeed, clarifying philosophical issues was the original point of their work; the technological payoffs only came later! Today, it would be hard to imagine a serious discussion about (say) the philosophy of mind, the foundations of mathematics, or the prospects of machine intelligence that was uninformed by this revolution in human knowledge three-quarters of a century ago.

However, as computers became widely available starting in the 1960s, computer scientists increasingly came to see computability theory as not asking quite the right questions. For almost all the problems we actually want to solve turn out to be computable in Turing’s sense; the real question is which problems are efficiently or feasibly computable. The latter question gave rise to anew field, called computational complexity theory (not to be confused with the “other” complexity theory, which studies complex systems such as cellular automata). Since the 1970s, computational complexity theory has witnessed some spectacular discoveries, which include NP-completeness, public-key cryptography, new types of mathematical proof (such as probabilistic, interactive, and zero-knowledge proofs), and the theoretical foundations of machine learning and quantum computation. To people who work on these topics, the work of Gödel and Turing may look in retrospect like just a warmup to the “big” questions about computation.

Because of this, I find it surprising that complexity theory has not influenced philosophy to anything like the extent computability theory has. The question arises: why hasn't it? Several possible answers spring to mind: maybe computability theory just had richer philosophical implications. (Though as we'll see, one can make a strong case for exactly the opposite.) Maybe complexity has essentially the same philosophical implications as computability, and computability got there first. Maybe outsiders are scared away from learning complexity theory by the “math barrier.” Maybe the explanation is social: the world where Gödel, Turing, Wittgenstein, and Russell participated in the same intellectual conversation vanished with World War II; after that, theoretical computer science came to be driven by technology and lost touch with its philosophical origins. Maybe recent advances in complexity theory simply haven't had enough time to enter philosophical consciousness.

However, I suspect that part of the answer is just complexity theorists' failure to communicate what they can add to philosophy's conceptual arsenal. Hence this essay, whose modest goal is to help correct that failure, by surveying some aspects of complexity theory that might interest philosophers, as well as some philosophical problems that I think a complexity perspective can clarify.

To forestall misunderstandings, let me add a note of humility before going further. This essay will touch on many problems that philosophers have debated for generations, such as strong AI, the problem of induction, the relation between syntax and semantics, and the interpretation of quantum mechanics. In none of these cases will I claim that computational complexity theory “dissolves” the philosophical problem—only that it contributes useful perspectives and insights. I'll often explicitly mention philosophical puzzles that I think a complexity analysis either leaves untouched or else introduces itself. But even where I don't do so, one shouldn't presume that I think there are no such puzzles! Indeed, one of my hopes for this essay is that computer scientists, mathematicians, and other technical people who read it will come away with a better appreciation for the subtlety of some of the problems considered in modern analytic philosophy.¹

1.1 What This Essay Won't Cover

I won't try to discuss every possible connection between computational complexity and philosophy, or even every connection that's already been made. A small number of philosophers have long invoked computational complexity ideas in their work; indeed, the “philpapers archive” lists 32 papers under the heading Computational Complexity.² The majority of those papers prove theorems about the computational complexities of various logical systems. Of the remaining papers, some use “computational complexity” in a different sense than I do—for example, to encompass

¹When I use the word “philosophy” in this essay, I'll mean philosophy within the analytic tradition. I don't understand Continental or Eastern philosophy well enough to say whether they have any interesting connections with computational complexity theory.

²See philpapers.org/browse/computational-complexitycomputability theory—and some invoke the concept of computational complexity, but no particular results from the field devoted to it. Perhaps the closest in spirit to this essay are the interesting articles by Cherniak [40] and Morton [97]. In addition, many writers have made some version of the observations in Section 4, about computational complexity and the Turing Test: see for example Block [30], Parberry [101], Levesque [87], and Shieber [116].

In deciding which connections to include in this essay, I adopted the following ground rules:

(1) The connection must involve a “properly philosophical” problem—for example, the justification for induction or the nature of mathematical knowledge—and not just a technical problem in logic or model theory.
(2) The connection must draw on specific insights from the field of computational complexity theory: not just the idea of complexity, or the fact that there exist hard problems.

There are many philosophically-interesting ideas in modern complexity theory that this essay mentions only briefly or not at all. One example is pseudorandom generators (see Goldreich [63]): functions that convert a short random “seed” into a long string of bits that, while not truly random, is so “random-looking” that no efficient algorithm can detect any regularities in it. While pseudorandom generators in this sense are not yet proved to exist,³ there are many plausible candidates, and the belief that at least some of the candidates work is central to modern cryptography. (Section 7.1 will invoke the related concept of pseudorandom functions.) A second example is fully homomorphic encryption: an extremely exciting new class of methods, the first of which was announced by Gentry [60] in 2009, for performing arbitrary computations on encrypted data without ever decrypting the data. The output of such a computation will look like meaningless gibberish to the person who computed it, but it can nevertheless be understood (and even recognized as the correct output) by someone who knows the decryption key. What are the implications of pseudorandom generators for the foundations of probability, or of fully homomorphic encryption for debates about the semantic meaning of computations? I very much hope that this essay will inspire others to tackle these and similar questions.

Outside of computational complexity, there are at least three other major intersection points between philosophy and modern theoretical computer science. The first one is the semantics of programming languages, which has large and obvious connections to the philosophy of language.⁴ The second is distributed systems theory, which provides both an application area and a rich source of examples for philosophical work on reasoning about knowledge (see Fagin et al. [53] and Stalnaker [123]). The third is Kolmogorov complexity (see Li and Vitányi [89]) which studies the length of the shortest computer program that achieves some functionality, disregarding time, memory, and other resources used by the program.⁵

In this essay, I won’t discuss any of these connections, except in passing (for example, Section 5 touches on logics of knowledge in the context of the “logical omniscience problem,” and Section 7 touches on Kolmogorov complexity in the context of PAC-learning). In defense of these omissions, let me offer four excuses. First, these other connections fall outside my stated topic. Second, they

³The conjecture that pseudorandom generators exist implies the $P \neq NP$ conjecture (about which more later), but might be even stronger: the converse implication is unknown.

⁴The Stanford Encyclopedia of Philosophy entry on “The Philosophy of Computer Science,” plato.stanford.edu/entries/computer-science, devotes most of its space to this connection.

⁵A variant, “resource-bounded Kolmogorov complexity,” does take time and memory into account, and is part of computational complexity theory proper.would make this essay even longer than it already is. Third, I lack requisite background. And fourth, my impression is that philosophers—at least some philosophers—are already more aware of these other connections than they are of the computational complexity connections that I want to explain.

2 Complexity 101

Computational complexity theory is a huge, sprawling field; naturally this essay will only touch on small parts of it. Readers who want to delve deeper into the subject are urged to consult one of the many outstanding textbooks, such as those of Sipser [122], Papadimitriou [100], Moore and Mertens [95], Goldreich [62], or Arora and Barak [15]; or survey articles by Wigderson [133, 134], Fortnow and Homer [58], or Stockmeyer [124].

One might think that, once we know something is computable, whether it takes 10 seconds or 20 seconds to compute is obviously the concern of engineers rather than philosophers. But that conclusion would not be so obvious, if the question were one of 10 seconds versus $10^{10^{10}}$ seconds! And indeed, in complexity theory, the quantitative gaps we care about are usually so vast that one has to consider them qualitative gaps as well. Think, for example, of the difference between reading a 400-page book and reading every possible such book, or between writing down a thousand-digit number and counting to that number.

More precisely, complexity theory asks the question: how do the resources needed to solve a problem scale with some measure $n$ of the problem size: “reasonably” (like $n$ or $n^2$ , say), or “unreasonably” (like $2^n$ or $n!$ )? As an example, two $n$ -digit integers can be multiplied using $\sim n^2$ computational steps (by the grade-school method), or even $\sim n \log n \log \log n$ steps (by more advanced methods [112]). Either method is considered efficient. By contrast, the fastest known method for the reverse operation—factoring an $n$ -digit integer into primes—uses $\sim 2^{n^{1/3}}$ steps, which is considered inefficient.⁶ Famously, this conjectured gap between the inherent difficulties of multiplying and factoring is the basis for most of the cryptography currently used on the Internet.

Theoretical computer scientists generally call an algorithm “efficient” if its running time can be upper-bounded by any polynomial function of $n$ , and “inefficient” if its running time can be lower-bounded by any exponential function of $n$ .⁷ These criteria have the great advantage of theoretical convenience. While the exact complexity of a problem might depend on “low-level encoding details,” such as whether our Turing machine has one or two memory tapes, or how the inputs are encoded as binary strings, where a problem falls on the polynomial/exponential dichotomy can be shown to be independent of almost all such choices.⁸ Equally important are the closure properties of polynomial and exponential time: a polynomial-time algorithm that calls a polynomial-time subroutine still yields an overall polynomial-time algorithm, while a polynomial-

⁶This method is called the number field sieve, and the quoted running time depends on plausible but unproved conjectures in number theory. The best proven running time is $\sim 2^{\sqrt{n}}$ . Both of these represent nontrivial improvements over the naïve method of trying all possible divisors, which takes $\sim 2^n$ steps. See Pomerance [105] for a good survey of factoring algorithms.

⁷In some contexts, “exponential” means $c^n$ for some constant $c > 1$ , but in most complexity-theoretic contexts it can also mean $c^{n^d}$ for constants $c > 1$ and $d > 0$ .

⁸This is not to say that no details of the computational model matter: for example, some problems are known to be solvable in polynomial time on a quantum computer, but not known to be solvable in polynomial time on a classical computer! But in my view, the fact that the polynomial/exponential distinction can “notice” a modelling choice of this magnitude is a feature of the distinction, not a bug.time algorithm that calls an exponential-time subroutine (or vice versa) yields an exponential-time algorithm. There are also more sophisticated reasons why theoretical computer scientists focus on polynomial time (rather than, say, $n^2$ time or $n^{\log n}$ time); we'll explore some of those reasons in Section 5.1.

The polynomial/exponential distinction is open to obvious objections: an algorithm that took $1.0000001^n$ steps would be much faster in practice than an algorithm that took $n^{10000}$ steps! Furthermore, there are many growth rates that fall between polynomial and exponential, such as $n^{\log n}$ and $2^{2^{\sqrt{\log n}}}$ . But empirically, polynomial-time turned out to correspond to “efficient in practice,” and exponential-time to “inefficient in practice,” so often that complexity theorists became comfortable making the identification. Why the identification works is an interesting question in its own right, one to which we will return in Section 12.

A priori, insisting that programs terminate after reasonable amounts of time, that they use reasonable amounts of memory, etc. might sound like relatively-minor amendments to Turing's notion of computation. In practice, though, these requirements lead to a theory with a completely different character than computability theory. Firstly, complexity has much closer connections with the sciences: it lets us pose questions about (for example) evolution, quantum mechanics, statistical physics, economics, or human language acquisition that would be meaningless from a computability standpoint (since all the relevant problems are computable). Complexity also differs from computability in the diversity of mathematical techniques used: while initially complexity (like computability) drew mostly on mathematical logic, today it draws on probability, number theory, combinatorics, representation theory, Fourier analysis, and nearly every other subject about which yellow books are written. Of course, this contributes not only to complexity theory's depth but also to its perceived inaccessibility.

In this essay, I'll argue that complexity theory has direct relevance to major issues in philosophy, including syntax and semantics, the problem of induction, and the interpretation of quantum mechanics. Or that, at least, whether complexity theory does or does not have such relevance is an important question for philosophy! My personal view is that complexity will ultimately prove more relevant to philosophy than computability was, precisely because of the rich connections with the sciences mentioned earlier.

3 The Relevance of Polynomial Time

Anyone who doubts the importance of the polynomial/exponential distinction needs only ponder how many basic intuitions in math, science, and philosophy already implicitly rely on that distinction. In this section I'll give three examples.

3.1 The Entscheidungsproblem Revisited

The Entscheidungsproblem was the dream, enunciated by David Hilbert in the 1920s, of designing a mechanical procedure to determine the truth or falsehood of any well-formed mathematical statement. According to the usual story, Hilbert's dream was irrevocably destroyed by the work of Gödel, Church, and Turing in the 1930s. First, the Incompleteness Theorem showed that no recursively-axiomatizable formal system can encode all and only the true mathematical statements. Second, Church's and Turing's results showed that, even if we settle for an incomplete system $F$ , there is still no mechanical procedure to sort mathematical statements into the three categories“provable in $F$ ,” “disprovable in $F$ ,” and “undecidable in $F$ .”

However, there is a catch in the above story, which was first pointed out by Gödel himself, in a 1956 letter to John von Neumann that has become famous in theoretical computer science since its rediscovery in the 1980s (see Sipser [121] for an English translation). Given a formal system $F$ (such as Zermelo-Fraenkel set theory), Gödel wrote, consider the problem of deciding whether a mathematical statement $S$ has a proof in $F$ with $n$ symbols or fewer. Unlike Hilbert’s original problem, this “truncated Entscheidungsproblem” is clearly decidable. For, if nothing else, we could always just program a computer to search through all $2^n$ possible bit-strings with $n$ symbols, and check whether any of them encodes a valid $F$ -proof of $S$ . The issue is “merely” that this approach takes an astronomical amount of time: if $n = 1000$ (say), then the universe will have degenerated into black holes and radiation long before a computer can check $2^{1000}$ proofs!

But as Gödel also pointed out, it’s far from obvious how to prove that there isn’t a much better approach: an approach that would avoid brute-force search, and find proofs of size $n$ in time polynomial in $n$ . Furthermore:

If there actually were a machine with [running time] $\sim Kn$ (or even only with $\sim Kn^2$ ) [for some constant $K$ independent of $n$ ], this would have consequences of the greatest magnitude. That is to say, it would clearly indicate that, despite the unsolvability of the Entscheidungsproblem, the mental effort of the mathematician in the case of yes-or-no questions could be completely [added in a footnote: apart from the postulation of axioms] replaced by machines. One would indeed have to simply select an $n$ so large that, if the machine yields no result, there would then also be no reason to think further about the problem.

If we replace the “ $\sim Kn$ or $\sim Kn^2$ ” in Gödel’s challenge by $\sim Kn^c$ for an arbitrary constant $c$ , then we get precisely what computer science now knows as the P versus NP problem. Here P (Polynomial-Time) is, roughly speaking, the class of all computational problems that are solvable by a polynomial-time algorithm. Meanwhile, NP (Nondeterministic Polynomial-Time) is the class of computational problems for which a solution can be recognized in polynomial time, even though a solution might be very hard to find.⁹ (Think, for example, of factoring a large number, or solving a jigsaw or Sudoku puzzle.) Clearly $P \subseteq NP$ , so the question is whether the inclusion is strict. If $P = NP$ , then the ability to check the solutions to puzzles efficiently would imply the ability to find solutions efficiently. An analogy would be if anyone able to appreciate a great symphony could also compose one themselves!

Given the intuitive implausibility of such a scenario, essentially all complexity theorists proceed (reasonably, in my opinion) on the assumption that $P \neq NP$ , even if they publicly claim open-mindedness about the question. Proving or disproving $P \neq NP$ is one of the seven million-dollar Clay Millennium Prize Problems¹⁰ (alongside the Riemann Hypothesis, the Poincaré Conjecture

⁹Contrary to a common misconception, NP does not stand for “Non-Polynomial”! There are computational problems that are known to require more than polynomial time (see Section 10), but the NP problems are not among those. Indeed, the classes NP and “Non-Polynomial” have a nonempty intersection exactly if $P \neq NP$ .

For detailed definitions of P, NP, and several hundred other complexity classes, see my Complexity Zoo website: www.complexityzoo.com.

¹⁰For more information see www.claymath.org/millennium/P\_vs\_NP/

My own view is that P versus NP is manifestly the most important of the seven problems! For if $P = NP$ , then by Gödel’s argument, there is an excellent chance that we could program our computers to solve the other six problems as well.proved in 2002 by Perelman, etc.), which should give some indication of the problem’s difficulty.¹¹

Now return to the problem of whether a mathematical statement $S$ has a proof with $n$ symbols or fewer, in some formal system $F$ . A suitable formalization of this problem is easily seen to be in NP. For finding a proof might be intractable, but if we’re given a purported proof, we can certainly check in time polynomial in $n$ whether each line of the proof follows by a simple logical manipulation of previous lines. Indeed, this problem turns out to be NP-complete, which means that it belongs to an enormous class of NP problems, first identified in the 1970s, that “capture the entire difficulty of NP.” A few other examples of NP-complete problems are Sudoku and jigsaw puzzles, the Traveling Salesperson Problem, and the satisfiability problem for propositional formulas.¹² Asking whether $P = NP$ is equivalent to asking whether any NP-complete problem can be solved in polynomial time, and is also equivalent to asking whether all of them can be.

In modern terms, then, Gödel is saying that if $P = NP$ , then whenever a theorem had a proof of reasonable length, we could find that proof in a reasonable amount of time. In such a situation, we might say that “for all practical purposes,” Hilbert’s dream of mechanizing mathematics had prevailed, despite the undecidability results of Gödel, Church, and Turing. If you accept this, then it seems fair to say that until $P$ versus $NP$ is solved, the story of Hilbert’s Entscheidungsproblem—its rise, its fall, and the consequences for philosophy—is not yet over.

3.2 Evolvability

Creationists often claim that Darwinian evolution is as vacuous an explanation for complex adaptations as “a tornado assembling a 747 airplane as it passes through a junkyard.” Why is this claim false? There are several related ways of answering the question, but to me, one of the most illuminating is the following. In principle, one could see a 747 assemble itself in a tornado-prone junkyard—but before that happened, one would need to wait for an expected number of tornadoes that grew exponentially with the number of pieces of self-assembling junk. (This is similar to how, in thermodynamics, $n$ gas particles in a box will eventually congregate themselves in one corner of the box, but only after $\sim c^n$ time for some constant $c$ .) By contrast, evolutionary processes can often be observed in simulations—and in some cases, even proved theoretically—to find interesting solutions to optimization problems after a number of steps that grows only polynomially with the number of variables.

Interestingly, in a 1972 letter to Hao Wang (see [130, p. 192]), Kurt Gödel expressed his own doubts about evolution as follows:

I believe that mechanism in biology is a prejudice of our time which will be disproved. In this case, one disproof, in my opinion, will consist in a mathematical theorem to the effect that the formation within geological time of a human body by the laws of

¹¹One might ask: can we explain what makes the $P \neq NP$ problem so hard, rather than just pointing out that many smart people have tried to solve it and failed? After four decades of research, we do have partial explanations for the problem’s difficulty, in the form of formal “barriers” that rule out large classes of proof techniques. Three barriers identified so far are relativization [21] (which rules out diagonalization and other techniques with a “computability” flavor), algebrization [8] (which rules out diagonalization even when combined with the main non-relativizing techniques known today), and natural proofs [108] (which shows that many “combinatorial” techniques, if they worked, could be turned around to get faster algorithms to distinguish random from pseudorandom functions).

¹²By contrast, and contrary to a common misconception, there is strong evidence that factoring integers is not NP-complete. It is known that if $P \neq NP$ , then there are NP problems that are neither in $P$ nor NP-complete [85], and factoring is one candidate for such a problem. This point will become relevant when we discuss quantum computing.physics (or any other laws of similar nature), starting from a random distribution of the elementary particles and the field, is as unlikely as the separation by chance of the atmosphere into its components.

Personally, I see no reason to accept Gödel’s intuition on this subject over the consensus of modern biology! But pay attention to Gödel’s characteristically-careful phrasing. He does not ask whether evolution can eventually form a human body (for he knows that it can, given exponential time); instead, he asks whether it can do so on a “merely” geological timescale. Just as Gödel’s letter to von Neumann anticipated the P versus NP problem, so Gödel’s letter to Wang might be said to anticipate a recent effort, by the celebrated computer scientist Leslie Valiant, to construct a quantitative “theory of evolvability” [128]. Building on Valiant’s earlier work in computational learning theory (discussed in Section 7), evolvability tries to formalize and answer questions about the speed of evolution. For example: “what sorts of adaptive behaviors can evolve, with high probability, after only a polynomial number of generations? what sorts of behaviors can be learned in polynomial time, but not via evolution?” While there are some interesting early results, it should surprise no one that evolvability is nowhere close to being able to calculate, from first principles, whether four billion years is a “reasonable” or “unreasonable” length of time for the human brain to evolve out of the primordial soup.

As I see it, this difficulty reflects a general point about Gödel’s “evolvability” question. Namely, even supposing Gödel was right, that the mechanistic worldview of modern biology was “as unlikely as the separation by chance of the atmosphere into its components,” computational complexity theory seems hopelessly far from being able to prove anything of the kind! In 1972, one could have argued that this merely reflected the subject’s newness: no one had thought terribly deeply yet about how to prove lower bounds on computation time. But by now, people have thought deeply about it, and have identified huge obstacles to proving even such “obvious” and well-defined conjectures as $P \neq NP$ .¹³ (Section 4 will make a related point, about the difficulty of proving nontrivial lower bounds on the time or memory needed by a computer program to pass the Turing Test.)

3.3 Known Integers

My last example of the philosophical relevance of the polynomial/exponential distinction concerns the concept of “knowledge” in mathematics.¹⁴ As of 2011, the “largest known prime number,” as reported by GIMPS (the Great Internet Mersenne Prime Search),¹⁵ is $p := 2^{43112609} - 1$ . But on reflection, what do we mean by saying that $p$ is “known”? Do we mean that, if we desired, we could literally print out its decimal digits (using about 30,000 pages)? That seems like too restrictive a criterion. For, given a positive integer $k$ together with a proof that $q = 2^k - 1$ was prime, I doubt most mathematicians would hesitate to call $q$ a “known” prime, even if $k$ were so large that printing out its decimal digits (or storing them in a computer memory) were beyond the Earth’s capacity. Should we call $2^{2^{1000}}$ an “unknown power of 2,” just because it has too many decimal digits to list before the Sun goes cold?

All that should really matter, one feels, is that

¹³Admittedly, one might be able to prove that Darwinian natural selection would require exponential time to produce some functionality, without thereby proving that any algorithm would require exponential time.

¹⁴This section was inspired by a question of A. Rupinski on the website MathOverflow. See mathoverflow.net/questions/62925/philosophical-question-related-to-largest-known-primes/

¹⁵www.mersenne.org- (a) the expression ‘ $2^{43112609} - 1$ ’ picks out a unique positive integer, and

(b) that integer has been proven (in this case, via computer, of course) to be prime.

But wait! If those are the criteria, then why can’t we immediately beat the largest-known-prime record, like so?

$p' = \text{The first prime larger than } 2^{43112609} - 1.$

Clearly $p'$ exists, it is unambiguously defined, and it is prime. If we want, we can even write a program that is guaranteed to find $p'$ and output its decimal digits, using a number of steps that can be upper-bounded a priori.¹⁶ Yet our intuition stubbornly insists that $2^{43112609} - 1$ is a “known” prime in a sense that $p'$ is not. Is there any principled basis for such a distinction?

The clearest basis that I can suggest is the following. We know an algorithm that takes as input a positive integer $k$ , and that outputs the decimal digits of $p = 2^k - 1$ using a number of steps that is polynomial—indeed, linear—in the number of digits of $p$ . But we do not know any similarly-efficient algorithm that provably outputs the first prime larger than $2^k - 1$ .¹⁷

3.4 Summary

The point of these examples was to illustrate that, beyond its utility for theoretical computer science, the polynomial/exponential gap is also a fertile territory for philosophy. I think of the polynomial/exponential gap as occupying a “middle ground” between two other sorts of gaps: on the one hand, small quantitative gaps (such as the gap between $n$ steps and $2n$ steps); and on the other hand, the gap between a finite number of steps and an infinite number. The trouble with small quantitative gaps is that they are too sensitive to “mundane” modeling choices and the details of technology. But the gap between finite and infinite has the opposite problem: it is serenely insensitive to distinctions that we actually care about, such as that between finding a solution and verifying it, or between classical and quantum physics.¹⁸ The polynomial/exponential gap avoids both problems.

4 Computational Complexity and the Turing Test

Can a computer think? For almost a century, discussions about this question have often conflated two issues. The first is the “metaphysical” issue:

Supposing a computer program passed the Turing Test (or as strong a variant of the

¹⁶For example, one could use Chebyshev’s Theorem (also called Bertrand’s Postulate), which says that for all $N > 1$ there exists a prime between $N$ and $2N$ .

¹⁷Cramér’s Conjecture states that the spacing between two consecutive $n$ -digit primes never exceeds $\sim n^2$ . This conjecture appears staggeringly difficult: even assuming the Riemann Hypothesis, it is only known how to deduce the much weaker upper bound $\sim n2^{n/2}$ . But interestingly, if Cramér’s Conjecture is proved, expressions like “the first prime larger than $2^k - 1$ ” will then define “known primes” according to my criterion.

¹⁸In particular, it is easy to check that the set of computable functions does not depend on whether we define computability with respect to a classical or a quantum Turing machine, or a deterministic or nondeterministic one. At most, these choices can change a Turing machine’s running time by an exponential factor, which is irrelevant for computability theory.Turing Test as one wishes to define),¹⁹ would we be right to ascribe to it “consciousness,” “qualia,” “aboutness,” “intentionality,” “subjectivity,” “personhood,” or whatever other charmed status we wish to ascribe to other humans and to ourselves?

The second is the “practical” issue:

Could a computer program that passed (a strong version of) the Turing Test actually be written? Is there some fundamental reason why it couldn’t be?

Of course, it was precisely in an attempt to separate these issues that Turing proposed the Turing Test in the first place! But despite his efforts, a familiar feature of anti-AI arguments to this day is that they first assert AI’s metaphysical impossibility, and then try to bolster that position with claims about AI’s practical difficulties. “Sure,” they say, “a computer program might mimic a few minutes of witty banter, but unlike a human being, it would never show fear or anger or jealousy, or compose symphonies, or grow old, or fall in love...”

The obvious followup question—and what if a program did do all those things?—is often left unasked, or else answered by listing more things that a computer program could self-evidently never do. Because of this, I suspect that many people who say they consider AI a metaphysical impossibility, really consider it only a practical impossibility: they simply have not carried the requisite thought experiment far enough to see the difference between the two.²⁰ Incidentally, this is as clear-cut a case as I know of where people would benefit from studying more philosophy!

Thus, the anti-AI arguments that interest me most have always been the ones that target the practical issue from the outset, by proposing empirical “sword-in-the-stone tests” (in Daniel Dennett’s phrase [46]) that it is claimed humans can pass but computers cannot. The most famous such test is probably the one based on Gödel’s Incompleteness Theorem, as proposed by John Lucas [91] and elaborated by Roger Penrose in his books The Emperor’s New Mind [102] and Shadows of the Mind [103].

Briefly, Lucas and Penrose argued that, according to the Incompleteness Theorem, one thing that a computer making deductions via fixed formal rules can never do is to “see” the consistency of its own rules. Yet this, they assert, is something that human mathematicians can do, via some sort of intuitive perception of Platonic reality. Therefore humans (or at least, human mathematicians!) can never be simulated by machines.

Critics pointed out numerous holes in this argument,²¹ to which Penrose responded at length in Shadows of the Mind, in my opinion unconvincingly. However, even before we analyze some

¹⁹The Turing Test, proposed by Alan Turing [126] in 1950, is a test where a human judge interacts with either another human or a computer conversation program, by typing messages back and forth. The program “passes” the Test if the judge can’t reliably distinguish the program from the human interlocutor.

By a “strong variant” of the Turing Test, I mean that besides the usual teletype conversation, one could add additional tests requiring vision, hearing, touch, smell, speaking, handwriting, facial expressions, dancing, playing sports and musical instruments, etc.—even though many perfectly-intelligent humans would then be unable to pass the tests!

²⁰One famous exception is John Searle [113], who has made it clear that, if (say) his best friend turned out to be controlled by a microchip rather than a brain, then he would regard his friend as never having been a person at all.

²¹See Dennett [46] and Chalmers [37] for example. To summarize:

(1) Why should we assume a computer operates within a knowably-sound formal system? If we grant a computer the same freedom to make occasional mistakes that we grant humans, then the Incompleteness Theorem is no longer relevant.
(2) Why should we assume that human mathematicians have “direct perception of Platonic reality”? Humanproposed sword-in-the-stone test, it seems to me that there is a much more basic question. Namely, what does one even mean in saying one has a task that “humans can perform but computers cannot”?

4.1 The Lookup-Table Argument

There is a fundamental difficulty here, which was noticed by others in a slightly different context [30, 101, 87, 116]. Let me first explain the difficulty, and then discuss the difference between my argument and the previous ones.

In practice, people judge each other to be conscious after interacting for a very short time, perhaps as little as a few seconds. This suggests that we can put a finite upper bound—to be generous, let us say $10^{20}$ —on the number of bits of information that two people $A$ and $B$ would ever realistically exchange, before $A$ had amassed enough evidence to conclude $B$ was conscious.²² Now imagine a lookup table that stores every possible history $H$ of $A$ and $B$ ’s conversation, and next to $H$ , the action $f_B(H)$ that $B$ would take next given that history. Of course, like Borges’ Library of Babel, the lookup table would consist almost entirely of meaningless nonsense, and it would also be much too large to fit inside the observed universe. But all that matters for us is that the lookup table would be finite, by the assumption that there is a finite upper bound on the conversation length. This implies that the function $f_B$ is computable (indeed, it can be recognized by a finite automaton!). From these simple considerations, we conclude that if there is a fundamental obstacle to computers passing the Turing Test, then it is not to be found in computability theory.²³

In Shadows of the Mind [103, p. 83], Penrose recognizes this problem, but gives a puzzling and unsatisfying response:

One could equally well envisage computers that contain nothing but lists of totally false mathematical ‘theorems,’ or lists containing random jumbles of truths and falsehoods. How are we to tell which computer to trust? The arguments that I am trying to make here do not say that an effective simulation of the output of conscious human activity (here mathematics) is impossible, since purely by chance the computer might ‘happen’

mathematicians (such as Frege) have been wrong before about the consistency of formal systems.

(3) A computer could, of course, be programmed to output “I believe that formal system $F$ is consistent”—and even to output answers to various followup questions about why it believes this. So in arguing that such affirmations “wouldn’t really count” (because they wouldn’t reflect “true understanding”), AI critics such as Lucas and Penrose are forced to retreat from their vision of an empirical “sword-in-the-stone test,” and fall back on other, unspecified criteria related to the AI’s internal structure. But then why put the sword in the stone in the first place?

²²People interacting over the Internet, via email or instant messages, regularly judge each other to be humans rather than spam-bots after exchanging a much smaller number of bits! In any case, cosmological considerations suggest an upper bound of roughly $10^{122}$ bits in any observable process [34].

²³Some readers might notice a tension here: I explained in Section 2 that complexity theorists care about the asymptotic behavior as the problem size $n$ goes to infinity. So why am I now saying that, for the purposes of the Turing Test, we should restrict attention to finite values of $n$ such as $10^{20}$ ? There are two answers to this question. The first is that, in contrast to mathematical problems like the factoring problem or the halting problem, it is unclear whether it even makes sense to generalize the Turing Test to arbitrary conversation lengths: for the Turing Test is defined in terms of human beings, and human conversational capacity is finite. The second answer is that, to whatever extent it does make sense to generalize the Turing Test to arbitrary conversation lengths $n$ , I am interested in whether the asymptotic complexity of passing the test grows polynomially or exponentially with $n$ (as the remainder of the section explains).to get it right—even without any understanding whatsoever. But the odds against this are absurdly enormous, and the issues that are being addressed here, namely how one decides which mathematical statements are true and which are false, are not even being touched...

The trouble with this response is that it amounts to a retreat from the sword-in-the-stone test, back to murkier internal criteria. If, in the end, we are going to have to look inside the computer anyway to determine whether it truly “understands” its answers, then why not dispense with computability theory from the beginning? For computability theory only addresses whether or not Turing machines exist to solve various problems, and we have already seen that that is not the relevant issue.

To my mind, there is one direction that Penrose could take from this point to avoid incoherence—though disappointingly, it is not the direction he chooses. Namely, he could point out that, while the lookup table “works,” it requires computational resources that grow exponentially with the length of the conversation! This would lead to the following speculation:

(*) Any computer program that passed the Turing Test would need to be exponentially-inefficient in the length of the test—as measured in some resource such as time, memory usage, or the number of bits needed to write the program down. In other words, the astronomical lookup table is essentially the best one can do.²⁴

If true, speculation (*) would do what Penrose wants: it would imply that the human brain can’t even be simulated by computer, within the resource constraints of the observable universe. Furthermore, unlike the earlier computability claim, (*) has the advantage of not being trivially false!

On the other hand, to put it mildly, (*) is not trivially true either. For AI proponents, the lack of compelling evidence for (*) is hardly surprising. After all, if you believe that the brain itself is basically an efficient,²⁵ classical Turing machine, then you have a simple explanation for why no one has proved that the brain can’t be simulated by such a machine! However, complexity theory also makes it clear that, even if we supposed (*) held, there would be little hope of proving it in our current state of mathematical knowledge. After all, we can’t even prove plausible, well-defined conjectures such as $P \neq NP$ .

4.2 Relation to Previous Work

As mentioned before, I’m far from the first person to ask about the computational resources used in passing the Turing Test, and whether they scale polynomially or exponentially with the conversation length. While many writers ignore this crucial distinction, Block [30], Parberry [101], Levesque [87], Shieber [116], and several others all discussed it explicitly. The main difference is that the previous discussions took place in the context of Searle’s Chinese Room argument [113].

²⁴As Gil Kalai pointed out to me, one could speculate instead that an efficient computer program exists to pass the Turing Test, but that finding such a program would require exponential computational resources. In that situation, the human brain could indeed be simulated efficiently by a computer program, but maybe not by a program that humans could ever write!

²⁵Here, by a Turing machine $M$ being “efficient,” we mean that $M$ ’s running time, memory usage, and program size are modest enough that there is no real problem of principle understanding how $M$ could be simulated by a classical physical system consisting of $\sim 10^{11}$ neurons and $\sim 10^{14}$ synapses. For example, a Turing machine containing a lookup table of size $10^{10^{20}}$ would not be efficient in this sense.Briefly, Searle proposed a thought experiment—the details don’t concern us here—purporting to show that a computer program could pass the Turing Test, even though the program manifestly lacked anything that a reasonable person would call “intelligence” or “understanding.” In response, many critics said that Searle’s argument was deeply misleading, because it implicitly encouraged us to imagine a computer program that was simplistic in its internal operations—something like the giant lookup table described in Section 4.1. And while it was true, the critics went on, that a giant lookup table wouldn’t “truly understand” its responses, that point is also irrelevant. For the giant lookup table is a philosophical fiction anyway: something that can’t even fit in the observable universe! If we instead imagine a compact, efficient computer program passing the Turing Test, then the situation changes drastically. For now, in order to explain how the program can be so compact and efficient, we’ll need to posit that the program includes representations of abstract concepts, capacities for learning and reasoning, and all sorts of other internal furniture that we would expect to find in a mind.

Personally, I find this response to Searle extremely interesting—since if correct, it suggests that the distinction between polynomial and exponential complexity has metaphysical significance. According to this response, an exponential-sized lookup table that passed the Turing Test would not be sentient (or conscious, intelligent, self-aware, etc.), but a polynomially-bounded program with exactly the same input/output behavior would be sentient. Furthermore, the latter program would be sentient because it was polynomially-bounded.

Yet, as much as that criterion for sentience flatters my complexity-theoretic pride, I find myself reluctant to take a position on such a weighty matter. My point, in Section 4.1, was a simpler and (hopefully) less controversial one: namely, that if you want to claim that passing the Turing Test is flat-out impossible, then like it or not, you must talk about complexity rather than just computability. In other words, the previous writers [30, 101, 87, 116] and I are all interested in the computational resources needed to pass a Turing Test of length $n$ , but for different reasons. Where others invoked complexity considerations to argue with Searle about the metaphysical question, I’m invoking them to argue with Penrose about the practical question.

4.3 Can Humans Solve NP-Complete Problems Efficiently?

In that case, what can we actually say about the practical question? Are there any reasons to accept the claim I called (*)—the claim that humans are not efficiently simulable by Turing machines? In considering this question, we’re immediately led to some speculative possibilities. So for example, if it turned out that humans could solve arbitrary instances of NP-complete problems in polynomial time, then that would certainly strong excellent empirical evidence for (*).²⁶ However, despite occasional claims to the contrary, I personally see no reason to believe that humans can solve NP-complete problems in polynomial time, and excellent reasons to believe the opposite.²⁷ Recall, for

²⁶And amusingly, if we could solve NP-complete problems, then we’d presumably find it much easier to prove that computers couldn’t solve them!

²⁷Indeed, it is not even clear to me that we should think of humans as being able to solve all P problems efficiently, let alone NP-complete problems! Recall that P is the class of problems that are solvable in polynomial time by a deterministic Turing machine. Many problems are known to belong to P for quite sophisticated reasons: two examples are testing whether a number is prime (though not factoring it!) [9] and testing whether a graph has a perfect matching. In principle, of course, a human could laboriously run the polynomial-time algorithms for such problems using pencil and paper. But is the use of pencil and paper legitimate, where use of a computer would not be? What is the computational power of the “unaided” human intellect? Recent work of Drucker [51], which shows how to use a stock photography collection to increase the “effective memory” available for mental calculations,example, that the integer factoring problem is in NP. Thus, if humans could solve NP-complete problems, then presumably we ought to be able to factor enormous numbers as well! But factoring does not exactly seem like the most promising candidate for a sword-in-the-stone test: that is, a task that’s easy for humans but hard for computers. As far as anyone knows today, factoring is hard for humans and (classical) computers alike, although with a definite advantage on the computers’ side!

The basic point can hardly be stressed enough: when complexity theorists talk about “intractable” problems, they generally mean mathematical problems that all our experience leads us to believe are at least as hard for humans as for computers. This suggests that, even if humans were not efficiently simulable by Turing machines, the “direction” in which they were hard to simulate would almost certainly be different from the directions usually considered in complexity theory. I see two (hypothetical) ways this could happen.

First, the tasks that humans were uniquely good at—like painting or writing poetry—could be incomparable with mathematical tasks like solving NP-complete problems, in the sense that neither was efficiently reducible to the other. This would mean, in particular, that there could be no polynomial-time algorithm even to recognize great art or poetry (since if such an algorithm existed, then the task of composing great art or poetry would be in NP). Within complexity theory, it’s known that there exist pairs of problems that are incomparable in this sense. As one plausible example, no one currently knows how to reduce the simulation of quantum computers to the solution of NP-complete problems or vice versa.

Second, humans could have the ability to solve interesting special cases of NP-complete problems faster than any Turing machine. So for example, even if computers were better than humans at factoring large numbers or at solving randomly-generated Sudoku puzzles, humans might still be better at search problems with “higher-level structure” or “semantics,” such as proving Fermat’s Last Theorem or (ironically) designing faster computer algorithms. Indeed, even in limited domains such as puzzle-solving, while computers can examine solutions millions of times faster, humans (for now) are vastly better at noticing global patterns or symmetries in the puzzle that make a solution either trivial or impossible. As an amusing example, consider the Pigeonhole Principle, which says that $n + 1$ pigeons can’t be placed into $n$ holes, with at most one pigeon per hole. It’s not hard to construct a propositional Boolean formula $\varphi$ that encodes the Pigeonhole Principle for some fixed value of $n$ (say, 1000). However, if you then feed $\varphi$ to current Boolean satisfiability algorithms, they’ll assiduously set to work trying out possibilities: “let’s see, if I put this pigeon here, and that one there ... darn, it still doesn’t work!” And they’ll continue trying out possibilities for an exponential number of steps, oblivious to the “global” reason why the goal can never be achieved. Indeed, beginning in the 1980s, the field of proof complexity—a close cousin of computational complexity—has been able to show that large classes of algorithms require exponential time to prove the Pigeonhole Principle and similar propositional tautologies (see Beame and Pitassi [24] for a survey).

Still, if we want to build our sword-in-the-stone test on the ability to detect “higher-level patterns” in combinatorial search problems, then the burden is on us to explain what we mean by higher-level patterns, and why we think that no polynomial-time Turing machine—even much more sophisticated ones than we can imagine today—could ever detect those patterns as well. For an initial attempt to understand NP-complete problems from a cognitive science perspective, see Baum [22].

provides a fascinating empirical perspective on these questions.## 4.4 Summary

My conclusion is that, if you oppose the possibility of AI in principle, then either

(i) you can take the “metaphysical route” (as Searle [113] does with the Chinese Room), conceding the possibility of a computer program passing every conceivable empirical test for intelligence, but arguing that that isn’t enough, or
(ii) you can conjecture an astronomical lower bound on the resources needed either to run such a program or to write it in the first place—but here there is little question of proof for the foreseeable future.

Crucially, because of the lookup-table argument, one option you do not have is to assert the flat-out impossibility of a computer program passing the Turing Test, with no mention of quantitative complexity bounds.

5 The Problem of Logical Omniscience

Giving a formal account of knowledge is one of the central concerns in modern analytic philosophy; the literature is too vast even to survey here (though see Fagin et al. [53] for a computer-science-friendly overview). Typically, formal accounts of knowledge involve conventional “logical” axioms, such as

• If you know $P$ and you know $Q$ , then you also know $P \wedge Q$

supplemented by “modal” axioms having to do with knowledge itself, such as

• If you know $P$ , then you also know that you know $P$
• If you don’t know $P$ , then you know that you don’t know $P$ ²⁸

While the details differ, what most formal accounts of knowledge have in common is that they treat an agent’s knowledge as closed under the application of various deduction rules like the ones above. In other words, agents are considered logically omniscient: if they know certain facts, then they also know all possible logical consequences of those facts.

Sadly and obviously, no mortal being has ever attained or even approximated this sort of omniscience (recall the Turing quote from the beginning of Section 1). So for example, I can know the rules of arithmetic without knowing Fermat’s Last Theorem, and I can know the rules of chess without knowing whether White has a forced win. Furthermore, the difficulty is not (as sometimes claimed) limited to a few domains, such as mathematics and games. As pointed out by Stalnaker [123], if we assumed logical omniscience, then we couldn’t account for any contemplation of facts already known to us—and thus, for the main activity and one of the main subjects of philosophy itself!

We can now loosely state what Hintikka [72] called the problem of logical omniscience:

²⁸Not surprisingly, this particular axiom has engendered controversy: it leaves no possibility for Rumsfeldian “unknown unknowns.”Can we give some formal account of “knowledge” able to accommodate people learning new things without leaving their armchairs?

Of course, one vacuous “solution” would be to declare that your knowledge is simply a list of all the true sentences²⁹ that you “know”—and that, if the list happens not to be closed under logical deductions, so be it! But this “solution” is no help at all at explaining how or why you know things. Can’t we do better?

Intuitively, we want to say that your “knowledge” consists of various non-logical facts (“grass is green”), together with some simple consequences of those facts (“grass is not pink”), but not necessarily all the consequences, and certainly not all consequences that involve difficult mathematical reasoning. Unfortunately, as soon as we try to formalize this idea, we run into problems.

The most obvious problem is the lack of a sharp boundary between the facts you know right away, and those you “could” know, but only after significant thought. (Recall the discussion of “known primes” from Section 3.3.) A related problem is the lack of a sharp boundary between the facts you know “only if asked about them,” and those you know even if you’re not asked. Interestingly, these two boundaries seem to cut across each other. For example, while you’ve probably already encountered the fact that 91 is composite, it might take you some time to remember it; while you’ve probably never encountered the fact that 83190 is composite, once asked you can probably assent to it immediately.

But as discussed by Stalnaker [123], there’s a third problem that seems much more serious than either of the two above. Namely, you might “know” a particular fact if asked about it one way, but not if asked in a different way! To illustrate this, Stalnaker uses an example that we can recognize immediately from the discussion of the P versus NP problem in Section 3.1. If I asked you whether $43 \times 37 = 1591$ , you could probably answer easily (e.g., by using $(40 + 3)(40 - 3) = 40^2 - 3^2$ ). On the other hand, if I instead asked you what the prime factors of 1591 were, you probably couldn’t answer so easily.

But the answers to the two questions have the same content, even on a very fine-grained notion of content. Suppose that we fix the threshold of accessibility so that the information that 43 and 37 are the prime factors of 1591 is accessible in response to the second question, but not accessible in response to the first. Do you know what the prime factors of 1591 are or not? ... Our problem is that we are not just trying to say what an agent would know upon being asked certain questions; rather, we are trying to use the facts about an agent’s question answering capacities in order to get at what the agent knows, even if the questions are not asked. [123, p. 253]

To add another example: does a typical four-year-old child “know” that addition of reals is commutative? Certainly not if we asked her in those words—and if we tried to explain the words, she probably wouldn’t understand us. Yet if we showed her a stack of books, and asked her whether she could make the stack higher by shuffling the books, she probably wouldn’t make a mistake that involved imagining addition was non-commutative. In that sense, we might say she already “implicitly” knows what her math classes will later make explicit.

In my view, these and other examples strongly suggest that only a small part of what we mean by “knowledge” is knowledge about the truth or falsehood of individual propositions. And

²⁹If we don’t require the sentences to be true, then presumably we’re talking about belief rather than knowledge.crucially, this remains so even if we restrict our attention to “purely verbalizable” knowledge—indeed, knowledge used for answering factual questions—and not (say) knowledge of how to ride a bike or swing a golf club, or knowledge of a person or a place.³⁰ Many everyday uses of the word “know” support this idea:

Do you know calculus?
Do you know Spanish?
Do you know the rules of bridge?

Each of the above questions could be interpreted as asking: do you possess an internal algorithm, by which you can answer a large (and possibly-unbounded) set of questions of some form? While this is rarely made explicit, the examples of this section and of Section 3.3 suggest adding the proviso: ... answer in a reasonable amount of time?

But suppose we accept that “knowing how” (or “knowing a good algorithm for”) is a more fundamental concept than “knowing that.” How does that help us at all in solving the logical omniscience problem? You might worry that we’re right back where we started. After all, if we try to give a formal account of “knowing how,” then just like in the case of “knowing that,” it will be tempting to write down axioms like the following:

If you know how to compute $f(x)$ and $g(x)$ efficiently, then you also know how to compute $f(x) + g(x)$ efficiently.

Naturally, we’ll then want to take the logical closure of those axioms. But then, before we know it, won’t we have conjured into our imaginations a computationally-omniscient superbeing, who could efficiently compute anything at all?

5.1 The Cobham Axioms

Happily, the above worry turns out to be unfounded. We can write down reasonable axioms for “knowing how to compute efficiently,” and then go ahead and take the closure of those axioms, without getting the unwanted consequence of computational omniscience. Explaining this point will involve a digression into an old and fascinating corner of complexity theory—one that probably holds independent interest for philosophers.

As is well-known, in the 1930s Church and Kleene proposed definitions of the “computable functions” that turned out to be precisely equivalent to Turing’s definition, but that differed from Turing’s in making no explicit mention of machines. Rather than analyzing the process of computation, the Church-Kleene approach was simply to list axioms that the computable functions of natural numbers $f : \mathbb{N} \rightarrow \mathbb{N}$ ought to satisfy—for example, “if $f(x)$ and $g(x)$ are both computable, then so is $f(g(x))$ ”—and then to define “the” computable functions as the smallest set satisfying those axioms.

In 1965, Alan Cobham [42] asked whether the same could be done for the efficiently or feasibly computable functions. As an answer, he offered axioms that precisely characterize what today we call FP, or Function Polynomial-Time (though Cobham called it $\mathcal{L}$ ). The class FP consists of all

³⁰For “knowing” a person suggests having actually met the person, while “knowing” a place suggests having visited the place. Interestingly, in Hebrew, one uses a completely different verb for “know” in the sense of “being familiar with” (makir) than for “know” in the intellectual sense (yodeya).functions of natural numbers $f : \mathbb{N} \rightarrow \mathbb{N}$ that are computable in polynomial time by a deterministic Turing machine. Note that FP is “morally” the same as the class P (Polynomial-Time) defined in Section 3.1: they differ only in that P is a class of decision problems (or equivalently, functions $f : \mathbb{N} \rightarrow {0, 1}$ ), whereas FP is a class of functions with integer range.

What was noteworthy about Cobham’s characterization of polynomial time was that it didn’t involve any explicit mention of either computing devices or bounds on their running time. Let me now list a version of Cobham’s axioms, adapted from Arora, Impagliazzo, and Vazirani [16]. Each of the axioms talks about which functions of natural numbers $f : \mathbb{N} \rightarrow \mathbb{N}$ are “efficiently computable.”

(1) Every constant function $f$ is efficiently computable, as is every function which is nonzero only finitely often.
(2) Pairing: If $f(x)$ and $g(x)$ are efficiently computable, then so is $\langle f(x), g(x) \rangle$ , where $\langle, \rangle$ is some standard pairing function for the natural numbers.
(3) Composition: If $f(x)$ and $g(x)$ are efficiently computable, then so is $f(g(x))$ .
(4) Grab Bag: The following functions are all efficiently computable:
- • the arithmetic functions $x + y$ and $x \times y$
- • $|x| = \lfloor \log_2 x \rfloor + 1$ (the number of bits in $x$ ’s binary representation)
- • the projection functions $\Pi_1(\langle x, y \rangle) = x$ and $\Pi_2(\langle x, y \rangle) = y$
- • $\text{bit}(\langle x, i \rangle)$ (the $i^{\text{th}}$ bit of $x$ ’s binary representation, or 0 if $i > |x|$ )
- • $\text{diff}(\langle x, i \rangle)$ (the number obtained from $x$ by flipping its $i^{\text{th}}$ bit)
- • $2^{|x|^2}$ (called the “smash function”)
(5) Bounded Recursion: Suppose $f(x)$ is efficiently computable, and $|f(x)| \leq |x|$ for all $x \in \mathbb{N}$ . Then the function $g(\langle x, k \rangle)$ , defined by

$g(\langle x, k \rangle) = \begin{cases} f(g(\langle x, \lfloor k/2 \rfloor \rangle)) & \text{if } k > 1 \\ x & \text{if } k = 1 \end{cases},$

is also efficiently computable.

A few comments about the Cobham axioms might be helpful. First, the axiom that “does most of the work” is (5). Intuitively, given any natural number $k \in \mathbb{N}$ that we can generate starting from the original input $x \in \mathbb{N}$ , the Bounded Recursion axiom lets us set up a “computational process” that runs for $\log_2 k$ steps. Second, the role of the “smash function,” $2^{|x|^2}$ , is to let us map $n$ -bit integers to $n^2$ -bit integers to $n^4$ -bit integers and so on, and thereby (in combination with the Bounded Recursion axiom) set up computational processes that run for arbitrary polynomial numbers of steps. Third, although addition and multiplication are included as “efficiently computable functions,” it is crucial that exponentiation is not included. Indeed, if $x$ and $y$ are $n$ -bit integers, then $x^y$ might require exponentially many bits just to write down.

The basic result is then the following:Theorem 1 ([42, 110]) The class FP, of functions $f : \mathbb{N} \rightarrow \mathbb{N}$ computable in polynomial time by a deterministic Turing machine, satisfies axioms (1)-(5), and is the smallest class that does so.

To prove Theorem 1, one needs to do two things, neither of them difficult: first, show that any function $f$ that can be defined using the Cobham axioms can also be computed in polynomial time; and second, show that the Cobham axioms are enough to simulate any polynomial-time Turing machine.

One drawback of the Cobham axioms is that they seem to “sneak in the concept of polynomial-time through the back door”—both through the “smash function,” and through the arbitrary-looking condition $|f(x)| \leq |x|$ in axiom (5). In the 1990s, however, Leivant [86] and Bellantoni and Cook [25] both gave more “elegant” logical characterizations of FP that avoid this problem. So for example, Leivant showed that a function $f$ belongs to FP, if and only if $f$ is computed by a program that can be proved correct in second-order logic with comprehension restricted to positive quantifier-free formulas. Results like these provide further evidence—if any was needed—that polynomial-time computability is an extremely natural notion: a “wide target in conceptual space” that one hits even while aiming in purely logical directions.

Over the past few decades, the idea of defining complexity classes such as P and NP in “logical, machine-free” ways has given rise to an entire field called descriptive complexity theory, which has deep connections with finite model theory. While further discussion of descriptive complexity theory would take us too far afield, see the book of Immerman [77] for the definitive introduction, or Fagin [52] for a survey.

5.2 Omniscience Versus Infinity

Returning to our original topic, how exactly do axiomatic theories such as Cobham’s (or Church’s and Kleene’s, for that matter) escape the problem of omniscience? One straightforward answer is that, unlike the set of true sentences in some formal language, which is only countably infinite, the set of functions $f : \mathbb{N} \rightarrow \mathbb{N}$ is uncountably infinite. And therefore, even if we define the “efficiently-computable” functions $f : \mathbb{N} \rightarrow \mathbb{N}$ by taking a countably-infinite logical closure, we are sure to miss some functions $f$ (in fact, almost all of them!).

The observation above suggests a general strategy to tame the logical omniscience problem. Namely, we could refuse to define an agent’s “knowledge” in terms of which individual questions she can quickly answer, and insist on speaking instead about which infinite families of questions she can quickly answer. In slogan form, we want to “fight omniscience with infinity.”

Let’s see how, by taking this route, we can give semi-plausible answers to the puzzles about knowledge discussed earlier in this section. First, the reason why you can “know” that $1591 = 43 \times 37$ , but at the same time not “know” the prime factors of 1591, is that, when we speak about knowing the answers to these questions, we really mean knowing how to answer them. And as we saw, there need not be any contradiction in knowing a fast multiplication algorithm but not a fast factoring algorithm, even if we model your knowledge about algorithms as deductively closed. To put it another way, by embedding the two questions

Q1 = “Is $1591 = 43 \times 37$ ?”

Q2 = “What are the prime factors of 1591?”

into infinite families of related questions, we can break the symmetry between the knowledge entailed in answering them.Similarly, we could think of a child as possessing an internal algorithm which, given any statement of the form $x + y = y + x$ (for specific $x$ and $y$ values), immediately outputs true, without even examining $x$ and $y$ . However, the child does not yet have the ability to process quantified statements, such as “ $\forall x, y \in \mathbb{R} \ x + y = y + x$ .” In that sense, she still lacks the explicit knowledge that addition is commutative.

Although the “cure” for logical omniscience sketched above solves some puzzles, not surprisingly it raises many puzzles of its own. So let me end this section by discussing three major objections to the “infinity cure.”

The first objection is that we’ve simply pushed the problem of logical omniscience somewhere else. For suppose an agent “knows” how to compute every function in some restricted class such as FP. Then how can we ever make sense of the agent learning a new algorithm? One natural response is that, even if you have the “latent ability” to compute a function $f \in \text{FP}$ , you might not know that you have the ability—either because you don’t know a suitable algorithm, or because you do know an algorithm, but don’t know that it’s an algorithm for $f$ . Of course, if we wanted to pursue things to the bottom, we’d next need to tell a story about knowledge of algorithms, and how logical omniscience is avoided there. However, I claim that this represents progress! For notice that, even without such a story, we can already explain some failures of logical omniscience. For example, the reason why you don’t know the factors of a large number might not be your ignorance of a fast factoring method, but rather that no such method exists.

The second objection is that, when I advocated focusing on infinite families of questions rather than single questions in isolation, I never specified which infinite families. The difficulty is that the same question could be generalized in wildly different ways. As an example, consider the question

Q = “Is 432,150 composite?”

Q is an instance of a computational problem that humans find very hard: “given a large integer $N$ , is $N$ composite?” However, Q is also an instance of a computational problem that humans find very easy: “given a large integer $N$ ending in 0, is $N$ composite?” And indeed, we’d expect a person to know the answer to Q if she noticed that 432,150 ends in 0, but not otherwise. To me, what this example demonstrates is that, if we want to discuss an agent’s knowledge in terms of individual questions such as Q, then the relevant issue will be whether there exists a generalization G of Q, such that the agent knows a fast algorithm for answering questions of type G, and also recognizes that Q is of type G.

The third objection is just the standard one about the relationship between asymptotic complexity and finite statements. For example, if we model an agent’s knowledge using the Cobham axioms, then we can indeed explain why the agent doesn’t know how to play perfect chess on an $n \times n$ board, for arbitrary values of $n$ .³¹ But on a standard $8 \times 8$ board, playing perfect chess would “merely” require (say) $\sim 10^{60}$ computational steps, which is a constant, and therefore certainly polynomial! So strictly on the basis of the Cobham axioms, what explanation could we possibly offer for why a rational agent, who knew the rules of $8 \times 8$ chess, didn’t also know how to play it optimally? While this objection might sound devastating, it’s important to understand that it’s no different from the usual objection leveled against complexity-theoretic arguments, and can be given the usual response. Namely: asymptotic statements are always vulnerable to being rendered irrelevant, if the constant factors turned out to be ridiculous. However, experience has

³¹For chess on an $n \times n$ board is known to be EXP-complete, and it is also known that $\text{P} \neq \text{EXP}$ . See Section 10, and particularly footnote 60, for more details.shown that, for whatever reasons, that happens rarely enough that one can usually take asymptotic behavior as “having explanatory force until proven otherwise.” (Section 12 will say more about the explanatory force of asymptotic claims, as a problem requiring philosophical analysis.)

5.3 Summary

Because of the difficulties pointed out in Section 5.2, my own view is that computational complexity theory has not yet come close to “solving” the logical omniscience problem, in the sense of giving a satisfying formal account of knowledge that also avoids making absurd predictions. I have no idea whether such an account is even possible.³² However, what I’ve tried to show in this section is that complexity theory provides a well-defined “limiting case” where the logical omniscience problem is solvable, about as well as one could hope it to be. The limiting case is where the size of the questions grows without bound, and the solution there is given by the Cobham axioms: “axioms of knowing how” whose logical closure one can take without thereby inviting omniscience.

In other words, when we contemplate the omniscience problem, I claim that we’re in a situation similar to one often faced in physics—where we might be at a loss to understand some phenomenon (say, gravitational entropy), except in limiting cases such as black holes. In epistemology just like in physics, the limiting cases that we do more-or-less understand offer an obvious starting point for those wishing to tackle the general case.

6 Computationalism and Waterfalls

Over the past two decades, a certain argument about computation—which I’ll call the waterfall argument—has been widely discussed by philosophers of mind.³³ Like Searle’s famous Chinese Room argument [113], the waterfall argument seeks to show that computations are “inherently syntactic,” and can never be “about” anything—and that for this reason, the doctrine of “computationalism” is false.³⁴ But unlike the Chinese Room, the waterfall argument supplements the bare appeal to intuition by a further claim: namely, that the “meaning” of a computation, to whatever extent it has one, is always relative to some external observer.

More concretely, consider a waterfall (though any other physical system with a large enough state space would do as well). Here I do not mean a waterfall that was specially engineered to perform computations, but really a naturally-occurring waterfall: say, Niagara Falls. Being governed by laws of physics, the waterfall implements some mapping $f$ from a set of possible initial states to a set of possible final states. If we accept that the laws of physics are reversible, then $f$ must also be injective. Now suppose we restrict attention to some finite subset $S$ of possible initial states, with $|S| = n$ . Then $f$ is just a one-to-one mapping from $S$ to some output set

³²Compare the pessimism expressed by Paul Graham [68] about knowledge representation more generally:

In practice formal logic is not much use, because despite some progress in the last 150 years we’re still only able to formalize a small percentage of statements. We may never do that much better, for the same reason 1980s-style “knowledge representation” could never have worked; many statements may have no representation more concise than a huge, analog brain state.

³³See Putnam [106, appendix] and Searle [114] for two instantiations of the argument (though the formal details of either will not concern us here).

³⁴“Computationalism” refers to the view that the mind is literally a computer, and that thought is literally a type of computation.$T = f(S)$ with $|T| = n$ . The “crucial observation” is now this: given any permutation $\sigma$ from the set of integers ${1, \dots, n}$ to itself, there is some way to label the elements of $S$ and $T$ by integers in ${1, \dots, n}$ , such that we can interpret $f$ as implementing $\sigma$ . For example, if we let $S = {s_1, \dots, s_n}$ and $f(s_i) = t_i$ , then it suffices to label the initial state $s_i$ by $i$ and the final state $t_i$ by $\sigma(i)$ . But the permutation $\sigma$ could have any “semantics” we like: it might represent a program for playing chess, or factoring integers, or simulating a different waterfall. Therefore “mere computation” cannot give rise to semantic meaning. Here is how Searle [114, p. 57] expresses the conclusion:

If we are consistent in adopting the Turing test or some other “objective” criterion for intelligent behavior, then the answer to such questions as “Can unintelligent bits of matter produce intelligent behavior?” and even, “How exactly do they do it” are ludicrously obvious. Any thermostat, pocket calculator, or waterfall produces “intelligent behavior,” and we know in each case how it works. Certain artifacts are designed to behave as if they were intelligent, and since everything follows laws of nature, then everything will have some description under which it behaves as if it were intelligent. But this sense of “intelligent behavior” is of no psychological relevance at all.

The waterfall argument has been criticized on numerous grounds: see Haugeland [71], Block [30], and especially Chalmers [37] (who parodied the argument by proving that a cake recipe, being merely syntactic, can never give rise to the semantic attribute of crumbliness). To my mind, though, perhaps the easiest way to demolish the waterfall argument is through computational complexity considerations.

Indeed, suppose we actually wanted to use a waterfall to help us calculate chess moves. How would we do that? In complexity terms, what we want is a reduction from the chess problem to the waterfall-simulation problem. That is, we want an efficient algorithm that somehow encodes a chess position $P$ into an initial state $s_P \in S$ of the waterfall, in such a way that a good move from $P$ can be read out efficiently from the waterfall’s corresponding final state, $f(s_P) \in T$ .³⁵ But what would such an algorithm look like? We cannot say for sure—certainly not without detailed knowledge about $f$ (i.e., the physics of waterfalls), as well as the means by which the $S$ and $T$ elements are encoded as binary strings. But for any reasonable choice, it seems overwhelmingly likely that any reduction algorithm would just solve the chess problem itself, without using the waterfall in an essential way at all! A bit more precisely, I conjecture that, given any chess-playing algorithm $A$ that accesses a “waterfall oracle” $W$ , there is an equally-good chess-playing algorithm $A'$ , with similar time and space requirements, that does not access $W$ . If this conjecture holds, then it gives us a perfectly observer-independent way to formalize our intuition that the “semantics” of waterfalls have nothing to do with chess.³⁶

³⁵Technically, this describes a restricted class of reductions, called nonadaptive reductions. An adaptive reduction from chess to waterfalls might solve a chess problem by some procedure that involves initializing a waterfall and observing its final state, then using the results of that aquatic computation to initialize a second waterfall and observe its final state, and so on for some polynomial number of repetitions.

³⁶The perceptive reader might suspect that we smuggled our conclusion into the assumption that the waterfall states $s_P \in S$ and $f(s_P) \in T$ were encoded as binary strings in a “reasonable” way (and not, for example, in a way that encodes the solution to the chess problem). But a crucial lesson of complexity theory is that, when we discuss “computational problems,” we always make an implicit commitment about the input and output encodings anyway! So for example, if positive integers were given as input via their prime factorizations, then the factoring problem would be trivial (just apply the identity function). But who cares? If, in mathematically defining the waterfall-simulation problem, we required input and output encodings that entailed solving chess problems, then it would no longer be reasonable to call our problem (solely) a “waterfall-simulation problem” at all.## 6.1 “Reductions” That Do All The Work

Interestingly, the issue of “trivial” or “degenerate” reductions also arises within complexity theory, so it might be instructive to see how it is handled there. Recall from Section 3.1 that a problem is NP-complete if, loosely speaking, it is “maximally hard among all NP problems” (NP being the class of problems for which solutions can be checked in polynomial time). More formally, we say that $L$ is NP-complete if

(i) $L \in \text{NP}$ , and
(ii) given any other NP problem $L'$ , there exists a polynomial-time algorithm to solve $L'$ using access to an oracle that solves $L$ . (Or more succinctly, $L' \in \mathbf{P}^L$ , where $\mathbf{P}^L$ denotes the complexity class P augmented by an $L$ -oracle.)

The concept of NP-completeness had incredible explanatory power: it showed that thousands of seemingly-unrelated problems from physics, biology, industrial optimization, mathematical logic, and other fields were all identical from the standpoint of polynomial-time computation, and that not one of these problems had an efficient solution unless $\mathbf{P} = \text{NP}$ . Thus, it was natural for theoretical computer scientists to want to define an analogous concept of P-completeness. In other words: among all the problems that are solvable in polynomial time, which ones are “maximally hard”?

But how should P-completeness even be defined? To see the difficulty, suppose that, by analogy with NP-completeness, we say that $L$ is P-complete if

(i) $L \in \mathbf{P}$ and
(ii) $L' \in \mathbf{P}^L$ for every $L' \in \mathbf{P}$ .

Then it is easy to see that the second condition is vacuous: every P problem is P-complete! For in “reducing” $L'$ to $L$ , a polynomial-time algorithm can always just ignore the $L$ -oracle and solve $L'$ by itself, much like our hypothetical chess program that ignored its waterfall oracle. Because of this, condition (ii) must be replaced by a stronger condition; one popular choice is

(ii') $L' \in \text{LOGSPACE}^L$ for every $L' \in \mathbf{P}$ .

Here LOGSPACE means, informally, the class of problems solvable by a deterministic Turing machine with a read/write memory consisting of only $\log n$ bits, given an input of size $n$ .³⁷ It’s not hard to show that $\text{LOGSPACE} \subseteq \mathbf{P}$ , and this containment is strongly believed to be strict (though just like with $\mathbf{P} \neq \text{NP}$ , there is no proof yet). The key point is that, if we want a non-vacuous notion of completeness, then the reducing complexity class needs to be weaker (either provably or conjecturally) than the class being reduced to. In fact complexity classes even smaller than LOGSPACE almost always suffice in practice.

In my view, there is an important lesson here for debates about computationalism. Suppose we want to claim, for example, that a computation that plays chess is “equivalent” to some other computation that simulates a waterfall. Then our claim is only non-vacuous if it’s possible to exhibit the equivalence (i.e., give the reductions) within a model of computation that isn’t itself powerful enough to solve the chess or waterfall problems.

³⁷Note that a LOGSPACE machine does not even have enough memory to store its input string! For this reason, we think of the input string as being provided on a special read-only tape.## 7 PAC-Learning and the Problem of Induction

Centuries ago, David Hume [76] famously pointed out that learning from the past (and, by extension, science) seems logically impossible. For example, if we sample 500 ravens and every one of them is black, why does that give us any grounds—even probabilistic grounds—for expecting the 501^st raven to be black also? Any modern answer to this question would probably refer to Occam’s razor, the principle that simpler hypotheses consistent with the data are more likely to be correct. So for example, the hypothesis that all ravens are black is “simpler” than the hypothesis that most ravens are green or purple, and that only the 500 we happened to see were black. Intuitively, it seems Occam’s razor must be part of the solution to Hume’s problem; the difficulty is that such a response leads to questions of its own:

(1) What do we mean by “simpler”?
(2) Why are simple explanations likely to be correct? Or, less ambitiously: what properties must reality have for Occam’s Razor to “work”?
(3) How much data must we collect before we can find a “simple hypothesis” that will probably predict future data? How do we go about finding such a hypothesis?

In my view, the theory of PAC (Probabilistically Approximately Correct) Learning, initiated by Leslie Valiant [127] in 1984, has made large enough advances on all of these questions that it deserves to be studied by anyone interested in induction.³⁸ In this theory, we consider an idealized “learner,” who is presented with points $x_1, \dots, x_m$ drawn randomly from some large set $\mathcal{S}$ , together with the “classifications” $f(x_1), \dots, f(x_m)$ of those points. The learner’s goal is to infer the function $f$ , well enough to be able to predict $f(x)$ for most future points $x \in \mathcal{S}$ . As an example, the learner might be a bank, $\mathcal{S}$ might be a set of people (represented by their credit histories), and $f(x)$ might represent whether or not person $x$ will default on a loan.

For simplicity, we often assume that $\mathcal{S}$ is a set of binary strings, and that the function $f$ maps each $x \in \mathcal{S}$ to a single bit, $f(x) \in {0, 1}$ . Both assumptions can be removed without significantly changing the theory. The important assumptions are the following:

(1) Each of the sample points $x_1, \dots, x_m$ is drawn independently from some (possibly-unknown) “sample distribution” $\mathcal{D}$ over $\mathcal{S}$ . Furthermore, the future points $x$ on which the learner will need to predict $f(x)$ are drawn from the same distribution.
(2) The function $f$ belongs to a known “hypothesis class” $\mathcal{H}$ . This $\mathcal{H}$ represents “the set of possibilities the learner is willing to entertain” (and is typically much smaller than the set of all $2^{|\mathcal{S}|}$ possible functions from $\mathcal{S}$ to ${0, 1}$ ).

Under these assumptions, we have the following central result.

³⁸See Kearns and Vazirani [82] for an excellent introduction to PAC-learning, and de Wolf [136] for previous work applying PAC-learning to philosophy and linguistics: specifically, to fleshing out Chomsky’s “poverty of the stimulus” argument. De Wolf also discusses several formalizations of Occam’s Razor other than the one based on PAC-learning.Theorem 2 (Valiant [127]) Consider a finite hypothesis class $\mathcal{H}$ , a Boolean function $f : \mathcal{S} \rightarrow {0, 1}$ in $\mathcal{H}$ , and a sample distribution $\mathcal{D}$ over $\mathcal{S}$ , as well as an error rate $\varepsilon > 0$ and failure probability $\delta > 0$ that the learner is willing to tolerate. Call a hypothesis $h : \mathcal{S} \rightarrow {0, 1}$ “good” if

$\Pr_{x \sim \mathcal{D}} [h(x) = f(x)] \geq 1 - \varepsilon.$

Also, call sample points $x_1, \dots, x_m$ “reliable” if any hypothesis $h \in \mathcal{H}$ that satisfies $h(x_i) = f(x_i)$ for all $i \in {1, \dots, m}$ is good. Then

$m = \frac{1}{\varepsilon} \ln \frac{|\mathcal{H}|}{\delta}$

sample points $x_1, \dots, x_m$ drawn independently from $\mathcal{D}$ will be reliable with probability at least $1 - \delta$ .

Intuitively, Theorem 2 says that the behavior of $f$ on a small number of randomly-chosen points probably determines its behavior on most of the remaining points. In other words, if, by some unspecified means, the learner manages to find any hypothesis $h \in \mathcal{H}$ that makes correct predictions on all its past data points $x_1, \dots, x_m$ , then provided $m$ is large enough (and as it happens, $m$ doesn’t need to be very large), the learner can be statistically confident that $h$ will also make the correct predictions on most future points.

The part of Theorem 2 that bears the unmistakable imprint of complexity theory is the bound on sample size, $m \geq \frac{1}{\varepsilon} \ln \frac{|\mathcal{H}|}{\delta}$ . This bound has three notable implications. First, even if the class $\mathcal{H}$ contains exponentially many hypotheses (say, $2^n$ ), one can still learn an arbitrary function $f \in \mathcal{H}$ using a linear amount of sample data, since $m$ grows only logarithmically with $|\mathcal{H}|$ : in other words, like the number of bits needed to write down an individual hypothesis. Second, one can make the probability that the hypothesis $h$ will fail to generalize exponentially small (say, $\delta = 2^{-n}$ ), at the cost of increasing the sample size $m$ by only a linear factor. Third, assuming the hypothesis does generalize, its error rate $\varepsilon$ decreases inversely with $m$ . It is not hard to show that each of these dependencies is tight, so that for example, if we demand either $\varepsilon = 0$ or $\delta = 0$ then no finite $m$ suffices. This is the origin of the name “PAC-learning”: the most one can hope for is to output a hypothesis that is “probably, approximately” correct.

The proof of Theorem 2 is easy: consider any hypothesis $h \in \mathcal{H}$ that is bad, meaning that

$\Pr_{x \sim \mathcal{D}} [h(x) = f(x)] < 1 - \varepsilon.$

Then by the independence assumption,

$\Pr_{x_1, \dots, x_m \sim \mathcal{D}} [h(x_1) = f(x_1) \wedge \dots \wedge h(x_m) = f(x_m)] < (1 - \varepsilon)^m.$

Now, the number of bad hypotheses is no more than the total number of hypotheses, $|\mathcal{H}|$ . So by the union bound, the probability that there exists a bad hypothesis that agrees with $f$ on all of $x_1, \dots, x_m$ can be at most $|\mathcal{H}| \cdot (1 - \varepsilon)^m$ . Therefore $\delta \leq |\mathcal{H}| \cdot (1 - \varepsilon)^m$ , and all that remains is to solve for $m$ .

The relevance of Theorem 2 to Hume’s problem of induction is that the theorem describes a nontrivial class of situations where induction is guaranteed to work with high probability. Theorem 2 also illuminates the role of Occam’s Razor in induction. In order to learn using a “reasonable” number of sample points $m$ , the hypothesis class $\mathcal{H}$ must have a sufficiently small cardinality. But that is equivalent to saying that every hypothesis $h \in \mathcal{H}$ must have a succinct description—sincethe number of bits needed to specify an arbitrary hypothesis $h \in \mathcal{H}$ is simply $\lceil \log_2 |\mathcal{H}| \rceil$ . If the number of bits needed to specify a hypothesis is too large, then $\mathcal{H}$ will always be vulnerable to the problem of overfitting: some hypotheses $h \in \mathcal{H}$ surviving contact with the sample data just by chance.

As pointed out to me by Agustín Rayo, there are several possible interpretations of Occam’s Razor that have nothing to do with descriptive complexity: for example, we might want our hypotheses to be “simple” in terms of their ontological or ideological commitments. However, to whatever extent we interpret Occam’s Razor as saying that shorter or lower-complexity hypotheses are preferable, Theorem 2 comes closer than one might have thought possible to a mathematical justification for why the Razor works.

Many philosophers might be familiar with alternative formal approaches to Occam’s Razor. For example, within a Bayesian framework, one can choose a prior over all possible hypotheses that gives greater weight to “simpler” hypotheses (where simplicity is measured, for example, by the length of the shortest program that computes the predictions). However, while the PAC-learning and Bayesian approaches are related, the PAC approach has the advantage of requiring only a qualitative decision about which hypotheses one wants to consider, rather than a quantitative prior over hypotheses. Given the hypothesis class $\mathcal{H}$ , one can then seek learning methods that work for any $f \in \mathcal{H}$ . (On the other hand, the PAC approach requires an assumption about the probability distribution over observations, while the Bayesian approach does not.)

7.1 Drawbacks of the Basic PAC Model

I’d now like to discuss three drawbacks of Theorem 2, since I think the drawbacks illuminate philosophical aspects of induction as well as the advantages do.

The first drawback is that Theorem 2 works only for finite hypothesis classes. In science, however, hypotheses often involve continuous parameters, of which there is an uncountable infinity. Of course, one could solve this problem by simply discretizing the parameters, but then the number of hypotheses (and therefore the relevance of Theorem 2) would depend on how fine the discretization was. Fortunately, we can avoid such difficulties by realizing that the learner only cares about the “differences” between two hypotheses insofar as they lead to different predictions. This leads to the fundamental notion of VC-dimension (after its originators, Vapnik and Chervonenkis [129]).

Definition 3 (VC-dimension) A hypothesis class $\mathcal{H}$ shatters the sample points ${x_1, \dots, x_k} \subseteq \mathcal{S}$ if for all $2^k$ possible settings of $h(x_1), \dots, h(x_k)$ , there exists a hypothesis $h \in \mathcal{H}$ compatible with those settings. Then $\text{VCdim}(\mathcal{H})$ , the VC-dimension of $\mathcal{H}$ , is the largest $k$ for which there exists a subset ${x_1, \dots, x_k} \subseteq \mathcal{S}$ that $\mathcal{H}$ shatters (or if no finite maximum exists, then $\text{VCdim}(\mathcal{H}) = \infty$ ).

Clearly any finite hypothesis class has finite VC-dimension: indeed, $\text{VCdim}(\mathcal{H}) \leq \log_2 |\mathcal{H}|$ . However, even an infinite hypothesis class can have finite VC-dimension if it is “sufficiently simple.” For example, let $\mathcal{H}$ be the class of all functions $h_{a,b} : \mathbb{R} \rightarrow {0, 1}$ of the form

$h_{a,b}(x) = \begin{cases} 1 & \text{if } a \leq x \leq b \\ 0 & \text{otherwise.} \end{cases}$

Then it is easy to check that $\text{VCdim}(\mathcal{H}) = 2$ .

With the notion of VC-dimension in hand, we can state a powerful (and harder-to-prove!) generalization of Theorem 2, due to Blumer et al. [31].Theorem 4 (Blumer et al. [31]) For some universal constant $K > 0$ , the bound on $m$ in Theorem 2 can be replaced by

$m = \frac{K \text{VCdim}(\mathcal{H})}{\varepsilon} \ln \frac{1}{\delta\varepsilon},$

with the theorem now holding for any hypothesis class $\mathcal{H}$ , finite or infinite.

If $\mathcal{H}$ has infinite VC-dimension, then it is easy to construct a probability distribution $\mathcal{D}$ over sample points such that no finite number $m$ of samples from $\mathcal{D}$ suffices to PAC-learn a function $f \in \mathcal{H}$ : one really is in the unfortunate situation described by Hume, of having no grounds at all for predicting that the next raven will be black. In some sense, then, Theorem 4 is telling us that finite VC-dimension is a necessary and sufficient condition for scientific induction to be possible. Once again, Theorem 4 also has an interpretation in terms of Occam’s Razor, with the smallness of the VC-dimension now playing the role of simplicity.

The second drawback of Theorem 2 is that it gives us no clues about how to find a hypothesis $h \in \mathcal{H}$ consistent with the sample data. All it says is that, if we find such an $h$ , then $h$ will probably be close to the truth. This illustrates that, even in the simple setup envisioned by PAC-learning, induction cannot be merely a matter of seeing enough data and then “generalizing” from it, because immense computations might be needed to find a suitable generalization! Indeed, following the work of Kearns and Valiant [81], we now know that many natural learning problems—as an example, inferring the rules of a regular or context-free language from random examples of grammatical and ungrammatical sentences—are computationally intractable in an extremely strong sense:

Any polynomial-time algorithm for finding a hypothesis consistent with the data would imply a polynomial-time algorithm for breaking widely-used cryptosystems such as RSA!³⁹

The appearance of cryptography in the above statement is far from accidental. In a sense that can be made precise, learning and cryptography are “dual” problems: a learner wants to find patterns in data, while a cryptographer wants to generate data whose patterns are hard to find. More concretely, one of the basic primitives in cryptography is called a pseudorandom function family. This is a family of efficiently-computable Boolean functions $f_s : {0, 1}^n \rightarrow {0, 1}$ , parameterized by a short random “seed” $s$ , that are virtually indistinguishable from random functions by a polynomial-time algorithm. Here, we imagine that the would-be distinguishing algorithm can query the function $f_s$ on various points $x$ , and also that it knows the mapping from $s$ to $f_s$ , and so is ignorant only of the seed $s$ itself. There is strong evidence in cryptography that pseudorandom function families exist: indeed, Goldreich, Goldwasser, and Micali [64] showed how to construct one starting from any pseudorandom generator (the latter was mentioned in Section 1.1).

Now, given a pseudorandom function family ${f_s}$ , imagine a PAC-learner whose hypothesis class $\mathcal{H}$ consists of $f_s$ for all possible seeds $s$ . The learner is provided some randomly-chosen sample points $x_1, \dots, x_m \in {0, 1}^n$ , together with the values of $f_s$ on those points: $f_s(x_1), \dots, f_s(x_m)$ . Given

³⁹In the setting of “proper learning”—where the learner needs to output a hypothesis in some specified format—it is even known that many natural PAC-learning problems are NP-complete (see Pitt and Valiant [104] for example). But in the “improper” setting—where the learner can describe its hypothesis using any polynomial-time algorithm—it is only known how to show that PAC-learning problems are hard under cryptographic assumptions, and there seem to be inherent reasons for this (see Applebaum, Barak, and Xiao [14]).this “training data,” the learner’s goal is to figure out how to compute $f_s$ for itself—and thereby predict the values of $f_s(x)$ on new points $x$ , points not in the training sample. Unfortunately, it’s easy to see that if the learner could do that, then it would thereby distinguish $f_s$ from a truly random function—and thereby contradict our starting assumption that ${f_s}$ was pseudorandom! Our conclusion is that, if the basic assumptions of modern cryptography hold (and in particular, if there exist pseudorandom generators), then there must be situations where learning is impossible purely because of computational complexity (and not because of insufficient data).

The third drawback of Theorem 2 is the assumption that the distribution $\mathcal{D}$ from which the learner is tested is the same as the distribution from which the sample points were drawn. To me, this is the most serious drawback, since it tells us that PAC-learning models the “learning” performed by an undergraduate cramming for an exam by solving last year’s problems, or an employer using a regression model to identify the characteristics of successful hires, or a cryptanalyst breaking a code from a collection of plaintexts and ciphertexts. It does not, however, model the “learning” of an Einstein or a Szilard, making predictions about phenomena that are different in kind from anything yet observed. As David Deutsch stresses in his recent book The Beginning of Infinity [49], the goal of science is not merely to summarize observations, and thereby let us make predictions about similar observations. Rather, the goal is to discover explanations with “reach,” meaning the ability to predict what would happen even in novel or hypothetical situations, like the Sun suddenly disappearing or a quantum computer being built. In my view, developing a compelling mathematical model of explanatory learning—a model that “is to explanation as the PAC model is to prediction”—is an outstanding open problem.⁴⁰

7.2 Computational Complexity, Bleen, and Grue

In 1955, Nelson Goodman [67] proposed what he called the “new riddle of induction,” which survives the Occam’s Razor answer to Hume’s original induction problem. In Goodman’s riddle, we are asked to consider the hypothesis “All emeralds are green.” The question is, why do we favor that hypothesis over the following alternative, which is equally compatible with all our evidence of green emeralds?

“All emeralds are green before January 1, 2030, and then blue afterwards.”

The obvious answer is that the second hypothesis adds superfluous complications, and is therefore disfavored by Occam’s Razor. To that, Goodman replies that the definitions of “simple” and “complicated” depend on our language. In particular, suppose we had no words for green or blue, but we did have a word grue, meaning “green before January 1, 2030, and blue afterwards,” and a word bleen, meaning “blue before January 1, 2030, and green afterwards.” In that case, we could only express the hypothesis “All emeralds are green” by saying

“All emeralds are grue before January 1, 2030, and then bleen afterwards.”

—a manifestly more complicated hypothesis than the simple “All emeralds are grue”!

⁴⁰Important progress toward this goal includes the work of Angluin [11] on learning finite automata from queries and counterexamples, and that of Angluin et al. [12] on learning a circuit by injecting values. Both papers study natural learning models that generalize the PAC model by allowing “controlled scientific experiments,” whose results confirm or refute a hypothesis and thereby provide guidance about which experiments to do next.I confess that, when I contemplate the grue riddle, I can't help but recall the joke about the Anti-Inductivists, who, when asked why they continue to believe that the future won't resemble the past, when that false belief has brought their civilization nothing but poverty and misery, reply, "because anti-induction has never worked before!" Yes, if we artificially define our primitive concepts "against the grain of the world," then we shouldn't be surprised if the world's actual behavior becomes more cumbersome to describe, or if we make wrong predictions. It would be as if we were using a programming language that had no built-in function for multiplication, but only for $F(x, y) := 17x - y - x^2 + 2xy$ . In that case, a normal person's first instinct would be either to switch programming languages, or else to define multiplication in terms of $F$ , and forget about $F$ from that point onward⁴¹ Now, there is a genuine philosophical problem here: why do grue, bleen, and $F(x, y)$ go "against the grain of the world," whereas green, blue, and multiplication go with the grain? But to me, that problem (like Wigner's puzzlement over "the unreasonable effectiveness of mathematics in natural sciences" [135]) is more about the world itself than about human concepts, so we shouldn't expect any purely linguistic analysis to resolve it.

What about computational complexity, then? In my view, while computational complexity doesn't solve the grue riddle, it does contribute a useful insight. Namely, that when we talk about the simplicity or complexity of hypotheses, we should distinguish two issues:

(a) The asymptotic scaling of the hypothesis size, as the "size" $n$ of our learning problem goes to infinity.
(b) The constant-factor overheads.

In terms of the basic PAC model in Section 7, we can imagine a "hidden parameter" $n$ , which measures the number of bits needed to specify an individual point in the set $\mathcal{S} = \mathcal{S}_n$ . (Other ways to measure the "size" of a learning problem would also work, but this way is particularly convenient.) For convenience, we can identify $\mathcal{S}_n$ with the set ${0, 1}^n$ of $n$ -bit strings, so that $n = \log_2 |\mathcal{S}_n|$ . We then need to consider, not just a single hypothesis class, but an infinite family of hypothesis classes $\mathcal{H} = {\mathcal{H}_1, \mathcal{H}_2, \mathcal{H}_3, \dots}$ , one for each positive integer $n$ . Here $\mathcal{H}_n$ consists of hypothesis functions $h$ that map $\mathcal{S}_n = {0, 1}^n$ to ${0, 1}$ .

Now let $L$ be a language for specifying hypotheses in $\mathcal{H}$ : in other words, a mapping from (some subset of) binary strings $y \in {0, 1}^*$ to $\mathcal{H}$ . Also, given a hypothesis $h \in \mathcal{H}$ , let

$\kappa_L(h) := \min \{|y| : L(y) = h\}$

be the length of the shortest description of $h$ in the language $L$ . (Here $|y|$ just means the number of bits in $y$ .) Finally, let

$\kappa_L(n) := \max \{\kappa_L(h) : h \in \mathcal{H}_n\}$

be the number of bits needed to specify an arbitrary hypothesis in $\mathcal{H}_n$ using the language $L$ . Clearly $\kappa_L(n) \geq \lceil \log_2 |\mathcal{H}_n| \rceil$ , with equality if and only if $L$ is "optimal" (that is, if it represents

⁴¹Suppose that our programming language provides only multiplication by constants, addition, and the function $F(x, y) := ax^2 + bxy + cy^2 + dx + ey + f$ . We can assume without loss of generality that $d = e = f = 0$ . Then provided $ax^2 + bxy + cy^2$ factors into two independent linear terms, $px + qy$ and $rx + sy$ , we can express the product $xy$ as

$\frac{F(sx - qy, -rx + py)}{(ps - qr)^2}.$

Xet Storage Details

Size:: 115 kB
Xet hash:: 34a07bc7dedc39c3e6b3f2b53a7b047f58c6a6bff02c9fff46ba40c3b0c46100

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

Buckets:

huggingchat
/

papers-content