Category Archives: Automated Discovery

One of these things is not like the others

Consider the three plots below:

pop_growth_fig1

What you're looking at is simulated, noisy data describing the growth of three biological populations over time (population size is shown on the vertical access with a shared scale, and time on the horizontal). One of those populations is governed by a dynamics distinct from that which governs the other two.

That last claim requires a little clarification. Roughly speaking, I mean that the way one of those systems evolves is described by a differential equation with a different form from that governing the others. A little more precisely, two of those systems share the same dynamical symmetries. A dynamical symmetry is, in this case, a change in population that commutes with its evolution through time. That is, it makes no difference whether you intervene and transform the population and then let it grow, or let it grow and then transform the population. Two and only two of these three populations share the same set of dynamical symmetries. Why is the sharing of dynamical symmetries an interesting criterion of sameness? Why are the categories or kinds picked out this way important? Because categories of this sort are 'natural kinds' in that they support induction -- many features of one member generalize to the others (see this paper for a full discussion and careful definitions of the terms used above). I won't give much of an argument here except to point out that lots of the most important scientific kinds are kinds of this sort: orbital systems, first-order chemical reactions, quasi-isolated mechanical systems are all kinds of this sort, and all central theoretical categories in scientific practice. If we want to do science in a new domain of phenomena, we want to identify such categories to study.

This raises an interesting question: Can we find natural kinds of this sort prior to having a theoretical understanding of a domain? Can we spot the categories directly and use them to focus the inquiry that lets us build fully predictive or explanatory theories? In answer to that question, consider the plots below:

pop_growth_fig2

The coloring reflects the categories chosen by EUGENE, an algorithm for automated discovery of natural kinds (see this post). EUGENE groups the first and third into the same kind. And this is in fact correct. The model used to simulate the leftmost and rightmost systems is the classic "logistic equation":

The only difference is that the growth rate, r is much lower in the rightmost system.

The middle system, on the other hand, the one that EUGENE marked in green, is described by the following equation:

Taken together, these systems exemplify just two varieties of a large family of models of interest to biologists. They are of interest in large part because it's so hard to tell which is correct. That is, it is remarkably difficult to determine experimentally whether a system is described by one or another set of parameters in the general equation:

And yet, accurately and reliably, with no prior knowledge or explicit hypotheses about the governing dynamics, EUGENE can sort them one from another! I think that's a pretty neat trick.

The EUGENE Project

In the spring of 2015, I was lucky enough to receive a NSF CAREER award for a project entitled "Automated scientific discovery and the philosophical problem of natural kinds." The aim of this project is to develop a new approach to automated scientific discovery based on the theory of natural kinds -- in the sense of projectible kinds -- that I've been elaborating for a while (see this paper). More specifically, the aim over the next five years is to produce algorithms that sort dynamic causal systems into natural kinds as well as algorithms that construct novel variables useful for finding law-like causal relations and additional kinds. These algorithms are intended to be pit directly against the real world; from the outset they are being developed to communicate with physical systems via sensors and actuators rather than confronted with data that has been preprocessed by a human.

Since the grant is a CAREER award, it funds extensive education and outreach components as well. I am excited to be offering a two-week graduate summer school in "Philosophy & Physical Computing" in July of 2016. I will also be putting on a two-day "Robot Scientist" event for middle school students that will be hosted at the Science Museum of Western Virginia.

I and my group of student researchers have already gotten some promising prototypes of the classifier algorithm -- an algorithm that finds kinds -- to work. And I've given the project a new name. I've begun calling the entire collection of automated discovery algorithms under development "EUGENE", largely in honor of Eugene Wigner whose ideas were influential in shaping the theory of natural kinds being implemented (hence the title of this post).

In the next few posts, I'll explain the basic algorithm for kind discovery and why one might expect it to uncover useful categories. For now, in order to give a little more of an overview of the project, I'll provide the summary from my grant proposal:

CAREER: Automated scientific discovery and the philosophical problem of natural kinds

In the course of everyday research, scientists are confronted with a recurring problem: out of all the empirical quantities related to some phenomenon of interest, to which should we pay attention if we are to successfully discover the regularities or laws behind the phenomenon? For most ways of carving up the observable world with a choice of theoretical variables, no tractable patterns present themselves. It is only a special few that are 'projectible', that allow us to accurately generalize from a few particular facts to a great many not in evidence. And yet in the course of their work, scientists efficiently choose variables that support generalization. This presents a puzzle, the epistemic version of the philosophical problem of `natural kinds': how we can know in advance which choices of variables are projectible. This project will clarify and test a new approach to solving this puzzle---the Dynamical Kinds Theory (DKT) of natural kinds---by constructing a series of computer algorithms that automatically carry out a process of variable choice in the service of autonomous scientific discovery. The inductive success of these algorithms when applied to genuine problems in current scientific settings will serve as tangible validation of the philosophical theory.

This project connects the philosophical problem of natural kinds with computational problems of automated discovery in artificial intelligence. It tests the DKT by deriving discovery algorithms from that theory's normative content, and then applying these algorithms to real-world phenomena. Successful algorithms imply that in fact the DKT at least captures an important subclass of the projectible kinds. More dramatically, these discovery algorithms have the potential to produce more than one equally effective but inconsistent classification of phenomena into kinds. The existence of such alternatives plays a central role in debates over scientific realism.

The automated discovery algorithms produced will be leveraged to introduce a generation of graduate students in philosophy and science to the deep connections between physical computing and philosophical epistemology. A recurring summer school will train graduate students in basic programming and formal epistemology, with hands on development of automated discovery systems. Each summer school will culminate in a two-day outreach event at which the graduate students will assist a diverse group of area secondary school children in building their own `robot scientist'. Students and teachers completing the summer school or outreach programs will leave with their own mini-computers configured for developing their own approaches to discovery. Outside of philosophy, the application of the discovery algorithms to open problems in areas of ecology, evolution, metagenomics, metabolomics, and systems biology has the potential to suggest previously unconceived theories of the fundamental ontology in these fields. In particular, the algorithms will be applied to agent-based models of evolutionary dynamics to search for population-level laws, and to publicly available long-term ecological data to search for stable dynamical kinds outside the standard set of ecological categories.

Some observations on the problem of conceptual novelty in automated discovery

Following a recent conversation with Richard Burian, I realized that both of us had assumed that a necessary if not sufficient condition for a new scientific variable to represent a genuinely novel concept is for it to allow a finer partitioning of possible states of the world than was previously possible. The idea is intuitively plausible. If I posit the existence of a new variable property of material bodies, then I can discriminate more possible states of the world. If, for instance, I posit the existence of an internal resistance for wires, then states that were previously indistinguishable when described in terms of current and voltage in a wire, are now potentially distinguishable on the basis of resistance. If I posit the existence of new kind of particle, then it seems I have recognized a greater variety of possible worlds. Corresponding to what were previously unique states of the world are now many possible states in which the new particles assume various positions, velocities, and so on. Recognizing a genuinely novel property (or class of properties) seems to entail admitting a finer-grained view of the world. But I'm no longer convinced that's the case.

Before I explain why I'm unconvinced, let me back up and explain the question at issue and where it came from. Since the heyday of logical positivism, the consensus in mainstream philosophy of science is that there does not exist a "logic of discovery", a method for mechanically generating significant scientific hypotheses. The only serious argument to this effect turns on the notion of conceptual novelty. The key premise is that no algorithmic process can introduce variables (or associated concepts) that were not already present in the presentation of the data or observations for which we are seeking an explanatory hypothesis. So, for instance, Hempel (1966, p14) claimed that one cannot "...provide a mechanical routine for constructing, on the basis of the given data, a hypothesis or theory stated in terms of some quite novel concepts, which are nowhere used in the description of the data themselves." Laudan echoed the sentiment a couple of decades later. He conceded that, while machines can certainly carry out algebra and curve-fitting, the essence of scientific discovery is the introduction of explanatory theories "...some of whose central concepts have no observable analogue" (Laudan, 1981, p186). Though he makes no explicit argument to this effect, he takes it as obvious that no effective procedure could introduce the sorts of concepts far removed from observation that are at the heart of modern theories.

How much of a stumbling block for automated discovery is the required sort of novelty? That's rather difficult to answer without a more substantive account of conceptual novelty. However, Hempel's syntactic characterization suggests a plausible necessary condition that Laudan would presumably endorse: a novel class of variables represents a novel concept just if the values of that variable are not functions of preexisting variables. Thus, if you already have concepts of mass and velocity, adding momentum or kinetic energy (both of which are defined as simple functions of mass and velocity) doesn't really introduce conceptual novelty. However, introducing a new variable m to represent a heretofore unacknowledged property of inertial mass into a theory involving only position and velocity is a sort of conceptual novelty.

Interestingly, introducing properties like inertial mass into theories previously lacking them is the sort of conceptual invention that automated discovery algorithms were capable of by the end of the decade in which Laudan wrote. I'm thinking specifically of third program in the BACON lineage developed by Herb Simon, Pat Langley, Gary Bradshaw, and Jan Zytow (1987). If we take the above condition as genuinely necessary for conceptual novelty, then BACON.3 is at least a counterexample to the claim that the condition cannot be met by an algorithm. It does in fact introduce an inertial mass when given data from experiments with springs, and it introduces a variable for resistance when examining currents in various circuits. Of course, you might just take this as an indication that the proposed condition for conceptual novelty is not sufficient. That's not an argument I want to take up this time.

What I do want to do is scrutinize the notion that positing a novel concept must somehow increase the number of possible worlds we recognize. In the sense of logical possibility, the new variables allow a finer partitioning of the world. Equivalently, they are not functions of existing variables. But if their introduction is well-motivated, it seems that enough of the additional logical possibilities are nomologically precluded that the number of ways the world might be remains the same. To see what I mean, it will help to consider in a little detail how BACON.3 introduces a variable. Consider the following table of data (adapted from figure 4.1 in (Langley, et al, 1987)):

Battery Wire Current (I) Conductance (c) Voltage (v)
A X 3.4763 3.4763 1.0000
A Y 4.8763 4.8763 1.0000
A Z 3.0590 3.0590 1.0000
B X 3.9781 3.4763 1.1444
B Y 5.5803 4.8763 1.1444
B Z 3.5007 3.0590 1.1444
C X 5.5629 3.4763 1.6003
C Y 7.8034 4.8763 1.6003
C Z 4.8952 3.0590 1.6003

BACON begins with the first three columns of data. Letters label distinct wires and batteries. The only variable measured is current, which is represented by a real number. Upon examining the first three rows of the table (corresponding to the same battery but different wires), BACON notes that current varies from wire to wire. The next step of the algorithm is, practically speaking, driven by the fact that BACON cannot relate non-numerical variables (e.g., the identifiers for distinct wires) to numerical variables. But we might give it a rather plausible methodological interpretation: if a variable changes from one circumstance to the next -- in this case, from one wire to the next -- it is reasonable to suppose that there exists a hidden, causally salient property which varies from wire to wire. Let's call that property conductance, and assume that it can be represented by a real number as well.

Following this maxim, BACON introduces a new variable whose values are shown in the third column. How were these values determined? As is clear from the table, BACON assigns a conductance equal to the values of the previously known variable, current. The authors don't discuss this procedure much, but it is a simple way to ensure that the new variable explains the old in the sense that there is a unique conductance value for each resulting current.

So far, it's not clear that the "new" variable is very informative or novel. But things get interesting when we get to the next three rows of the table. Since each wire was already assigned a value for conductance, BACON uses those values again, and notes that for battery B, the conductance and the current are proportional to one another. Unlike the case for battery A, however, the constant of proportionality is now 1.1444. Similarly, for the last three rows (corresponding to battery C), BACON finds that conductance and current are related by a slope of 1.6003. How to explain this variation? Posit a new variable! This time, we suppose there is a property of batteries (the voltage) that explains the variation, and we assign values identical to the slopes in question. If we note that conductance is the reciprocal of resistance, we can see that BACON has just 'discovered' Ohm's law of resistance: I = v / r. Of course, that relation is tautological if we consider only the data on hand. But treated as a generalization, it is quite powerful and most definitely falsifiable. We might, for instance, find that a new wire, D, has a conductance of c as determined using battery B. But when connected to battery A, the new wire could show a current not equal in value to c. This would violate Ohm's law.

There are two lessons to draw from the procedure described above. First, it sure seems like positing previously unconsidered intrinsic properties like conductance and voltage amount to producing novel theoretical concepts. Thus, it looks as though there is no real barrier to the algorithmic production of novelty, and the objections of Hempel, Laudan, and others are simply misguided. Second, the introduction of a novel concept does not entail recognizing a greater diversity of possible worlds, at least not in every sense. It is certainly the case that if we assume that a newly introduced variable can take on any value consistent with its representation (e.g., any real number), then as a matter of logical possibility, we have considered a finer partitioning of states of the world -- there are more ways the world might be for which we can provide mutually exclusive descriptions. But these logical possibilities are, as a rule, moot. The whole reason for introducing a novel variable is to explain previously unexplained variation. That means that a variable is likely to enter scientific consideration already bound up in a nomic relation with other variables. That law-like relationship precludes many logical possibilities. In fact, in cases like Ohm's law, those relationships will be such as to permit all only those states of the world we already recognized as possible in terms of known variables.

Note that I am not suggesting there is no way to introduce new variables that allow for a finer discrimination of states of the world. It seems obvious that such a thing is possible. My point is just that it is not necessary. In fact, it seems like in most cases of scientific relevance, the new variables do not provide finer discrimination.

To sum up, variables are introduced to do a job: they are supposed to represent whatever hidden properties vary from one circumstance to the next and so explain a previously unexplained variation. But that means that they are generally introduced along with law-like relations to other variables. These relations generally (or at least often) restrict the values in such a way that no finer partitioning of the states of the world is achieved.

Works cited

Hempel, Carl G. 1966. Philosophy of Natural Science. Prentice-Hall Foundations of Philosophy Series. Englewood Cliffs, N.J: Prentice-Hall.

Langley, Pat, Herbert A. Simon, Gary Bradshaw, and Jan M. Zytkow. 1987. Scientific Discovery: Computational Explorations of the Creative Processes. Cambridge, Mass: MIT Press.

Laudan, Larry. 1981. Science and Hypothesis. Dordrecht, Holland: D. Reidel Publishing Company.