Here's a follow up to an earlier posting, my first stab at blagging a piece of proper (peer-reviewed) research. The paper, by Pan et al, is published in Nature Neuroscience, and is called Reward prediction based on stimulus categorization in primate lateral prefrontal cortex.
As I said before, this paper is a welcome addition to our knowledge in the field of neuroeconomics, but the results are presented in a way that I regard as partly confused or misleading.
The subjects of the experiments were monkeys. The 'reward prediction' referred to in the title is their making choices, by means of saccades to specific targets, that correctly led to greater reward while activity of lateral prefrontal cortex (LPFC) neurons was. There's nothing new about this - monkeys have been saccading to targets that predicted reward with wires stuck in their LPFCs for a while. What's relatively new is the specific challenge they faced identifying the correct target.
First they were trained that each of two different sequences of visual stimuli interspersed with saccade tasks (A1-A2-A3, and B1-B2-B3 - see figure) led to equal reward after the final saccade and stimulus. Then they were separately trained that A3 predicted greater reward than B3.
Then they were offered a 'novel' choice between A1 and B1, and 'correctly' chose A1 most of the time.
Again, as I said before, the behavioral result is cool if not surprising (it's old news in animal learning). The monkeys 'could' have showed no relative preference in the final task, given that the first stimulus in each sequence had not been directly associated with different relative reward at the end of the sequence. And the recordings from large numbers of neurons in the LPFCs of the monkeys found that some were preferentially active for rewards, some for stimuli (A1 or B1), and some for stimulus-reward interactions (e.g. associations between A1 and the differential reward associated with A3 over B3).
The title of the paper, the abstract, and various remarks in it, suggest that we should read this as having something to do with 'categorization', and that the work presents some kind of problem for temporal-difference learning approaches to reward prediction. I don't think either point is correct, even though the work is interesting.
On the first point, categorization has an established meaning in cognitive science. According to the first sentence of the entry on 'Categorization' in the MIT Encylopedia of the Cognitive Sciences, "Categorization, the process by which distinct entities are treated as equivalent, is one of the most fundamental and pervasive cognitive activities." But this paper isn't about treating distinct entities as equivalent except in the minimal sense that numerically distinct instances of A1 are treated as equivalent. If that's all it takes to make a categorization experiment, just about every neuroeconomics experiment has been a categorization experiment. (Pan and colleagues don't give a definition of categorization, so it's not clear enough what they mean by it.)
On the second point, predictor-valuation (or temporal difference) models can cope with sequences of predictors where reward only follows the final predictor. (Montague and Berns (2002) report a sequence with two stimuli, where the middle one was sometimes unpredictably timed relative to the first and the reward.) Predictors are predictors of the values of states, and associations establish informational links between states. So making a state have a higher value, which is what the middle stage of the Pan et al study did by making A3 worth more than B3, should raise the values of predictors of ways of getting to those states. This new paper sheds interesting light on how the brain handles predictor valuation in cases where predictors are chained together, but it isn't reason to think that we've found something that temporal difference learning can't deal with.
Or am I missing something?