Babaioff; Characterizing Truthful Multi-Armed Bandit Mechanisms
Moshe Babaioff presented an interesting paper.
Suppose that you're conducting an auction for adwords, where you want to rank the bidders based on expected revenue in order to allocate slots and determine prices for slots based on bids. But suppose you don't know what the clickthrough rate will be for the items.
In a multi-armed bandit model, there are multiple bandit slot machines and you have to decide which arms to pull. There is an explore/exploit tradeoff-- you need to explore (experiment) to estimate the clickthrough rates, including some experimentation with those you have low estimates for, in case that estimate is wrong. But over time you switch to more exploitation, where you pull the arm of the highest expected value.
The new twist in this paper is that you want advertisers to truthfully reveal their valuation for a click. If clickthrough rates are known, you can set price essentially using a second-price mechanism based on bid*clickthrough. But if you're using a multi-armed bandit algorithm to determined clickthrough rates, the correct prices would depend on estimated clickthrough rates that you don't necessarily see because you don't test them.
It's a theory paper. They prove that, with the requirement that the mechanism induce trruthful bidding, there's always a pure exploration phase, where the selection of the winners can depend on previous clickthroughs but *not* on the bids; and then a pure exploitation phase, where the clickthroughs no longer affect allocation of slots in the next round. The best multi-armed bandit algorithms without the truth-telling requirement don't have that separation of phases. And, it turns out that the best algorithms without the truth-telling requirement have less "regret" relative to the best you could do if you magically knew the clickthrough rates at the beginning.
So now I'm curious what the best algorithms are without the truth-telling requirement. My guess is that they put more exploration into things that the best estimate so far has higher value for. We actually need to use an algorithm like this for the next version of our "converation double pivots" work on drupal.org, where we're going to dynamically change the set of recommended items based on a combination of prior generated from recommender algorithms and actual observed clicks. But we don't have any truthful revelation requirement, so we should be able to use the standard algorithms.