A Jet (Lag) Induced Exercise in Binomial Probability
The mathematical thrill in particle physics of correcting your Monte Carlo simulation efficiencies to match what’s actually measured in data is rarely documented in popular science media, probably with good reason. But as I’m awake at 5am with jet lag I’ve made it my mission to write something to help insomniacs everywhere get back to sleep.
Imagine you have an algorithm to identify a physics process in your detector. For sake of argument let’s say this algorithm identifies jets containing b hadrons from those which don’t. Your simulation suggests that the probability to successfully identify the so-called b-jet is 50%. Now let’s suppose this simulation is a sample of 1000 events each containing exactly 3 b-jets, in how many of those events would you expect your algorithm identify all three jets? Only two of the three? One of the three? None of them? Fortunately this is a simple exercise in binomial probability with N=3 and p=0.5; using the binomial formula (
fun exercise for the reader) we calculate:
- Events with 3 b-jets identified = 125
- Events with 2 b-jets identified = 375
- Events with 1 b-jet identified = 375
- Events with no b-jets identified = 125
Nice and symmetric. I probably want to measure a quantity something like: events with at least one b-jet identified. The simulation tells me I expect to see 875 such events in data. If alternatively my measurement was counting the number of b-jets identified then the prediction of the simulation is (3*125)+(2*375)+(1*375)+(0*125)=1500 b-jets. Not surprising for an algorithm efficiency of 50% in a simulation of 3000 b-jets!
But then a complication. Someone using a pure sample of b-jets in data from top quark decays has measured the actually efficiency of the algorithm to identify b-jets in data to be 60+/-10%, not the same as the simulation: the prediction is wrong. So what we now need is a simple way to correct the simulation so that it correctly predicts the number of b-jets while keeping other properties of the simulation the same: 1000 events and 3000 b-jets in total.
Fortunately binomial theory makes this simple. We want to correct the efficiency to p=0.6+/-0.1 (and corresponding inefficiency to q=(1-p)=0.4-/+0.1). Note that the change in the inefficiency is completely anti-correlated with the change in the efficiency to keep the total probability one. For every p in the binomial formula we need to add a factor of 1.2+/-0.2 and for every q a factor 0.8-/+0.2 so we get an event weight depending on the number of b-jets identified in the original simulation:
- Weight for events with 3 b-jets identified = 1.2*1.2*1.2 = 1.728
- Weight for events with 2 b-jets identified = 1.2*1.2*0.8 = 1.152
- Weight for events with 1 b-jet identified = 1.2*0.8*0.8 = 0.768
- Weight for events with no b-jets identified = 0.8*0.8*0.8 = 0.512
Which gives us a new prediction for what we expect in data:
- Events with 3 b-jets identified = 125*1.728 = 216
- Events with 2 b-jets identified = 375*1.152 = 432
- Events with 1 b-jet identified = 375*0.768 = 288
- Events with no b-jets identified = 125*0.512 = 64
Still 1000 events with 3 b-jets in total! You can also cross-check that these are exactly the numbers expected from the binomial formula with N=3 and p=0.6. We now revise our simulation based prediction for events with at least one b-jet identified to 936 and for counting the number of b-jets identified to 1800 – or 0.6*3000 as we would expect, the system works 🙂
Very eager readers are welcome to propagate the uncertainty on the efficiency correction too… Which coincidentally is exactly what I’ll be doing today with the ATLAS simulation and b-tagging algorithms now that it’s 7am in Berkeley and a more reasonable time to go to work.