Optimising the analysis
Overview
Teaching: 30 min
Exercises: 60 minQuestions
What aspects of our analysis do we want to optimize?
How can we quantify selections to help decide SR/CR definitons.
Do some variables affect signal and BG differently/similarly?
Are there any correlated varibles?
What final selections are going to be applied to the analysis
Objectives
Identify optimizable parts of analysis.
Use Punzi significance and other measures to optimise selections.
Obtain a close-to-optimal selection to define SR.
Recording files of this session are in cernbox
Setup Ahead of Session
To more efficiently help with debugging, we are going to use remote access of your terminal through tmate. The way this application works is that you can initialize it (by calling ‘tmate’) and it’ll start a session of your terminal that is viewable and editable online. In the website linked it shows how that looks like. If you feel comfortable letting the facilitators use this with you, follow the steps below to install tmate.
To download this you’ll have to use homebrew (another application). To check if you have homebrew installed,
brew help
If you get some output, you’re set. Otherwise it’ll give you an error saying the ‘brew’ command doesn’t exist. In that case, download homebrew by
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
Once installed, use homebrew to install tmate by
brew install tmate
In addition, pull all the latest changes from the repositories:
cd ~/CMSVDAS2020/b2g-long-exercise/
git fetch --all
git pull origin gh-pages
cd ../CMSSW_11_0_1/src/
rm -rf timber-env
cmsenv
virtualenv timber-env
source timber-env/bin/activate
cd TIMBER/
git fetch --all
git checkout cmsdas_dev
python setup.py install
source setup.sh
cd ../BstarToTW_CMSDAS2020
git fetch --all
git pull origin master
Wrapping-Up From Preselection Episode
Last session, we talked about how a preselection is useful to cut down the size of the ntuples produced/read. Generally your preselection should include the union of your signal and control regions, cutting out unnecessary data. Later on, we will make further selections to optimize our signal regions and estimate the background in control regions. For this stage, using the plots we were able to make from BstarToTW_CMSDAS2020/examples/ex4.py, let’s decide on a preselection.
Question: What should our preselection be?
Discuss for the next ~15 minutes what the preselection for our analysis should be. Feel free to use your plots as evidence supporting your argument. Think about what the preselection is supposed to be cutting on (e.g., remember to leave space for estimating the BG).
Solution
Taking a Step Back: Analysis Strategy
With an idea of how we want to make our preselection, let’s take a moment to think ahead to how we want to organize our analysis.
Question: How would we expect to see our signal? Is there a specific, discriminating variable we should be using in our analysis to find our signal?
Think about what kind of signal we are looking for. Is it a resonance or not? What
Solution
Now that we know we will want to use mtW, we need to make a rough decision of how we are going to select for signal and estimate the background.
Question: What is our general signal-selection/background-estimation strategy going to be?
Think about what backgrounds we have, how we will estimate those, and how we can make the best selection for a signal region.
Solution
Optimizing: But How?
Perhaps when deciding the rough preselection cuts you may have already thought ‘How do I make the best cuts to the variables available to me?’ Another question of similar nature is ‘how would I define what is best?’ There are a few ways to answer these questions, but first we must decide on how we are going to define ‘optimal’ cuts.
Question: How will we define optimal?
What objective measure will we use to help us define an optimal selection?
Solution
Going forward with this exercise, we will use the the ‘S/√B’ approximation for significant to guide our decisions.
Tightening Selections
Now that we understand what the minimal selection is that we want to apply to our signal and background, we need to think harder about what are the final (tighter) selections that we want to apply to define our signal and control regions.
Question: What are some parts of the analysis you think could be optimized?
In addition to the preselection, we can make selections on the top and W bosons to ensure an enriched signal region.
Solution
‘N minus 1’ Plots
One powerful analysis tool for optimization are what are referred to as ‘N minus 1’ plots. These are plots of distributions used in series of selections, systematically omitting one selection of the series at a time and plotting that varibale. N-1 plots can help us understand the impact of tightening cuts on the variable. Normally, we ‘tighten’ selections and want to know how our significance estimate changes as a function of ‘tighening’. The direction of ‘tight’ depends on the observable at hand, for example toward 0 for τ32 and towards infinity for jet pT.
Question: How would you define tight for a mass peak?
Solution
Give it a shot yourself, type the following into your terminal from the BstarToTW_CMSDAS2020 directory
python exercises/nminus1.py -y 16 --select
This should create some example plots for the selections used by the B2G-19-003 analysis team. Change the selections applied to what you have decided upon today to checkout the impact of your cuts. Is your selection optimal? Use the remaining time to produce and look into these plots to come up with a signal region selection.
Key Points
Preselection is not enough to be sensitive to signal, we need to tighten selection to increase significance of signal to background.
The traditional way of optimizing selections is to apply N-1 (all-but one) cuts and findng peak of significance curve in removed cut.
Boosted Decision Trees and other multivariate optimization techniques are also widely used.