Controlling the background

Overview

Teaching: 10 min
Exercises: 50 min

Questions

What is the background to the search?

How can we model the background processes affecting our measurement?

Do we have common resources to estimate the background?

How can we test our background modeling technique?

How and why should we enrich a particular background process when defining control regions?

Objectives

Identify background processes affecting current preselection.

Understand performance and limitations of background simulations.

Develop control regions in which one can estimate individual backgroud processes.

Understand the concept of blinding, and have a first look into data in CRs.

Recording files of this session are in cernbox

Backgrounds to the Analysis

We begin understanding our background by considering the final state of the signal. The excited b decays to a top quark and a W boson, where the decays of both are to several jets. bstarFeynman There are are a few Standard Model processes that can lead to a top+W-like experimental signature: – Multi-jet QCD: The nature of the LHC means hadronic background is given for any analysis that does not have a lepton or missing energy to flag. The reason we get large masses in W/t-candidates is that jets cluster any PF candidate within the R-parameter, and with enough momentum and/or distance between PF candidates these QCD events can create a large-mass jet. – W+Jets: This process is when a single W-boson is produced in association to QCD jets. Since the preselection identifies the true W, the process become a background to our signal through its associated jets. Thus, we can include the process as part of the ‘QCD’ (stochastic jets) background contibution. – ttbar: The simplest way this process can enter our signal region is a fully-hadronic decay, where one of the b-quarks is either soft or not identified with our b-tagging. It can also happen that the top has a large decay opening-angle and the W decay is captured in a large-R jet separate from the associated b-quark. – singletop: The Standard Model can have a pair of protons produce a single top in association to a (usually heavy) quark or W-boson. In our case the top produced with a W becomes an irreducible background, however less dominant than the above processes.

Control Regions

Just as signal regions are defined by selections meant to enrich a measurement in signal while eliminating as much background as possible, a control regions are defined by selections meanto to enrich a measurment with a particular background and rejecting as much signal as possible. The idea behind control regions are to provide a kinematic region which is orthogonal but ‘close’ to the signal region, where a particular background is dominant; the MC is then adjusted to match the data in that region and then an extrapolation is done to approximate the MC scalings needed in the signal region to appropriately model the data. The extrapolation could be a single ‘Scale Factor’, or a more sophisticated ‘Transfer Function’

Question: Can you come up with a rough idea for Control Region for any of our background?

Solution

ttbar is arguably the easiest to define a control region for: require having two top-tags. singletop is an irrducible background, it truly has a t and W in the final state. To develop a CR for it one would have to find fine differences in production kinematics, however its sub-dominance to the other BGs makes having a dedicated CR not time-efficient. Multi-Jet Production: The process has a falling distribution across m_top and m_tW. The MC is not well-modeled. The best way to model this is to fit a polynomial to the data that extends in both axes. We can measure all parts against data everywhere but the signal region, where we would extrapolate from our fit. W+Jets: Since the signal region is defined by a top-tag, the background in m_top-m_tW space behaves like multi-jet production. We could fit this background simultaneously to the QCD.

The figure below communicates how the various signal and background regions relate to each other though selections, and how (this particular) transfer function is defined from ratios of pass/fails in the regions. DataDrivenCartoon

Question: What is the variable represented here by the x-axis? What do the regions A/B/C represent, and what the are selections dividing them? What does ‘pass’ and ‘fail’ refer to?

Solution

We can start with the pass/fail, which refers to the top tag. The x-axis is m_top, meaning the regions A/B/C are different mass ranges: A is low-mass, B is the top-mass window, and C is high-mass.

Need for Data Driven Background Estimation

QCD modeling is notoriusly hard to model. One strategy to modeling processes that are not well simulated is to remove the effect of well-modeled processes from data, then creating a shape template on which to match the poor-modeled background. This approach to estimating the background is called data-driven estimation.

Question: What background processes to our signal would be good candidates for data-driven estimation?

Solution

Multi-jet QCD is almost always estimated using a data-driven technique. In addition, the W+Jets process has a QCD-formed top candidates. ttbar and singletop are relatively well-modeled, however the normalization of their distributions are allowed to float in the final parameter fit.

Validation Regions

When we obtain either a scale factor or transfer function to extrapolate from the Control Region to the Signal Region, it is often useful to have a Validation Region that lies somewhere in between these two regions where one can apply the correction to the MC and check against data to ensure good agreement. In the framework we are walking you through, the fit uses all available (non-blinded) regions to simultaneously fit the transfer function parameters, leaing no space ‘between’ the SRs and CRs to define a VR.

Blinding

One thing to keep in mind when using data before settling on final signal and control region selections is the possibility of biasing one’s methods by being exposed to the data. For example, one could imagine a situation where one is ‘testing’ methods in a signal region and then observes a ‘problem’ (lack of events, too many events, etc) which guides the analyzer to biased decision-making. A similar problem can occur if a control or validation region is chosen that has significant signal contamination. To avoid these problems, we ‘blind’ our data whenever possible. That is, we do not look at any data in the SRs, when appropriate we could also deicde to remove data if it passes a significance threshold, and ensuring the control/validation regions have little signal contamination.

Key Points

The backgrounds involved are processes producing jets with large pT that look like t+W.

All-hadronic ttbar can look like t+W if one of tops has a very soft (or untagged) b-jet. MC is reasonably modeled

Singletop production is sub-dominant, but irreducible. MC is reasonably modeled

LHC is a hadron collider, multi-jet production (QCD) is almost always a background in all hadronic analyses.

W/Z+Jets behaves a lot like multi-jet production since the top-tag it passes to make it into out signal selection comes from combinatoric combination of jets.

Control regions are used to study specific backgrounds in a kinematic region orthogonal to the signal regions; we test our background estimation in CRs and apply corrections needed there to the SR.

Validation regions are designed to lie orthgonal but between CRs and SRs, perhaps with lower target BG purity, to test the corrections extracted from the CR that will be applied to SRs.

Blinding is the convention of not looking at data in the SR (or any signal-enriched selection).

previous episode

CMSDAS CERN 2020: B2G Long Exercise

next episode

Controlling the background

Overview

Backgrounds to the Analysis

Control Regions

Question: Can you come up with a rough idea for Control Region for any of our background?

Solution

Question: What is the variable represented here by the x-axis? What do the regions A/B/C represent, and what the are selections dividing them? What does ‘pass’ and ‘fail’ refer to?

Solution

Need for Data Driven Background Estimation

Question: What background processes to our signal would be good candidates for data-driven estimation?

Solution

Validation Regions

Blinding

Key Points

previous episode

next episode