Create datasets#

This guide demonstrates how to use observable parsing system to create a dataset for future use. With this system, you can create and use observables not only with specific classes but also with simple strings.

Start by importing the necessary modules:

import uproot
from hml.generators import Madgraph5
from hml.representations import Image
from hml.datasets import SetDataset, ImageDataset
from hml.approaches import Cut

Load generated events#

As the previous guide showed, HML can handle two cases. Here, we use the Madgraph5 class to fetch runs from the output directory:

sig = Madgraph5.from_output("data/pp2wz@10k", "mg5_aMC")
bkg = Madgraph5.from_output("data/pp2jj@10k", "mg5_aMC")

sig_events = uproot.open(sig.runs[0].events()[0])
bkg_events = uproot.open(bkg.runs[0].events()[0])

Preselection#

For the processes, we choose three observables: mass and n-subjettiness ratio of the leading fat jet, and the angular distance between the leading and subleading jets. To ensure we can obtain the observables without missing value, it’s necessary to preselect or filter events based on the number of fat jets and jets:

preselection = Cut("fatjet.size > 0 and jet.size > 1")
preselection.read(sig_events)
preselection.value

[True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 ...,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 False]
------------------
type: 10000 * bool

fatjet.size is the observable Size associated with the physics object fatjet. This observable refers to the number of the objects. In a root file, it applies to an entire branch. The physics object corresponds to the branch name (case-insensitive).
An observable is always linked to one or more physics objects. This concept inspires HML to create its own observable parsing system: <physics_object>,<another>.<observable>. The physics_object is any branch defined in your root file. Multiple objects are separated by ,. For a single object, specify the index directly after the object name, e.g., "jet0", "muon1".

To extract the observable values, use the read method, which returns a boolean list. The value attribute stores the result.

Create a set dataset#

Now, we use the 1D data container SetDataset to hold these three observables for all events:

cut = "fatjet.size > 0 and jet.size > 1"

set_ds = SetDataset(["fatjet0.mass", "fatjet0.tau21", "jet0,jet1.delta_r"])
set_ds.read(sig_events, 1, [cut])
set_ds.read(bkg_events, 0, [cut])

To confirm our choice of observables is powerful enough to differentiate the signal and background, we use show to plot three distributions:

set_ds.show()

observable_distributions

Right before saving the dataset to the disk, we use the split method to divide the dataset into training and testing sets:

set_ds.split(0.7, 0.3, seed=42)
print(set_ds.train.samples.shape)
print(set_ds.train.targets.shape)

set_ds.save("data/wjj_vs_qcd_set.ds")

(12644, 3)
(12644,)

The split method also supports the validation set: set_ds.split(0.7, 0.2, 0.1).

Create an image dataset#

Besides a set dataset, we can also represent each event as an image and then create a dataset of these images.

image_ds = ImageDataset(
    Image(
        height="fatjet0.constituents.phi",
        width="fatjet0.constituents.eta",
        channel="fatjet0.constituents.pt",
    )
    .with_subjets("fatjet0.constituents", "kt", 0.3, 0)
    .translate(origin="SubJet0")
    .rotate(axis="SubJet1", orientation=-90)
    .pixelate(size=(33, 33), range=[(-1.6, 1.6), (-1.6, 1.6)])
)

The Image class is a representation of the image. It takes three arguments: height, width, and channel. The height and width are the observables for the y-axis and x-axis, respectively. The channel is the observable for the pixel intensity.
The with_subjets method adds subjets to the image.
The translate method moves the origin of the image to the subjet.
The rotate method rotates the image.
The pixelate method pixelates the image. This step makes the points discrete and produce the real image.

To create the image dataset, we still use the read method:

image_ds.read(sig_events, 1, [cut])
image_ds.read(bkg_events, 0, [cut])

#--------------------------------------------------------------------------
#                         FastJet release 3.4.1
#                 M. Cacciari, G.P. Salam and G. Soyez                  
#     A software package for jet finding and analysis at colliders      
#                           http://fastjet.fr                           
#                                                                         
# Please cite EPJC72(2012)1896 [arXiv:1111.6097] if you use this package
# for scientific work and optionally PLB641(2006)57 [hep-ph/0512210].   
#                                                                       
# FastJet is provided without warranty under the GNU GPL v2 or higher.  
# It uses T. Chan's closest pair algorithm, S. Fortune's Voronoi code,
# CGAL and 3rd party plugin jet algorithms. See COPYING file for details.
#--------------------------------------------------------------------------

To visualize the images, there is also a show method:

image_ds.show(norm="log", target=0, show_pixels=True)
image_ds.show(norm="log", target=1, show_pixels=True)

background image signal image

Finally, we split and save the dataset:

image_ds.split(0.7, 0.3, seed=42)
image_ds.save("data/wjj_vs_qcd_image.ds")

Check the doc to learn more about observables, representations and datasets.