NERSC

Artificial Intelligence for the Electron Ion Collider (AI4EIC)

(2024)

The Electron-Ion Collider (EIC), a state-of-the-art facility for studying the strong force, is expected to begin commissioning its first experiments in 2028. This is an opportune time for artificial intelligence (AI) to be included from the start at this facility and in all phases that lead up to the experiments. The second annual workshop organized by the AI4EIC working group, which recently took place, centered on exploring all current and prospective application areas of AI for the EIC. This workshop is not only beneficial for the EIC, but also provides valuable insights for the newly established ePIC collaboration at EIC. This paper summarizes the different activities and R&D projects covered across the sessions of the workshop and provides an overview of the goals, approaches and strategies regarding AI/ML in the EIC community, as well as cutting-edge techniques currently studied in other experiments.

Real‐time XFEL data analysis at SLAC and NERSC: A trial run of nascent exascale experimental data analysis

(2024)

X-ray scattering experiments using Free Electron Lasers (XFELs) are a powerful tool to determine the molecular structure and function of unknown samples (such as COVID-19 viral proteins). XFEL experiments are a challenge to computing in two ways: i) due to the high cost of running XFELs, a fast turnaround time from data acquisition to data analysis is essential to make informed decisions on experimental protocols; ii) data collection rates are growing exponentially, requiring new scalable algorithms. Here we report our experiences analyzing data from two experiments at the Linac Coherent Light Source (LCLS) during September 2020. Raw data were analyzed on NERSC's Cori XC40 system, using the Superfacility paradigm: our workflow automatically moves raw data between LCLS and NERSC, where it is analyzed using the software package CCTBX. We achieved real time data analysis with a turnaround time from data acquisition to full molecular reconstruction in as little as 10 min -- sufficient time for the experiment's operators to make informed decisions. By hosting the data analysis on Cori, and by automating LCLS-NERSC interoperability, we achieved a data analysis rate which matches the data acquisition rate. Completing data analysis with 10 mins is a first for XFEL experiments and an important milestone if we are to keep up with data collection trends.

Comparison of point cloud and image-based models for calorimeter fast simulation

(2024)

Abstract: Score based generative models are a new class of generative models that have been shown to accurately generate high dimensional calorimeter datasets. Recent advances in generative models have used images with 3D voxels to represent and model complex calorimeter showers. Point clouds, however, are likely a more natural representation of calorimeter showers, particularly in calorimeters with high granularity. Point clouds preserve all of the information of the original simulation, more naturally deal with sparse datasets, and can be implemented with more compact models and data files. In this work, two state-of-the-art score based models are trained on the same set of calorimeter simulation and directly compared.

Improving generative model-based unfolding with Schrödinger bridges

(2024)

Machine learning-based unfolding has enabled unbinned and high-dimensional differential cross section measurements. Two main approaches have emerged in this research area; one based on discriminative models and one based on generative models. The main advantage of discriminative models is that they learn a small correction to a starting simulation while generative models scale better to regions of phase space with little data. We propose to use Schrödinger bridges and diffusion models to create sbunfold, an unfolding approach that combines the strengths of both discriminative and generative models. The key feature of sbunfold is that its generative model maps one set of events into another without having to go through a known probability density as is the case for normalizing flows and standard diffusion models. We show that sbunfold achieves excellent performance compared to state of the art methods on a synthetic Z+jets dataset.

Multi-differential Jet Substructure Measurement in High Q² ep collisions with HERA-II Data

Mikuni, V

(2024)

The radiation pattern within quark- and gluon-initiated jets (jet substructure) is used extensively as a precision probe of the strong force as well as for optimizing event generators for nearly all tasks in high energy particle and nuclear physics. A detailed study of modern jet substructure observables, jet angularities, in electron-proton collisions is presented using data recorded using the H1 detector at HERA. The measurement is unbinned and multi-dimensional, using machine learning to correct for detector effects. Training these networks was enabled by the use of a large number of GPUs in the Perlmutter supercomputer at Berkeley Lab. The particle jets are reconstructed in the laboratory frame, using the kT jet clustering algorithm. Results are reported at high transverse momentum transfer Q2 > 150 GeV2, and inelasticity 0.2 < y < 0.7. The analysis is also performed in sub-regions of Q2, thus probing scale dependencies of the substructure variables. The data are compared with a variety of predictions and point towards possible improvements of such models.

Discovering High Entropy Alloy Electrocatalysts in Vast Composition Spaces with Multiobjective Optimization.

(2024)

High entropy alloys (HEAs) are a highly promising class of materials for electrocatalysis as their unique active site distributions break the scaling relations that limit the activity of conventional transition metal catalysts. Existing Bayesian optimization (BO)-based virtual screening approaches focus on catalytic activity as the sole objective and correspondingly tend to identify promising materials that are unlikely to be entropically stabilized. Here, we overcome this limitation with a multiobjective BO framework for HEAs that simultaneously targets activity, cost-effectiveness, and entropic stabilization. With diversity-guided batch selection further boosting its data efficiency, the framework readily identifies numerous promising candidates for the oxygen reduction reaction that strike the balance between all three objectives in hitherto unchartered HEA design spaces comprising up to 10 elements.

The HPC Best Practices Webinar Series

(2019)

What Deploying MFA Taught Us About Changing Infrastructure

(2019)

NERSC is not the first organization to implement multi-factor authentication (MFA) for its users. We had seen multiple talks by other supercomputing facilities who had deployed MFA, but as we planned and deployed our MFA implementation, we found that nobody had talked about the more interesting and difficult challenges, which were largely social rather than technical. Our MFA deployment was a success, but, more importantly, much of what we learned could apply to any infrastructure change. Additionally, we developed the sshproxy service, a key piece of infrastructure technology that lessens user and staff burden and has made our MFA implementation more amenable to scientific workflows. We found great value in using robust open-source components where we could and developing tailored solutions where necessary.

Designing an all-flash Lustre file system for the 2020 NERSC Perlmutter system

(2019)

Pin-pointing Node Failures in HPC Systems

(2017)

Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resilience. With increasing scalability required for exascale, accurate fault prediction aiding in quick remedy is hard. With changing supercomputer architectures, distilling fault data from the noisy raw logs requires substantial efforts. Predicting node failures in such voluminous system logs is challenging. To this end, we investigate an interesting way to pin-point node failures in such supercomputing systems. Our study on Cray system data with automated machine learning tools suggests that specific patterns of event messages on node unavailability can be indicator to node failures. This data extraction coupled with system and job data correlation helps in devising a methodology to predict node failures and their location over a specific time frame. This work aims to enable broader applicability for a generic fault prediction framework.

Computing Sciences

NERSC