Provenance-Aware Machine Learning Pipelines for Large Astronomical Surveys

Faculty Mentor

Chris Cain

Presentation Type

Poster

Start Date

4-14-2026 11:30 AM

End Date

4-14-2026 1:30 PM

Location

PUB NCR

Primary Discipline of Presentation

Computer Science

Abstract

Machine learning methods for astronomical object classification in large sky surveys are studied using data from the Legacy Survey of Space and Time (LSST), the wide-field survey of the Vera C. Rubin Observatory. Modern astronomical datasets present significant challenges for machine learning due to their scale, calibration and instrumental systematics, noisy measurements, observational biases, and high-dimensional feature spaces. To address these challenges, a provenance-aware extension to the DAG-based data abstraction layer underlying the Rubin Observatory LSST pipeline system is developed. The framework introduces a lineage-aware machine learning pipeline that tracks data transformations including cross-matching, labeling, data cleaning, normalization, and feature extraction across machine learning workflows. The schema-based graph representation enables reproducible workflows and transparent data lineage tracking. Using the LSST Data Preview 1 (DP1) release, which provides early survey data suitable for developing and evaluating machine learning methods, galaxies from the 3D-HST Hubble Space Telescope survey in the Chandra Deep Field South (CDFS) are cross-matched with LSST sources to construct a labeled dataset. Deep-field cross-survey datasets provide high-quality labels that enable development and evaluation of machine learning methods for wide-area survey data. The resulting dataset integrates photometric measurements and image cutouts with star formation rate (SFR) estimates for galaxies from the 3D-HST catalog. Machine learning models combining image and photometric features are trained to evaluate the predictive capability of LSST observations. A hybrid ensemble architecture integrates convolutional neural networks applied to image data with tree-based models trained on photometric features. Performance is assessed using supervised and unsupervised methods, including dimensionality reduction and clustering analyses of the learned feature space. These analyses evaluate the ability of models to infer galaxy properties from LSST data, using star formation rate as a case study, and demonstrate the utility of provenance-aware machine learning pipelines for large-scale astronomical surveys.

This document is currently not available here.

Share

COinS
 
Apr 14th, 11:30 AM Apr 14th, 1:30 PM

Provenance-Aware Machine Learning Pipelines for Large Astronomical Surveys

PUB NCR

Machine learning methods for astronomical object classification in large sky surveys are studied using data from the Legacy Survey of Space and Time (LSST), the wide-field survey of the Vera C. Rubin Observatory. Modern astronomical datasets present significant challenges for machine learning due to their scale, calibration and instrumental systematics, noisy measurements, observational biases, and high-dimensional feature spaces. To address these challenges, a provenance-aware extension to the DAG-based data abstraction layer underlying the Rubin Observatory LSST pipeline system is developed. The framework introduces a lineage-aware machine learning pipeline that tracks data transformations including cross-matching, labeling, data cleaning, normalization, and feature extraction across machine learning workflows. The schema-based graph representation enables reproducible workflows and transparent data lineage tracking. Using the LSST Data Preview 1 (DP1) release, which provides early survey data suitable for developing and evaluating machine learning methods, galaxies from the 3D-HST Hubble Space Telescope survey in the Chandra Deep Field South (CDFS) are cross-matched with LSST sources to construct a labeled dataset. Deep-field cross-survey datasets provide high-quality labels that enable development and evaluation of machine learning methods for wide-area survey data. The resulting dataset integrates photometric measurements and image cutouts with star formation rate (SFR) estimates for galaxies from the 3D-HST catalog. Machine learning models combining image and photometric features are trained to evaluate the predictive capability of LSST observations. A hybrid ensemble architecture integrates convolutional neural networks applied to image data with tree-based models trained on photometric features. Performance is assessed using supervised and unsupervised methods, including dimensionality reduction and clustering analyses of the learned feature space. These analyses evaluate the ability of models to infer galaxy properties from LSST data, using star formation rate as a case study, and demonstrate the utility of provenance-aware machine learning pipelines for large-scale astronomical surveys.