Provenance-Aware Machine Learning Pipelines for Large Astronomical Surveys
Faculty Mentor
Chris Cain
Presentation Type
Poster
Start Date
4-14-2026 11:30 AM
End Date
4-14-2026 1:30 PM
Location
PUB NCR
Primary Discipline of Presentation
Computer Science
Abstract
Machine learning methods for astronomical object classification in large sky surveys are studied using data from the Legacy Survey of Space and Time (LSST), the wide-field survey of the Vera C. Rubin Observatory. Modern astronomical datasets present significant challenges for machine learning due to their scale, calibration and instrumental systematics, noisy measurements, observational biases, and high-dimensional feature spaces. To address these challenges, a provenance-aware extension to the DAG-based data abstraction layer underlying the Rubin Observatory LSST pipeline system is developed. The framework introduces a lineage-aware machine learning pipeline that tracks data transformations including cross-matching, labeling, data cleaning, normalization, and feature extraction across machine learning workflows. The schema-based graph representation enables reproducible workflows and transparent data lineage tracking. Using the LSST Data Preview 1 (DP1) release, which provides early survey data suitable for developing and evaluating machine learning methods, galaxies from the 3D-HST Hubble Space Telescope survey in the Chandra Deep Field South (CDFS) are cross-matched with LSST sources to construct a labeled dataset. Deep-field cross-survey datasets provide high-quality labels that enable development and evaluation of machine learning methods for wide-area survey data. The resulting dataset integrates photometric measurements and image cutouts with star formation rate (SFR) estimates for galaxies from the 3D-HST catalog. Machine learning models combining image and photometric features are trained to evaluate the predictive capability of LSST observations. A hybrid ensemble architecture integrates convolutional neural networks applied to image data with tree-based models trained on photometric features. Performance is assessed using supervised and unsupervised methods, including dimensionality reduction and clustering analyses of the learned feature space. These analyses evaluate the ability of models to infer galaxy properties from LSST data, using star formation rate as a case study, and demonstrate the utility of provenance-aware machine learning pipelines for large-scale astronomical surveys.
Recommended Citation
Sergio, David, "Provenance-Aware Machine Learning Pipelines for Large Astronomical Surveys" (2026). 2026 Symposium. 30.
https://dc.ewu.edu/srcw_2026/ps_2026/p2_2026/30
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Provenance-Aware Machine Learning Pipelines for Large Astronomical Surveys
PUB NCR
Machine learning methods for astronomical object classification in large sky surveys are studied using data from the Legacy Survey of Space and Time (LSST), the wide-field survey of the Vera C. Rubin Observatory. Modern astronomical datasets present significant challenges for machine learning due to their scale, calibration and instrumental systematics, noisy measurements, observational biases, and high-dimensional feature spaces. To address these challenges, a provenance-aware extension to the DAG-based data abstraction layer underlying the Rubin Observatory LSST pipeline system is developed. The framework introduces a lineage-aware machine learning pipeline that tracks data transformations including cross-matching, labeling, data cleaning, normalization, and feature extraction across machine learning workflows. The schema-based graph representation enables reproducible workflows and transparent data lineage tracking. Using the LSST Data Preview 1 (DP1) release, which provides early survey data suitable for developing and evaluating machine learning methods, galaxies from the 3D-HST Hubble Space Telescope survey in the Chandra Deep Field South (CDFS) are cross-matched with LSST sources to construct a labeled dataset. Deep-field cross-survey datasets provide high-quality labels that enable development and evaluation of machine learning methods for wide-area survey data. The resulting dataset integrates photometric measurements and image cutouts with star formation rate (SFR) estimates for galaxies from the 3D-HST catalog. Machine learning models combining image and photometric features are trained to evaluate the predictive capability of LSST observations. A hybrid ensemble architecture integrates convolutional neural networks applied to image data with tree-based models trained on photometric features. Performance is assessed using supervised and unsupervised methods, including dimensionality reduction and clustering analyses of the learned feature space. These analyses evaluate the ability of models to infer galaxy properties from LSST data, using star formation rate as a case study, and demonstrate the utility of provenance-aware machine learning pipelines for large-scale astronomical surveys.