Poster Session 2

Provenance-Aware Machine Learning Pipelines for Large Astronomical Surveys

David Sergio, Eastern Washington UniversityFollow

Faculty Mentor

Chris Cain

Presentation Type

Poster

Start Date

4-14-2026 11:30 AM

End Date

4-14-2026 1:30 PM

Location

PUB NCR

Primary Discipline of Presentation

Computer Science

Abstract

Machine learning methods for astronomical object classification in large sky surveys are studied using data from the Legacy Survey of Space and Time (LSST), the wide-field survey of the Vera C. Rubin Observatory. Modern astronomical datasets present significant challenges for machine learning due to their scale, calibration and instrumental systematics, noisy measurements, observational biases, and high-dimensional feature spaces. To address these challenges, a provenance-aware extension to the DAG-based data abstraction layer underlying the Rubin Observatory LSST pipeline system is developed. The framework introduces a lineage-aware machine learning pipeline that tracks data transformations including cross-matching, labeling, data cleaning, normalization, and feature extraction across machine learning workflows. The schema-based graph representation enables reproducible workflows and transparent data lineage tracking. Using the LSST Data Preview 1 (DP1) release, which provides early survey data suitable for developing and evaluating machine learning methods, galaxies from the 3D-HST Hubble Space Telescope survey in the Chandra Deep Field South (CDFS) are cross-matched with LSST sources to construct a labeled dataset. Deep-field cross-survey datasets provide high-quality labels that enable development and evaluation of machine learning methods for wide-area survey data. The resulting dataset integrates photometric measurements and image cutouts with star formation rate (SFR) estimates for galaxies from the 3D-HST catalog. Machine learning models combining image and photometric features are trained to evaluate the predictive capability of LSST observations. A hybrid ensemble architecture integrates convolutional neural networks applied to image data with tree-based models trained on photometric features. Performance is assessed using supervised and unsupervised methods, including dimensionality reduction and clustering analyses of the learned feature space. These analyses evaluate the ability of models to infer galaxy properties from LSST data, using star formation rate as a case study, and demonstrate the utility of provenance-aware machine learning pipelines for large-scale astronomical surveys.

Recommended Citation

Sergio, David, "Provenance-Aware Machine Learning Pipelines for Large Astronomical Surveys" (2026). 2026 Symposium. 30.
https://dc.ewu.edu/srcw_2026/ps_2026/p2_2026/30

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

This document is currently not available here.

COinS

Apr 14th, 11:30 AM Apr 14th, 1:30 PM

Provenance-Aware Machine Learning Pipelines for Large Astronomical Surveys

PUB NCR

EWU Digital Commons

Poster Session 2

Provenance-Aware Machine Learning Pipelines for Large Astronomical Surveys

Faculty Mentor

Presentation Type

Start Date

End Date

Location

Primary Discipline of Presentation

Abstract

Recommended Citation

Creative Commons License

Search

Browse

Author Corner

Links

Links

EWU Digital Commons

Poster Session 2

Provenance-Aware Machine Learning Pipelines for Large Astronomical Surveys

Authors

Faculty Mentor

Presentation Type

Start Date

End Date

Location

Primary Discipline of Presentation

Abstract

Recommended Citation

Creative Commons License

Share

Search

Browse

Author Corner

Links

Links