AstroSpectro
Interpretable ML pipeline for stellar spectral classification on LAMOST DR5 × Gaia DR3
Overview
AstroSpectro is an open-source research project at the intersection of stellar spectroscopy, machine learning, and astronomical survey analysis. The project focuses on building physically grounded and interpretable pipelines for analysing large collections of stellar spectra, using data from LAMOST DR5 cross-matched with Gaia DR3.
The current pipeline works on a cleaned dataset of more than 43,000 spectra and combines raw FITS ingestion, preprocessing, continuum normalisation, physics-based feature extraction, supervised classification, dimensionality reduction, clustering, and SHAP-based interpretability. Rather than treating classification accuracy as the final objective, AstroSpectro is designed to ask a broader scientific question:
What physical information is actually being learned from stellar spectra by modern machine-learning models?
Scientific Motivation
Classical stellar classification is historically built around the visual strength of spectral lines, especially the Balmer sequence as a temperature indicator. In large survey datasets, however, machine-learning models may learn a more complex combination of temperature, metallicity, surface gravity, evolutionary state, and survey selection effects.
AstroSpectro explores this question by extracting physically motivated spectroscopic descriptors and testing which features actually drive classification decisions. A central result of the current pipeline is that metallicity-sensitive features, including Ca II H&K and Mg b, often rank above Balmer-line temperature proxies in SHAP-based feature importance analyses.
This does not replace the classical MK framework. Instead, it suggests that data-driven classifiers trained on large heterogeneous surveys may encode stellar population structure in ways that are not reducible to temperature alone.
Pipeline
| Stage | Description |
|---|---|
| Data acquisition | Robust download and management of LAMOST DR5 FITS spectra |
| Catalogue construction | FITS-header metadata extraction and Gaia DR3 cross-matching |
| Preprocessing | Wavelength reconstruction, flux normalisation, inverse-variance cleaning |
| Feature engineering | Physics-based spectroscopic descriptors: Balmer lines, metallic lines, molecular bands, continuum indices, line profiles |
| Supervised learning | XGBoost and related classifiers for stellar spectral classification |
| Dimensionality reduction | PCA, UMAP, t-SNE, and autoencoder-based latent-space exploration |
| Interpretability | SHAP analysis to connect model decisions back to physical spectral features |
Current Results
Current experiments show that:
- physically motivated spectral features can recover strong classification performance without relying on instrumental or positional metadata;
- dimensionality-reduction methods reveal a structured spectral manifold consistent with known stellar-population gradients;
- SHAP analysis identifies both expected features, such as Balmer lines, and less obvious discriminants, such as metallicity-sensitive Ca and Mg features;
- removing potentially leaky or non-physical metadata improves the scientific interpretability of the model;
- F/G/K transitions appear as continuous structures in the learned representation rather than sharply separated categories.
These results support the main goal of the project: developing a workflow where machine learning is not only predictive, but also interpretable in terms of stellar astrophysics.
Technical Stack
Python · Astropy · NumPy · pandas · scikit-learn · XGBoost · PyTorch · UMAP · HDBSCAN · SHAP · Weights & Biases · Docusaurus
Links
- 📁 GitHub Repository
- 📖 Full Documentation
- 📊 Experiment tracking with Weights & Biases
Status: active research and development project. The current focus is on scientific validation, interpretability, dimensionality reduction, and preparation of reproducible research outputs.