Issue 5 (Oct 2024)

Astrostatistics News

Issue 5, October 2024

Issue Editors:  Jessi Cisewski-Kehe, David W. Hogg, Vinay L. Kashyap, Aneta Siemiginowska

Astrostatistics News (AN) is a newsletter designed to inform, promote, cultivate, and inspire the astrostatistics community. 


Highlights


National Science Foundation and Simons Foundation Astro-AI Institutes

The National Science Foundation and Simons Foundation recently announced two new AI Institutes that will focus on astronomy.  The two institutes are NSF-Simons AI Institute for Cosmic Origins (NSF-Simons CosmicAI) led by the University of Texas-Austin and NSF-Simons AI Institute for the Sky (NSF-Simons SkAI) led by Northwestern University.


More details are presented in the following press release:
https://new.nsf.gov/news/nsf-simons-foundation-launch-2-ai-institutes-help?utm_medium=email&utm_source=govdelivery



The Cosmostatistics Initiative: a decade into interdisciplinarity

Alberto Krone-Martins (U. California Irvine, USA), Emille E. O. Ishida (CNRS, France), and Rafael S. de Souza (U. Hertfordshire, UK)


How can we bridge the gap between different scientific communities, while developing a common vocabulary and addressing challenging astronomical and cosmological questions?


The history: The Cosmostatistics Initiative (COIN) [1] is an international collaborative effort that aims to solve this question. Historically, building bridges between different scholarly communities has been challenging, in part due to the lack of common vocabulary across  fields. In this context, cosmology and statistics have been evolving in tandem for centuries, and experts on both fronts have acknowledged the need for synergy for decades. During a brainstorming session at the IAU Symposium 306: Statistical Challenges in 21st Century Cosmology [2], held in Lisbon in 2014, the seed concept for the Cosmostatistics Initiative was planted. COIN was created later that year, under the umbrella of the International Astrostatistics Association (IAA).


The model: Based on a previous successful experience [3], and partially inspired by the so-called multi-disciplinary Tiger Teams [4], COIN builds communities through its COIN Residence Programs (CRPs). These small-scale, high-intensive events use astronomy as a catalyst for producing interdisciplinary scientific results.  During one week, we bring together a selected team of individuals to live in a house and work collaboratively on solving a scientific problem proposed and chosen by the participants through voting. In the process, they start developing common vocabularies, thus bridging the gaps between their fields. In the short term, these language bridges -- and often friendships -- help participants to address scientific challenges together, in an accelerated and effective manner.


Outcomes and Community: So far, this effort resulted in multiple publications in major journals and conferences. In each of these reports, a different team adapted and applied statistical methodologies to astronomical problems that had never been used in astronomy previously, or had only been used very incipiently before our event. Historically, CRPs resulted in two to three publications per event, with some projects generating longer-term collaborations [5]. So far, researchers from all career stages representing more than 25 nationalities -- including North and South America, Europe, Africa, the Middle East, South, East and North Asia, and Oceania-- participated in CRPs.


The philosophy: COIN events are essentially sociological experiments. The team dynamics strategies on which this initiative was initially inspired [4] aimed to solve complex interdisciplinary problems requiring a robust, and sometimes fast, solution. However, in the context of COIN, there is no problem to start with. At the time of application, participants (and organizers) do not know which project they will work on. The group is then formed by individuals who, by construction, are interested in the experience of scientific collaboration as a goal in itself. In the context of COIN, science is approached as a form of human expression, and in this sense, it is not different from other art forms. The specific scientific results are as remarkable as they are unpredictable, as, in a certain way, they are merely the final materialization of the potential enclosed in that group at that moment in time.


The future: A decade after its creation, there are still many challenges to be overcome. Each new group brings an entirely new set of possibilities, but also a different set of cultural and sociological backgrounds that must be addressed. Moreover, it is still unclear how some version of this experience can be officially incorporated into our current academic model in the longer term. We are determined to continue working so that COIN, as a community, keeps pushing the barriers to the art of making interdisciplinary science.


If you want to join this endeavor, contact one of the chairs! An upcoming CRP is being baked at this moment for summer 2025!


[1] https://cosmostatistics-initiative.org/

[2] http://sccc21.sim.ul.pt/

[3] Krone-Martins, Ishida & de Souza, 2014, The first analytical expression to estimate photometric redshifts suggested by a machine, MNRASL, Volume 443, Issue 1, 1 September 2014, Pages L34–L38

[4] https://en.wikipedia.org/wiki/Tiger_team

[5] https://cosmostatistics-initiative.org/projects/



2024 PHYSTAT-SBI Meeting Summary

Mikael Kuusela (Carnegie Mellon University)


The PHYSTAT-SBI meeting took place at the Max Planck Institute for Physics in Garching, Germany on May 15-17, 2024. The meeting brought together leading experts in simulation-based inference (SBI) from statistics, machine learning, particle physics, and astrophysics to identify the next research frontiers in SBI. 


The meeting started with overview talks by Gilles Louppe and Kyle Cranmer on the current state of the art and open challenges in SBI. This was followed by three days of technical talks, poster presentations and many discussions about recent methodological developments in SBI as well as SBI applications across particle and astrophysics. Some common themes that were discussed extensively included handling of model misspecification, rigorous validation of the trained models and efficient ways to sample the parameter space. Important open research questions were identified regarding the robustness of the SBI inferences and our ability to scale the methods to handle high-dimensional parameters in addition to high-dimensional observations. 


The slides of most of the presentations are available at https://indico.cern.ch/event/1355601/timetable/ and are highly recommended reading for anybody working on or interested in simulation-based inference in fundamental physics and beyond.



Historical Astrostatistics

Astrostatistics innovations of the past are highlighted in this section.  


Gauss, Least Squares, and the Missing Planet

By Milton Lim (Columbia University, Milton.lim@columbia.edu)


The method of least squares is an ubiquitous workhorse of modern statistics, yet few statisticians are aware of the historical context behind its invention. On 1 January 1801, a 'new' planet named Ceres was discovered between Mars and Jupiter, which was tracked for 40 days before being lost behind the Sun. An open challenge was issued to all astronomers and mathematicians to predict its unknown orbit; all failed except for a young Carl Friedrich Gauss.


The basic problem is that observations of a moving object are carried out from a moving platform, though the motions are predictable since all the objects follow elliptical paths with the Sun at one focus.  In order to figure out the location of Ceres, the elements of its orbit had to be estimated, and the orbit extrapolated.  Gauss figured out how to find the orbital elements from just three observations through direct geometric calculations without trial and error, and described how to refine the calculations by comparing it with additional observations to minimize the errors.  He was the first to recognize that measurement uncertainties follow the normal distribution, with certain fundamental properties like large errors are less frequent, deviations are symmetrical, and the average represents the likely value.  This leads directly to the concept of minimizing the squared deviations around the residuals to a fitted curve.  It allowed him to compute the Keplerian ellipse that predicted the position of Ceres to within ½ degree, in a different area of the sky than previous searches were being carried out.


Read Milton Lim’s article on this topic at  https://www.actuaries.digital/2021/03/31/gauss-least-squares-and-the-missing-planet/ 



Spotlight

Astrostatistics innovations of the present are highlighted in this section. 


Optimally weighted PCA for high-dimensional heteroscedastic data

By David Hong (University of Delaware)

Paper:  Hong, D., Yang, F., Fessler, J.A. and Balzano, L., 2023. Optimally weighted PCA for high-dimensional heteroscedastic data. SIAM Journal on Mathematics of Data Science, 5(1), pp.222-250.
Link:  https://doi.org/10.1137/22M1470244

Software package:  https://github.com/dahong67/WeightedPCA.jl

 

Principal component analysis (PCA) is a workhorse method for discovering latent low-dimensional phenomenon in noisy high-dimensional data. But what happens when the noise is heterogeneous? This is a common feature of modern astronomical data, and it turns out that (conventional) PCA is not robust to it. Improved PCA methods are needed. This paper studies data with heterogeneous noise levels. Specifically, we consider high-dimensional data with noise that is heteroscedastic across samples, i.e., some samples are noisier than others. PCA is not robust to this form of heterogeneity; the noisiest samples can disproportionately dictate how well (or rather, how poorly) PCA recovers underlying signal components. A simple fix is to instead perform a weighted PCA that gives noisier samples less weight. The question is: what should the weights be? A natural choice is to weight the samples by the inverse of their noise variances, but we discovered a surprising fact: inverse noise variance weights are sub-optimal! In this paper, we derived the actual optimal weights under some natural statistical assumptions, demonstrated their improved performance through numerical simulations, and studied their potential benefits for astronomical data such as quasar spectra from SDSS. The optimal weights are a simple function of the signal and noise variances, making them easy to use. We have also developed a user-friendly package (https://github.com/dahong67/WeightedPCA.jl) that implements Optimally Weighted PCA. Much future work remains as modern astronomical data also exhibit other forms of heterogeneity. Indeed, we have lots of ongoing work building on this paper (handling more general noise heterogeneity, determining how many components to keep, etc.) - contact David to learn more!



A "Rosetta Stone" for Studies of Spatial Variation in Astrophysical Data: Power Spectra, Semivariograms, and Structure Functions

By Sabrina Berger (University of Melbourne, Australian National University) and Benjamin Metha (University of Melbourne)

Paper: Metha, B. and Berger, S., 2024. A "Rosetta Stone" for Studies of Spatial Variation in Astrophysical Data: Power Spectra, Semivariograms, and Structure Functions. arXiv preprint arXiv:2407.14068.  

Link:   https://arxiv.org/abs/2407.14068


How do you quantify spatial correlation in your astrophysics subfield? From the turbulent interstellar medium to the cosmic web, astronomers in many different fields have needed to make sense of spatial data describing our Universe, spanning centimeter to Gigaparsec scales. Because of this history, terminology from a myriad of different fields is used, often to describe two data products that are mathematically identical. In this Note, we clarify the differences and similarities between the power spectrum, two-point correlation function, covariance function, semivariogram, and structure function to unify the language used in spatial correlation analysis. We also highlight under which conditions these data products are useful and describe how the results found using one method can be translated to those found using another, allowing for easier comparison between different subfields' native methods. We hope this document will serve as a "Rosetta Stone" for bridging statistical approaches and promoting cross-disciplinary data analysis.



Student-led award-winning astrostatistics papers at JSM 2024

Every year, several contests are run to identify innovative student-led papers at the Joint Statistical Meeting.  This year was no exception, and the following were finalists or winners of contests run by the American Statistical Association’s Astrostatistics Interest Group and Section on Statistics in Imaging.


A Bayesian hierarchical model for the galaxy mass - globular cluster system mass scaling relation for low-mass galaxies

https://ui.adsabs.harvard.edu/abs/2023ApJ...955...22B/abstract 

Samantha Berek (University of Toronto)

Finalist, AIG Student Paper Award competition

We introduce the HERBAL model, a hierarchical errors-in-variables Bayesian lognormal hurdle model, to characterize the scaling relation between galaxy mass and globular cluster (GC) system mass. Our model resolves the ambiguity in the low-mass end of this scaling relation by modeling galaxies without GCs in tandem with the rest of the population. The HERBAL model is able to describe the Local Group galaxy population, measure intrinsic scatter in GC system mass, and estimate mass-to-light ratios for galaxies within its hierarchical structure. With our comprehensive methodology, we are able to (1) show that low-mass galaxies with small or no GC systems follow the same linear scaling relation as higher-mass galaxies, indicating that GC formation and evolutionary processes may be universal, and (2) quantify the spread in GC system masses which may point to dependencies on other galaxies properties (e.g., environment, star formation history).


GausSN: Bayesian Time-Delay Estimation for Strongly Lensed Supernovae

https://ui.adsabs.harvard.edu/abs/2024MNRAS.530.3942H/abstract 

Erin Hayes (University of Cambridge)

Finalist, AIG Student Paper Award competition

Gravitationally lensed supernovae (glSNe) are rare astrophysical phenomena in which a SN in a background galaxy appears multiple times due to the presence of a lens galaxy in the foreground. These multiple “images” of the SN appear with some delay in time relative to one another which allows us to probe the distance to the SN and, therefore, the expansion rate of the Universe. In this work, we present GausSN – a robust Bayesian model for extracting the time delays of glSNe from light curve data.

GausSN models the true underlying light curve of the SN as a draw from a Gaussian Process, leveraging information from all images and in all wavelength regimes simultaneously. Marginalizing over the shape of the light curve, we sample the time-delays as hyperparameters of the system to give fully Bayesian time-delay estimates. In addition, we include a novel treatment of systematic uncertainties alongside time-delay estimation. GausSN maintains the level of precision and accuracy achieved by existing time-delay extraction methods with fewer assumptions about the underlying light curve shape, and while incorporating a treatment of additional systematic effects into the error budget.

Robust methods for time delay estimation, such as GausSN, will enable precise and accurate estimates of the expansion rate of the Universe, a fundamental parameter in our model of the Universe, with glSNe.


Understanding the formation history of the Milky Way disk using Copulas and Elicitable Maps

https://ui.adsabs.harvard.edu/abs/2023MNRAS.526.1997P/abstract

Aarya Patil (University of Toronto)

Finalist, AIG Student Paper Award competition

In the Milky Way, the distribution of stars in age and chemistry holds essential information about the evolution of the galaxy. We apply copulas and elicitable maps to interpret the observed age-chemical distributions from the APOGEE survey. Copulas are a way to characterize the dependence structure between variables by removing the marginal distributions of the variables. They are useful in our application because much information about Galactic evolution is contained in the relations between chemical compositions and ages of stars, while the overall distribution of age and chemistry is largely set by the star-formation history. Using copulas, we were able to show that the famous [a/Fe]-[Fe/H] plane with its two sequences of stars has a much clearer structure in copula space that clarifies the relation between the two sequences. Elicitable maps provide a way to estimate distribution expectations and quantiles in a nonparametric way given limited data. We used these maps to determine the age-chemical relation from close to the Galactic center to the outer disk, where available data is sparse. Structures such as the bar are clearly visible in our results, helping us tackle major open problems in Milky Way evolution.


Improved Weak Lensing Photometric Redshift Calibration via StratLearn and Hierarchical Modeling

https://ui.adsabs.harvard.edu/abs/2024arXiv240104687A/abstract 

Maximilian Autenrieth (Imperial College)

Winner, AIG Student Paper Award competition

Recent comparisons between cosmological parameter estimates from cosmic shear surveys and from Planck cosmic microwave background measurements reveal discrepancies between the amount and clustering strength of dark matter, challenging the ability of the highly successful ΛCDM model to describe the nature of the Universe. To rule out systematic biases in cosmic shear survey analyses, accurate redshift calibration within tomographic bins is key. We propose a new method to improve photo-𝑧 calibration via Bayesian hierarchical modeling of full galaxy photo-𝑧 conditional densities, by employing StratLearn, a recently developed statistical methodology, which accounts for systematic differences in the distribution of the spectroscopic training/source set and the photometric target set using propensity score stratification, a pivotal methodology in causal inference. We show that StratLearn-estimated conditional densities improve the galaxy tomographic bin assignment, and that our StratLearn-Bayesian framework leads to nearly unbiased estimates of the target population means, evaluated at realistic simulations that were designed to resemble the KiDS+VIKING-450 dataset.

We anticipate that our improved photo-z calibration method might alleviate some discrepancies between the cosmological parameter estimates obtained via weak lensing cosmic shear analysis and Planck CMB measurements.


Uncertainty Quantification of Object Boundaries Extracted from Spatial Point Pattern Images

Under review at AAS Journals

Jue Wang (University of California, Davis)

Winner, Section on Statistics in Imaging Student Paper Award

A common challenge in observational astronomy is the identification of the boundary of an irregular region of extended emission.  This is especially difficult in photon-starved high-energy datasets.  We have developed a new method to estimate the boundary of extended sources in photon lists and to quantify the uncertainty in the boundary.  We use graphed seeded growing to identify segments, then model the segment boundaries using Fourier Descriptors.  This allows us to isolate cases where a well-defined region is identifiable in the field of view, and to construct bootstrap samples and derive uncertainties on them.  We apply them to Chandra and XMM data of galaxies which have complex structure, and show, first, that the boundary estimates in NGC 2300 are robust to details of resolution and background, and second, that the X-ray size of the interacting galaxy Arp299 in XMM is substantially smaller than expected from the optical.



Jargon Dictionary

General definitions of astronomy or statistical terms are included in this section.


The International Astronomical Union is constructing a glossary of common astronomy terms (see https://astro4edu.org/resources/glossary/search/). Here we plan to build up a similar dictionary, focusing on both statistics and astronomy jargon.  


If you have comments, questions, concerns, edits, or terms you would like included please let us know at astrostatisticsnews@gmail.com.


Metallicity

Elemental composition is a critical parameter required to understand most astronomical objects.  There are various ways to summarize them, ranging from crude (e.g., abundance relative to solar) to highly detailed (e.g., absolute abundances of specific elements and even their ionic species).  A common and highly useful summary is the so-called metallicity, which is simply the abundance of iron (Fe) relative to hydrogen (H), and which stands as a proxy for all metals; astronomers call everything outside of the two most abundant elements in the Universe, H and helium (He), as metals.
VLK


Bolometric correction, Distance modulus, Color excess, Absolute magnitude, etc

Because astronomical techniques have spanned many centuries and many technologies, there is a lot of old observational and quasi-observational jargon. Some of it is subtle: For example, a magnitude is not a (negative) logarithmic measurement of brightness, it is a logarithmic measurement of a brightness ratio. A pedagogical introduction to all these magnitude-connected quantities is available in “Magnitudes, distance moduli, bolometric corrections, and so much more” found at https://arxiv.org/abs/2206.00989

The note is aimed at physicists, but it can be understood by most people with a quantitative background.
DWH


Nuisance parameter

The structure of most astrophysics projects (science projects?) is that there is a fundamental model you care about (e.g., an astronomer may care about a model of an exoplanet orbiting a distant star) and then an auxiliary model to handle the things you do not care about (e.g., this might be something that describes the stochastic variability of the star that the planet is orbiting). The parameters of the fundamental model are the parameters you care about, and the parameters of the auxiliary model are what we often call “nuisance parameters.”  Though you generally need to represent the nuisance parameters, and often have to infer them, you do not care about them. Trouble arises when the nuisance parameters are covariant with the parameters you do care about; a solution is to marginalize or profile (see definitions described below).
DWH


Marginalization

In Bayesian contexts, we have priors over parameters, which permit integration (in the calculus sense). Priors are probability distributions, and are therefore also measures; they make integration for a statistical analysis possible. If you have a likelihood function p(y|a,b) (in statistics we often write this as “L(a,b|y)”), where y represents the data, a represents the parameters you care about, and b represents the nuisance parameters, you can integrate over b, but only if you have a prior on b. That is, to integrate, you need a prior p(b), and when you integrate p(y|a,b)p(b) over all b, you obtain the marginalized likelihood or marginal likelihood p(y|a). The rules for constructing and integrating probability distributions are set down in (among other places) “Data analysis recipes: Probability calculus for inference” found at https://arxiv.org/abs/1205.4446.
DWH


Profiling (statistics)

In frequentist statistics contexts—or when you feel uncomfortable putting a prior p(b) on your nuisance parameters b—you can nonetheless account for covariances between the parameters of interest and the nuisance parameters in your inferences by profiling. The profile likelihood pb(y|a) is the value of the full likelihood p(y|a,b) but, at each value of a, evaluated at the maximum-likelihood value of b, given that setting of a. That is, the profile likelihood is the likelihood function optimized over the nuisance parameters. Profiling can be useful when you do not have a principled way to put priors on your nuisance parameters. A recent paper on profiling, aimed at cosmologists, is “Profile Likelihoods in Cosmology: When, Why and How illustrated with ΛCDM, Massive Neutrinos and Dark Energy” found at https://arxiv.org/abs/2408.07700.
DWH



Other News


Job Opportunities in Astrostatistics

A list of job opportunities will be maintained at our website, astrostatisticsnews.com/job-opportunities.


Open-rank tenure-track position in astrostatistics at Penn State

Center for Astrostatistics, Department of Statistics, Penn State

Details: https://www.mathjobs.org/jobs/list/20684 

Deadlines: Application deadline is October 15, 2024, continuing until selection


Assistant Professor of Physics and Statistical and Data Sciences 

Department of Physics and the Program in Statistical and Data Sciences, Smith College 

Details:  The Department of Physics and the Program in Statistical and Data Sciences at Smith College invite applications for a joint tenure-track position at the rank of Assistant Professor, to begin July 1, 2025.  Teaching responsibilities for this position will rotate through introductory and advanced courses in the standard undergraduate physics curriculum, and introductory and intermediate courses in data science and/or statistics, using Python and/or R.  A Ph.D. in Physics or a closely related field is expected by the time of appointment;  the candidate’s research should make extensive use of advanced data science tools and methods to further our physical understanding. Additional degrees in data science and/or statistics are welcome but not required. Candidates from groups underrepresented in Physics are encouraged to apply. Details about the Department of Physics and the Program in Statistics and Data Sciences may be found at https://www.smith.edu/academics/physics and https://www.smith.edu/academics/statistical-data-sciences.  For more information and to apply, visit https://apply.interfolio.com/151219. Applications are due on or before 10/4/24. EO/AA/Vet/Disability Employer.  

Deadlines: Application deadline is October 4, 2024



Astrostatistics Events

A list of events is maintained at our website, astrostatisticsnews.com/upcoming-events.


STAMPS Seminar Series


STAtistical Methods for the Physical Sciences Research Center (STAMPS@CMU) 

https://www.cmu.edu/dietrich/statistics-datascience/stamps/index.html

launched the seminar series on September 20, 2024. 


Talks are open to everyone who registers on the web site:

https://www.cmu.edu/dietrich/statistics-datascience/stamps/events/webinars/index.html



Content suggestions


If you have ideas for AN content, please send a message to astrostatisticsnews@gmail.comWe may include your idea in a future issue if we think it is a good fit for an issue.

Ideas may include relevant astrostatistics papers/data/code, visualizations, upcoming events, job postings, format or commentary suggestions, etc.  


Astrostatistics News website


See astrostatisticsnews.com for more information such as past issues, lists of astrostatistics references and societies.


Subscribe to Astrostatistics News

To subscribe to Astrostatistics News, go to https://groups.google.com/g/astrostatistics-news and select the “Join group” button.  You will need to be logged into your Google account to join the group.