# Issue 2 (April 2023)

Astrostatistics News

Issue 2, April 2023

Issue Editors: Jessi Cisewski-Kehe, David W. Hogg, Vinay L. Kashyap, Aneta Siemiginowska

Astrostatistics News (AN) is a newsletter designed to inform, promote, cultivate, and inspire the astrostatistics community.

In this issue, you will find a discussion of the data challenges for the Rubin Observatory/LSST, historical background on the method of least squares, a remembrance of astrostatistics pioneer Alanna Connors, the upcoming Statistical Challenges in Modern Astronomy conference, an approach for pinning down uncertainty in periodicities, an overdispersed χ² distribution for Poisson count spectral modeling, and more.

# Highlights

## Rubin Observatory/LSST data challenges

## Tom Loredo & François Lanusse

On the high-and-dry Atacama plateau in Chile, workers are completing the construction of the Vera C. Rubin Observatory, whose 8.4-meter Simonyi Survey Telescope, equipped with a 3.2-gigapixel camera, is expected to achieve "first light" in August of next year. Named for one of the discoverers of dark matter, the observatory was originally conceived over 20 years ago as a tool for studying dark matter in unprecedented detail by repeatedly surveying the southern sky, building up a deep galaxy catalog. Such a repeated, sensitive survey could address topics across many areas of astronomy, so the project quickly evolved to have a much broader scope. Philanthropists kickstarted the project in 2008 by funding construction of the main mirror. The NSF and DOE began funding the project in earnest in 2013, with Chile and other international partners eventually contributing. Full operation should finally commence by early 2025.

In its first 10 years of operation, the Rubin Observatory will undertake the Legacy Survey of Space and Time (LSST), the largest and most sensitive optical sky survey ever planned. Its automated telescope will systematically scan the southern night sky, making a complete pass roughly twice a week, alternating between six spectral bands ("filters"). About 20 TB of image data will be produced each night, building up a complex stop-motion multi-color movie of the southern sky, enabling both static-sky and time-domain astronomy with unprecedented reach. Near-real-time processing will produce about 10 million alerts of transient, variable, and moving sources each night. Annual public data releases will include calibrated images and catalogs of dozens of measured properties for tens of billions of stars and galaxies, and for about six million solar system objects. These data will enable transformative discoveries in nearly every area of astronomy and cosmology. With a final database size anticipated to comprise fifteen petabytes, LSST is certainly a big data project. But mining that dataset will produce a wealth of statistical problems spanning all scales.

LSST data will come in three main types. The fundamental data will be calibrated images (of stars, galaxies, minor planets, and spatially extended emission). Image processing will detect and characterize sources in each image, with measurements compiled in irregularly-sampled multivariate time series ("light curves"), with roughly a thousand measurements per object over 10 years. Stacking and more sophisticated accumulation of data over time for each source will produce large multivariate survey catalogs. LSST data analysis challenges thus include image processing (source detection, deblending, deconvolution of optical and atmospheric effects, image classification), time series analysis (including characterization, classification, and period detection with sparse, multivariate, asynchronous time series), multivariate demographic studies (accounting for both measurement error and selection biases akin to censoring and truncation), and modeling complex marked spatial point processes (the large-scale distributions of stars and galaxies, with rich multivariate "marks").

These problem types are not always disjoint; in particular, image and time series characterization often will arise in a demographic context: e.g., how do the images of galaxies (2D functions), or the light curves of variable stars (1D functions in each filter), behave across a population? In statistical lingo, the Rubin Observatory will be a functional data factory. To date, functional data analysis (FDA, roughly, statistical modeling of populations of functions) has had little direct impact in astronomy; astronomers have been inventing clever FDA methods on their own (without the FDA label). LSST is an invitation to FDA-savvy statisticians to join the fray.

Finally, the full panoply of ground- and space-based telescopes will be used to follow up on LSST observations, often in near real-time. LSST will thus be a major driver for multiwavelength and multimessenger astronomy, requiring joint analysis of LSST image and time series data with spectroscopic and other data from follow-up instruments.

The large scale of LSST data is driving increasing interest and work on using machine learning (ML) approaches, both for traditional ML tasks like image and times series classification, but also as surrogates or emulators for statistical analysis in settings where conventional statistical approaches are computationally infeasible. In a number of important LSST science areas—particularly in cosmology—physical modeling is typically done via simulation, driving a surge of interest in simulation-based inference (SBI, or likelihood-free inference), in the context of fitting, comparing, and checking parametric and semiparametric models.

The Rubin Observatory Informatics and Statistics Science Collaboration (ISSC) aims to be the home for statistically-savvy astronomers, statisticians, and other information scientists who want to bring their expertise to LSST data science challenges. More information about the ISSC is available at the ISSC web site, which includes a broad overview of LSST data science challenges (forthcoming webpage will be available here).

# Historical Astrostatistics

Astrostatistics innovations and innovators of the past are highlighted in this section.

## Legendre, Gauss, and Least squares

By Jessi Cisewski-Kehe

As we look forward in anticipation to the bright future of astrostatistics, we plan to look backwards to highlight important developments bridging the fields that today comprise astrostatistics. We begin with the invention of the method of least squares.

Statisticians who have dabbled in the historical foundations of statistics are familiar with the work of Professor Stephen M. Stigler, the Ernest DeWitt Burton Distinguished Service Professor Emeritus from the University of Chicago. Professor Stigler has written several books touching on various aspects of the history of statistics. What does this have to do with astrostatistics, you may be asking? In his 1986 book, “The history of statistics: The measurement of uncertainty before 1900” (Harvard University Press), the first part of the book is devoted to “The Development of Mathematical Statistics in Astronomy and Geodesy before 1827”, and the first chapter focuses on the development of least squares and the “combination of observations.” Perhaps embarrassingly (as a statistician), I had never pondered the ingenuity or insight required to first combine different observations in a principled manner.

The method of least squares is ubiquitous in both statistics and astronomy. It is a standard topic in introductory statistics courses as students begin their journeys into regression, where the objective is to minimize the sum of the squared errors between the observed values and the model. The person who invented this fundamental method is, at least somewhat, disputed. The two finalists are Adrien-Marie Legendre (1752—1833) and Johann Carl Friedrich Gauss (1777—1855). In 1805, Legendre published, Nouvelles Méthodes pour la Détermination des Orbites des Comètes, which included a clear exposition on the method of least squares in the appendix. From the title of his text, his focus was on new methods for determining the orbits of comets. This is, apparently, the first comprehensive presentation of the method of least squares that was rather quickly used as a standard tool in astronomy (Stigler, 1986). However, Gauss claimed that he had invented the method and communicated his ideas before 1805. To whom did he claim he told? Several astronomers (Stigler 1981). It is possible that Gauss did come up with the method prior to 1805 (see details in Stigler 1981), but (note to self) a well-communicated presentation can be an effective marketing tool.

References:

Stigler, S.M., 1981. Gauss and the invention of least squares. The Annals of Statistics, pp.465-474.

Stigler, S.M., 1986. The history of statistics: The measurement of uncertainty before 1900. Harvard University Press.

## Remembering Alanna Connors (1956-2013)

by Jeff Scargle

Alanna Connors was a pioneering astrophysicist who played a pivotal role in developing the nascent field of astrostatistics over two decades spanning the turn of the millennium. A gamma-ray astronomer and a principled Bayesian, she worked out use cases and pushed hard for greater integration between astronomy and statistics. She passed away too early a decade ago after a long battle with cancer. Jeff Scargle remembers her:

Anyone who had the good fortune to interact with Alanna Connors in any capacity well knows what a kind and generous person she was, dedicated to family, friends and the community. She was also a bright light in modern astrophysics, dedicated to a principled way of probing the Universe. One can only speculate as to what benefits we have missed by her life ending far too soon. But I will describe my view of her scientific viewpoint. (Even listing the specifics of her contributions would take much more space.)

Neologisms like astrostatistics, machine and "deep" learning, and data science may be useful, but I think they also tend to emphasize academic fracture lines. Alanna was a major pioneer of an approach that transcends these ideas and crosses these lines: principled analysis. To me this term means extracting knowledge about the Universe, from hard-won observational data, using an inductive process based on rigorous mathematics and statistical science. In spite of the historical origins of much of statistics in astronomical contexts, I believe that 20th century astronomy largely reneged on the promise of even 18th century mathematics -- through lack of imagination, placing undue reliance on statistical fables, and blind use of "standard" analysis methods beyond their realm of applicability. Alanna Connors contributed mightily to the ideas that continue to empower modern astronomical research.

# Spotlight

## Statistical Challenges in Modern Astronomy VIII

by Emille Ishida, Chad Schafer, Hyungsuk Tak, Ashley Villar, the co-chairs of the SCMA VIII SOC

The Statistical Challenges in Modern Astronomy (SCMA) Conference Series was started by Professors Jogesh Babu and Eric Feigelson in the early 1990s. The SCMA conferences bring together researchers in astronomy, statistics, and related fields to tackle pressing challenges in astronomy. The upcoming SCMA VIII (June 12-16, 2023) conference is truly cross-disciplinary — presentations by invited scholars of astronomical and data sciences are mixed in a program that pursues both scientific and methodological goals. Themes include statistical modeling of astronomical phenomena, discovering hidden astronomical signals, and enhancing the roles of machine learning for astrophysical insights.

Preceding the SCMA conference is the annual Summer School in Statistics for Astronomers XVIII (June 5-9, 2023), which provides an intensive program in statistical inference covering topics such as principles of probability and inference, regression and model selection, bootstrap resampling, multivariate clustering and classification, Bayesian data analysis, Markov chain Monte Carlo (MCMC), time series analysis, spatial statistics, deep learning neural networks, and machine learning with random forest.

Details about both events are available below under Astrostatistics Events.

## Pinning down uncertainty in periodicities

Randomization Inference of Periodicity in Unequally Spaced Time Series with Application to Exoplanet Detection

by Panos Toulis and Jacob Bean

https://arxiv.org/abs/2105.14222

Statistical summary: Existing methods to infer periodicity in time series data rely on theoretical assumptions of consistency and normality. In practice, these assumptions are unrealistic because the underlying model (e.g., periodogram) is usually highly irregular. We develop a randomization-based method to infer periodicity that does not rely on a "well-behaved" model or asymptotics. The method is valid in finite samples when fully implemented, while it can be asymptotically valid under an approximate implementation designed to ease computation. We validate our method in exoplanet detection using radial velocity data, showing benefits over existing statistical techniques.

Astronomical summary: In applications such as exoplanet detection, a key statistical task is to infer a hidden periodicity in the data. This is not an easy task, however, because the underlying statistical models tend to be irregular with severe nonsmoothness and multiple modes (also known as 'aliasing'). Here, we propose a robust method to handle such irregularities for inference of periodicity. Our method leverages the theory of randomization tests, which have been gaining popularity in statistics for their robustness against more traditional statistical techniques. Last but not least, we show how our method can lead to improved observation designs by slightly randomizing the scheduling of radial velocity measurements.

## An Overdispersed χ² distribution for Poisson count spectral modeling

Systematic errors in the maximum likelihood regression of Poisson count data: introducing the overdispersed chi-square distribution

By Max Bonamente

https://arxiv.org/abs/2302.04011

Typically X-ray energy spectra are analyzed by fitting an astronomical source model to a counts histogram, assuming a normal approximation with the variance of counts in a bin set to σ2STATISTICAL=N. When the astronomical model is not exact, or when there are uncorrected instrumental effects, then an additional variance term σ2SYSTEMATIC is included. This paper addresses the consequences of correcting for this excess variance when the normal approximation to Poisson is not suitable.

This paper presents a new method to estimate systematic errors (e.g., model misspecification, calibration) for count data. The method is applicable in particular to X-ray data binned in energy (spectra) in situations where the Poisson log-likelihood, or the Cash goodness-of-fit statistic, indicates a poor fit that is attributable to overdispersion of the data, even in the absence of systematic trends. The proposed method treats overdispersion in Poisson data as an intrinsic model variance that can be estimated from the best-fit model, using the maximum-likelihood CMIN statistic (=–2⋅ln(Poisson likelihood)). This is in contrast to the traditional approach of adding the σ2SYSTEMATIC term to the approximate normal model assumed for binned counts.

The paper also studies the effects of such systematic errors on the likelihood-ratio statistic, ΔC, which can be used to test for the presence of a nested model component in the regression of Poisson count data. The new distribution, which is referred to as the overdispersed chi-square distribution, is proposed as the distribution of choice for the CMIN and ΔC statistics in the presence of systematic errors in the count data. Given its simple analytical form, critical values and p-values can be immediately calculated numerically, similar to the case of the χ² distribution. The paper discusses the application of the proposed approach to testing for the presence of absorption lines in X-ray spectra whose significance tends to be overestimated when this overdispersion is ignored.

# Other News

Job Opportunities in Astrostatistics

TITAN – Frugal AI and Application in Astrophysics Postdocs (2 positions)

Crete, Greece

Details: An ERA Chair HORIZON EUROPE grant funded by the EU; 2 years with possibility of extension; Salary 45.000-50.000 €/year (gross income depending on family status)

The two positions are to work in cosmology and/or machine learning in the context of cosmology. Subjects of interest are optical and radio weak-lensing, High order statistics, EoR, inverse problems, deep learning, components separation, data on manifold, high dimensional and big data processing, etc.

Interested candidates are invited to communicate with J.-L. Starck (jstarck@cea.fr) and P. Tsakalides (tsakalid@ics.forth.gr), sending a cover letter and their CV by May 31, 2023. CVs can be sent before the deadline as positions could be filled before if an ideal candidate is found.

Deadlines: Application deadline is May 31, 2023, selection deadline is when filled.

A list of job opportunities will be maintained at our website, astrostatisticsnews.com/job-opportunities.

Astrostatistics Events

Summer School in Statistics for Astronomers XVIII

June 5-9, 2023

Center for Astrostatistics, Pennsylvania State University

Details: https://sites.psu.edu/astrostatistics/su23/

Deadline: May 5, 2023

Statistical Challenges in Modern Astronomy VIII

June 12-16, 2023

Center for Astrostatistics, Pennsylvania State University

Details: http://scma8.org

Deadlines: Abstract submission deadline was Feb 1, 2023, Final program will be announced Mar 15, 2023

Summer School in AstroStatistics in Crete

June 19-23, 2023

Department of Physics, University of Crete, Heraklion

Details: http://astro.physics.uoc.gr/Conferences/Astrostatistics_School_Crete_2023/

Deadline: Mar 24, 2023

Note: the deadline has passed, but this event may be of interest in future years.

Statistics for Astronomy

Session 312 at the 64th ISI World Statistics Congress in Ottawa 2023

July 19, 2023

Ottawa, Canada

Details: https://www.isi2023.org/

Session Details: https://www.isi2023.org/conferences/session/312/details/

Early registration deadline is on April 17, 2023

Six Astrostatistics Sessions at the ASA JSM2023

True North Strong and…Amazing at Astrostatistics! (D. Stenning)

Astronomers Speak Statistics (J.G. Babu)

Modeling techniques for astrostatistical datasets (J. Williams)

Pulling Signal out of Noise for Data-Driven Discoveries in Astronomy (H. Tak)

Uncertainty Quantification in Astronomy (A. Siemiginowska)

Finalists of the Best Student Papers in Astrostatistics Competition (H. Tak)

August 5-10, 2023

Toronto, Canada

Conference Details: https://ww2.amstat.org/meetings/jsm/2023/

The program will be announced in May 2023.

Astromatic 2023

August 6-12, 2023

University of Montreal

Details: https://www.astro.umontreal.ca/astromatic/2023/

Summary from website: Astromatic 2023 is a week-long program to bring together a group of 15 outstanding undergraduate students interested in artificial intelligence, machine learning, and astrophysics from around the world in the vibrant city of Montreal. The program consists of lectures on these topics given by experts, followed by a hackathon and an exciting competition that fosters teamwork and creativity to develop a powerful project at the intersection of astrophysics and machine learning.

Deadline: April 15, 2023

A list of events will be maintained at our website, astrostatisticsnews.com/upcoming-events.

## Content suggestions

If you have ideas for AN content, please send a message to astrostatisticsnews@gmail.com. We may include your idea in a future issue if we think it is a good fit for an issue.

Ideas may include relevant astrostatistics papers/data/code, visualizations, upcoming events, job postings, format or commentary suggestions, etc.

## Astrostatistics News website

See astrostatisticsnews.com for more information such as past issues, lists of astrostatistics references and societies.

Subscribe to Astrostatistics News

To subscribe to Astrostatistics News, go to https://groups.google.com/g/astrostatistics-news and select the “Join group” button. You will need to be logged into your Google account to join the group.

Please forward this information to anyone who may be interested!