News and Views

News and Views on ``Machine learning classification of Gaia Data Release 2"


News and Views

Author: Stephen Justham

  1. College of Astronomy and Space Sciences, University of Chinese Academy of Sciences, Beijing 100049, China

  2. National Astronomical Observatories, Chinese Academy of Sciences, Beijing 100101, China

  3. Astronomical Institute Anton Pannekoek, University of Amsterdam, 1098 XH Amsterdam, The Netherlands

The present and near future of astronomy is heavily invested in large surveys. “Big data” is a phrase that has rapidly become so widely used as to be tiresome, but the concept behind it is very important for making the most of the information we receive from the sky. A huge amount of information about evolutionary changes in astrophysics is encoded in populations, since we can only very rarely watch things change on evolutionary timescales. (An old example of that is when we concluded that massive stars have shorter lives than less massive stars, based on examining the populations in which they are found.) Modern surveys can give us an amazing wealth of population information, but the data volume is too large for humans to process each individual object. Hence astronomers are turning to machine-learning techniques to process their data.

This is the context in which to see the paper by Bai, Liu & Wang (2018). They apply their previously-developed machine-learning algorithm to photometric data matched with roughly 85 million entries from Gaia’s second data release (DR2). This figure of 85 million is mainly limited by the available infrared photometry. Gaia’s DR2 contains over a billion entries, from which Bai et al. could match roughly 800 million objects with optical photometry. In turn only approximately a tenth of those also had matches with AllWISE infra-red colours which the authors considered to have adequate signal-to-noise.

I am not an expert in Galactic structure, but I think it is fair to say that the outputs of this work are more interesting for the techniques they are developing than for the results themselves. This is an interesting step towards the type of tools that will be invaluable for future large surveys. In particular, I find it satisfying how the authors study the systematic patterns, and potential reasons for problems, in the places where their method does not do so well. This is impressively open, and is very important. Even an algorithm with a 99.9% success rate, if applied to a dataset of a billion objects, would return a million errors. That is potentially very problematic for use of future large survey data, especially if those errors introduce systematic biases (rather than random noise). Here the authors clearly admit that their algorithm only found ≈83% of the stars in their test sample, and speculate on a potential systematic reason as to why. Further investigation of that suggestion would be worthwhile. The authors also discuss how to combine their machine learning classifier with other information --- specifically Gaia parallax uncertainty --- to produce a large but very clean sample of stars.


  • Bai, Y.; Liu, J.-F.; Wang, S.; 2018, RAA, 18, 118 ADS