SKA: The biggest big data project in the universe
Chris Middleton explains why the SKA radio telescope will redefine the concept of big data.
Fifty-six years ago, cosmonaut Yuri Gagarin became the first man in space. Since then, we’ve gathered more data about the universe than in the rest of human history combined, via technology in space and on the ground.
Cosmological data is big data; the biggest there is. And one organisation knows more about planning for big data and how to process it when it arrives than any other enterprise on the planet. Indeed, we may need to coin a more appropriate phrase for what it gathers: supermassive data.
What is the SKA Project?
The Square Kilometre Array (SKA) Project is the biggest science project on or off Earth: the building over the next two decades of a series of giant radio telescope arrays in remote parts of Australia and southern Africa, to create a dish with a surface area over 200 times larger than the Lovell Telescope at Jodrell Bank. This UK-centred international programme – headquartered at Jodrell Bank – is designed to understand gravity and magnetism on a universal scale, along with more traditional astronomy topics, such as black holes, the origins and evolution of the universe, and the nature of dark matter and dark energy.
Professor Philip Diamond is Director General of the SKA Organisation. He says: “In many ways, you can think of the SKA as a time machine, as we’ll be able to look back in time and make movies of the evolving universe. We’ve recently published our science case. It comes in two volumes, totalling 2,000 pages, and when dropped on a minister’s desk the nine kilograms make a resounding thump – which is the main aim of the printed copy.
“I haven’t mentioned the Search for Extraterrestrial Intelligence, but we will be the ultimate SETI machine, too. It’s not one of our main aims, but if we do detect that little signal then I think that would address some of the funding issues we might have.”
Big data, big investment
The UK has committed £200 million to the SKA to date, and the Australian government A$300 million, but over the next few years the project will need billions of dollars of investment, the case for which the organisation is building today. Currently, it is a not-for-profit UK company, but it will eventually become a treaty organisation and intergovernmental project, similar to CERN.
But what’s the idea behind the array itself?
The cosmos is vast: a notional beam of light travelling from Earth at 671 million miles an hour would take 46.5 billion years to reach the edge of the currently observable universe. Using the most common element, neutral hydrogen, as a tracer, the SKA will be able to follow the evolution of the universe all the way back to the cosmic dawn. But over billions of years the wavelength of those ancient hydrogen signatures becomes stretched via the doppler effect, until it falls into the same range as the radiation emitted by mobile phones, FM radio, and digital TV. This is why the SKA arrays are being built in remote, sparsely populated regions, says Diamond.
“The aim is to get away from people. It’s not because we’re antisocial – although some of my colleagues are a little! – but because we need to get away from interference, which is like shining a torch in the business end of an optical telescope.”
Eventually there will be two SKA radio telescopes. The first, consisting of 130,000 two-metre dipole low-frequency antennae, is being built in the Shire of Murchison, a remote region about 800km north of Perth, Australia – an area the size of the Netherlands, but with a population of less than 100 people. Construction kicks off in 2018. In Phase 2, says Diamond, the SKA will consist of half a million low- and mid-frequency antennae, with arrays spread right across southern Africa, from Kenya to South Africa – a multibillion-euro project on an engineering scale similar to the Large Hadron Collider.
Supermassive data processing
Which brings us to that supermassive data challenge in what will be an ICT-driven facility. Diamond says, “The antennae will generate enormous volumes of data. Even by the mid-2020s [Phase 1], we will be looking at 5,000 petabytes a day – five exabytes – of raw data. This will go into huge banks of digital signal processors, which we’re in the process of designing now, and then into high-performance computers and an archive for scientists worldwide to access.
“Our archive growth rate will be somewhere between 300 and 500 petabytes a year: that’s science-quality data coming out of the supercomputer. For the full SKA, the figures will go up by a factor of 100. But that’s in the 2030s. We’re designing now for the 2020s, but in the following decade, the data problem will be much worse.”
“To put it in perspective, worldwide annual Google searches generate about 100 petabytes of data. Facebook is about twice that. Global business emails generate about 3,000 petabytes of data. But the raw data from SKA Mid, we estimate, will be 62 exabytes (62,000 petabytes). So at that point we will have to design equipment to handle something that’s 20 times larger than global email traffic. Total global internet traffic is one zetabyte. Ultimately, we’ll have five zetabytes within our internal systems alone.”
All of this means that each stage of the SKA will need supercomputers that don’t exist yet, with a speed of approximately 300 petaflops, according to Diamond. The fastest supercomputer in the world is currently China’s Tianhe-2, which runs at 33.86 petaflops, so the SKA will need access to a computer that processes data 10 times quicker than the fastest machine on earth today.
But none of this bothers Diamond: “The IBMs and Intels of this world tell us that this is entirely within their forecast capability. In fact, I’m pretty sure that the NSA already has something a little faster, but they won’t tell us.”
And as with all big data, the supermassive SKA data will not only be defined by its volume, variety, and velocity, but also by a fourth ‘V’: its value. “What we then have to do to these enormous volumes of raw data is detect and amplify the signals amongst the noise, digitise and line them up, correlate them and integrate them, process them, and then create sky images, which the scientists will use. The SKA will be providing science-ready data products, calibrated and quality controlled.
“Traditional radio astronomy goes through this process many times, but we will only be able to do it once. We won’t be able to store all the raw data, it’s a one-pass system. So we have to understand our systematics much better than any other facility on Earth. For us, the main principles are scalability, affordability, and maintainability, but we also have to maintain innovation. We have bright people throughout the world developing the algorithms to process this data, but we’ve got to be able to replace them too, as new thinking emerges.”
A new golden age
Over the next few years, three more great observatories will help make the 21st Century the new golden age of space exploration. The Atacama Large Millimetre Array in the Chilean Andes is already in operation; nearby, the 39-metre European Extremely Large Telescope is under construction, as is the James Webb Space Telescope, due for launch in 2018 as the long-term replacement for Hubble.
That’s four supermassive data projects, while the European Space Agency’s Euclid programme will be another. Euclid aims to map the entire dark (unseen) universe, as opposed to the fingernail-sized patch of sky that’s visible to Hubble, in terrestrial terms. Euclid is scheduled to go live in 2020 and will be able to look back in time by 10 billion years.
On its own, the SKA is an awe-inspiring project, and Diamond hopes that the intergovernmental organisation to drive and develop it to its full potential will soon be in place. So let’s hope that local politics don’t derail this extraordinary international programme. Might the UK’s exit from Europe imperil this and other ‘big science’ programmes? It’s possible. In the end, much will come down to how important people consider these science programmes to be, and what their long-term, terrestrial applications might be.
The biggest amount of data ever gathered and processed, passing through the UK and managed by a UK team for the benefit of all mankind, unlocking the secrets of matter – and, potentially, antimatter? You shouldn’t need a telescope to see the economic potential.
• This article was first published by diginomica.
© Chris Middleton 2017.