Driving innovation in radio astronomy — Thoughtworks Technology Podcast
[MUSIC PLAYING] Hello, everybody. My name is Rebecca Parsons. I'm one of the co-hosts for the ThoughtWorks technology podcast. And I'm joined by my colleague Prem. Hello, everyone. I'm very glad to be here. Looking forward to a great conversation today. And we are joined by two guests.
First, I have my other colleagues at ThoughtWorks. Justin, would you please introduce yourself? Hello, everyone. I'm Justin and I work with E4R, and I work as a lead developer in E4R. And joining us from the project itself, Neeraj, would you like to introduce yourself, please? Hi, everyone. I am Neeraj Gupta. I am an astronomer,
and I work at the Inter-University Center for Astronomy and Astrophysics. It's a research institute based in India. So let's start with introducing the problem-- what is it that you're trying to solve and how have you approached this? And let's start with what's the science problem that we're trying to solve, because the principle of E4R really is to remind people that we are trying to bring-- we as ThoughtWorks are trying to bring our software development capabilities and our technology capabilities to assist scientists like Neeraj in solving science problems. So, Neeraj, can you tell us a little bit about what the problem is we're trying to solve? Yeah, sure. So the central problem that we are trying to address is to understand how galaxies form and evolve. Now what are
galaxies? When we think about them, galaxies are ensemble of stars, but they also contain a lot of gas. And these stars form from this gas. So one thing when we try to understand galaxies is to understand how galaxies can acquire this gas from outside and can then convert it into stars. But this is not so simple because at the center of these galaxies, there is also usually a massive black hole sitting and this black hole at times can emit huge amounts of energy. And this energy can be so large, so huge that it can actually outshine all the stars that are put together in the galaxy. And it can also emit a lot of materials into the galaxy.
And all this feedback that is coming from black hole can actually disrupt the process through which galaxies can actually convert this gas that they have acquired painstakingly into stars. So understanding galaxy evolution-- that is their formation and how they evolve over a period of time, is essentially to understand this interplay between gas, stars, and the feedback that is coming from the central black hole. And now this is a very complex problem, as you can imagine.
And there are several long-standing questions related to this. So what we are doing now is to use the most sensitive radio telescope in South Africa called the MeerKAT telescope, and we are obtaining a lot of data of millions of galaxies in the sky. Essentially, we are taking observations of 1,600 hours. This will lead to 1.6 petabyte of data. And all this is being gathered over a period of about three years, and that's something which we have to process and try to understand how these galaxies form and evolve.
Excellent. Justin, can you talk us through the problem from the technology perspective? What have we been doing to try to support this work? OK. So from a technology perspective, the problem is twofold. One is the data, the amount of data that we would be looking at now, as Neeraj had mentioned earlier. So MeerKAT is a state-of-the-art telescope and it is going to generate a lot of data. And at the point when the collaboration started, we had no benchmark to understand how to process this data.
This would be a large volume of data. So from a data perspective, this was a very big problem statement to handle. And that was the first unknown. So what happens when you get a large volume of data on the scale of petabyte, how do you process it? The second one was mostly from the domain aspect itself. So how do you build a robust pipeline to support science, which from a technology standpoint-- I mean, if I talk from my perspective as a technologist, astrophysics is a domain which I'm not so comfortable with. So how do we design a software where we understand the domain while also keeping in mind the unknowns which are brought in via both the data and the domain itself? The volume of the data and the domain.
So that was the challenge from the technology perspective, mostly delivering a robust system which will enable the science to progress further given the unknown that we are handling such a scenario for the first time. And how did we get involved? How did this collaboration start? What was the impetus for this collaboration? So I can tell my perspective on this and then Justin can add to it. So when we started thinking about executing this project in 2015 and '16-- that's quite a while ago, just from the simple back-of-the-envelope calculations, it was clear to us that if we want to process this 1.6 petabytes of data using traditional methods of those times which is like an astronomer will take data from telescope, load it onto their high-end workstation, and then look at it in their own time and process it, it would take close to 20 to 30 years of time. So it was obviously clear that we cannot use just those traditional methods which have all the work for decades to work on a project of this scale. So it was obvious to us that we need to bring in best software engineering practices on board to solve this problem.
And then the second aspect was also clear to us that-- and that's coming from the complexity of the data itself that we cannot use off-the-shelf tools even in the software domain. So we need to work with the best in the field in every domain. And that's where we started discussing with ThoughtWorks and we agreed upon that, OK, we will start working on this.
So can you elaborate a little bit on what kind of system we built? So the entire phase of the project, I mean, the entire collaboration was done in phase-wise manner. So the first was practically prototyping it, coming up and understanding can this even be done. So prototype it, proof of concept. Get a proof of concept out and that then becomes your base benchmark to start with. That was the initial phase.
And with that prototype being a success, knowing that yes, something like this can be built, the unknown just being the data volume-- how the data volume is going to affect, but just from a science perspective the proof of concept is done. Then we entered into phase II, where we actually dwelled into building the concrete system which would help us progress through this journey where we can start identifying what to observe, when to observe, and what to do once the data comes in, while keeping area checkpoints that if something fails, how do we fall back, what are the checkpoints to make sure that the processing goes in smoothly. That was the second phase. And then came the third phase where now we have the data, we have processed, we have some sort of images generated out of it, how do we make it publicly accessible? Move ahead with the science. The science is done to be consumed by people at the end, so how do we make it available to people at large? So journey overall picture of how the overall-- what is the collaboration was structured.
So can you tell us a little bit about what the flow of information is and a bit more specifically? I mean, so far we know there's lots of data and that's a very general problem. But can you tell us a little bit more? We've got the observations, what kinds of things are you looking for in the data and what are the properties that you're trying to maintain when we're processing the data other than the fact that yes, there's a lot of it and we need to make sure we process it efficiently? So to understand this, we need to take a step back and just look at what is the complexity of the data. So I said, we have got this lot of data, 1.6 petabytes of data and what we are trying to do is to look at the sky. And it's a radio telescope that we are using, which means that we want to look at the sky at radio wavelengths. So the act of seeing which we are so used to with our eyes when we look up and try to look at the stars or moon, it all happens through this lens which is sitting in our eye and we take it for granted. Mathematically, if we look at it,
it's actually doing a process which we call as Fourier transform. It's actually a Fourier transform which is sitting in our eye. But this lens which is so readily available at optical wavelengths to us through nature does not function the same way at radio wavelengths. So what we do actually at radio wavelengths is to build a telescope which actually consists of large number of dishes or like antennas. So in case of MeerKAT, it is 64 antennas which are spread over an area of 8 kilometers. And what we do is that we take voltages from each of these antennas and combine them in pair. For example,
if there are three antennas, we will combine the signals from 1 and 2, 2 and 3, and 3 and 1, like this we will do for 64. And then a data stream is flowing to us that is like every few seconds. And then it is coming as a function of frequency that is like 32 sampled frequencies coming to us.
So that's the complex data that we have. And what we do with this data is that we take it and do a Fourier transform of it, which is equivalent to the process of actually making a lens at the radio wavelength. Once we make this lens through our computer and electronic data processing, then it actually is equivalent to producing an image, very similar to what we would see.
Now, at the processing level the complexity comes from two levels. One is that the large volume of this itself which needs to pass through our systems to be able to process reliably and efficiently. The data is organized in these three dimensions which are frequency, time, and antenna separations, such that the different data processing steps actually cannot be partitioned in the same way. So we cannot actually say that, OK, I am going to take this data partitioning strategy and this is what I can apply to all these different steps of data processing, and then this can actually lead to the-- it give me the image out of it. So we need a system which can actually work through these different stages with different partitioning or solving strategy that is one level of complexity that comes in that this pipeline has to tackle.
And second is related to the nature of the project and that is that building this pipeline is going to take time because it's complex. And the requirements of this pipeline should come from the telescope performance because it's supposed to process the data from this telescope. But the telescope is not yet built. And we cannot wait till the telescope is built and we know its properties completely. So we have to start building the system well before the telescope has come into place and we understand it completely. So Justin was talking about that we spend a lot of time in prototyping. So we build a system which can actually cater to a large number of data processing or solving scenarios and then we test it against a variety of telescopes that were available at that time. And we left those options open so that
as soon as the telescope comes online, we process the real data through it. We can quickly make those choices, optimize our pipeline, and start processing this data so that we are ready for it when it comes. Because we have to remember that 1.6 petabytes of data is actually coming over a period of three years and our system has to be efficient that we process actually 1.6 petabytes in about three to four years of time.
We cannot take five years or 10 years for that. So to be able to meet that, the system has to be prototyped and ready well before we actually start even our processing. So again trying to understand here. So
even before the telescopes actually started transmitting real data the prototype actually created some synthetic data to simulate what the actual telescopes will send you and then now you saw the results of that so that when the telescopes did become ready, you're ready to go as far as being able to process it correctly. Is that a fair representation of how we solved the problem? So we use simulated data. That's correct. But we also used real data from the best telescopes of that time because we also wanted to actually ensure that our pipeline is responsive to the real-world scenario. Because, with simulations, there can be limitation that you can actually get a result that you have put in the simulation.
You can not be 100% sure that you actually are testing against a real scenario. And I understand that even with this initial prototype, we actually made a real scientific discovery. So can you briefly tell us that discovery, and then Justin I'll ask you a little bit about how that came about. Yeah. So we were describing the prototyping stage and I mentioned that we also tested against real data from the best telescopes of that time. So we were looking at these galaxies, which are fairly distant from us, and we ended up detecting the traces of hydroxyl molecules in the galaxy. And
when I was describing this central problem which is to understand how galaxies form and evolve, in that problem space, the cold phase of gas actually, cold atomic and molecular phase of gas, it is as cold as like 10 or 20 kelvin, actually acquires very special place. It's central to understanding how gas can be converted into stars. And this hydroxyl ion is actually a very nice tracer of it. And very few of these have been detected in past. So through this prototyping phase itself we were able to detect one such case, which was very exciting. We got it published in a very prestigious journal-- that's one aspect of it, but at the same time, it also gave us confidence that when we are going to do this large-scale survey, we will surely be able to detect many more of these in the sky.
So, Justin, that must have been pretty exciting to have the work that you did result in a scientific discovery. How did you approach this from the perspective of a technologist in terms of setting up this pipeline in a way to assist Neeraj in his research? So there is a funny story behind that discovery itself. So the team back then while they were working on this pipeline, they were like, yes, let's test it out.
It's a queuing phase of the pipeline, let's do it, and we have the data. And while testing they're like, we have this plot generated from the pipeline and it might seem like an anomaly. It might be because the pipeline is not configured properly. And I think that is the point when they got in touch with Neeraj inside, maybe you can just validate if the pipeline is running properly. And voila, we have a discovery during the queuing phase of the pipeline itself. So that is one incident how the discovery came into being. And
I remember a comment back then. I mean, this is the phase when I was really joining into the project back then. So the developer back then told me that the statement was-- if it were done via script back then, a basic script, this might have been skipped because the script would not have been robust enough to capture it the way the pipeline was designed and the pipeline was able to capture it. So from a point of what you say, a confidence, it told us that yes, we can-- I mean, even though the pipeline is still in a phase where we exactly do not know that it might work properly or not, we are at a state where we can confidently say that yes, we can proceed. We have the confidence that it will perform properly.
That was in the prototyping, the output of the prototype. Now taking inspiration or keeping that as the baseline of development, I think we developed four parallel pipelines taking that as-- what do you say, the baseline and building on top of that. So we have the base systems as the same. And we change the subsystem so that they do different operations in between.
So in total, we built four data processing pipelines-- one for the different stages in the data preparation itself. And it was all-- and what the researcher at the end, what Dr. Neeraj could do was chain these pipelines together to achieve or do the science which he intended to do with it.
So yeah, that was the way in which the overall-- what you said, the development was followed through. Very exciting. It makes me curious. What kind of technologies did you use to build this kind of system? So to understand that, we need to understand what the underlying astronomical tool which we were using. So the underlying tool which was used was CASA and the APIs provided by CASA was in Python. So what we used is we used Python to write modules which would then chain together the individual aspects of the data processing into a pipeline form.
And the overall design was configurable in a way that you could switch off certain aspects of the pipeline if you do not want to run it. So the entire system was robust enough in such a way that I can choose which phase of the data processing I would want to run and which one-- I mean, knowing, say for example I'm rerunning a certain data cleaning activity. I know certain cleaning activities have been done before, I would not want to rerun it.
So my configuration would allow me to choose very specific parts of that process in itself. And this was one pipeline. And similarly, we had similar Python applications. I would call it applications itself. And so, Neeraj, would you say these modules and this configuration is something that made sense to you from the perspective of the analysis that's familiar to you as a scientist? Oh yes, because we worked very closely and collaboratively over all these years.
So all the configurations that go into fine-tuning of this and all the outputs that come out of it, we designed and tested them together. That's one crucial aspect of our project. And then in addition to this, what Justin has described, we also had a few additional requirements, one being that it has-- because we are dealing with these large volumes of data that one data set, that one hour of data that we get from the telescope to make certain type of images can actually take a week of processing on our cluster.
So the pipeline has to couple very nicely with this set of processes that will run on our high-performance computing system that we set up at IUCAA. And then our research team is actually geographically distributed. So we also needed a system which our team can actually access seamlessly without any hindrance of where they are actually physically located. So that was another major requirement. And then since the data volume is so large, at all stages, actually, when the data comes from South Africa to India at our Institute, we loaded using tapes to the storage associated with our cluster from that stage. Then we have to process it and then we have to archive it.
And then also when we get these different data products which are images of galaxies or their spectra which is their brightness as a function of frequency, so we are talking about millions of objects and millions of spectra that we have to process and make it available to our team. So since everything is so complex and large volume, large numbers, we have to have a system which can deal with all the stages seamlessly. It's not that just process the data and stop it and someone else will take care of the products. We have to have a system which can actually-- the moment data is there it can understand that OK, this is the data that I need to process, and then it processes through it, it should be able to tell us that OK, whether it processed successfully or not.
If not, then at what stages it may have failed so that someone can actually address the issue, looking at various diagnostics that have been produced by the pipeline. Because it has to be very efficient, otherwise, if a scientist has to look at the data while they are processing-- and the total processing time for this data is three years while we are observing. This means that during those three years, we can actually either process the data or do the science. So we have to have a system where scientists engages with the processing, they prototype it, they configure it, the pipeline, and when they fire the process after that they can actually think about the science.
So I think this is what we achieved through the design that Justin described that a scientist would come in, spend time in configuring it, but once they have done it, they can actually trigger the processing and forget and start thinking about the science. And that's why we have been able to make these discoveries while the data is still coming. And I understand that in addition to the discovery from the prototype stage that there have been additional discoveries as you've been processing the data. So can you tell me a little bit about that? Yeah. So this actually again highlights the point that whenever we look at the sky with a new telescope which has capabilities which did not exist before, then every time we point it in the sky, we can be surprised, provided the data has been processed properly. So through MeerKAT, we are getting data which is covering such a large range in frequency that when we were observing this particular galaxy in the sky, in this particular situation what happens is that this galaxy actually contains a lot of cold atomic and molecular gas, which I was talking about.
So naturally it actually has a lot of stars in it as well. But what happens when stars form they also emit a lot of radiation and this radiation can then ionize the gas or destroy the cold gas from which they actually formed. It looks counterintuitive but this is what happens in nature. And when this gas gets ionized it actually emits a different kind of a radiation which we call as recombination lines. We call it recombination lines because electrons which have been ejected from the atoms or molecules they are combining back. So that's why it is called recombining.
And from physics, we expect that it should produce certain signatures at radio wavelengths, but these signatures are so weak that they have not been really detected reliably in past. So in this particular case, what happened we got this nice, beautiful spectra of this object. We knew that it should contain the signature of this because that's what we expect from our understanding of basic physics and that can never be wrong because physics is robust. But those signatures are actually not happening at one frequency. In this spectra which has more than 64,000 pixels, they are happening at maybe at 30 to 50 different locations.
So what we did was that we identified those locations based on our expectation from basic physics and then we average and combine them in frequency space. And when we did this, the signal actually really came out very significantly. And this discovery is very important from two aspects.
One is that it validates the basic physics that we understand and we expect it to work even in these distant galaxies. And also this detection implies that we should be able to detect many more of such systems with MeerKAT, our survey, and also with the future telescope such as Kilometer Array which may be even more sensitive. So this has actually opened up a new field, which we know should exist, but it was not really becoming accessible. So it's one significant step in making this field accessible and open to community.
This is really, really exciting this discovery. Can you tell us about the MeerKAT absorption line survey, please? So MeerKAT absorption line survey is the name of this large survey project that we are carrying out with MeerKAT telescope that I mentioned in the beginning. We are observing for 1,600 hours approximately this tons of galaxies in the sky and understand how galaxies in general form and evolve. And so our idea is that we are observing approximately 400 different patches in the sky that have been selected with a certain emphasis that these are the best locations to understand the formation of galaxies and detect especially the cold atomic and molecular gas phases because that's what we are actually trying to look for in these galaxies.
And up to this point, we have acquired close to 1 petabyte of data, and we have through the pipeline processed more than 700 terabytes of data. And from this first batch of data processing, we have now identified half a million objects in the sky. And most of these are actually supermassive black holes. And many of them have been detected for the first time and that essentially forms the first data release of our project. By data release, we mean that we have actually organized this data in a form that not only our team can use this for various science objectives but also the astronomy community at large can use this in a very efficient manner. And this is a very significant milestone for several reasons.
The foremost, which is relevant for our project, is that with this release, we have actually ticked all the boxes, all the things that our project should do starting from carrying out observations till actually the end of getting these images of the sky from which science can be done. And it includes processing archive and everything, all the stages of pipeline that Justin described. So that's one very important aspect. So this gives us a lot of confidence now that we can process the rest of the data.
It's just a matter of time. And then on the basis of it, we will be able to tell whatever that is there in the sky because now we know that we can do it and we can do it properly. So that's one very important aspect. And second important aspect is that we have also been able to make this publicly available to the astronomy community at large.
And that's important because this data is very rich and very complex and it contains a lot of information. Our survey team has got certain objectives. We will do science based on those objectives and we will put this out also in public domain. But then there's a lot of other science that can be done with this data which we cannot do, we cannot do for various reasons, one being we know that this can be done, this must be done, but it's just too much. We are a small team. We have finite resources, so we cannot do everything.
So the community will do it. And the second thing is that there are types of things that can be done with this data which our team cannot do. We don't have expertise. So putting this out in public domain enables all those possibilities. And of course, the third one being that none of us know at this point what this data can do, actually. After two years or three years maybe someone else with a new perspective or better abilities will come along and look at this data and do those new things.
So this is how the projects of this scale needs to be executed that we not only enable what we want to do but we also share everything that has been done publicly so that it can actually be improved upon and much more can be done with that. So that tells us a little bit about where this is going from the science perspective. Justin, where else can this pipeline go from a technology perspective? Are there things that you're looking to do, or are there different applications that you might want to take into account? What's the future for the tech? We know the future for this is the science, which is already very exciting, but what's the future for the tech? So to answer that I'll slightly sidestep a bit on the lines of cognitive burden. So for a researcher, the primary goal is the science. Data processing is a process through which they reach to the science. Now, if a researcher requires multiple tools, switching between these tools causes a cognitive overload for them.
It's a very easy trap to get to lose context of where they were or lose track of what they were thinking. I mean, a very simple case would be I would be starting with a tool thinking about something, and by the time I'm done-- what do you say, configuring the thought chain in itself is broken. So that cognitive burden is something which we can reduce by following an idea or approach which we have done for the entire MALS survey or the ARTIP, which includes the ARTIP pipeline, along with the environment which we provide along with it. So the idea is very akin to a science platform itself. So we take that learnings from ARTIP so we know how powerful the idea in itself is having a unified platform where all the tools which are required by the researcher is available. The data flow is transparent to them.
So in the case of ARTIP, the researcher did not have to worry about what happens after one phase of data processing. Where does the data go from there? What does the second phase do with it? They just had to prepare a config. Yes, they had to put thought into how the config would look like. But once the config was done, they did not have to worry about the management of data in-depth. They knew the next coming pipeline will take care of it or the next coming tool chain will take care of that data. It knows where the data resides and where it needs to go.
So taking that idea one step further, can we take inspiration from this to build science platforms at large scales, very much on the emphasis being the large scale where the data volume is large and the processing timelines in itself are-- let's say weeks or months? I mean, such a system would enable the scientist or the researcher to very specifically focus on the outcomes of the research rather than get into the nitty-gritty of how individual systems would interact or what happens if let's say API call fails. Just as an example. The system is built in such a way that it knows what to do or how to manage such a failure. What happens if a processing fails, what kind of message should a researcher get? Is the researcher getting too much verbose errors, or are they just getting what they need to know? So building such massive science platforms is one-- what do you say, major learning which we take from this. And I think that idea,
I mean, there are other domains. So say for example, the pharma domain is one, where such pipelines can now help. So we start with a hypothesis, put that hypothesis to test in a robust pipeline which takes in data, produces, goes through multiple processes, produces an output, and the researcher is just involved in the initial phase of designing how the pipeline looks like, and then it is repeatable. Secondly, such a science platform allows you for reproducibility of results. So
let's say, I mean, let's just take the first example of the initial discovery. We thought it was an anomaly but because we could rerun it on the pipeline more efficiently within a limited span of time we could reproduce that results over and over again. And we knew that it is not an anomaly. It is a concrete result which can be used. So reproducibility also comes along.
And also it generalizes or-- in science basically, reproducibility along with how to reproduce it. So if I have a well-established system like a science platform which says these are the tool chains which were chained together with these configurations and if you run it as it, is you get the result. So that emphasizes on reproducibility of the result how-- I mean if someone is starting from taking this as a base point baseline, they can reach that baseline pretty easily because they have everything that is required to reach there on the first place. And from there they can develop further both on science and the tech aspect itself. So if there are other tools which can be interlinked into this science platform idea, they can interlink it because the base system is already available. So that is where this idea can grow from this point. And ARTIP has been a great inspiration in thinking in that direction. It allows you to think about
what happens when you get a large volume of data as an example. What happens if the domain is unknown? How do you interact with the researcher to understand or incorporate the domain into the technology? How do you integrate a tool? And it also opens up an arena where the idea of collaboration between a researcher and a technologist goes in a symbiotic fashion. So we understand-- so we make progress in the technology perspective where, let's say, in other projects in E4R, we are developing hardware which can process a large volume of data.
So we can now take inspiration. So we can grow in that technology aspect while the researcher along with us grows in the science aspect. So it's a symbiotic relationship from this point on. Excellent. Well,
I'd like to thank you all once again. Another fascinating set of discoveries coming out of our engineering for research group, E4R. So I would like to thank you, Justin. And thank you, Neeraj, for joining us and explaining to us the joys of star creation and gold gases and how you're making all of this information available so other scientists can build on the data set we've made available to them. So thank you very much.
Pleasure. See you, folks. PLAYING]