Keynote Leonelli – Rethinking HPS through digital studies – DS² 2021

Show video

Okay, we ought to be streaming now. Wait for a second as usual. Stream catch up? Yes. Okay. Good. All right. Yes, we're live Welcome back everybody day four Time flies.

Well, I at least I did I assume it's been flying for you it's it's actually even, even in spite of the the effort of sharing, it's been flying for me too, just because I've been having so much fun getting to see all these getting to see all these talks. This is really a kid in a candy store for days for me. So thank you all for for being back for for day four. As per usual,

we'll be starting with a keynote, our last keynote address, really excited to be able to introduce to you Sabina leonelli, who I'm sure many of you know, University of Exeter, where where she's Professor of Philosophy, history, science, co director of the fantastic again, his center, the Exeter Center for the Study of the life sciences. And someone who's been doing really just as I think I said, in a promo thread on Twitter, just absolutely unmissable work, it's a required reading on the role of big data in science, the way that data has been changing the production of scientific knowledge. So I'm really excited to, to think now a little bit about how we can rethink HPS through the lens of digital studies of science. So please take it away. Thanks so much. Thank you very much, Charles, it's really a great pleasure to be here is a fantastically exciting event. And it's really

wonderful to see these different communities coming together and talking to each other at last. So thank you very much to you, and all your team for putting this together so effectively. And I think one of the things I'm particularly loved about the conversations over the last few days, is the fact that they really pushed to the fore, some of the social aspects of science, and particularly the study of communication. And many

of the talks I've heard, unfortunately, I didn't manage to hear all of them because I was teaching to this week. But many of the talks I've heard, it really focused also on some meta level aspects of how scientists communicate how we are researchers communicate. And I think that's partly what I'm going to be focusing on also, in my talk, and one of the things that really should premise my talk with is the fact that we're going to change gears slightly here. And this is because I'm not going to be talking as much about wonderfully quantitative work. But I'm going to be talking about qualitative work, I've been conducting it looking at what people are doing in different scientific domains when they're working with large amounts of data. So it will be a meta level and account, and I guess is going to bring back also at a meta level, the importance of bringing together qualitative and quantitative aspects. When we are doing digital studies of science. It

there's been many talks, of course, that pointed to this, but I think mine is going to be another step in that direction. So what I'm going to be doing is I'm going to be talking a little bit about some past work that I've been doing is looking at dimensions of data intensive research, then I'm going to be looking at some of my current research topics, and then look a little bit into the future. And see what of the things that we've been learning through the studies may be applicable, as a reflection for all of us here who are involved in this kind of work. And a many of these thoughts are will start to hopefully unravel as we go along in this kind of big tour of different projects and different ideas and different initiatives. And But hopefully, by the end, we will have enough thoughts that we can have a good discussion. So first of all, I

have been doing a quite a while to work in the past on the role of big and open data within the sciences, and the interest that prompted that kind of work, where the fact that for instance, the consideration that data has acquired the new status within research. And I think the very existence of this meeting proves that it has become something that becomes publishable in in their own rights. And it there is a renewed attention to try and reuse available data and try and decrease the waste of data or the loss of data, a lot of attention, of course, to data driven research, planning and discovery to the role of semantics and standards in enabling the movement of data and therefore at their reuse. And of course, to use data in machine readable formats to fuel AI and artificial intelligent tools. And this has been

accompanied by the emergence of new institutions and new modes of dialogue around The use of data. To some extent, this is really at the moment allowing us to reinvent or at least to rethink how we exchange scientific results. How do we make inferences from results? And how do we collaborate across different disciplines and countries. And of course, data

and data sharing questions have always had that power within the sciences. This is certainly not a new phenomenon. But the ways in which a phrase that the discourse of open science is not being mobilized. And the structures and the interest and the ways of financing that sharing at the moment are very particular. One of the things that is also become very topical, to think about in this in in this era, is how do we incentivize and reward these kinds of shifts towards a more data driven or data centric way of working? And how do we go beyond the hypercompetitive, a publishing climate that we've all grown up with, where whether you're in the humanities or in the sciences, there is a premium put on people who spend less time tinkering with data, tinkering with data and thinking about ways to productively collaborate, and more time in producing a sole authored or first author publications in high level journals. And of course, more broadly, there has been an attention, there continues to be an attention to how this new emphasis on the role of data can become an opportunity to improve the relationship between science and society, and in fact, can become a platform to debate what counts as science in the first place, for instance, whether technical work is actually part of scientific work. And what is the role of scientific infrastructures in knowledge production, and the role of scientific governance, of course, and how designs should be credited and disseminated in the first place. So it might focus since many years now as

being on trying to think about how big an open data become mobilized. And whether looking at the ways in which these data are mobilized, tells us something about how comprehensive how reliable these data are, and how they could be reused. And of course, that is a very widely shared presupposition for anybody working in contemporary AI, that in AI needs to be grounded on data that are shared across different contexts and can be reused for a variety of purposes. And, and this is a major challenge, of course, particularly when we're looking at large, complex and heterogeneous data sets, which have been put together by linking many different locations and many different sources and the research done on different phenomena, without even talking about the interest that the diverging interests that can underpin such research. So in trying to understand these

issues, I've been doing a lot of work on what they call in data journeys. So the ways in which they get mobilized across contexts, and to try and understand these challenges. So one of the ways we have been doing this, I'm sorry, some of you have seen this slide before, but just to summarize, is to look at how data move first of all, from sites of data creation, to sites of data mobilization, that can typically be any kind of data infrastructures, from databases to repositories, data bags, etc, etc. And then two sides of data interpretation. And I've been particularly interested in data journeys, where the sides of data interpretation actually differ from the original sides of data creation. So I've been a focusing my work philosophically and empirically on how they need to be decontextualized. To be able to be mobilized in

particular ways, data cannot travel with absolutely any information that accompanies their production. But typically, there is a selection of which bits of information may be relevant about data provenance may actually be relevant to their potential reuse. And this process of D contextualizing data is really crucial to putting together data resources. And of course, many of you provided wonderful examples of this already in the last few days. It but importantly, this

process of recontextualization typically happens and certainly needs to happen, in my view, with already an idea in mind a vision for what potential recontextualization could happen with the data. So what would be needed so that people who are based in different locations and in have a different background from the people who have originally produced the data, can fruitfully reuse the data and do that in a way that still makes the dataset reliable, and does justice to the efforts that were produced in generating the data in the first place? So in doing this, some of these work has started by looking at how data get mobilized in model organism communities, which are large communities of researchers typically very highly distributed around the globe. Who, however focus on the study of particular organisms are devoted lots of work on the distribution of data and their use of data on Arab siciliana. And this is a little graph that shows you different elements that are involved in the journey of such data and the reuse of data through a particular in a database in platforms. And I've been paying attention specifically to the relationship between data platforms and types of data, the use, and the ways in which the original samples and specimens from which data were garnered, were stored. So this will be stock centers for

organisms. And in the case of plants, it will be seen centers and seed banks. In also I've been looking in a lot of detail and continuing to study this at systems of semantics. So how do we frame keywords, terminologies, concepts, assumptions that are underpinning not only the ways in which we are sharing data, but the ways in which we are in a recontextualizing them. So what assumptions are linked to a the dissemination of data, what keywords what ontologies are use computational ontologies. to disseminate these data, these

have a huge, huge implications on the systems that we're using and the ways in which they'll be reused. And in doing that, I paid a lot of attention to the fact that what we're looking at in these kinds of systems is fundamentally a type of distributed reasoning. We're looking at systems where there's potentially 1000s of people who are indirectly collaborating different groups of people working on different infrastructures, different data resources, different types of repositories. They relate to each other, because the data that they're caring for link to many other different types of data in services, which are available in the wide data ecosystem, but very often these people are not working directly together, they may not know each other, typically, they don't know each other. And so, and yet, each of these groups of people, in essence, is custodian for one essential element within this data infrastructure and the ecosystem that is important in trying to understand the ecosystem and what kind of meaning this whole system is assigning to the data, right? So we look at the situation where there's no single individual here who can really understand the whole system is fundamentally a distributed system. And this is, of course, fascinating implications in epistemic terms for how we're thinking about knowledge, but also in ethical terms. And so

I've been very interested in the last few years in also thinking about, what does this distributed quality of data infrastructures mean, in terms of accountability relating to a people's work on this system, which, of course is very relevant for us in SPS, because the question becomes when we're starting to put together data infrastructures and to think about specific kinds of data reuses, and visualizations? What does this contribution to the ecosystem of knowledge and data in this field actually contribute to future work? And in which way? Are we accountable for the quality and durability of the kinds of data infrastructures and interpretations that we're putting out? And of course, this relates, to an extent to the broader controversy around reproducibility, which is raging in scientific fields. I really wonder how many of you would think about this question in relation to your own work and in digital studies? And of course, we can come back to this question, hopefully, in discussion. Now, in one of the more specific case studies have been looking at is how does one think about data linkage. So the

idea of making different data repositories interoperable with each other, so that they still are independent from each other, they can still be used autonomously. But if somebody is interested in trying bringing together data sets that sit across these repositories, that is actually possible to do so. And I've been looking specifically at situations where you're trying to implement data linkage in databases that are targeting data taking from nonhumans and databases that are targeting data taken from humans. And in fact, one of the

interesting cases that I was looking at is the use of data taken from yeast, which you would argue is a rather humble organism compared to the complexity of the human body. And these data were in fact being re used for cancer research in human. If one of the things that I noticed noticed at that point being extremely important, in cases such as this when you're doing a rather complex passage, have data linkage is in sorry, is the extent to which the responsibility for the data curation and expertise used for data curation and our notation are distributed, and what does actually mean in terms of trustworthiness of the infrastructure that are therefore produced. And, again, one of the interesting things I kept noticing is that these issues were much more easily handled for a database is that still may comprise a lot of data, but were managed by relatively few people, and when addressing are relatively self contained in community. So in the case of fisheries, for instance, this is a scientific community, which is relatively well contained, still contains a few 100 researchers, but they tend to more or less know each other, at least by name. Many of them are related through genealogy, they've been trained in the same labs. And that

actually provides a cohesion to the deck data collection method. It also meant that it was pretty much the only community I've worked with, where there was a 5050 split between a curator database taking responsibility for annotating the data, and people that actually are producing data, taking some responsibility for how they would appear in the database and contributing some crowdsourcing in to the database. Another case that we'll be looking at, and this was with Nicola tenpenny. My collaborator here in Exeter, was the case of how does one use databases for an in a complex field, which is a translational field as a site for a trusted expertise that allows you to then immediately put this data into action in a very concrete manner. In this was the case of cancer genomics, and the role played by database like cosmic, which is a database devoted to a somatic mutations in curating data from a very wide variety of sources, from publications to experimental work, in making them available to people working on the interface with the clinic and personalized medicine, so that this data could in fact, be actioned into the production of clinical diagnostics, and eventually also the diagnosis of individual patients. Another example that we looked at is the very, very important role that thinking about information security can have when producing trustworthy data infrastructures. And to look at

this, we worked with the secure anonymized information Linked Data Bank, which is based in Wales, this is one of their most prominent data banks, certainly in England, arguably in the world, which is dedicated, at least or sensitively to the anonymization of very sensitive health related data. So what they've been doing now for almost 20 years is to garner data coming from a medical practitioners a data coming from cancer registries, from clinical trials and from individual research projects, and bring them all into their data system, and anonymize them at different levels of organization as required by the level of sensitivity of this research. And this, in fact, made it possible for researchers that were interested in reusing that data to actually go to this data bank and make an agreement for how they could reuse the data. And even if it was such sensitive data, and therefore very, very difficult to handle. One of the things we found here is that in the information security system set up by an infrastructure like this ended up having a very, very strong epistemic role, in addition, of course to being infrastructure is very important to actually allowing people access to the data physically. What these systems manage to achieve, is to provide and maintain a reliable chain of evidence when it came to the use of this data. And in fact, very often people involved

in this databank ended up providing assistance to researchers that were interested in reusing their data, in reframing their own research questions in thinking about other data types that may be relevant for those questions, and therefore really becoming an integral part of the research effort, rather than just being people who take care of storing the data somehow. Another example we looked at, which was particularly fascinating to me, it was the attempt a which is commonly referred to as a data mashup, to bring together data from very disparate sources in full awareness of the fact that this generates all sorts of epistemic problems because the communities that produce these data are operating on very different assumptions, but with the idea that if one can find some parameters that are common to all these data sets, and to assume to some extent that these parameters are invariant, then it actually makes sense to try and just smash this data set together and try and see what kind of inferences one can make out of them. So the case we looked at was the case of so called magma project, which was a platform that was put in bringing together environmental data, health data in socio economic data and climate data in the attempt to try and allow for, for instance, the mapping of the spread of seasonal diseases in England, and how this would affect health services and the provision of hospital beds, for instance. And what we saw here was that on one end, this was predicated on the idea that things like the measurement of locations of patients, and also of the spreads of potential pathogens, and could be pinpointed rather accurately. And so by

juxtaposing datasets that apply to the same location, you could actually start to see interesting correlations, for instance, between in the amount of patients that were in been hospitalized because of any kind of pulmonary or respiratory disease, and the type of weather conditions and the type of pathogens present in the area at that point in time. What we found here is that actually, these invariants, which sometimes can look so obvious, like location, but of course, time, as we all know, is another big one, they are, in fact, highly varied, that are almost infinite ways of measuring locations. And many of these ways are instantiated in these kinds of datasets. So researchers busy in this project, actually took an inordinate amount of time, to reanalyze, and curate their data, to be able to actually provide reliable inferences. And again, this speaks to the

incredible amount of hidden manual and very often work in the kinds of judgments and informed judgments that are necessary when putting together these kinds of very large data resources. And also looked at the question of how does one, automate some of these issues, and specifically the ways in which emerging data are now being automated in research on plants? And that brought us to think a little bit more about what does it mean to relate data to models, and to which extent data actually are representing in a part of the world. And to which extent are is the modeling of data to responsible for doing this? I'm not going to get into much detail about this now. But you know, my position is that actually, they do not represent at all. But it's data models that are doing the representational job. But this was certainly another very interesting window into the huge complexity, and the enormous amount of constant calibration needed to produce imaging, and even a situation where the conditions under which imaging is produced, like in this case, one of these automated experimental stations are highly standardized. So what kind of

lessons Do you believe learn from some of these work? Well, certainly, we learned that the technical know how it necessary to manage data at this kind of scale is to some extent elusive, and certainly represents a whole new set of skills and expertise. And I'm sure many of you will agree with me, I've done quite a lot of this work, which are very, very difficult to bring together at the same level with the other types of expertise is that they are needed for this kind of work. And indeed, again, we come back to this problem of this being highly distributed work, where you need people who actually have specific expertise on data management, and what best practice might actually mean in a particular domain. And here, for instance, we've just made kind of a very broad topology of the kinds of data portals that people working in plant science really should be at least aware of when they're thinking about managing their data. And as you can see, it's a very long list. And the specific examples of each of these technologies will vary very quickly in time as some infrastructures become, come out of use, and others are invented. So it's actually very difficult

to keep up with these kinds of developments. But at the same time, of course, you also want to have people who retain a domain knowledge that is absolutely necessary to be able to contextualize this data. And we also, of course, ended up thinking a lot about what kind of incentives are underpinning these kinds of data ecosystem. And we found that the incentives at this point are wrong, and in fact, a very worrying when it comes to the robustness of the ecosystem. There is of course, a lack of recognition for data creation and donation, and that limits the amount of data and metadata that are actually available online. And that again, means that the count of data collections we can find online are representing very highly selected data. From a

very small proportion of available sources, and again, we've seen that time and again, in previous presentations this week, I mean, thankfully, in the case of HPS, I think everybody is trying to be very careful in really qualifying the extent to which the original sources can speak to broader issues. But of course, there's always this creeping problem of representation. When doing this kind of big analysis, like the wish that the data that we're analyzing in the first place could actually provide a bigger sample of the world than what we actually have in our hands. There's also generally a lack of business models to try and develop an update, especially update online databases, in this entire in turn, limits the comprehensiveness, the usability and durability of the contents. So a one of the things that we kept finding, we keep finding all the time. In fact, the things that I think things are

getting much worse over the last 10 years, is that the selection of data use for particular research domains and questions, is based on convenience on the tractability of the data themselves, on the socio economic conditions of data sharing, for instance, when people insist on using Twitter data, because it's much easier, and it can be done to some extent for free, rather than using, say Facebook data. And and these are not really epistemic choices, and they're not necessarily methodologically justified choices. And, of course, the fact that we don't really have yet reduced structures and criteria to value this work within academia in general. And I think, again, this also impacts the EDS community. To a large extent, it means that again, we have

worries around the quality of the data that we're using, and how trustworthy the infrastructure that are being set up are given that people are not really acknowledged for doing a good work with them very often. And generally, we also saw a lot of an misalignment between the IT solutions that engineering and domains are offering, and the research needs of people who are trying to put data to work in the field. What is meant is really thinking about the digital landscape as a highly fragile landscape, where there is an exponential growth of data quality concerns, the sustainability of the landscape that we're working with, is really unclear and certainly limited. And this, of course, also connects to the fact that data travels and data journeys are constantly reshaped, and by institutional, national, disciplinary, cultural boundaries. And at the very same times, they challenge those

boundaries all the time. So we're looking at the landscape that actually is very highly dynamic in all sorts of ways. And these again, it makes it very difficult to keep it well maintained without proper resources. And of course, this also goes for this was an ability in a more ethical sense, because protecting the rights of individuals and communities, which may be affected by data reuse, requires both local investments, but also a long term shared vision for what it means to actually care for the data subjects that we are dealing with in our work. There's also, of course, a risk of conservatism when we are kind of keeping reusing the same data sources. And what loom large looms large, certainly in the

uses of data in the sciences, is the fact that the vast majority of data, certainly in domains like agricultural domain, but also in the health domain, are private, privatized, or commodified in in various ways. And that means that they are either inaccessible to people working for publicly funded institutions, or they're actually just very difficult to handle and very expensive to handle. So let me say something about the kind of work that I'm doing at the moment that I've been doing in the last year or so a, which builds on some of these insights and tries to apply them in a variety of different ways. So one of the things I've come

to realize, at least in my own little corner of the field, is that one of the things that science technology studies people have been saying for decades. And I think in this audience, we all pretty much give for granted. So the idea that, you know, what we're looking at here is a social technical problem. It's an issue where technical considerations, conceptual considerations are completely intertwined with the social conditions under which they are taken and achieved. And so I realized that this really was not in any way acknowledged in any particularly deep way when it came to the setup of many data, infrastructures and approaches to data language and data reuse. And so I decided to try and focus a little bit more

on one of these areas. And think about how specifically plant data are now being linked between different infrastructures around the world and try and see whether All the discourse, which had followed by closely around the standards, the semantics and the technical features of the software, and the systems used to try and enable this kind of data linkage was matched by an attention to the social implications of linking this data, and the ways in which the data actually cross national boundaries, the ways in which this data, in fact belong to particular heritage traditions. And you know how this set of consideration was, in fact, intersecting with the technical realm. So we started this project called freedom from data to global indicators, as part of the Alan Turing Institute projects, to look at how plant data can be reliable in close data infrastructure around the world, and how this can be done responsibly. So here's an example of one of the cases that I've looked at already, since a few years, I did my a field mean a field study of this case in 2017. And I'm now working with people in based in different parts of this case, to try and further this. So this is a field in Nigeria,

at the International Institute for agriculture in Ibadan, a close to Legos in the south of Nigeria. And this is one of the main word Institutes for sustainable agriculture, and is one where a lot of data collected on crop trials, including in this case, and at trial on cassava, which is a route you can see here in this picture, and which is a very important food staple for much of the population in the global south. And what we're looking at here is the ways in which data collected from these kinds of field trials on different varieties of cassava, ends up informing research on cassava variety, and in improving cassava to be more resilient to the environment and to plant pathogens, but also the commercialization of cassava is actually implemented. So in one of the things that I looked at, in a lot of detail and collaborated with is the development of the crop ontology, which is a system semantic system that captures information about plant traits. And you see an example of this

here, there's a particularly big cassava root. And these are the type of terms associated to some of the plant traits in this case. And also looking at the ways in which these kinds of information and this kind of criteria for what people should be looking at when they're collecting data are implemented through the use of field books, which researchers and technicians on the ground can actually use to collect data directly as they're going across the fields. And as they're looking at newly dug cassava roots, and then exporting directly in a kind of open data manner to servers, the web will disseminate this data all around the world, as I'm indicating here. So from Nigeria, and of course, as many, many other

trials of this type going on around the world, these data are immediately made available to all sorts of institutions that are working on the same issues, and to both commercial and public organizations that are interested in this. So what are we learned here? Well, I mean, in keeping with the questions and the issue around how this is really a social technical issue, what we learned is that the idea that many people have in this field than one should try and produce all sorts of environmental intelligence. So try and use AI, to monitor environmental conditions and try and produce better results out of the interaction between humans and the environment, including agriculture, that these ideas are really needed also a large extent of social intelligence. And in, for instance, you know, even in very technical questions around how does one put together data sources that speaks to different levels of organization of the same plant, and even environmental factors. And this is very important when you're trying to examine gene environmental interactions in this kind of research, it was important to try and mix quantitative observational and imaging data. This meant that one needed common trait descriptors. And this meant that

they need to be some sort of agreement among stakeholders about what an appropriate trait descriptors would be. It also meant that there needs to be metadata to inform the further contextualization of this data. And again, there needs to be agreement on what kind of metadata to use. But of course, this is a very difficult kind of consensus to achieve when the cultures of data exchange in this field are so wildly diverse, both across the scales when he's looking at one goes from a little field somewhere, you know, in a small holed farming community in Nigeria, all the way to multinational corporate Issues like control and seed production in the majority of countries, and of course, across borders with all any nation having different perceptions of how they want to think about agricultural development in Also, many international organizations been highly involved in trying to regulate all of these and with their own stakes in this. And of course, the domain between the relationship between public and private, in agencies in this space. And so actually finding, sharing access and reuse

agreements among stakeholders becomes very fraught, and so is defining appropriate data governance to define what constitutes lawful and adequate data use in this case. This, of course, is complicated by the fact that there is a strong interest right now in heritage crops, particularly those that come from the global south, and have been less studied, and probably going to become much, much more relevant for sustenance and food security in the global north due to climate change. And in these cases, it's very important to try and acknowledge and reward the provenance of the data, the work made by indigenous communities in producing a particular m specimens in particular breeds. And also to locate

responsibility for what kind of uses are made of this data, and being able to pinpoint mistakes and concerns. All of these is crucial to try and identify who is being excluded by the systems who is not being rewarded appropriately and who's not been credited appropriately. But of course, as you can imagine, given the scales and the different interest involved in the system, it becomes very complicated, very quickly. So there is a strong recognition that trying to compare and integrate data from across the globe is absolutely crucial to end producing good results in the area of agriculture and precision medicine. And there is a lot of emphasis on trying to

develop global data infrastructures, and related semantics. But of course, that raises these questions around how do we acquire consensus to do this? So what kind of infrastructures are we looking at a very concretely, we'll be working with infrastructures, which are almost by definition transdisciplinary, they need to involve experts from technical side, as well as from the domains, as well as on the territories and the types of crops that are involved here, as well as from the humanities and social sciences, as in my case, a we're looking at this from the perspective of what actually constitutes a sustainable infrastructure. And a these are also deliberately transnational initiatives, of course, and as many of them I've cited them here, if you want to have a look at them. These are very much at the level of the FAO and United Nations and similar initiatives, trying to really bring together different types of data stakeholders and have been communicate with each other. Now, of course, there are huge governance challenges in trying to do that. There is this

underlying idea to many of these initiatives that we're we're looking at is an idealization of global plant data resources as a common good, that should be harnessed for the survival of humans and of the planet. But of course, there's no such thing as a global data resource. These are all highly local Baker resources. And whether or not it makes sense to conceptualize them as a common good, is an incredibly fraught question. And open data seem to be very important in this space, the fact that you can freely share the data. And yet, this is very

tough to reconcile with the fact that you've got to recognize the rights of indigenous groups and local breeders, particularly in situation where these groups may not be very happy to share their data, especially not with multinational corporations. And indeed, also to reconcile this idea of openness with the fact that the vast majority of this data still are produced by pan dryers, which are sponsored by agrotech companies, and they're privately managed and completely inaccessible. So data governance is very often pointed to as a key to address these issues. But the question becomes governance among whom, established by whom? And in the sense, a, again, we'll come back to this question of the this been a huge socio technical problem. Even the idea of what counts as data production in relation to crops is a very fraught issue. And because you can think, for instance, that data production is the result of growing plant specimens, that's when you're producing data, right? Or you could say, No, no, this is about when you select the specific strains that are gonna be about everything else doesn't matter.

Or you could say no, actually, it's about when you design the field trials, because this is when you're actually setting up what will count as data for you. Ultimately, you're setting up your instruments and your methods. Or you can say no, no, it's about the measurement tools you using this is really what's producing the data. So that's what you need to focus on. Or you can talk about who is designing data storage and data infrastructures for this data. All of these ways of thinking about the the production actually potentially could be equally valid. But depending on how you answer this question, you have different answers to the question of who is the legitimate owner of the data, and who has actually control over the use, right. And the

lack of clarity to key questions like this is what in fact, leaves the door open to bioprospecting. And to what some people are calling digital feudalism. So having a countries in the global north, whether publicly sponsored researchers or private companies that go to the global south, and basically probate and appropriate all of their resources relating to plants. And this, of course, is certainly nothing new, it builds on itself to long exploitation and discrimination that is built in the very food production system that we use every day, everywhere in the world. So to try and think about this

question in our cameras, likely more database way, this is gonna be the only time where I'm showing you an attempt to do it. In our own group, a very specific data intensive exercise, we started to think about how to map these data infrastructures. And we started with the idea of trying to map them geographically. And this, of course, this is all very preliminary, this is what I'm doing with you, we unsewn and Michelle during Sierra University of Exeter, and it's very much in flux, also, because we're still trying to find more funding for this. But the idea was, of course, try and locate them geographically. But most importantly, try and locate

these initiatives that chronically, so thinking about the time at which different initiatives which are which have a lot of responsibilities for how data are being mapped and put together and been, have been developed, when did they start who was involved in these initiatives, which were key points of change within these initiatives, when were they discontinued, and so on, and so forth. So you can see an example for his own one potential visualization of the timing of some of these initiatives here. And of course, we are compiling profiles of each institution and platform that we're looking at so that we can start to put these elements together. Of course, this is far from being something that we can already do a lot of data intensive analysis on, but the intention is eventually to be able to do this. And partly This is because there's been a lot of work actually done within the plant sciences themselves, or mapping in data infrastructures initiatives, much, much better than anything we may be doing here, which is already anyhow, very limited to the ones that we are intersected with. And but what is really lacking is a more

historical perspective on these kinds of issues. Helen, Korea, University of Cambridge is doing wonderful work trying to supply some of these and many other people are simply participating. But I think this is a space where the the intensive approaches could really be helpful, and to further these kinds of studies. So when it comes to these studies of research initiatives and data centric initiatives around the world in different domains, what kinds of solutions? And did we come up with? And what kind of things I've been discussing with these different stakeholders when thinking about what could be done to improve the current situation? Well, first of all, of course, the idea that data stewardship should be valued, and should try and foster critical data reuse. So rather than just black boxing, and whatever is been happening to the data and producing nice visualization that actually cannot be unpacked, and disaggregated, is really important to provide tools to be able to go back to analyze steps of data processing and data visualization and potentially question them, partly because, of course, you know, as I've been saying, the context of a data use is going to be really important in determining what matters about the data and what we really want to look at. And, of course, also trying to build

the responsible practices into technical specifications of data infrastructures that are principles that are being put forward, certainly in, in the general sciences around this The so called care and trust principles, we can come back to this in discussion if you want. And, but this seems to be a very important thing to try and do. We are right now in the middle actually, of carrying out a big worship, we were trying to put together stakeholders to try and push this a little bit along. And the idea that data providers and users should really be involved in the development of data infrastructures is absolutely fundamental vacuum out from pretty much every study I've done in a qualitative sense in this area. And there are initiatives in relation to crops that are trying to stimulate that so called communities of practice. And there is in fact,

a big exercise at the FAO right now, to try and see what this kind of long term investment involvement could be and how it can be incentivized. And of course, more generally, and this really brings us back to what we are doing HPS is trying to encourage, as explicit as invasive as possible over what are the overarching goals and in fact, what are the concept of Human Development that underpin data sharing practices. And certainly, in the case of crop data, this is absolutely essential, because as some of you will be aware, that are very particular ways of thinking about what constitutes agricultural development, that tend to take all of the attention and not even be questioned by people who are working in this field, and that as potentially really dramatic and bad results. So now, let's think a little bit briefly about looking into the future. And I think it's pretty clear from what I've been saying so far, that one of the things that I've gotten more and more relevant for me in thinking about these problems is the question of inequity, injustice, and exclusion in digital systems and from digital systems. And also the fact that the ways in which the system try and encompass and capture environmental variability is extremely diverse. And very,

very often very problematic, because variability happens on many, many different scales. And in this needs to be recognized into the systems. And in these two things, I think are related. The fact that we have limited ways of recognizing environmental variability is actually in my view, deeply linked to the kind of exclusions and distant divides that are happening in the contemporary digital realm. So I'm like, in the coming year, I will be starting a five year research project looking specifically, what does it mean to think about open science in a situation where you have very highly diverse research environments where this notion would actually signify different things for different people, especially when it comes to the sharing of data systems. And I've done some

work over the last year, as you would expect, and probably many of you about the preoccupation on a How does one think about these ideas in the case of the current COVID crisis and the pandemic. And certainly it's important to mention this at the moment, because the pandemic has brought an incredible acceleration of this transformation. Those of us who were already in a digital rounds on what have been absolutely in, you know, pushed forward enormously. And there is a lot of evidence of deep integration of digital services, decentralization of infrastructures, around the world, and particularly in the global north. And of course, this is further amplified by the launch of the 5g networks. At the same time, there is also a very strong recognition of the fact that this is in fact amplifying the digital divide to an incredible extent, because people who have been excluded from these transformations at the beginning of the pandemic are now finding themselves excluded from any kind of social or medical assistance, because now many of the services start to pass, like the crazy digital passport, if you've been immunized start to pass through the digital systems. And so the

World Economic Forum has put out this wonderful idea of the great reset, we need to think about digital platforms, so that we have a new social contract that honors the dignity of every human being. And this was specifically to try and address the marginalization that exists in the sphere. But in fact, even here, we've seen enormously an emphasis on the technical as a great solution as an alternative to tackling the much more difficult questions around social conditions. So across different countries, at the beginning of the pandemic, we saw a rush towards Data Science Solutions, like tracing apps on smartphones, trying to do data aggregation across countries. And of course, these increased capacity in already very powerful big technology corporations like Apple and Google who provided the technology for some of this. And

again, it decreased even more capacity in low resource environment that arguably would need it most, which brought me to have a preliminary assessment to this great idea of the great reset, as a combination of surveillance capitalism and some sort of lip service to social responsibility, certainly not something that has really helped much to improve the situation. And again, I think this is pointed as to the fragility of the system, this, there's huge limits to that access. Still, even when it comes to COVID related data, we've seen it happening for the medical frontline social services tracing and data interoperability and linkage have not working very well. And even now, it also highlights the

problematic relationship between different governments, governments and corporations and the role of international agencies like the w h o. And of course, the dire consequences of using digital platforms for surveillance, and the lack of trust of people who are exposed to these kinds of systems. I think particularly one thing that I want to really highlight here, again, it's absolutely trivial, hopefully, for this audience, but I think it's really worth repeating again, the fact that when we're doing digital studies, there is this layer of neutrality that comes from using some of these technologies. And you know, hopefully we're all very aware of the fact that data science and digital studies are not neutral. They're everything but And unfortunately, especially data science as a field keeps, very often selling itself as a neutral field that can be put to the service of different masters needs must, right. And I think this is really problematic.

We've had a big discussion very recently in the Harvard data science review on this point where I published a paper that was then commented on by lots of different experts in the field coming from different domains. And it's quite an interesting exchange, if you want to have a look at it, trying to consider what does it mean to move away from the lure of neutral and value free data science, and instead, work collaboratively with domain experts and communities towards forging socially beneficial solutions in a very explicit way. So let's spend just a few minutes on implications for HPS. I think I've been hopefully pointing to that, as I was going along and taking you on this tour of data studies in research. One of the sets of arguments have been making over and over and again, apologies for those of you who have ever done before, is that situating data, analyzing data is a practice of valuing. Right and it's unavoidably so the procedures through which data are processed and ordered, crucially affect their interpretation. I hope that at

least the quick examples I gave are enough to give you a sense of this and it will be also part of your experience. It of course, databases do not store some sort of ready made facts. It is the value it is a ways in which a evidential value is attributed to data that determines the epistemic significance of data torez knowledge claims. And of course, this evidential value is not just determined by scientific or intellectual considerations. It also depends on other forms of valuing data that can range from effective forms you like a certain data type more than another, economic, for instance, access to data, personal considerations and cultural considerations. One argument had been giving in giving a

philosophical reading of this work is that the triangulation of data, which is typically seen as one of the great solutions to try and make datasets better, does not actually reliably counter the kind of bias that we see in the data landscape. And because it doesn't really counter necessarily the bias introduced by the diverse methods of data collection, storage, dissemination visualization, because these are already sitting within a highly layer than an equal landscape. It of course, you know, this, one of the conclusions for me is that pluralism and methods and standards in Rome mostly contributes robustness to data analysis, and reduces the loss of system specific knowledge. And also, a big lesson for me in doing this work has been the role of interdisciplinarity. And why the engagement on data sources and analysis. Of course, multidisciplinary teams are

indispensable many of you are already working in this way, it's certainly been my experience that I you know, understand very little of many aspects of the things I'm trying to analyze. And research is a hugely benefit from a wide network of collaborators and friends and peers, that it provides me with input and feedback on what actually is important here. It is particularly important when trying to understand the context and social significance of the data. And to do that, of course, is also very important to try and engage beyond a professional research per se. In many cases, this can add data to what one is doing. It also can add

robustness to existing data, and to contextualize and validate interpretation of data. So I want to come back very briefly, in the context of APS to the kind of solutions, I've hinted at before and been thinking about more scientific efforts in trying to manage and interpret big data and see how we fare when we look at those same criteria. So the first solution I'd put forward, of course, these are in no way meant to be comprehensive or exhaustive or anything like that, right. But the first idea was the idea that data stewardship is needed to foster critical data reuse. What does it mean for us? Well, first of all, I think for many people in HPR, still, probably all of you exclude it. But there is a strong need to recognize what

counts as data and why. And this, I think, should happen in a relational framing. So really recognizing what is it that you're using as evidence, and acknowledging the fact that in fact, whatever it is that you're using evidence, in fact, is really counting as a form of data for your research. And,

therefore, that some of these considerations actually apply, rather than being something that one can just discard? Having, of course at exclusive debate, on which data get to travel and why this is particularly important when you're looking at historical archives, what of course, we have a huge path dependence about what is being recorded, which figures are being given prominence, which kinds of journals for instance, are being tracked and which not you know, in the relevance of languages here is very important. So for instance, the word that Christoph Mahathir was presenting two days ago is absolutely wonderful in that way. But basically try and keep questioning sources and also the ways in which data has been processed in time. And this also means when we are processing data ourselves, try to make our own data processing as trackable as possible, provide people who are looking at our data visualizations or our papers with some indication about how they can deconstruct our arguments and our visualizations and go back to thinking without us in potentially thinking differently from us about how we're using our data sources.

And this is something I tried to do for qualitative data. So I've been a publishing, of course, with a lot of work on the ethics of this publishing, some of the transcript that came out of the interviews are being carried out on zenodo, and one of the big repositories held by CERN. And I will certainly continue to do this on a larger scale, as I continue in empirical research in these kinds of fields. But it's just a small step. Of course, there's a lot of work to be done about, how do we set up our own infrastructures to make data journeys trackable? And think about how much standardization Do we really need, and how this sits compared to local solutions adopted by Pacific Pacific, specific communities and domains in dealing with the same data. And of course, like Estonia, for instance, talked about this when he was talking about the ambiguity of standards and in fact, how fruitful he can be to really build on that when we're thinking about data infrastructures. And of course, think about values as much as

possible, think about the labor systems that we're using for Responsible sharing the implications that a sharing certain data can have in a much broader sense than just our research. And this brings me to the second point, which is building responsible practice into technical specifications or data infrastructures? Well, I think that actually there continues to be relatively little work in HPS, on the ethics of data sharing. Thankfully, many of you touched upon this in presentations during this conference. But I

think even in HPS, there actually still is a strong attention on the technical side, which is not necessarily matched by attention to the side of responsible innovation, what is mean to actually share this data ethically. And one of the big components of this, of course, is to think very carefully about which services we're actually using. What are we relying upon when we're doing this work? Are we using Amazon Web Services? Are we using Google Cloud? What kind of software what kind of platforms? What implications does this have? I think Sara Daly started to point in that direction on or talk about a Twitter as a platform. And but I think there's a lot of work to be done, especially in HPS, to really start to get a better awareness of what this means. And of course, that extends to publications. And the whole question around a potential open access and dissemination of our own research results in our own journals in in our own book series. And it's very important

to recognize this early and place is at the heart of technical work. And there are questions around how principles that have been trialed in the space of the natural sciences could actually apply to work in HPS. And I think that that's a long discussion. I'm not going to spend much time on this now, because I'm almost at the end of my talk. And, of course, there is a question around in potentially being careful around a discussion that an open science, but also be careful to problematize, the idea that open science is always democratic, and a wonderful thing. And in fact, I think this is not always

the case. And especially when it comes to data sharing, we need to pay a lot of attention to what we are sharing and how third solution will be looked at is the long term involvement of data providers is, of course, particularly relevant for those of us who work with living actors, people who are still alive and want to document their work, or we're investigating research practices, which are now ongoing. So what could the direct engagement with these objects of our work look like? What benefits would it bring? And I think it's a very important questions at least to ask, even if one has no resources to really put this in an emotion in one's own work. Certainly, what I've seen in my own work over the years is that when I started many of my investigations by just looking in, right, just be the participant observant to some scientific initiatives very often I become part of them. And now I would consider many of them to be collaborations rather than just case studies. And that's an interesting shift, which comes with all sorts of interesting accountabilities, problems, but also advantages. And of course, a big question

for all of you who are historians is how does that apply to historical sources? If at all? And of course, a bigger question here is how do we choose the topics that we're going to be investigating? Who do we talk to which audiences do we have in mind? Which implications do these choices have audience in public actually have fourth point And this hopefully should be the strongest point for our field is to try and encourage debate over overarching goals and concepts of human development underpinning data sharing practices. And of course, I think in digital studies, we're all pretty good at trying to think about this in relation to some of the sources we're using. And maybe the practices that we are analyzing. The question is, how does this apply to our own

work? Of course, and I think this is really an important question for all of us to think about. And with this, I'll finally stop talking. Thank you very much for your attention. And I look forward to our discussion. Fantastic, thank you so much. While I wait on on people to post their questions as that as the tape delay catches up with us, I wanted to ask this is this is related to something you were just touching on at the very end. So I've been interested in in in these kinds of questions for a while now, especially these questions about involving, involving various kinds of stakeholders. And I wanted to ask you so because I think

you're better positioned than probably anybody I know, to answer this question. How have you, and how do you think we ought to engage these processes of building trust? Because I think that's so often at the heart, they see someone, you know, a population, see someone coming out of an academic research unit looking to talk about big data, right, and the shields go up, and they probably should, right. So what is that? What has that trust building process been like? And and and I'm wondering what, what kind of thoughts you might have to share about it? Yeah, I mean, thank you very much. Ours is a complex question, of course, because one has to be careful, too, because it is true that partly because of the way in which research is organized at the moment, and valued the work that we do in terms of looking at the meta level, and what's going on in research is actually not valued at all, as it should, I would argue, and I think many of you will agree with me. And so we're always starting from as the underdog in this kind of collaboration, certainly, you know, anthropologists would call the study nap very often, particularly when we are involved in collaborations with a very prominent scientist, or kind of, or local context. And I think, what First of all, I mean, of course, it completely depends on the very specifics of the results. That's partly why I wasn't trying to give too

detailed precepts here, because it will change enormously, given the situation one is in, I can give a couple of examples from my own experience. So and certainly the fact that I've built up kind of, you know, even just a track record, of working with some communities and spending, in fact, many, many years in sitting in committees and helping out and kind of, you know, being part of the service structure of some communities ended up giving me also more credibility when it came to have discussions with people around the topics I was interested in, and what they would think about this and how we could think about working on this together. And so Indeed, there is a lot of personal work, I think, involved in setting up these networks of trust. But once these are, you know, a few of them are kind of on their way. And that seems to then become easier, just because you're slightly better recognized as somebody who can be useful in that way. And it turns out, at least for pretty much all of the domains I've been working with, and people can be doubtful or a little bit worried, indeed, as you said, when they see people who do design studies who start to mock around their field. But my absolutely overwhelming

experiences a lot of interest in gratitude over almost, about the fact that there was the interest. And, you know, there was an opening to try and discuss things. And of course, I've also had my share of big critiques and big clashes with people, some of them are very useful to me, because I think potentially, I really could have done things better to prevent that. And to have preliminary discussions that would have

avoided the need for that kind of big reason, in other cases, just in the nature of the game. And certainly now that I'm doing research fields are very politicized. There's absolutely no way you can avoid it. Right. So there's, there's also that question, and but again, there is also that also then becomes a question around, in building up and awareness and reflexivity about who are the audiences that interest you? Like, because I think for me, that has really expanded in the course of the last few years. So I started, you know, my big aim, like maybe

1015 years ago was to be able to talk to people that were doing the semantics, really, the infrastructure is amazing. They're changing the way we science is done. This is really where fear is, blah, blah, blah, like no, so very coherently with my own philosophical framework. I spent a lot of time trying to understand what they were doing and talk to them and understand how they were thinking and try and think about what this meant philosophically. And but as I got more and more exposed to some of the implications of adopting this framework, In terms of, you know, you know, the inequity that data resources can cause that also shifted in my attention to Publix. And to

be honest, it's also a question of very often being invited to sit in committees is bizarre like at one point, you start to realize, in fact that there are more public than you thought. So it's a whole is regarded as part of the exploratory work of research that is wonderful that you keep discovering new people that you didn't know existed and have jobs that you didn't know existed, and, in fact, turned out to be really, really important for the kinds of things you're interested in. So, you know, it's I'm sorry, it's not a very systematic answer, but at least a flavor. No, that's weird, though. That's that's super helpful. Thanks. Thanks. Yeah, I think, I think

2021-03-24

Show video