Telling the big stories with a bit of help from AI - Pierre Romera Zhang

Show Video

[Applause] hello hello uh thanks for this um very nice introduction uh so I'm very glad to be here today um and yeah I'm going to talk to you about ICI and how we use technology and different data sources to do our work um so well first small introduction uh you probably don't know me and I think it's it's better that way um uh so I'm uh chief of Technology at ICI and I've been working with leaks since uh 15 years in fact uh I started many many years ago with Wikileaks and other similar organizations and since then I've been uh working on global um investigation that involve a lot of journalists and a lot of Technology because this is probably what all those investigations have in common um so I will give you my contact at the end again but if you want to reach out after this conference uh don't hesitate to send me an email I'm on uh every social network you can even contact me on Tik Tok if you want um and I'm happy to hear your stories if you have any or if you want to share any document with us um so let's start with the ICI um I don't know how many people in that room hear about the Panama papers can you raise your hand maybe the Panama papers not bad um The Uber files more confidential but still important and ICI did you hear about ICI well yeah just a few of you um so this is because we work in the shadow and most of the time you hear about Le M the New York Times The Washington Post our Media Partners but you don't really know about the organization that run those kind of Investigations um so yeah we are the ICI it's a pretty old organization that started to Swift to Global um investigation in 2014 uh and basically we are a group of um three sorry we are an organization made of three groups um the first one the staff like me second one the members and third one the partners so it's a bit difficult to understand the first one so as I said the staff it's about 40 people bit more now uh we are across four continents uh we mix knowledge between uh journalism data analysis coding U design um it support and basically our role is to coordinate those big investigations um but icj started in um uh I don't know I don't remember the date uh 1988 I think no yeah yeah probably um but anyway it was at first a group of members um that basically exchanged tips uh knowledge and you know it's very funny because the very first Workshop it was 25 years ago they all met to learn how to use pgp um and so this group of journalists they are all you know very famous investigative journalists in their own country in France we have I don't know uh fabis Ari uh we have I don't remember but anyway uh we have many um uh famous investigative journalists in many countries and yeah and they at the beginning just exchanged tips but they were not really working together and so because ICI wanted to use that uh member that member group they created partnership and so those partners are already working with us dur during investigation some of them are members of ICI some are not for instance in France we work with Le with Radio France with um cash investigation and most of them are not members of ICS they just work with us during investigation and we consider them as partners so as you can imagine it's a pretty big role to coordinate this work between that many people and that many organization and this is what we do at IC this is really our speciality and I think we kind of um give everyone The Playbook to know how to coordinate this kind of Investigation because before IJ there was some great effort to work together with different news organization but ready IJ uh made it professional so today I'm going to talk to you about four No in fact five ICI investigation using AI um I'm going to talk to you about this investigation very briefly if you want to know more about them you can just go on our website and uh you will see all our stories and I will explain to you how we use AI along the years and other technology to enable investigative effort so the first one is going to be the implant file 2018 uh it was a global investigation on medical devices medical devices can be many things like a pacemaker for instance um but we realized that it was a very unregulated uh sector and there was a lot of problems with those devices uh around the world but you know when you don't have any numbers when you don't have data you cannot really give a diagnosis of a problem it's very hard to know you know if there is truly a problem with medical devices or the regulation so we tried to uh work on uh several data sources uh one of the most important one came from the FDA in the US um and so basically we got what they call Adverse Events which basically are events where a device is suspected to have caused serious injuries or even death with the patient and the problem with those uh Adverse Events was the fact that they were all redacted there was just text um so you know very hard to extract numbers and statistics when you just have text also they did what we called Under reporting meaning that instead of saying someone was dead they were saying something like the P the patient expired or something you know not very clear that for human was obvious but for a machine was much harder to um analyze also all the data we had in this report were just as I said plain text with absolutely zero structured data so zero spreadsheet zero uh columns it was it was very hard to analyze so that was the first time that IC used um machine learning to try to identify which reports were talking about someone dead and which one was not and also we wanted to know if there are some sort of discrimination or or you know a population that was more victim of those problems so we try also to extract genders from those um from those reports and well we didn't do uh ethnicity because most of the time it was not in the report but that would have been an interesting uh uh aspect of the analysis and so we listed a lot of terms that we knew was related to death and we uh basically train our machine learning algorithm to identify the report the death and you know read the description for us thanks to this process we managed to extract more than 2,000 uh cases where the P the patient died so it was quite a success for first attempt but it was also uh very expensive uh it was in 2018 so at the time you know machine learning was good but not as good as today um and it was also making a lot of mistakes so in fact our reporters read all the cases that was flagged as um you know positive in this analysis and they read them so basically we did machine learning and then we had reporters to read the report to confirm the machine learning was right so you might think well what the [ __ ] why do you need machine learning if you read everything well it was an experiment so we had to try we had to verify the work from the machine learning because ICI has one secret recipe it is the fact checking everything we publish is fact checked three or four times which ensure that when we publish something when we publish a number like this one we are certain that this number is true and the result is quite uh important because ICI is doing this kind of Investigation since many years but we never get sued by anyone we got of course threats we got threats every days but nobody was able to uh Sue icj because all analysis we did were uh fact checked and were backed by a team of reporters a bit later I think uh yeah almost the same years we did the morous leaks um so we got uh tons of um documents from an offshore low company we really like offshore low companies um and uh we had to coordinate the effort with our African Partners to explore those documents the problem at the time was the size of the team we didn't have many partners in this project it was something like 30 Partners uh so it's not a lot usually when we do investigation it's more like 400 Partners um and so we really needed to explore those document in a very smart manner we really needed to create a way for those journalists to quickly identify interesting documents so we worked with uh quartz uh it's a news media outlet that is um I think it closed uh so it's it's it used to be great but now they're drastically reduced the the staff along the years and I think they don't exist anymore but at the time they worked with us to um identify similar documents so what are similar documents it's a very interesting aspect of uh the research when you have many many documents you really need to be able to classify them so we try to class class ify them to narrow down the research uh we created several technology like tax returns or business plans and we we train our models to um identify those documents so if a journalist wanted to get all the business plans from the leak they were able to to do it and you know uh do more research it was a very interesting approach uh because For the First Time instead of using uh existing models and asking our own team to do it we involved the partners and we asked them to help us to flag documents by categories so we did what we called uh supervised learning so many many uh reporters helped to yeah flag documents we taught the model that uh a document was in in a certain category and then we were able to scale and to analyze all the documents automatically as I I said this uh sign significantly sorry significantly accelerated the process of exploring the documents and uh it was also the very first time that we use machine learning not to create any sort of analyzes but mostly to uh speed up the research so you might wonder what it looks like um so this is data share uh this is our search engine for liid documents it's a technology we created many years ago that is open source that you can use on your own computer um you can install it in a few clicks especially if you are on Linux it's super easy um but you can also use it on your server IC use it on um on their server uh to explore Millions terabytes of leaked document and well we started to develop that in 2015 uh to um basically distri distribute the work of reading the documents and putting them into an index and so here we created this very simple feature that is called Tags that you have on the left um where you can see that we created clusters of documents so we classified some document by clusters so as I said the journalists were able to quickly filter out documents uh falling into a certain category um then it got bigger um the year after that we got another leak uh this investigation what called The W Lees it was basically a lot of Records a lot of files related to Isabel dentos which is the daughter of the former president of Angola and very quickly we realized that she was using her power and her network to get money out of Angola even even public f um but as I said again there was a lot of documents and the problem it's not a problem for aan people but it's a problem for American journalist or French journalists the document were in Portuguese so how do you coordinate an investigation that involve hundreds of journalists if they don't speak Portuguese so you might say well easy let's use Google Translate right in fact we can't do that well first we don't want Google to to have our documents and so we are legally not allowed to do that because it will be like sharing our documents with Google and if we are protected because we are journalist Google is not so Google could be I don't know um sued by Isabel do sentos because at a point they they process our document on their servers um so yeah preserving the secrecy of the investigation while uh allowing many Jour to explore it was very important so we decided to use offline translation uh at the time we use an open source technology called apertium it's very famous it's very fast uh but it's uh not very good at translating from uh Portuguese to English in fact they have no models to translate from Portuguese to English so we did something very dirty we translated from Portuguese to Spanish and then to Spanish uh from from Spanish to English and so as you can imagine the translation was very very bad honestly uh I'm not proud of it um but it was still good because with this translation and with data share our reporters were able to search into you know hundreds of documents so even if they were typing words in English or in Portuguese whatever uh they were able to find documents and then even if the translation was not good they were able to either use um a Portuguese speaker to help them or they could you know just try to understand it correctly um since uh 2020 we really improved this technology recreated to translate a lot of documents uh we are now using Aros translate which is much more efficient much more uh good at translating uh from Portuguese to English so now the translation are great but they are also much smaller um uh so it takes forever to translate um that many documents so that mean that we are still able to use apertium to translate those documents if we need to be fast if we have millions of documents we will probably need apertium but if we have less document and I will talk about it later we can use Aros which will provide a great um a great uh translation as you can see and you will see a lot of links in the bottom of the screen uh we published the technology we use to translate all those documents because as you can imagine we cannot run that on one single servers we had to distribute it and so we created a technology that was able to distribute the the computation between different servers using a a bus to share the work and uh it was extracting the text from an elastic search so we called it elastic search translator um it's open source like everything we do and you can find it on our GitHub like um like many other tools um then the year after that started the biggest investigation in journalism history not because of the impact the impact was huge but in in terms of the number of people involved in this investigation we got more than 400 journalists working with us during that uh investigation uh we started to get the leak in 202020 during the confinement during the lockdown um but the investigation itself lasted for almost 2 years and so we had a lot of documents uh almost uh 12 millions of documents that came from 14 different offshore provider so an offshore provider an offshore service provider is basically a company that helps you to set up an offshore um nshore uh company or nshore entity there are many of them and uh usually they have for customers very rich people that don't want to pay taxes so 14 of them great lot of stories to tell um problem all those files were pretty complex to read um most of them were in English that's easy but a lot of them were just uh ink of paper so uh either scans uh andwritten um andwritten text or or you know uh just very long Report with a lot of information um so we needed to be able again to identify some very specific type of documents and also to um uh extract structured data from those documents um so for the first step well again we use data share to analyze the documents and to search through them and uh we use machine learning to identify documents by categories just like we did for um the mor isue leaks but uh more than 10 times bigger the problem with this machine learning operation was the cost you know when you uh uh use machine learning algorithm you have to store vectors uh to speed up the process and when we tried to store the vectors of all those documents it was massive lot of data to store and at the end combining the storage and the computation it cost St it cost ICI uh $50,000 just to do this uh you know document classification so not ideal that's something I will do every day uh but that was worth a try because it allow us to search through the document faster um as I said we also wanted to extract structured data it's important to understand that when you have such big leak a lot of them are PDF emails words document some of them are spreadsheets or databases but it's a very tiny portion and the problem with this kind of leaks is the fact that there is so much personal data in it that you cannot just release it with everyone because if you do so people will be in danger we have in this leak like the ID card of Shakira we have personal emails poem love letter this kind of stuff so very personal uh so you don't want to publish that on internet because people will be in real danger and there will be a huge confidentiality issue but there is a lot of stories that we cannot tell because we are still a small organization and even if we involve a lot of journalists at the end of the publication the work is kind of done you know you don't go back so often on the leak so we really wanted to extract a list of all the structured um data all the companies present in this leak to publish some sort of offshore uh registry so you know offshore jurisdiction they don't have a public registry of companies that's why they are used you know for ax s um so we had a lot of companies in this leak so we wanted to extract those companies and publish them in what we call the offshore leaks database you have the link again here um and so anyone you uh or any researcher out there can use our data to do their own research and it works uh very often we got requests from journalists because they found an interesting name that was not interesting last year but it is interesting now just let's imagine all the new deputies that might be in our database that were not very famous uh last year but that might be famous uh next week um and so those journalist contacted us and they said well I find some interesting result about that person can you give me the documents or give me access to the panra papers and so this process works very well IC is committed to give access to as many journalists as possible to those kind of documents and so because we created this corporate registry where we really are able to um uh recreate new collaboration on the basis of this uh database so to do so uh so we I skip a little bit that part um we extracted structure document uh using uh machine learning again so basically we trained our uh models to recognize the name of the company the name of the officer so the person that have a role in the company uh and this kind of stuff like the address uh and at the end we were able to publish all that data that was extracted automatically um on this website as I said before um yeah yeah as I said before our secret source is factchecking we get a lot of analysis a lot of documents but when we want to publish we need to check that it's true and so because this process is very uh very hard very long we decided to create a platform to make it a bit more playful uh we kind of gamify the process of factchecking data so we created an open source platform it's it's very simple you know it's based on Jango you can publish it um uh read it on your server uh and it basically offer you a way to upload a spreadsheet and upload the list of uh value so you say for instance I don't know I want to identify countries in this spreadsheet and then the reporters are able to say if the value you um you extracted using machine learning is correct or not it's just that's simple just verifying that data and because we built this platform uh prophecies the reporters in our team managed to verify all the records we extracted using machine learning pretty pretty quickly uh so it was pretty fun to do for us it's funny to do some sort of tinder-like but for fact checking but it was also very fun for them because they have this nice interface they could use on their phone and you know uh I remember one of our data journalists was working by the pool she was you know just swiping on her phone to verify um the machine learning result um we are still using prophecies for many other projects but we really uh created this nice and friendly interface for the P papers um last year we released another um investigation into offshore activities but this time in Europe in Cyprus it was called the Cypress conf confidential and we got our end on a lot of documents um that were basically showing that Russia was using Cyprus as a gateway to Europe so they were either buying European passport or they were um you know sending up companies or whatever you need in cus to have access to found directly in Europe uh problem again a lot of documents were in Greek and Russian so we needed to translate them the problem here was the fact that the way the documents were formatted it was very hard to detect the language of the documents so we had this problem you know we try to OCR a document to turn it to turn an image into text but to do so we need to know the language of the document but to know the language of the document you need to extract the text and you know it's almost impossible so it's it's you know you you can have very efficient OCR technology like the one we use but in in that case it was just not working so uh we created another open source technology that we call plld for PDF language detector that was able to OCR the documents in different languages we just as for Russian Greek and English and then it what it used the confidence level to to assert what is the language of the documents and so thanks to this process at the end we were able to identify the language of every document correctly so we were able to OCR it correctly search result are much better when you recognize cilic as cilic um and we were able to translate them obviously using the technology I mentioned before so again machine learning help us to speed up a process that could have been done by hand and but because we had so many documents we were able to do it in a much more realistic time frame as I said we have a lot of uh open source technology uh I'm a strong believer of Open Source um it started many years ago uh and I still think that it's uh the best thing that happened to the world since um computer exist um and icj really want to follow that philosophy and publish as much as possible the source code of uh what what they do so as I said we created this data share search engine uh that you can find on this address that you can download on your computer so sometime it works sometime it doesn't be patient please it's not that easy to set up this kind of search engine on a computer uh keep in mind that when we use it usually it's on dozens of servers so so it's not easy to fit that kind of computational power on a laptop um we follow a philosophy that we call Extreme scalability meaning that we build this software we designed it so it can run on a small laptops with small memory but it can also run on hundreds of servers and so when we get a leak like the Pandora papers we already have the tools to analyze all the documents but if you do have a Le you know there are so many now uh just spend I don't know 30 minutes on the dark web and you will probably find something to download on your computer you can use data share to help you to navigate the documents our primarily focus is unstructured documents so they are PDF words images so things you know that are not so easy to read for most computers but that data share is able to read pretty quickly and pretty efficiently and we support something like I don't know 2,000 different file formats so when you have a leak and you have a jumbo of different file format data share is able to read them um so it's uh yeah it's at the very center of of what we do um as I said it's uh it's performing OCR using open source technology like Tesseract um it's use uh name entity extraction using um open source models so you are able to identify find name of people organization and places um and yeah and it basically allow you to search through your document with a bunch of of filters as I said before uh we use um OS translator to translate the documents so this is also an open source technology that you can run on your computer you have to be a bit familiar with command line to use it uh but in just if you click you will be able to to run translation on your elastic search index and you will just um just work so I I guess it's a pretty U big plus for us so what's next um there are so many leaks out there uh so many different kind of data that we still struggle to analyze so what we really want to do in the future is to make data share a platform that you will use not only to read your document but also to perform your own analysis so you don't have to care about can you read this PDF can you read this image can you read this word document can you read this huge mailbox you have access to you won't have to care about that because data share will extract everything correctly for you as it already does but then data share will offer you an API and a sandbox so you can run your own analysis over the document so let's say you want to identify different clusters in your leak data share will really offer you an environment where you can run your own lysis if you want to use Java I don't think many people will use Java but anyway if you want to use Java you will be able to do it if you want to use Python JavaScript whatever data share will offer a space to do that we also want to uh create some sort of um service around data share but it's I mean it's it's not our priority right now thank you very much and uh if you have any questions [Applause] I'm and I forgot to say uh we are an NGO so we are donor based so if you want to support our work I encourage you to go to that link so do we have some questions yep I was wondering how do you get the data you do for your investigations to people send to you and if so do they use specific tools is thata share usable to share some data with you and uh yeah thank you um so yeah it's a big topic um I spend a lot of my time meeting with sources and traveling to get an our drive and just go back home and basically put it into Data share that's mostly how we start an investigation but as I said ICI is also a network of of members so there are you know journalists around the world that have their own documents so very often they come to ICI and say hey I have two millions documents I don't have the servers or the technical skills to analyze them they just end it over to ICI and that's how an investigation starts of course not every leak leads to an investigation this is why we are uh we have so many reporters at ICI uh but I think the best way yet to start an investigation is probably to share documents with us we do some investigation that are not based on leak like the implant files I mentioned before it was not based on leak it was based on public data and um and data we uh uh scrapped from public website uh but yeah most of the time if you have a leak it's probably the the best way to work with us uh we work with um only news organization that's one of the reason why we are protected but our sources can be anyone like uh insiders hackers uh whatever um and uh I think you asked a question about how we review the data or what's the process but basically it's yeah it's a very uh manual work when you have so many reporters you are able to distribute the effort of exploring the documents uh but that's it there is not really any magic formula to review the documents just have to go one by one and read them all hello thank you for your presentation and for the work of AJ very very important for our society so please continue thank you um what I want to ask to you if um the findings that you that the journalist are are finding I wonder um if some um people that are targeted by these leaks if they want to Target you and steal the information that you have found and how you do that is I imagine every company protect as much as you can so every investigation starts with the threat modeling we have to know who our enemies what resources they have and in most cases all the one I I mentioned they don't have that much resources because our enemy is not the NSA yet um but yeah at the beginning of every investigation we try to assess what's going to be the risk and what security measure we're going to take I didn't really talk about it but in fact we are pretty strong on that point I mean we do the best we can um and every time we publish something regarding one of those actors we got DDOS attack we got int intrusion um attempt so we are very careful uh it's almost never uh possible to know where the attack comes from so most of the time we just guess but but very often it's the same kind of actor like you know Russia hello uh have you command a language structure to speak between uhal bit louder please yes sorry uh have you a common langage structure like sticks for example in cyber security to exchange between journalist or is just uh the data share model that is used by everyone and that's become the facto this language um so I'm not sure to get it but I'm going to try to answer anyway tell me if I'm wrong um so all the model we use are open source and publish with data share and the one we use to analyze the document are in fact coming from other organization so we don't have our we don't have our own models um but as you can imagine when we have so many documents we can train models with very unique um very unique data set so that's something we also try to do currently we are working with the oslomet university uh to build a model that is able to detect passport and basically extract the name of the person the country the date of birth the photo uh and when we have a leak because uh thanks to this model we are able to quickly recognize uh passport the problem is that we cannot really publish this model that is strained on confidential data because you know there there are ways to know if uh which document have been used to train the models you can retrieve some of the information analyzing the model so it's not trivial but it's possible and so it's not a risk we are willing to take yet and so that's why this kind of model are still not open source but the technology to train the model is open source and so if someone has their own uh data set they can train the model themselves using our technology yeah thanks uh you me you mentioned you had a few dozen people uh as staff a few hundred as members uh do you have a ratio of technical people uh in both of those groups because it seem like a huge amount of work so in the member Network it's close to zero technical uh people it's just journalist and uh some of them are uh not the same generation than me uh they don't really use technology to do their work so they don't have much interest but at ICI the organization I think it's uh a third of the organization that is no even half of the organization that is technical staff meaning uh developers data analysts data journalists uh it people um so yeah I think it's pretty unique to have a news organization where such a big part of the staff is technical um country can see you like a stress bit louder please country can see you like a a stress um um Secret Agency is a partner or an enemy for you um so the question is uh countries can be our enemies and uh survey agency are either a threat or a partner for us am I correct um so I would say an enemy without any doubt um we never share anything with uh governments uh we don't share anything with intelligence agencies uh we know that some intelligence agencies use our open source Technologies because they told us um but uh we don't want to work with them uh that would create a very dangerous precedent for the work of the journalist that work with us and um yeah and we will continue that way one of the reason why we publish the offshore leaks database is also because we don't want to work directly with government entities uh but we know that when we publish the off database the tax authorities for instance took the data and run their own research in France just in France uh it's estimated that they managed to get um 400 million back from the taxpayer just using our data just using the data we published uh in the word we estimate it's about 1 five billion and two billion but we don't really know because they don't always communicate like the US they don't say how much is uh based using our data um but we will continue to publish data as much as we can because we know that he can have an impact even after uh the work of the journalist is done um I was wondering how uh which kind of infrastructure are you using is this on premise or maybe you do both with some cloud provider and also another question related to something you said earlier you said that you have multiple shes and I wanted to know if you access your sources and if you uh you way to ensure that nobody is trying to PO poison your data right um so we have clude infrastructure and uh other infrastructure that are in Secret locations uh our clude infrastructure is Amazon um mostly because it's um easiest not the cheapest but easiest option we have um and also because it allows us to scale so when I woke up in the morning if I need to analyze 10 million document I can do it in a day um but we also have other infrastructure for more sensitive documents uh everything is encrypted obviously um but yeah that's probably one of our biggest weakness is the fact that we have to use the server provider uh either it's AWS or Google cloud or whatever we have to trust uh someone else uh we know that there are some uh friendly organization like uh flet for for instance in Iceland that offer um great protection for journalists um and we try to use this kind of organization as much as possible but when we do so we don't make it public because we don't want to um you know um attract people to their servers and to answer your second question yes some of the source we got might have an agenda I need an agenda it's always an issue and every time we have to assess it but um the question is what's the public interest when we receive data for instance from hackers ransomware organization they benefit from us if we work on the document they share because that that allow them to do their business because that creates some sort of uh pressure on their victims uh but if the data is important for the public interest we will still investigate it and we will still try to find interesting stories for our readers sometimes we got leaks from different sources and we decide not to use it of course we do some research first but we decide not to use it because we realize that the public interest is too little or not worth the risk um but in situation where we know that there are potential inter interferences sorry um that's where the fact checking part is very important because you know very often people think that it's because we got an email of someone saying they want to create an offshore company that we're going to write it in the paper it's not how it works we got an email with someone that wants to do that then we investigate then we verify this information and then we are able to publish it um for instance if you take Shakira or El tundran or uh I don't know any famous person in the leak of course we find them much faster because their name are known but once we got this kind of information for this kind of person we really need to run uh the kind of verification that we will do for any other kind of stories so we have to um to verify that the company exists we have to ask the company that set it up if they did it or not um and there are many you know uh many laborious steps before publishing the story do we have other questions nope well feel free to send all the data links to ICI thank you very much Pier

2024-12-31 13:33

Show Video

Other news

How useful is an original Raspberry Pi in 2025? (ft Blue Raspberry) 2025-06-05 04:05

【离限户外】巅峰对决 2025！vivo X200 Ultra、OPPO Find X8 Ultra 、小米 15 Ultra 户外综合体验 2025-06-03 01:19

Bring your own model to Windows using Windows ML | BRK225 2025-05-26 17:57