Okay, I think we can get started. So, hello everyone, and welcome to the third and final webinar - in the MeMAD webinar series, which is called - Industrialising Media Production: A Producer's Perspective. In today's webinar, we will present the results of the MeMAD project - in the context of facilitating audiovisual content production, - and particularly, use of AI technologies, and also, focus on real-life use cases. So, some practical note before we start. First of all, this event is closed-captioned.
You can turn on the closed captioning by clicking the "CC" button. We are recording the event, - and we will make it available for later viewing on the MeMAD web pages. We'd also like interactivity with the audience. In the last segment of the webinar, - you'll have the possibility to comment and raise question, - and you can do so either in text through the chat - or with voice, using the raise hand feature - and wait for me to call you to speak. So...
A few words about today's programme. My name is Michael Stormbom. I'm a business director in Lingsoft, one of the companies in this project. In the project, I work with dissemination and business development activities. After my short introduction, - you'll hear from our keynote speaker, Matthieu Parmentier. After that, we will present more of the MeMAD project and the outcomes - with two very interesting case studies there. And finally, we leave room for audience participation and discussion.
So. I would like to hand over to our keynote speaker. (Matthieu:) I will start with a short video of four minutes. I'd like to share it with you.
That will be maybe better support for the rest of the discussion. Just raise the hand if you don't want to listen to the sound, it's only music. Did you catch the video? (Michael:) That worked fine, thank you. (Matthieu): That worked fine, great, because sometimes it could be strange. So, let me now show you a few more information about this.
And I just share my - little behind-the-scene aspect of it. So, the goal of this work was to - advertise the benefits of AI for helping production and journalists. The interest we have in our organization in the France TV - where there is no AI nowhere, - there is a little bit of AI when we try to - automatically catch the name of the company when they send us invoices. But that's the only AI we have in the France Télévision a few months ago. We started with this project I just showed you, - the advertisement campaign, - in order to grab some more interest from the journalists, - which is one of the biggest population of workers - that continuously creates content at France TV. So, for sure, we were not in position - to push the use of AI - and massive data extraction by replacing the job of anyone.
That's why we focused on this use case - where we have 200 debates all around the country in the same weeks - so that nobody is able to watch all these debates at the same time - and to make some insights from these analyses. That was a very good use case for AI to help journalists. And because it was important also to show - all the nice aspects of data in AI project, - we started this project to underline the microservice approach, - which is very easy for jumping from a typical use case to another one - with constant reuse of what we can develop - on each project as a new aspect. This is where we push behind the curtain - the use of microservice platforms. This is an open-source platform we co-developed at France TV. And with this, we have created - some simple microservice for face recognition, - OCR and recollection of data in order to identify people.
This is where we push microservice with external, very good tools, - such as the French speech-to-text from Vocapia, - and the transcript editor proposed by Limecraft, - that we intensively used for this use case. And after that, because it was a specific use case - with journalists involved in a regional election, local election, - we had to deal with a proper way to classify the different information. And this is where we developed a complete agnostic NLP, - but specialized to look at this local political debate. So, NLB, this is natural language processing.
We proceed by all the microservice concerning - the classification of terms, like the vectorization of terms, - to check all the data that belong to the same topics. And then we use automatic AI classification to make - a human exhaustive classification of the terms. This is how we can then jump from big amounts of data - to big amount of classified data for each debate. This is where we could call the data journalists of France TV - to work with this information, to be able to draw heat map, - and bring some more insights of these debates around the country. I can show you a few examples of difference in the public interest - regarding the different regions of the country.
We also had strange topics that were raised in some parts of the country, - while completely absent from others, and so on. What we could see as a general conclusion for this use case - is the fact that, today, it's important to classify the information. We can grab from speech-to-text, facial analysis, video analyses - like object and monument detection, shot classification, and so on. But it's very important to give some meaning to all this. For this, we still have to create a user interface, which is able to - adapt the treatment of our collection of metadata - in relationship with the topic we are analysing.
For this case, it was an election, - but we have to change the tool for all the use cases we have. This is why we are actually creating - a specific interface to shape the data organization of each topic. And this is also now the range for us - to refine our process of natural language processing - to be able to classify the topic in multiple categories. And now, because we are dealing with live feeds and not only files, - that's the new aspect of this platform, - the fact that we can also give some insights all the time. We have this addition of the time-scaled functions - for this data analysis. I've just sent you that now. We were also working with the process of all these data in real time.
This was the good outcome from this very first project. One year ago, this project was started. It was a big advertisement for the benefit of data and AI - at France TV, for journalists and data journalists.
Now we are one year after. What I could tell you is that this kind of data visualisation - is able to be asked for any kind of needs of the journalist. In the meantime, we have now industrialised - this automatic metadata extraction - within the daily use among the journalists at France TV.
This is the Dalet Interface. And this is where we are today. I hope this was clear, I prepared that in a hurry.
But I'm really hoping for questions if we have time for that. (Michael:) Yes, if anyone has questions for Matthieu, please. Please ask at this juncture. I see that there... Does my voice sound okay now? There seems to be an echo there. Here's a question. "What is the reception of the data journalism?"
At the beginning, it was impossible to catch them correctly - because of the COVID, - which was important for them to become important in the newsroom. Because before this, at France TV, - the data journalists weren't always well treated. It was just a sort of sub-journalism in terms of - classification of the importance of journalism in the newsroom. But with the COVID, it was important for them - to be involved in the generation of new way - to show and explain the news.
That's why these results for the election - were very welcome at the top management level. Because when we came with this outcome, - the first phase of COVID was just behind. So, the only feedback we had from the data journalists - is that they wanted to have the raw data out of the system, - which is easy with the solution I showed you, Kibana.
So, it's very plugged into the raw data. What was also important for them is to be able to - have a sort of interface between their tools, - which are more design tools to design - beautiful graphics and histograms of the raw data. So, we think today, it's very important - to have these data better available for data journalists. Maybe to create the right interface, the right tool for them, - to assist them into the creation of these graphics.
Because I don't know how it is in other organisations, but mostly, - they work with an Excel sheet on one side and a graphical system on the other. But then it takes a lot of time to make some simple graphics. So, here we give them a lot of data, very well classified, - but it's a long time to process data, to turn it into a beautiful graphic. I think there are new tools which are rising on the market, - something like a sort of Google Doc approach for data journalists, - which help them to create beautiful data - with massive data, easily and quickly. Okay, here's another question. "Are use cases contemplated for end-user application of the results of analysis?" This wasn't, this was the demo aspect of it.
So, when we use these data output to build - some really published web article, - this is taken from the same data. But yes. Reprocess with - a most beautiful chart and graphical aspect of it. This slide I showed you is directly from the Kibana tool, - free tool, which is very interesting - to quickly design the kind of result that could be showed. But after that, if you want to have something in relation with - a chart of the graphical aspects on screen, - then you have to consider the data - and rebuild everything on the publishing tool. Okay, any other questions for Matthieu at this juncture? If not, then I'd like to thank you, Matthieu, - for stepping in as keynote speaker on short notice.
It's great that you could make it here today. Thank you very much. (Michael:) I hand over to Maarten Verwaest, CEO of Limecraft, - who will present the MeMAD project and the findings. Over to you Maarten. (Maarten:) Thank you, Michael, and thanks for Matthieu - for stepping in the really big shoes of Sandy MacIntyre, - the VP of Innovation of the Associated Press, - who was committed to deliver a speech, as well, also in the news context.
These guys are also very keen on exploring what can be done with AI. I think we're running slightly out of time, - so, we'll have this up to speed. My colleague, Dieter, lent me his slides - presented at Production Technology Seminar of EBU - just a couple of weeks ago, - where he presented the key findings of the MeMAD project - when it comes to using it in production processes. The title of this webinar is <i>Industrialising Media Production.</i> As we will discuss in a couple of minutes, - we're at the edge of applicable solutions in real life.
MeMAD, just to wrap up really quickly, is an R&D project - with these partners in the consortium. It's all about creating tools for content production, - media entertainment, production archives, - and helping improving productivity and efficiency. It's intended to bring AI to life - because there's truckloads of microservices, as Matthieu mentioned.
It's available on the market. Specialised players, like Vocapia and Speechmatics, - wholesale providers like Google, Amazon and Microsoft, - more than you can test. There have been amazing steps taken in improving accuracy of point solutions, - like automatic speech recognition, machine translation and facial detection. But there's still a huge gap between point solutions available on the market - and something which is usable for an editor or a producer. So, in MeMAD, we've asked ourselves the question - how we can turn this into something usable, combine tools? How should a user interface look like? And we specifically - put our bets on mixing state-of-the-art technology, - as also provided by universities and fundamental researchers, - with usability experts answering the question - what can we do with all this artificial intelligence? So, on the left-hand side, the short list of AI microservices, - which has been on the drawing board.
And on the right-hand side, - an indication where we've experimented with multi-modal fusion - or smartly combining tools on the left-hand side, - face recognition, voice recognition to detect people, - and more down to earth, really, combining human and AI. Because it's obvious, almost an open door, - that when you can combine best of both worlds, - you have something which is more efficient - than when executed by a manual process, - and more accurate than when done by machine. So, we've extended prior art - on subtitles and localisation of content. This is, I would say, in the state of making a business case. There're lots of technologies available, there are alternatives - where we can negotiate about - what should be in the pipeline and what value it will bring. In MeMAD, we've quantified these benefits.
Then later on, we will look at reusing archive material - and showing that with real cases by KRO-NCRV and The Associated Press. When it comes to subtitling and localisation, - there are a couple of subtitle experts in the audience out there, - the key question is what strategy works best? Should you first translate the transcript - and cut the translation in subtitles for localisation purposes? Or should you first cut subtitles and then translate the fragments? More interesting and philosophical question with all this automation: - how does it affect and impact the work of a professional subtitler? Which areas are left for improvement, - whether or not academic or applied research? So, in MeMAD, we've done some - objective testing of improvements measured in time. We've not looked at counting word error rates - and evaluating diarisation and interpunction accuracy. We've pulled up our chrono, - and we've measured efficiency improvements - when subtitling from scratch - compared to having machine translation give a first shot - and then do manual post-editing afterwards.
And then depending on the pipeline and the specific setup, - whether you translate entire documents or sentence level machine translation, - there are 30 to 35 percent improvements - in time spent on the overall process. However, there is always a however. When we did a subjective analysis of - how professional subtitlers felt, what they experienced as pleasant, - and what they found really annoying. Then obviously on the balance - with the profit in time, - what is left for the subtitler is a boring job of counting frames - and improving the spotting.
Notably, the correction of the timing - or correction of poorly spotted in and out points - in the first release of the prototype - was perceived as the biggest hurdle to adoption. So, in the intermediary release, - we did a lot of work on the spotting and by concentrating here on, - we've improved it making it more pleasant than before. And in the second release, we did another experiment, - thereby, comparing subtitling, - based on the raw output of automatic speech recognition, - with translating subtitles based on existing subtitle templates. And this is very convincing. This is interesting.
The biggest source of errors comes from the automatic speech recognition. And those errors propagate, they tend to escalate. If you start from poorly recognised audio, you put machine translation on top. It becomes worse and really difficult, annoying to do the post-editing. The subtitle operator can start from an existing subtitle template, - which is usually very succinct and compact. Then, machine translation gives a result - which is subjectively perceived as very helpful.
So, in summary, when it comes to evaluating - artificial intelligence for subtitling and localisation purposes, - the key conclusion is obviously, which is good, - a computer-assisted translation - renders your process more efficient, it's faster. But the remainder of the work is considered utterly boring. There is a lot of opportunity for improvements. It's better when subtitler can start from an existing subtitle template, - then the subjective improvement is much better. In general, when ASR is part of the pipeline, - when automatic speech recognition is part of the pipeline, - despite improvements over the last 12 months - in terms of accuracy, interpunction, speaker segmentation, - it is still the main showstopper.
I'm not sure if the technologies developed by expert companies, - like Vocapia and Speechmatics, have still opportunity for improvement. What we've experienced lately is that, by using specialised vocabularies, - so, depending on the context, - when we're in Formula One, including lists of the venues and the racers, - and when we're in soccer, to include lists of football players. That improves the accuracy of the recognition a lot, and - has disproportionately positive effect on the post-editing time. But this is recent developments. When it comes to using AI more in, - can I say upstream, - in the shot-listing and editing environments, - it becomes more fluffy - where we need to find the benefits. First of all, - MeMAD delivered amazing results - when using search indices - in combination with the machine translation technology.
We've cracked the challenge of multilinguality in your archive, - using words in a particular language and retrieve content - which is indexed, using another language. And also, here we did a quantitative approach - with a specific set of use cases and searches. Where we deliberately took a list - of use cases and processes which the users were familiar with, - so they could comment a way of managing the expectations. They could compare it with the tools they were currently using.
Main feedback, - roughly, or to cut to the chase, - is that, in the initial release, - where metadata was shown in lists and taken into account, - AI effectively gives back a lot more metadata - than what you could produce by hand being an archivist. The conventional media asset management solutions - for showing metadata made it very hard to find benefits. And we concluded, somewhere in the middle of the project, - that we needed a different user interface to show metadata properly.
And when we did a second iteration of the... ...of the measurement of the perceived benefits. Then, first of all, in the meantime, we've worked a lot on improving - the accuracy of the speech recognition and the face recognition. The main observations were that object tracking and face recognition - is particularly useful when there is no dialogue. In other words, a lot of the searches or retrievals come from the audio domain, - a lot of meaning and a lot of results are triggered by words in the dialogue. But when there's no dialogue available, the visual aspect is helpful. Something we anticipated is that - AI returns a lot, not only volume.
There is sometimes an arguable signal-to-noise ratio, - specifically, when models are not properly trained. Sometimes the amount of garbage is larger than the meaningful results, - which hampers the usability of the assets we're looking for. So, the key recommendation was - in order to make - all these AI services, point solutions, usable, - we need a way of reconciling, to normalise all the data - into a single set of numbers. And also, to find a way to combine - human and AI to curate the results, making them more usable.
One of the key deliverables of MeMAD is this. It appeals to the last observational recommendation. It's a user interface in which the different dimensions, - faces, voices, labels, places, are normalised, - put on timecode and making that available for post-editing. So, exactly as was the case with the subtitles.
If there is a conclusion, to not leave AI unattended - to make sure AI produced results need to be post-edited. Let's make it comfortable for the person that has to do post-editing - in a similar user interface. And then, as the last point, - we've also done experiments with the userbase, - whereby the results of artificial intelligence - were handed off to the edit suite, - hoping that editors would also find benefits - from the available metadata. But, to make it simple, - this is, in general, not the case, partially because - Avid Media Composer and Adobe Premiere can't cope with overload of available data. Especially not if there is a poor signal-to-noise ratio. So, we need to rethink the process - prior to transferring metadata and injecting it into the edit suite - in order to have real improvements, - we need to clean those data, and make sure they are usable for the editor.
So, before we go to the practical use cases of these technologies, - just to wrap up, - in the area of subtitling, - we're at the point that it is possible to make a business case, - given that you have quite clean audio. So, no computer-animated graphics with the synthetic audio, - but clean audio, not too much crosstalk, - preferably without the soundtrack. If you combine that with the latest generation of speech recognition, - preferably with custom dictionaries, it does a reasonably good job. And when you can start from an existing subtitle template, - you can benefit from machine translation, - if properly implemented in the user interface.
When it comes to using artificial intelligence - to help in the shot-listing, as we will see in a minute, - this starts getting feasible. When looking at automated editing, - AI based editing, - there are still some steps to take. Now.
Let's first have a look at... ...The Associated Press case, which was using - the technologies developed and explored in MeMAD, - and then presented as a proof of concept for an IBC Showcase. And it's the ambition of The Associated Press, - processing on an average - 30 000 hours of original footage on a yearly basis, - to go as far as possible in automatically describing - what's in the shots. And when looking for Mister Macron, - it's easy enough to use facial recognition. It starts getting...
Speech-to-text works fine. Facial recognition - is a simple point algorithm that can be easily trained, as well. It starts getting more interesting when you look for gestures like - walking or coming down from steps or shaking hands. And if we take these subclips - with the shaking hands detector, - and we play them out back-to-back, - note that the in and out points - are not just random. So, not only it has detected the gesture, - it's also looking for the most convenient in and out points around it, - preparing automatic shot-listing. We're not yet in the edit suite. But we're taking steps towards - better selection of content fragments - suitable for the edits.
Let's go back to the complete overview. There we go. And then... ...assuming these... Such a cut list is transferred to - Adobe Premiere or Avid Media Composer.
Jumping a bit further. There we go. It's probably hard to read, but... It effectively results in a description of the shot on the right-hand side - where the AI combines these sources saying - this is Trump as United States' President - or former President, I should say. This is Shinzo Abe, prime minister, in discussion.
All those data is shown as markers on a data track in Adobe Premiere. So, this just to give you an overview of - where we got with the state-of-the-art, - and as we explained earlier, - this needs a refinery to filter, better reconcile these data, - making sure that only, or as close as possible to the only relevant data, - are displayed in the edit suite. Then, KRO-NCRV took it one step further.
They effectively raised the question to - what extent is it currently possible to automate the editing process - based on what we've seen to date. So, they labelled it "+Eddie", their new editor. It's completely, artificially intelligent - and they've asked the question to what extent it can be trained - to then cut a given list of clips into an edited clip - that looks as if edited by a professional editor. Quite controversial. And then, it also begun with shot-listing - using speech recognition, using optical character recognition, - interestingly, not only to detect concepts, - like poverty in Dutch on the right-hand side, - but also to get an understanding of non-domestic content.
Here is a German lady speaking on the left. It's subtitled with open captions. Optical character recognition makes indexes of that content, as well.
We experimented with a deep captioning system from Mobius Labs. They claimed they could detect higher level concepts like - poverty, climate change. But the main result in that last area was - it was really hard to get consistent results.
It was easy to retrieve the frames - that had been used for training the engine. But when picked random frames, the recognition rate was too low - to be used in production. The more interesting question is, - what's the strategy to have this automatically edited? Valossa, this time, was used to aggregate - all the available data coming from different sources.
Valossa detects then - concepts like "demonstration", "interview", - similar to the gesture detection we've discussed earlier. They add emotional tonality - using specific words and pitch, et cetera. A really important step is to create - a blacklist of material fragments that should not be used in the edit.
KRO-NCRV wanted to get rid of studio content, - which is usually an introduction. And by looking at visual elements, - clothing, et cetera, they could identify and then remove the studio fragments. Then they ended up with a long list - that was iteratively cut, - taken into account the desired duration of the clip. And let's have a quick look - at the outcome of Eddie Plus, the automatic editor. It's...
...taking... It's usually taking logical in and out points. The majority of the content is in Dutch. But let's say, in 20-30 percent of the cases - the discovered or proposed in and out points are - valid for machine, completely unusable - to be considered real edits. The main conclusion here being, this is, in the best-case scenario, - good enough as a rough cut edit, as a proposed edit.
But this can't be published on social media - without being post-edited by a human editor - that can check the overall consistency. With that being said... I think the final judgement would be: - good enough for, - I would say, straightforward processes like - speech transcription cut down into subtitles, - that becomes feasible and useful to make a business case.
Making a shotlist and preparing rough cut edits. That can be done. But for the time being, - the editor plays a critical role in fine-tuning the results. So, we will need integrations with Premiere and Media Composer, - whereas we expect that, perhaps specialised companies like Valossa, - but we know that IBM Watson and other companies - are probably working, as well, on automating trailers, - that's not to say automating editing processes. So, mixed conclusions.
With that, Michael, - I'm handing back to you. Thank you, Maarten. So, I think we have a couple of minutes for - discussion and audience participation.
If you have questions or comments that you want to raise, - please, use the chat functionality, or the raise hand function - if you want to comment using your voice. "You did a great presentation, Maarten." It's the greeting from an audience member here. Question for the audience. I'm not sure if we can use a poll, - we're not with hundreds of attendants. How does the audience... Does the audience believe -
automatic editing is on your radar - in the coming 12 to 24 months? Would you consider it - if it were made available by technology suppliers? Or wouldn't you take it into consideration? Feel free to comment in the chat. "We're working on auto-editing for - summaries, key moments and content adaptive streams." Yeah. One case we cannot talk about explicitly, - but for another public service broadcaster in Europe, - we are working on a mixed use case, - something in between shotlists and automatic edits. It's about populating live blog on the news websites. So, imagine there is a live footage coming in - on soccer or on the Formula One.
It's been reported in a radio studio with live radio. Then, AI will cater automatic cuts - judging that only clips between one and two minutes - are suitable fragments to be published, - looking at the concepts, and for proper in and out points, - automatically publishing that on the live blog as fast as possible. It's all about speed there. So, my expectation would be that rather than to go for the full monty - and to try to achieve fully automated edits, - which is going to be disappointing, I think we'll see mixed scenarios - and hybrid cases in between.
I see a few comments coming in, - "key moments", "summaries", "teasers"... I think that's likely to happen. Yes, another question here. "I'm a subtitler." "When I'd be able to input terminology list for improved ASR?" "I'm quite convinced this could be helpful with collaborative customers." That's a question by Mark.
I confirm this is already possible. It's available in beta, - but it is possible, we are testing it - even with a couple of the attendants in the call here. So, Mark, we will reach out to you and we'll check - how we can make it available for your convenience. Another comment. "AI editing in sports, that's going to happen in five years."
Okay. We will take note of this comment - and we will discreetly publish it, but it will be available. I don't think it will take five more years. At least not for the in-between use cases. Any other comments or questions? A comment on football there. "No within five years." Another comment. "Definitely in Paris 2024, Olympic sports AI."
Okay, I think, Michael, - we should try to copy some of these judgments - for our memoirs in the MeMAD final conclusions, - and make sure we can point back to them in due course. It will be interesting to see by 2024 if these predictions came to pass. Another comment. "Great presentation, thanks." Another. "I think AI's best as a force multiplier." Not sure what he means by force multipliers. "Power tools for humans." I think that's quite in line with -
the suggestion in the Dieter's presentation. Make sure to combine the best of AI with the best human intelligence, - to make more content faster, have a better result in a shorter time frame. "Interesting field of research." "Thank you for sharing experiences." We have one more question coming. Maybe that will be the final question...
...before we put the webinar to close. "I sometimes have bilingual input." (Maarten:) Ha-haa. Yeah, that is a very interesting question. "Is there bilingual speech recognition?" Most attendants live in Europe, so, we know the drill. Some countries, like Finland, Switzerland, Belgium, - have several official languages.
As a producer, we have to deal with multilingual audio. Mikko here in the audience will confirm that - cracking the challenge of detecting the language just based on the audio - is a very interesting but incredibly hard topic. We are working on it. MeMAD has delivered some prototype results. Please stay tuned.
Have a look at "memad.eu" website - where the deliverables will be published. This is one of the areas where we've been working on lately.
Yes, indeed. Another MeMAD member here confirms - language identification for speech is in the works. Stay tuned. "Great presentations, interesting research, relevant for broadcasters." I think that would be a good note to start closing the webinar.
First, thank you to Matthieu Parmentier - for giving a very interesting keynote on very short notice. Thank you, Maarten, - for presenting the results and the case studies, was very interesting. And I think, on that note, we can probably start ending the webinar. This webinar has been recorded, and the previous two webinars - in the series on accessibility and from a research perspective, - have been recorded, and will be available on the website, "memad.eu". So, keep an eye on that if you want to see the webinars there - and to see all of our other results there.
So... -Thank you, Michael. (Michael:) Yes, so, thank you everybody. Have a good day or evening depending on your time zone. Thank you, bye.
2021-03-16