ACDH-CH Tool Gallery 8.1 – Demonstration: Applying VOICE technologies to other data
So if you remember what I told you about the tech stack of the previous VOICE releases, you can imagine we ask ourselves: Should we use eXist-db again? And also: Can we reuse the UI? And we had our doubts. There are the problems with the resource consumption of Java applications and the most recent eXist-db is no exception. On the plus side for eXist-dbs search speed in TEI-based corpora is generally very good, if a bit tricky to configure correctly. As for the UI, we have experience with Vue.js and Vuetify at our institution, we have no experience with QOOXDOO,
so we chose the first two to implement the UI. We saw that NoSketch Engine is very capable and it can search corpora that are hundreds of times as large as VOICE and it is very fast at that. It does not suffer from a tight binding to the JVM. It usually needs way less RAM to do its job in the first place and then hands back all the RAM used to the operating system. This is exactly what is needed in today's shared hosting environments. And also shared hosting environments
tend to bite you if you are on the same server with other resource intensive services. The most recent version of NoSketch Engine already is an API which is just shipped with a quite elaborate React-based UI. We take this approach and use it to our advantage. We get search results from NoSketch Engine and then in the browser we use TEI/XML snippets to generate various renderings like the familiar VOICE style or the POS style. Also the code running in the browser can provide
you with the downloadable files in text format or table format, even with files in microsoft's native XML format. We developed this translation from TEI/XML to the other formats with the help of a lot of great libraries so we could concentrate on the VOICE specific logic. On the server side we use the API and the very powerful corpus query language of NoSketch Engine, a query language which is actually, as you saw, overpowering many people. But we accept the query language that is easer to use and looks more like what you are familiar with from VOICE and then we translate this to the real CQL. As a side
effect it's easy for us to add the parts that are needed to ignore the pseudo-token. We can exploit the design of the corpus to provide the user with easy means to express different kinds of annotations. That is lowercase-uppercase, most prominently. For example the p.s. tags. The downside of this translation is we needed to write the search manual and I would
suggest you take a look at it because it's a good guide to what you need to use our search and also I hope it is an inspiration of what you can search for. One thing that is rather limited in the NoSketch Engine is its text rendering capabilities. That's one of the main reasons why we use TEI/XML and then render from that. We use Node.js to have the XML on the server and deliver that. Now Node.js is by no means an XML database but the only kind of XML query we need to answer, is which snippet of TEI/XML contains the xml:ids returned by NoSketch Engine. With a little pre-processing,
this is an easy task in JavaScript. On startup the service reads the VOICE XML files, pre-processes them and keeps them in memory. So it can access parts of them very fast and efficiently. On request the service can also give you an utterance by its ID. For performance reasons the browser does not keep XML with rendering of things that you cannot see in your browser. This is true for big search results as well as for if you browse through whole files.
Some of our tech stack decisions have advantages and drawbacks. VOICE always was TEI/XML data but users see a markup made of this parenthesis, punctation, characters, uppercasing, etc. Previously, with the full XML tech stack, we had XSL easily available, to provide a transformation from XML to HTML with the VOICE marker. Our current tooling is not XML based at all. This leads to more predictable resource consumption and is better known to many software developers today. For this toolset, XML is just text data that has a special meaning only to the human eye. We don't know a good, small, and fast XSL processor for all browsers, so in the end we ended up with XML processing in JavaScript in your browser. This is fast and no
server resources are needed for that. We think it is much easier to find maintainers for that in the future - we hope so - than for complex XSL transformation. In this project our browser UI specialist was not savvy in XSL so he actually came up with what you see and use right now in JavaScript only. I think it's pretty amazing what he did but it has one drawback: The JavaScript code he wrote and the time pressure, I admit, is far from easily adaptable to new projects or easy to maintain. If we build on what we created for VOICE
for other projects, we need to give this some more thought. Vue.js and Vuetify enable us to build the UI you see now with all the functionality we can provide in a very short period of time, that is my impression at least. We could experience firsthand how much a modern framework helps with getting the UI you want and need and just supply the logic needed for your project. You just let the framework care about the details of the building blocks. So to sum it up VOICE online 3.0 consists of three services: First the search engine NoSketch Engine which runs in the background and is generally not accessible from the internet directly. We package
the NoSketch Engine distribution together with the VERTICALS generated from VOICE 3.0 XML data into a container image. Second an XML file and snippet server and the query translator service, written in JavaScript using Node.js. This was developed using CI, Continuous Integration, with everything automatically built and set up by just pushing changes into a git repository. This approach uses container images. The third service is the browser-based UI which was also developed using the CI approach. This git repository is also built into a container image
after every push of a change. The image then serves the optimized code for the browser using a very small and fast web server. We provide all parts of VOICE as source code and also as container images. We use those images ourselves, but this also means if you want/need to run maybe a heavily used instance of VOICE 3 or search interface, you can do so on every x86 platform capable of running containers. This can be your workstation with docker desktop, which is a commercial tool and please be aware of the license restrictions, or any current distribution of linux or you can also use cloud providers that run containers for you.
But can I use this for my own data? That is what I asked myself not long after the release. There are a few prerequisites your data needs to fulfill and it doesn't hurt if the annotations are not as complex as VOICE, but you certainly could use our software with data that is annotated with the same level of detail. So what are the criteria your data should fulfill? It needs to be XML, it should be TEI/XML, encoded according to the guidelines, chapter 8, "Transcription of speech for spoken data", especially a segmentation into utterances with u-tags with xml:ids is necessary. The individual TEI documents need an externality at the TEI tag level. For other data,
an encoding using a sentence-like division of text, with s-tags with xml:ids would probably work, but I didn't try that. The text needs to be tokenized on widespace boundaries and all token need to be in w-tags with xml:ids. Xml:ids should be unique within the whole corpus and they should be stable. Your data is lowercase only or case does not matter to you. This is a
prerequisite for the simple query language you saw to work, as anything uppercase is interpreted as searching for a word with this POS tag. You could probably get around this requirement by adapting the parser of the simple VOICE query language. Also the UI uses some metadata to display information about speakers or the tree of speech events or documents in your corpus. How to create this Json encoded metadata is beyond our scope today. If you have any questions, please ask them in the chat, we will talk about them later.
I would like to show you a prototype example using the VOICE architecture for the TuniCo corpus. So coincidentally I started my work here at the Austrian Academy of Sciences with a corpus of spoken language. In this project, together with the university of Vienna, we connected samples of the language spoken in the greater Tunis area in Tunisia in the early 2010s. It is transcribed according to the standards of Arabic studies here in Europe. In this
transcription, case does not matter usually. Of course at the time the corpus data was encoded according to chapter eight of the TEI guidelines. The overall encoding structure is not as deep as the one of the VOICE 3.0 XML corpus so this makes it easier to process.
What makes a few changes necessary is the usage of way more unicode characters. The audio is available but it was segmented on an utterance basis, not on a whole text or speech event basis. This means the UI needs to be adapted to accommodate this. Extracting the metadata was a task very specific to that corpus. The metadata is stored very differently. And with this short introduction I would like to quickly demonstrate you live how I went about adapting the VOICE tool set to the TuniCo corpus data. For XML I really recommend Oxygen XML for dealing with XML data, transforming it, it is just very good at giving you all the tools to run transformations and easily debug them.
The transformation here is not very hard to understand and I will come to it in a moment. We previously heard about the need for VERTICALS in the NoSketch Engine and that we can define what data is or we can use it for any kind of data attached to a particular token, but we have to define that and for this there is a text configuration file that looks like this. We will also provide this configuration file for the VOICE corpus when we release all the - we want to release a zip package of all the data and some transformations of the data and it will be in there. So what do you see here? A few configuration settings that the service needs for finding the data and/or creating the initial binary representation of the VERTICAL files and then you see the definitions of how the data in the token should be addressable, if you want. You have - in what i want to show you - have still the word, you have the ID, which is in our workflow very important and then in the TuniCo corpus I can easily get a lemma and a part-of-speech. We don't have multiple - we don't have something like functional and
other part of speech. This is something that, if you wrote one, you can often just copy it around and just adjust it a bit. So now we have a configuration file, then we need VERTICALS. So how do we get them? In VOICE there is a perl script for that because Hannes is very savvy at perl and of course you can create these VERTICALS with perl or with python, if you prefer that.
Here I think I can easily show you how such a VERTICAL and the XML files are related. For example this is one of the transcribed texts - sorry, no, this is the metadata which is, as I said, totally different, sorry. This is actually the transcribed text as it is in the TuniCo corpus. So you see you have utterance
tags with xml:ids, you have w-tags with xml:ids you have the lemma, you have part-of-speech here, this is in the type-attribute and not in the other attribute, I don't really remember why we did that but that's the way the data is right now. And yeah, we actually have the ID that I would really recommend you have in a TEI, root-tag, and we have it here as this more special attribute. Now if I want to generate my VERTICAL, I can start the debugger, which will probably help us see the relation more easily. So I run it -
no that's the wrong one. - I run it and in output now I get a VERTICAL. So which part is made up of which? The interesting thing is I can just click here and it will show me how things were created. So I for example take an utterance tag and look for some particular attributes and put them here as a one-to-one relation, but what I always have to do is translate the w-tags into these tab separated value lines and in XSL this looks like this here. I take just the text within the w-tag and it's in the first column, then I add a tab, then I take the xml:id, like in here, and so on. So I need to separate by tab in XSL, this is obviously this encoding. I just put this on github so I'd say if you want to know even more about this you can have a look at the github repository, I will check the link once more and put it in the chat.
Okay, so then one thing that I thought with this demonstration just used my Oxygen program for generating the VERTICALS, you just configure which XML to run - which XML you want to transform using which XSL and then you get the results here as a lot of VERTICAL files. I did not change the file extensions so I think they are - at least in in TuniCo they are not only VERTICALS but they are also valid XML files and I heard that the VOICE files are not, so NoSketch Engine does not really care about all the opening and closing text having exactly the right order. That doesn't change how the search works in NoSketch Engine. So I have my VERTICALS, I have a configuration for NoSketch Engine, so how do I run this? NoSketch Engine is released as open source but right now it is um a bit stuck with the outdated version of python. They are working on that problem but right now everything that's open source needs python 2.6 and therefore you need old, stable, well maintained linux distribution that still has that.
I would not recommend you try to run this outside of a container. We packaged everything up and into a container image and published that on docker hub, so if you want to quickly um try something out with NoSketch Engine, I will just recommend you this. I put a link to our container image in the readme file and I hope this readme file here in on docker hub tells you a bit more about how you can solve this but if I just want to launch a NoSketch Engine instance with my TuniCo data over the VOICE data, in the future I can always do so with a very simple, actual very simple command, that is this here and if you launch it like that, - this is for windows and my computer so um the weird QW is a hint that you should change this, if you use it on your computer - then it will use the files you have processed on your computer and on startup build the binary representations that NoSketch Engine needs to actually do the search. In VOICE, this is noticeable, so it's not that you have to wait hours but you notice that it needs a bit to start up. There is a second option to do this once and create a derived container of the NoSketch Engine container together with the data, this will be documented for VOICE and then startup is next to instantaneous. So, I just drew this here on my computer, I use windows, the commands are next to the same on linux and mac OS.
I think this actually worked. So it looks like this, as I said this is for the even smaller TuniCo corpus but on the VOICE corpus it will take a bit longer and if I now go to my local host and the port I chose then I get the NoSketch Engine TuniCo data and nothing... okay, so something else. Sorry, misinterpreted the first not a number.
So if I would like I can now search here for say and I get these results, as you see there are a little bit of those underscore-encoded stuff here that we talked about. So I used a few pseudo-tokens to do a little bit of giving NoSketch Engine an idea of what a sentence is for me and how I can and want to search this. So I have one service running and now I will show you the other two services you would need to start: That is for one you need the API. TuniCo API server, it's also on github, so you can have a look at that code. What I need to do here is, first I need to add the XML files, it should be the same XML files that I generated the VERTICALS off, it doesn't have to be but things tend to get out of sync and that wouldn't be very good and then I may have to change a little bit here if my corpus has other - is structured differently or has other attributes that are different from what you have in the VOICE corpus. In this instance it's something about searching, that I actually had to change aside from having here the corpus name more as a variable, I think it's easier to show you this here. What I had to implement is the ability to use the unicode characters that are in the corpus
all over the place, I can now actually search for them because I changed the code a little bit. It boils down to something like this, so a change, just here I listed some umlauts, just for fun, actually, it doesn't really need it for VOICE, but I put them in there and now, it was a good idea in hindsight because I could just then add more here and this actually works, so the translation now accepts at least the unicode characters I need for my corpus. This, as I said, also is something that uses a lot of node.js tooling. So what we actually need to do to get this running here is doing the usual three steps of creating or starting up a node.js application, that is, install the dependencies, I already did that so with npm-install, I think it will just tell me that nothing is to be done, or it will just do or something, and then, there's a readme. there is this command that is a little bit custom but just launches the server. As you can see
and in VOICE it would be just the same, it reads the XML files and pre-processes them, exactly as I told you before and now I have a service running on, if I remember correctly, port 3000, which I can ask about a few things, I also built in some API docs, so you can actually try out something here. Before, I would like to re-implement a completely different user interface. But I will just not get into this much more and instead also I'll give you the front end and this works pretty much the same, it is a UGS based setup so you need to build it before you can run it but that's actually the biggest difference here. So again, as usual this takes a little while, it should be finished soon. This is actually the part that needs the most changes, if you like. There is a lot of things that probably will not have already available in your corpus, for example the filter part. I just
removed the UI for that, of course you could also try to find all the references and remove them too, and also some other things like in the TuniCo corpus you cannot actually create a meaningful VOICE style rendering, the annotation is just not there, but still. You can have the text rendering and you can have a POS rendering and so for example I just removed the UI elements that would launch the VOICE style but that is much easier than creating them in the first place. Aside from that I had to do some, only little, tweaking to get the front end really showing something with the TuniCo data. *typing* I will just cheat now. There is a way of defining a port, I did not remember that
the API - the XML server service and the API service used the same port as the front end on startup. And here you can see a cleaned up UI that just presents all the UI parts that are actually meaningful for the TuniCo corpus with the edition, as I mentioned, that audio is on a-per utterance-level so if I want to play audio I can select it here. It will actually play this small snippet of audio. I can also search here again and I will have the highlighting and everything, I can have the POS and have a look at the XML, where things were found. It is possible and I wanted to show you. This, I would say it took me about a week to get this in this state
and I have to I have to stress that I did not develop that, I was only a bit coordinating it and especially the UI. I was involved in the XML server development, but not in the UI development. I think even someone that is new to this code will have results in a pretty short period of time.
2022-08-11 22:11