ACDH-CH Tool Gallery 8.1 – Demonstration: Applying VOICE technologies to other data

So if you remember what I told you about the  tech stack of the previous VOICE releases,   you can imagine we ask ourselves: Should we use  eXist-db again? And also: Can we reuse the UI?   And we had our doubts. There are the problems  with the resource consumption of Java applications   and the most recent eXist-db is no exception.  On the plus side for eXist-dbs search speed   in TEI-based corpora is generally very good, if a  bit tricky to configure correctly. As for the UI,   we have experience with Vue.js and Vuetify at our  institution, we have no experience with QOOXDOO,  

so we chose the first two to implement the UI.  We saw that NoSketch Engine is very capable and   it can search corpora that are hundreds of times  as large as VOICE and it is very fast at that.   It does not suffer from a tight binding to the  JVM. It usually needs way less RAM to do its job   in the first place and then hands back all  the RAM used to the operating system. This is   exactly what is needed in today's shared hosting  environments. And also shared hosting environments  

tend to bite you if you are on the same server  with other resource intensive services. The most   recent version of NoSketch Engine already is an  API which is just shipped with a quite elaborate   React-based UI. We take this approach and use  it to our advantage. We get search results from   NoSketch Engine and then in the browser we use  TEI/XML snippets to generate various renderings   like the familiar VOICE style or the POS style.  Also the code running in the browser can provide  

you with the downloadable files in text format  or table format, even with files in microsoft's   native XML format. We developed this translation  from TEI/XML to the other formats with the help   of a lot of great libraries so we could  concentrate on the VOICE specific logic. On the server side we use the API and the very  powerful corpus query language of NoSketch Engine,   a query language which is actually,  as you saw, overpowering many people.   But we accept the query language that is  easer to use and looks more like what you   are familiar with from VOICE and then we  translate this to the real CQL. As a side  

effect it's easy for us to add the parts  that are needed to ignore the pseudo-token.   We can exploit the design of the corpus to  provide the user with easy means to express   different kinds of annotations. That is  lowercase-uppercase, most prominently. For example the p.s. tags. The downside of this translation is  we needed to write the search manual and I would  

suggest you take a look at it  because it's a good guide to what you need to use our search and also I hope  it is an inspiration of what you can search for.   One thing that is rather limited in the  NoSketch Engine is its text rendering   capabilities. That's one of the main reasons  why we use TEI/XML and then render from that. We use Node.js to have the XML  on the server and deliver that.   Now Node.js is by no means an XML database but the  only kind of XML query we need to answer, is which   snippet of TEI/XML contains the xml:ids returned  by NoSketch Engine. With a little pre-processing,  

this is an easy task in JavaScript. On startup the  service reads the VOICE XML files, pre-processes   them and keeps them in memory. So it can  access parts of them very fast and efficiently.   On request the service can also give you an  utterance by its ID. For performance reasons   the browser does not keep XML with rendering  of things that you cannot see in your browser.   This is true for big search results as well  as for if you browse through whole files.

Some of our tech stack decisions have advantages  and drawbacks. VOICE always was TEI/XML data but   users see a markup made of this parenthesis,  punctation, characters, uppercasing, etc.   Previously, with the full XML tech stack, we had  XSL easily available, to provide a transformation   from XML to HTML with the VOICE marker. Our  current tooling is not XML based at all. This   leads to more predictable resource consumption and  is better known to many software developers today.   For this toolset, XML is just text data that  has a special meaning only to the human eye.   We don't know a good, small, and  fast XSL processor for all browsers,   so in the end we ended up with XML processing in  JavaScript in your browser. This is fast and no  

server resources are needed for that. We think  it is much easier to find maintainers for that   in the future - we hope so - than for complex  XSL transformation. In this project our browser   UI specialist was not savvy in XSL so he  actually came up with what you see and   use right now in JavaScript only. I  think it's pretty amazing what he did but it has one drawback: The JavaScript code  he wrote and the time pressure, I admit, is far   from easily adaptable to new projects or easy to  maintain. If we build on what we created for VOICE  

for other projects, we need to give this some more  thought. Vue.js and Vuetify enable us to build the   UI you see now with all the functionality we  can provide in a very short period of time,   that is my impression at least. We could  experience firsthand how much a modern framework   helps with getting the UI you want and need and  just supply the logic needed for your project.   You just let the framework care about  the details of the building blocks. So to sum it up VOICE online 3.0 consists of three  services: First the search engine NoSketch Engine   which runs in the background and is generally not  accessible from the internet directly. We package  

the NoSketch Engine distribution together with the  VERTICALS generated from VOICE 3.0 XML data into   a container image. Second an XML file and  snippet server and the query translator service,   written in JavaScript using Node.js. This was  developed using CI, Continuous Integration,   with everything automatically built and set up  by just pushing changes into a git repository.   This approach uses container images. The third  service is the browser-based UI which was   also developed using the CI approach. This git  repository is also built into a container image  

after every push of a change. The image then  serves the optimized code for the browser using   a very small and fast web server. We provide  all parts of VOICE as source code and also as   container images. We use those images ourselves,  but this also means if you want/need to run   maybe a heavily used instance of VOICE 3 or  search interface, you can do so on every x86   platform capable of running containers. This  can be your workstation with docker desktop,   which is a commercial tool and please be  aware of the license restrictions, or any   current distribution of linux or you can also  use cloud providers that run containers for you.

But can I use this for my own data? That is what I  asked myself not long after the release. There are   a few prerequisites your data needs to fulfill  and it doesn't hurt if the annotations are not   as complex as VOICE, but you certainly could use  our software with data that is annotated with the   same level of detail. So what are the criteria  your data should fulfill? It needs to be XML,   it should be TEI/XML, encoded according to the  guidelines, chapter 8, "Transcription of speech   for spoken data", especially a segmentation  into utterances with u-tags with xml:ids   is necessary. The individual TEI documents need an  externality at the TEI tag level. For other data,  

an encoding using a sentence-like division of  text, with s-tags with xml:ids would probably   work, but I didn't try that. The text needs  to be tokenized on widespace boundaries and   all token need to be in w-tags with xml:ids.  Xml:ids should be unique within the whole corpus   and they should be stable. Your data is lowercase  only or case does not matter to you. This is a  

prerequisite for the simple query language you  saw to work, as anything uppercase is interpreted   as searching for a word with this POS tag. You  could probably get around this requirement by   adapting the parser of the simple VOICE query  language. Also the UI uses some metadata to   display information about speakers or the tree  of speech events or documents in your corpus.   How to create this Json encoded metadata is  beyond our scope today. If you have any questions,   please ask them in the chat,  we will talk about them later.  

I would like to show you a prototype example using  the VOICE architecture for the TuniCo corpus. So coincidentally I started my work here  at the Austrian Academy of Sciences with a   corpus of spoken language. In this project,  together with the university of Vienna,   we connected samples of the language spoken in the  greater Tunis area in Tunisia in the early 2010s.   It is transcribed according to the standards  of Arabic studies here in Europe. In this  

transcription, case does not matter usually.  Of course at the time the corpus data was   encoded according to chapter eight of the TEI  guidelines. The overall encoding structure is   not as deep as the one of the VOICE 3.0 XML  corpus so this makes it easier to process.  

What makes a few changes necessary is the  usage of way more unicode characters. The   audio is available but it was segmented on  an utterance basis, not on a whole text or   speech event basis. This means the UI needs to  be adapted to accommodate this. Extracting the   metadata was a task very specific to that corpus.  The metadata is stored very differently. And with   this short introduction I would like  to quickly demonstrate you live how   I went about adapting the VOICE tool  set to the TuniCo corpus data. For XML I   really recommend Oxygen XML for dealing with XML  data, transforming it, it is just very good at   giving you all the tools to run  transformations and easily debug them.  

The transformation here is not very hard to  understand and I will come to it in a moment.   We previously heard about the need for VERTICALS  in the NoSketch Engine and that we can define what   data is or we can use it for any kind  of data attached to a particular token,   but we have to define that and for  this there is a text configuration file that looks like this. We will also provide  this configuration file for the VOICE corpus   when we release all the - we want to release a zip  package of all the data and some transformations   of the data and it will be in there. So what  do you see here? A few configuration settings   that the service needs for finding the data and/or  creating the initial binary representation of the   VERTICAL files and then you see the definitions  of how the data in the token should be addressable, if you want. You have - in what i  want to show you - have still the word, you have   the ID, which is in our workflow very important  and then in the TuniCo corpus I can easily get a   lemma and a part-of-speech. We don't have multiple  - we don't have something like functional and  

other part of speech. This is something that,  if you wrote one, you can often just copy it   around and just adjust it a bit. So now we have a  configuration file, then we need VERTICALS. So how   do we get them? In VOICE there is a perl script  for that because Hannes is very savvy at perl   and of course you can create these VERTICALS  with perl or with python, if you prefer that.

Here I think I can easily show you how such  a VERTICAL and the XML files are related.   For example this is one of the transcribed texts - sorry, no, this is the metadata which  is, as I said, totally different, sorry.   This is actually the transcribed text as it is in  the TuniCo corpus. So you see you have utterance  

tags with xml:ids, you have w-tags with xml:ids  you have the lemma, you have part-of-speech   here, this is in the type-attribute  and not in the other attribute,   I don't really remember why we did that but  that's the way the data is right now. And yeah,   we actually have the ID that I would  really recommend you have in a TEI, root-tag, and we have it here  as this more special attribute.   Now if I want to generate my VERTICAL, I can start the debugger, which will probably help us see the relation  more easily. So I run it -

no that's the wrong one. - I run it and in  output now I get a VERTICAL. So which part   is made up of which? The interesting thing  is I can just click here and it will show me how things were created. So I for example  take an utterance tag and look for   some particular attributes and put  them here as a one-to-one relation,   but what I always have to do is translate  the w-tags into these tab separated value lines and in XSL this looks like this  here. I take just the text within   the w-tag and it's in the first column,  then I add a tab, then I take the xml:id,   like in here, and so on. So I need to separate  by tab in XSL, this is obviously this encoding.   I just put this on github so I'd say if you want to know even more about this  you can have a look at the github repository,   I will check the link once more and put it in the chat.

Okay, so then one thing that I thought  with this demonstration just used my Oxygen program for generating the VERTICALS,   you just configure which XML to run - which  XML you want to transform using which XSL and then you get the results here as a lot of VERTICAL files. I did not change the  file extensions so I think they are - at least in   in TuniCo they are not only VERTICALS  but they are also valid XML files   and I heard that the VOICE files are not, so  NoSketch Engine does not really care about all the   opening and closing text having exactly the right  order. That doesn't change how the search works in   NoSketch Engine. So I have my VERTICALS, I  have a configuration for NoSketch Engine,   so how do I run this? NoSketch Engine is released  as open source but right now it is um a bit stuck   with the outdated version of python. They  are working on that problem but right now   everything that's open source needs python 2.6 and  therefore you need old, stable, well maintained linux distribution that still has that.

I would not recommend you try to run this  outside of a container. We packaged everything up   and into a container image and published  that on docker hub, so if you want to quickly   um try something out with NoSketch  Engine, I will just recommend you this. I put a link to our container image in the  readme file and I hope this readme file   here in on docker hub tells you a  bit more about how you can solve this but if I just want to launch a NoSketch Engine instance with my TuniCo data over  the VOICE data, in the future I can always do so   with a very simple, actual very simple command,  that is this here and if you launch it like that,   - this is for windows and my computer so um the  weird QW is a hint that you should change this,   if you use it on your computer - then it will  use the files you have processed on your computer   and on startup build the binary  representations that NoSketch Engine   needs to actually do the search. In VOICE,  this is noticeable, so it's not that you   have to wait hours but you notice that it needs  a bit to start up. There is a second option to   do this once and create a derived container of the  NoSketch Engine container together with the data,   this will be documented for VOICE and then  startup is next to instantaneous. So, I just   drew this here on my computer, I use windows, the  commands are next to the same on linux and mac OS.

I think this actually worked. So it looks  like this, as I said this is for the even   smaller TuniCo corpus but on the VOICE corpus  it will take a bit longer and if I now go to   my local host and the port I chose then I get  the NoSketch Engine TuniCo data and nothing... okay, so something else. Sorry,  misinterpreted the first not a number.

So if I would like I can now search here for say and I get these results, as you see there  are a little bit of those underscore-encoded   stuff here that we talked about. So I used  a few pseudo-tokens to do a little bit of giving NoSketch Engine an idea of what  a sentence is for me and how I can   and want to search this. So I have  one service running and now I will   show you the other two services you would  need to start: That is for one you need the   API. TuniCo API server, it's also on  github, so you can have a look at that code.   What I need to do here is, first  I need to add the XML files, it should be the same XML files  that I generated the VERTICALS off,   it doesn't have to be but things tend to get  out of sync and that wouldn't be very good   and then I may have to change a little  bit here if my corpus has other - is structured differently  or has other attributes that are different from what you have in the  VOICE corpus. In this instance it's something about searching, that I actually had to change aside from having here the corpus name more as  a variable, I think it's easier to show you this   here. What I had to implement is the ability to  use the unicode characters that are in the corpus  

all over the place, I can now actually search  for them because I changed the code a little bit.   It boils down to something like this,  so a change, just here I listed some umlauts, just for fun, actually, it doesn't really need  it for VOICE, but I put them in there and now,   it was a good idea in hindsight because I could  just then add more here and this actually works,   so the translation now accepts at least the  unicode characters I need for my corpus.   This, as I said, also is something that uses a  lot of node.js tooling. So what we actually need   to do to get this running here is doing the usual  three steps of creating or starting up a node.js   application, that is, install the dependencies,  I already did that so with npm-install,   I think it will just tell me that nothing is  to be done, or it will just do or something, and then, there's a readme. there is this command that is a little bit custom but  just launches the server. As you can see  

and in VOICE it would be just the same, it reads  the XML files and pre-processes them, exactly   as I told you before and now I have a service  running on, if I remember correctly, port 3000, which I can ask about a few  things, I also built in some API docs, so you can actually try out something here. Before, I would  like to re-implement a completely different   user interface. But I will just not get into this much more and instead  also I'll give you the front end and this works pretty much the same, it is  a UGS based setup so you need to build it   before you can run it but that's actually  the biggest difference here. So again, as usual this takes a little  while, it should be finished soon.   This is actually the part that needs the most  changes, if you like. There is a lot of things   that probably will not have already available in  your corpus, for example the filter part. I just  

removed the UI for that, of course you could also  try to find all the references and remove them   too, and also some other things like in the TuniCo  corpus you cannot actually create a meaningful   VOICE style rendering, the annotation is just not  there, but still. You can have the text rendering   and you can have a POS rendering and so for  example I just removed the UI elements that   would launch the VOICE style but that is much  easier than creating them in the first place. Aside from that I had to do  some, only little, tweaking to get the front end really showing something with the TuniCo data. *typing* I will just cheat now. There is a way of  defining a port, I did not remember that  

the API - the XML server service and the  API service used the same port as the front end on startup. And here you can see a cleaned  up UI that just presents   all the UI parts that are actually meaningful  for the TuniCo corpus with the edition, as I   mentioned, that audio is on a-per utterance-level  so if I want to play audio I can select it here.   It will actually play this small snippet  of audio. I can also search here again   and I will have the highlighting and everything,   I can have the POS and have a look at the XML,  where things were found. It is possible and   I wanted to show you. This, I would say it  took me about a week to get this in this state  

and I have to I have to stress that I did not  develop that, I was only a bit coordinating it   and especially the UI. I was involved in the XML  server development, but not in the UI development.   I think even someone that is new to this code will  have results in a pretty short period of time.

