LCL2021: How-to: Data Science, Human Language Technology and Natural Language Processing
[Music] [Music] hi my name is christopher michael stewart um and i'm going to be presenting today on data science human language technology and natural language processing um okay so just a little bit about me uh i did a phd in french linguistics at the university of illinois then worked for a few years as an assistant professor of french linguistics at the university of texas at arlington in the modern languages department i then left that job and worked as a voice engineer in the text-to-speech research and development division at nuance which is now called serence in belgium and while i was there i developed siri the siri voices um took on a major tech face a major customer facing role um and had a lot of good beer in belgium uh then i left that job and got a two jobs as a senior data scientist first i worked at uh tata consultancy services in um in arkansas developing a machine learning pipeline for walmart and then worked at a tiny startup called narrativewave doing time series modeling of industrial internet of things data i now work at google as a computational linguist and my team works on detecting policy evaluations at scale best practices for crowdsourcing tasks and i build automated data quality reports before i go much further i'll mention that i am not speaking on behalf of google here i'm only talking about my experience and and background um okay so i've i've set myself up the impossible task of in 45 minutes talking about data science human language technology and natural language processing each one of these is an entire career's worth of information and there's no way i could possibly do much justice to all three of these fields in 45 minutes so what i've aimed to do then is hit the sweet spot between these three fields so what do these have in common and um i'm going to tell you that what i think that they have in common is something that i'm here calling predicted probability so these are all fields that are um all about modeling statistical modeling in particular machine learning um and trying to trying to figure out something about uh make some sort of inference about the future and use data to predict that so what does that look like so here's i have predicted probability and and then ml some machine learning or deep learning deal and really more deep learning than machine learning so under that i put the the short names for these and i'm going to be using ds hlt and nlp and i put um some examples of what's called conditional probabilities so a probability is denoted by p open parentheses the probability of some event right so the probability of a fair um of a fair coin coming up heads is 0.5 but you can also condition probabilities on things so in data science if you're working as a data scientist at for instance at tesla or somewhere like that you might be interested in predicting what is the probability that an object going uh that's that appears in front of a car will be a human so you're you know it's a predicted so will be a human and it's probabilistic what is the probability this object is a human given data coming from thousands of sensors historical data metadata all sorts of stuff right um human and if you're working in human language technology and you're working for a text-to-speech synthesis team you might have a whole bunch of segments a whole bunch of cut up vowels and consonants and you know things like that and you'll be interested in predicting what is the probability that a segment is good for a text-to-speech synthesis context given a a language model's spectral specifications duration uh focus you know narrow or broad focus uh part of speech etc if you work in nlp you might be interested in predicting what is the probability that a tweet a tweet will have hate speech given metadata about the tweet uh the the actual content the language in the tweet the time it was tweeted the place it was tweeted who retweeted it how many retweets did they get all this sort of information so this is this is the sort of world that you will live in and um so i'm going to start this talk uh talking about probability and predicted probability in machine learning and this is probably very different from the other talks that you've been that you've been to because i'm really going to start by the part that you might not know very much about and work back in the second half of the talk to an actual language context but we're going to start with predicted with talking about probability machine learning and things like that so if you don't like that i'm sorry but it is the common thread here amongst these three uh subject areas so before i start though before i go too in-depth here i just want to emphasize that if you have never taken statistics or have no idea what machine learning is or are afraid of equations or you know whatever the case may be don't worry just try and sort of absorb the ideas here so like for instance this conditional probability event you can imagine what is the probability that's going to rain on in on friday this coming friday right you might have some sort of naive probability but if i tell you that it's monday you might have a very different prediction than if i tell you it's thursday closer to the time we have a better idea of what's going to happen right so just try and engage with the ideas even if it's you know sort of foreign and seems odd or hard to get just try and get some of the ideas don't worry if you don't get all of the particulars okay um think to yourself have you ever taken a class in statistics i can only see a few people raise your hand if you have a statistics class okay okay if you haven't had a statistics class no worries not a problem um but the the interesting part for for our purposes about um about statistical modeling is that so statistical modeling is when you take a set of data and you try and build a model to do something and you know the famous quote all models are wrong but some are useful so the point is to try and and figure out something with this model right um typically in social sciences we build models we collect data right language data in this case for instance um and we build a statistical model mathematical model that of that data and we assume that this model tells us something about what is going on in the data so for instance if you have a regression model you you'll have parameters in your regression model and it will tell you something about what you know what what is the how predictive uh uh is um you know someone's language background of the language that they use or you know something like that so we assume that these models reflect some sort of underlying process because of that we're very interested in the parameters the parameters are the important parts that's the the interesting parts about the model like you know does someone's um age tell you more than their l2 or you know whatever about some sort of linguistic behavior because we're interested in parameters we prefer simpler models in this statistical modeling culture and this is prevalent in causal research but not so much in engineering i forgot to mention that this i'm taking this from a paper by a statistician named leo bremen called statistical modeling the two cultures and it's it's helpful for um seeing how these two cultures differ so that's data modeling and if you've taken uh introduction to statistics you've talked about things like this about t tests and anovas and regressions logistic regressions and all that sort of stuff right so there is a different culture of statistical modeling that leo bremen in this paper calls algorithmic modeling in algorithmic modeling we do not assume that the the model that we build reflects some sort of underlying process there's no assumption that what comes out of the model tells you about reality we're not really interested in that what we're interested instead in is prediction we're interested in predicting what's what's going to happen with this data that we have from the past because of this we don't really care about the model being simple because we're not really interested in what's going on in the model all the time we're more interested in in prediction um so not afraid of black box models models that we don't really know what's going on necessarily in every tiny little part of the model this needs lots of training data and lots of computational power typically now there is if you've never heard of this algorithm modeling sort of world and deep learning and machine learning and all that sorts of stuff but you have taken a statistics class there is an intermediate between these two it's a field called statistical learning um and there is a book written by uh hastie and tib shirani called introduction to statistical learning that i really recommend here if you've never read it or don't have any familiarity with it okay okay so let's talk a little bit about machine learning so machine learning great buzzword what does it mean um so in traditional programming you take input you make a program and you get results so for instance um in a popular coding problem um is to say i'm write a function where you give this function input a number and if the number is divisible by five you output fizz and if the number is divisible by seven you output buzz or something like that so you have input an integer a program this thing that says you know is this number divisible by five is this number divisible by seven and output something it does something machine learning is a little bit different because we take input and results and we put them into a machine learning model and what comes out of it is something like a program so this isn't this and this it's important to note that this has nothing to do with language you could be this is basic binary classification so you could be looking to classify uh you know this is in this i have like this funny sort of toy example of is this a chihuahua or a muffin um but you know you could be looking to like predict is it going to rain or is it not going to rain um is this person likely to be likely to click on this link or not click on this link or whatever any sort of binary classification so for the purposes of this little demonstration we're going to talk about this funny toy problem in visual binary classification which is is this picture a chihuahua or muffin which i took this picture from an article where you can see that actually chihuahuas and muppets are not always that easy to disambiguate which is kind of scary so if we want to build a classifier with this what would we do we need labeled data so we need an image one an image two and image three and image four where we where we indicate is this thing actually a dog or a muffin or a muffin or a dog or whatever and we put that into a model and the program that comes out of it we'll just call for now dog or muffin so we build a classifier and then we use it so we start with these this what's called training data in these four instances here image one two three and four and we put a new a new an unseen image into this and use the classifier so tell me what i asked the classifier i mean not literally but we have the classifier predict is this a dog or a muffin for instance in one case it could say i i'm 95 sure that this is a dog right um we look at the expected results so if we if we um put another image image six in it says look i'm 55 sure so i'm not really all that sure but 55 sure this is a dog and we look at the expected results oh it's a muffin so i wasn't real the the model wasn't real confident that it was a a dog and and that's good because it wasn't a dog it was a muffin so we can do this over and over again image seven image eight image seven sixty four percent chance it's a dog image eight ninety-two percent chance that it's a muffin uh we're wrong with uh image seven and we're right with image eight um and you can go through this over and over and over again um this is called reinforcement so this is kind of the basic intuition behind how this works and you can see that we're interested here is predicting so i don't really care what's going on in dog or muffin i don't really care that the model is looking for ears in an image or you know how bright or dark the image is or whatever it doesn't really matter i'm really interested in how accurate the model is at predicting whether the image is a dog or a muffin so in just to reiterate first of all uh trigger warning the next slide does have equations so if you're scared of betas and x's and y's and things like that and and greek letters of all sorts avert your eyes on the next slide so in the data modeling culture we're most interested in understanding what the model tells us about the data generating process so if you if you collect data about language usage you will put into this model whole bunch of information that you think helps you understand what's going on right so you're interested in in the data generating process all data goes in the model and you're not really concerned with prediction if you go to publish an article the editors are never going to come to you and say hey look i found this other the speaker that you didn't talk to who has these characteristics please put those characteristics in the model and tell me how predictive they are of this person and i'll tell you if you got it right or wrong that will never happen because that's really not the point right you're really interested in what's going on in the in the um in the data generating process so an algorithmic the algorithmic modeling culture we're primarily concerned with prediction so your priorities change um so in this in this sort of world some data could help you make better predictions and some data can in fact help you make worse predictions um you can end up in instances where you have uh not very many observations but millions and millions and millions of variables like genetic arrays and genetic testing is a good instance of this sometimes you only have a few samples of a gene but you know you have as many um i don't really know that much about genetics but as many of the little uh um you know pairs or whatever of of gene doodads uh it's sorry it's a long it's uh it's late in the day i'm i'm not very eloquent here but you get my point so sometimes you have more observations than more variables than you do observations and that might uh be a problem for your model so what might you want to do well if you were one of the person one of the people who earlier said that they had had introduction to an introduction to uh statistical modeling class you will be familiar with this data modeling equation y equals beta naught plus beta 1 x 1 beta 2 x 2 beta all the way to beta p xp and if you've never seen this and this looks confusing and crazy don't worry about it imagine that you want to predict height given weight right you if you have you'll have an x and a y axis on the x-axis might be height and the y-axis might be weight or vice versa it doesn't really matter and you would have a whole bunch of points there right and i tell you hey look i want you to predict a new person that you haven't seen given height and weight and i'm going to put you know this one point in here on the wat on the y-axis and i want you to tell me what the x-axis will be well that's what you have here right you have the y is the thing that you want to predict the x's are the things that you uh that you already know um and the betas are the weights on those things and what you want to do obviously is you want to draw a line that minimizes the distance between all those points and that line right it's called the line of best fit so that's what this minimizing rss you want to minimize the residual sum of squares now let's say that you have so much data that you actually have a problem because some of those x's are not very useful so you want to be able to adjust these betas right it's very some this this whole thing is the exact same this is machine learning this is called ridge regression um this the the the equation is the exact same except for you add this little penalty term here at the end that says look don't let don't let my um my uh betas get two my squared betas get too big and this this lambda parameters upside down y allows you to sort of turn a volume knob on all these betas right and so it's very simple it's a very simple uh thing and that that has been taken you from statistical modeling sort of the the data modeling world to machine learning okay now obviously that's not really the state of the art the state of the art is something more like much more complicated algorithmic modeling here you have like a basic uh neural network and this is what the state of the art is so this is what the people who um actually you know sort of build these production models at uh big companies this is what they do um which is you know it's just one iteration a slight bit more complicated but my what i'm trying to impart here is that this um you can you can start it if you understand uh ordinary least squares regression this top equation you can easily go to statistical learning which is this next one just by understanding what's going on in the model and then once you understand that with a little bit more work you can go to this algorithmic modeling world so the the the intuitions here again are very simple you can have a very a model that doesn't do a good job of predicting it's just it kind of makes random predictions right or you can have a model that's very in tune to this training data that really understands the the um all of those uh chihuahuan muffin pictures that we showed to it but um if i show a new picture it doesn't know it has no idea if it's a chihuahua or muffin so you want some sort of sweet spot where you're not making completely random predictions but your predictions aren't so tight so closely tuned to that training data that it doesn't generalize so you want it you want to find this minimum error here and that's this is referred to as the bias variance tradeoff so wrapping up our discussion of machine learning here how effective is modern machine learning well deep learning uh yeah i put a few links here you're welcome to copy them down and find the articles um it's not always great i mean as you probably have experienced in life but but it is quite good there's a massive need for labeled data and who labels this data and how do we ensure that their labels are good well now we're getting to our world linguists are actually pretty good at getting data from humans right and so linguists often work on this in industry okay so this is this is you know maybe more than you have sort of uh taken on in in your explorations of statistics so how can you start learning about machine learning well the important thing is to remember that all approaches even the most complex uh deep mind model that you can possibly imagine has x's and y's it has things you want to predict and things you already know it has weights that go on the x's that predict the y's and it has error the model is wrong and you want to quantify how wrong the model is so if you want remember that all these approaches have these ingredients if you want to start learning about machine learning you can develop some ins develop some statistical intuitions remember that the basic inferential statistics will serve you well going forward do your own analyses as much as possible make sure that you understand ideas like statistical assumptions normality is a normality test model assessment don't just go to stack overflow and enter in some r code that someone tells you builds a linear mixed effects model it's not that that will not serve you well read an introduction to statistical learning if you want to go beyond inferential statistics if you're already there and you're looking for something more complex find data invasion data analysis and if you want more complex than that then you're in the wrong place because this talk is not for you um okay so that was the part about machine learning thank you for sticking with me through statistics let's talk now about uh about natural language processing so we're now we're going to turn to language so how are these modern uh these machine learning models used in natural language processing so this section is just a brief peek into the kinds of consideration that go into an nlp pipeline this is not state of the art nlp this is like uh you know if you want to learn python and you start and they say okay you know define a variable that's a string and now capitalize the string or you know whatever that's kind of this equivalence i'm not i'm not proposing to you that this is like state of the art incredible nlp you know knowing these things is not going to get you a job but it is a peak into this world okay so what what is involved in natural language processing so this is just a sample sort of pipeline um uh things you might want to do if you uh were building a natural language pipeline uh with two sentences moscow has als moscow also denounced what it described as the rise of quote nationalist and neo-fascist sentiment in ukraine's western areas where it said russian speakers were being deprived of rights it has repeatedly expressed concern for the safety of russian citizens in ukraine so the first thing that we might be interested in here is defining uh uh sentence boundaries right um the sentence boundary you could a very naive approach would be to say anytime you see a period that that's the end of a sentence that's going to be problematic right because if we have something like um uh uh a d i know some sort of an abbreviation or usa or something like that um you could have periods that are not synthesis boundaries so again you're you're going to need to train a probabilistic model that tries to predict when sentence boundaries are and you know sometimes you'll be right and sometimes you'll be wrong and the next thing you might want to do is to divide all this this sentence up into tokens now what are tokens tokens are something like words um but importantly they may not always align with what you think of as a word boundary so friends one one interesting thing that i see right away here is this ukraine so ukraine apostrophe s most people would say that's that's a word right but for the purpose of tokenization we might be interested in saying that this apostrophe s which indicates possession is a separate token okay so there's again this is something that you're going to have to model probabilistically um after this part of speech tagging um i think part of speech tracking i have a speaker note here but i can't see it but i think part of speech attacking the state of the art in in english is something like 98 correct 90 accurate so again this is an instance where you would build a model and and probabilistically predict what these parts of speech are um syntactic parsing i'm sorry if you're a syntactician but this is i'm just going to skip over this in the interest of time and move on to entity detection so um going so i've started at sort of a very low level tokenization then part of speech tagging going up further syntactic processing parsing and then even further up than that to entity detection so you can see that um you want to be able to model the fact that some of these are sort of related to each other right so moscow and it have some sort of uh relationship um etcetera um so this is called entity detection or uh named entity recognition you might want to cluster them right so that moscow and it or clustered what and rise uh ukraine etc etc etc so these are just the kinds of considerations that you would want to take into account if you're going to um model if you're going to have a um a computational representation of what's going on in these two sentences so again all of these are predicted using deep learning models in state of the art natural language processing and um if you're interested in deep learning models remember this guy at the bottom here that's a a a visual representation of a deep learning model um this and subsequent steps are typically implemented by research scientists and computer engineers it's not often the case that linguists because we are not really trained in you know math and computer science and things like that are writing these models from scratch um and anymore a lot of a lot of this it used to be the case that um hyper parameter tuning and things like that you had to do by hand but increasingly even that is uh is solved by the computer by just grid search and things like that um linguists therefore contribute domain knowledge to identify rules and identify areas for improvement linguists have a lot of obviously a lot of experience with you know what language should look like and so when the model gets something wrong are pretty good at saying hey i think that this is what the problem is um and once you have enough of those you can identify rules right you can say hey look it looks like the model is not great at dealing with possessives or you know whatever the case may be and to provide and linguists also work to provide annotated data that train these automated processes um so you might think oh wow this is all pretty impressive so nlp is pretty much is what people call a solves problem right well not necessarily you probably recognize that you know a lot of um a lot of machine learning can kind of go wrong and do wonky things and the other day i was working on something and tried to get some sort of uh digital personal assistant to play a song and i said the name the name of the song in spanish so like um please play the song or something like that and to the thing and it says here's what that is in spanish or whatever and then just translated whatever i said in spanish so why didn't it get it right well there are some things that are pretty challenging for natural language processing so let's consider this passage really quickly the police arrested mayor john smith yesterday he was suspected of stealing garden gnomes the latest breach in a cree and a crime wave rocking the sleepy town of springfield so if we think about abstract representations of these sentences um you can start to see that there are things that are a little bit difficult right so those things look like this um the fact that we have co co-reference right to entity so like mayor joseph and he um the police are tied to this uh the town of springfield uh we have uh if we look at this sentence ontologically um how do we want to represent something like mayor john smith is this an instance of a mayor whose name is john smith or should it be john smith and his occupation is mayor what is the the sort of preferable um representation here in terms of salience um we have these event relations so the fact that we know as humans and and um language users that suspected comes before arrested normally someone is suspected of something and then they're arrested right but a computer doesn't know that you have to you have to sort of engineer that these are inferences that require uh real-world knowledge of criminal processes um also the fact that there there is something like subjectivity right that someone can be suspected of something that's sort of a subjective call that's a subjective uh um consideration right and you know we we understand that but computers don't really understand that um ditto or edem for um things like a crime wave that rocks you know how do how do we represent the fact that this is all this is metaphorical usage and that the town of springfield is not actually sleepy like it doesn't it's not that it wants to take a nap it's a way of saying that a town is calm and you know sort of bucolic um ditto from autonomy um so yeah these are challenges for for for modern natural language processing okay so starting to wrap up here um so this has all been a lot of information i'm sorry to to sort of uh i've gone through it relatively quickly i don't know if we do questions is there a question period does anyone know i'm not done one of these yet um but i guess we'll see at the end of this uh i think there is right because you guys can put questions in the chat hopefully in any case um happy to go back through uh parts that might have been a little tough but let's talk just briefly before that about um what you can start doing now so let's say you're in a master's program and you think oh this nlp stuff seems kind of cool you know i've taken an introduction to python and um and i took uh statistics 101 or whatever what what should i do next um this should be yourself sorry so the first thing is learn you got to do a lot of independent learning so there's this is not the sort of thing where um you turn up and you know someone gives you a job and you know immediately you have a job for 50. there's a lot of sort of you have to invest a lot of your own time into learning things here so learn how you learn best that's the first step um you know a lot i i was in the um uh office hours the other day and you know several people said hey you know i i took i'm taking introduction to python and i'm sort of interested but it's boring and it's hard to stick to and uh i just don't really like it that much and my advice was hey well then find a project you know figure out some sort of project that you're actually interested in like i'm interested in um uh i don't know like i i studied georgian or something and it has this very interesting syntactic property and i'm i want to build something that can predict that or i want to to suggest a better way for my favorite uh software package that handles georgian to better model you know whatever it is find a project and contribute to that project a lot of natural language processing software is open source so you can you can find the people that develop it take the code it's called forking the code you see you just make a copy of the code look at it change the part that you want and send the change back to the person who develops the code that's called a pull request and say look i really admire your work and think it's very cool but i found some i've found something that i think can be improved please take a look and see what you think how do you measure progress you know learn you're going to have to present what you've done in the past and what you want to do in the future and sort of give some sort of indication of of the progress that you've made on your various projects and that's going to be extremely important when you go into industry this works really well for technical skills right you can say um i've taken intermediate python and i i have you know this change this change in this change that i made to the natural language toolkit package um that are now in production or you know whatever so it works really well for technical skills next do side projects so find problems like i mentioned and fix them you know it's not always the most glamorous and interesting work but guess what it's what you're going to be doing if you get a job in one of these fields uh and so yeah find a problem and fix it um there's paid work so um big tech companies for instance hire paid interns every summer um and there's volunteer work i mean he you know what i said about working on uh something like nltk that's that's free i mean you don't you know you you sacrifice your time you get something out of it you get experience maybe the most important of all of these things is to keep records of what you did it's nice to tell people i know about statistics and i know about programming and i know about this i know about that but it's much more convincing if you can point to things you actually did and if you want to be able to point to things you actually did the way that you would do that is you make a web page right a personal web page you make a um what's called a repository a code repository and you point from your web page to the repository and the repository has project one project two project three project four project five and the web page does a beautiful job of explaining here's what project one does here's what project two does here's what project three does and for each one of those projects you can point to all the code that you wrote to do the project so you really want to be able to prove uh you know that you you have whatever kinds of experience that you have so it's really important to keep records really really really important to keep records next find chances to use new tools be sophisticated about data so this goes back to the slide where i said look make sure you understand the models that you're building a lot of people build statistical models and have no idea what they mean don't be one of those people be sophisticated about data apply for an internship like i said there are a lot of companies in tech that hire interns to do um you know things like text-to-speech synthesis and automatic speech recognition and natural language processing and all sorts of stuff like that so apply for an internship automate your drudge work so you you know is it the case that you are working on an r script that is now uh 3 000 lines long that has all of your dissertation research in it if so stop stop doing that figure out ways to write functions and then call the functions and write unit tests to test your functions so that you know that they're working as intended automate your drudge work and finally prove it make artifacts and understand impact so if i can just impart one thing from you from to you through this session it's that if you're interested in um in in this sort of line of work make sure that you can show people your work make sure that's very very important it's nice to say i learned python it's a lot better to say look at all these uh contributions that i have to nltk it's very easy to sit there and sit through an introduction of python class it's much more difficult to go and contribute to a software package right but one will show people that you know what you're talking about and others is kind of like well maybe they know something about python maybe they don't who knows um so i think that that's where i've stopped here um but i thank you for your time and we have about 10 minutes left for questions hi chris um thanks again so i was just wondering for those of us who have taken like a compositional semantics course um what kind of keywords how can we parlay that into relevant language that um you know people in these companies would understand what we actually worked on from a computational semantics course compositional semantics compositional semantics it's like it's very familiar oh okay okay that's okay i think it's very related it's like you know and then identification and everything so okay um well well um so one thing i could i think that compositional semantics is about is um ontologies presumably you worked on ontologies and embeddings and things like that maybe a little bit okay so that that's kind of the currency of the realm in natural language understanding so natural language understanding is kind of the um province of like the digital personal assistants and things like that right so if i tell my um uh whatever google home or or you know uh alexa or whatever um uh i love my mom why don't we call her now you know you need some sort of representation an ontology that says mom is referred to as her you know whatever however you want to envision it so natural language understanding is a good and has the sort of um added benefit of that it's it's a very hot field right now um they're hiring lots and lots and lots and lots of linguists to make lots and lots and lots and lots of ontologies um so that's one key word you can you can look for is ontology yeah i wish i knew more about compositional uh semantics yeah i think it's like very helpful for it um we're doing translation into uh working with computational because there's um you know we've slammed into lambda functions in order to like figure out how interrelated and everything so i think it's pretty close great you have a couple of questions in the chat first would you suggest to put academic output together with side projects on our website uh yeah for sure definitely um yeah there so there's nothing wrong with um with talking about what you've done in academia so my first job uh outside of academia was in texas speech um research and development and the um the director of the the org or whatever that i was joining sort of said yes send me all your you know academic articles i'd love to read them and then i turned up for the interview and he said oh thank you for coming by the way i didn't read any of your articles so you know um don't expect that people are going to read you know 7 50 page articles because you're interviewing with them they're not um but there's nothing wrong with putting um the fact that you wrote an article that's in language or you know journal social linguistics or whatever um on your website it certainly is impressive i mean it points that you can you know you can make deliverables um but it's probably not realistic to expect everyone to read off your articles that's just not how industry works um next question is it necessary to learn python python for nlp work i've done some java at beginner level can i use that yeah it's certainly not necessary to learn python um python is nice because it's very simple it's a scripting language a high level scripting language so it has the benefit of being relatively easy to read compared to java which i'll be honest i look at java and i think public class this public cost i have no idea what all that stuff means and it's hard to follow so if you've already done java python should be a piece of cake um so yeah it's certainly not necessary to learn python for nlp work right next question in your experience are there opportunities to work on these kinds of projects with the team strictly as a linguist without having coding knowledge yeah i get this question all the time do i need to know how to code and you know the the the truth is you don't need to know how to code but if you are not willing to interact with so you if you want to work with engineers so if you're going to work in tech you're going to work on a team you're going to work with engineers there are engineers and every the engineers the ones who actually do the stuff they write the code that you know makes things work that people use so if you're not willing to engage with the ideas and code and you sort of throw your hands up you're gonna make a rod for your own back so you don't have to strictly speaking know how to code but it is always going to help you it will never ever ever hurt you to know how to write code and generally speaking you don't have to know it but it will always help you the next question asks where would i go to look for side projects for linguistic annotation that might be a few hours per week yeah so this is a great question i mean i don't i don't really know i've not done it but um at the question the companies that i've worked at we hi we contracted with um with vendors who sourced uh or who hired um people who did the annotations so when i was at nuance i think that we worked with appen which is either what's up in butler hill or is now up in butler hill so that's one um lion bridge i know does this too um and there are a variety of companies that do this sort of thing and the next person would like to know the name of the statistical learning book that you mentioned earlier in your presentation yeah so the the authors of that book are hasty and tib sharani and the book is called introduction to statistical learning and that's sort of the the um the nice reader friendly part there's also one called elements of statistical learning that is much more in depth um could you hear someone else add to the chat add that to the chat so yeah introduction to statistical learning okay and next question will linguists become more or less relevant in the future of nlp this is a great question i mean i don't i don't have a crystal ball and obviously don't really know um but i think that the i think as as nlp ventures into more and more uncharted territory um linguists will become more and more relevant because linguists understand how humans use language right that's kind of our um our superpower and you know the more we get into difficult questions uh of um of language and wanting to model that the more linguists uh skills linguistics linguistic skill set will be valuable so my prediction is more but it's a probabilistic thing i don't really know right uh the last question that's in the chat at the moment is oop i missed it i'm a linguist and i'm learning data science how can i make sure that my new knowledge of data science makes me eligible for jobs in computational linguistics at companies like google um how can i make sure that my knowledge my new knowledge of data science makes me eligible well um so i mean to some extent like this this distinction that i've made here between um natural language processing data science and human language technology is artificial so you have data people whose job title is data scientists that work um on language things and people that do human language technology who are engineers who don't really know anything about linguistics so i guess i'm trying to say is if you are learning about building models and being uh statistical models and being sophisticated about using data then there's no way that your your work here can not make you um uh competitive for jobs or not making more competitive more eligible for jobs in fields like computational linguistics at companies like google and next question any exciting projects you're working on right now any suggestions for youtube channels or twitch channels for watching coding live streams um so i can't really talk that much about what i'm working on now it's highly proprietary um but uh yeah i mean i can tell you that i work in the ads org which is probably the most profitable machine on the face of the earth right now um so that's kind of interesting um [Music] suggestions for youtube channels um [Music] i don't know i'm old i read books uh and twitch i i almost don't even know what that is so i i can't kind of be really okaran and we have a hand from wei weilai hi chris um i enjoy your talk very much so thank you for the thought i'm wondering because you mentioned that you come from you are doing um jobs that are related to voice in the first place and then you switch to doodle and other places that might deal with text as well so i am a phoenician and i am wondering what are the uh how is the job opportunities look like how do they look like for speech scientists versus natural language processing side of the linguistic opportunities linguistic work job opportunities well so speech sciences is great because it gives you a lot of experience with um kind of the more technical aspects of of language and yeah if you want to work in a field like texas speech synthesis you're working on that all the time or asr automatic speech recognition you're going to be working on that all the time so it's just it's a little bit different kind you know sort of work but um in terms of are there more or less jobs that's kind of hard to say it's really really difficult to say which one has more jobs so we have about two more questions to get through we are at time uh the next question any introductory material you can recommend oops for anyone interested in learning about these topics so yeah that the book that i mentioned um introduction to statistical learning is is a great first stop um there is a so the the software package that i mentioned the natural language toolkit in ltk is all open source and it has a book that's all online for free it only demands your time and attention there is in addition there are a number of good textbooks one uh written by um a guy named jacob eisenstein who used to be at georgia tech and is now at google um who just came out with a new book on natural language processing and intro to natural language processing i i had to confess i don't know the title but i'm sure if you look up jacob eisenstein uh book natural english processing it'll pop right up um so there's yeah there's a lot of nice introductory material but keep in mind that it is it's not always written for you know i um for people who are not technical not sometimes it is but you know don't be scared if you see you know numbers and and equations and things like that just trying to wade through it you know do your best to understand what you can and our last question uh at least our last question right now how relevant is a graduate degree for nlp slash computational linguistics work in tech yeah so i get this question the question a lot too um and uh so sort of the the market reality is that um is that it you know if if a company can afford to pay someone who has an undergrad degree and a grad degree the same salary which one will they hire obviously they're going to hire the person with a graduate degree right you know if you can have like a crappy car and a really nice car for the same price you'll get the nice car almost invariably right so you know don't forget that these these fields obey market forces and so the the applications within natural language processing human language technology and all that that are hiring the most people will be the ones that are the most profitable and if you want to um you know it's not the case that you need a graduate degree to work in nlp or computational linguistics but it is the case that you need a skill set that will allow you to work in it and it is the case that there is unfortunately a glut of linguistics knowledge in the world and you know not that many jobs and so um you know just keep that in mind you know it's important to remember to be realistic about you know sort of the the job market all right i think those those are all the questions for you so thank you for your time have a great day [Music] [Music] you
2022-03-07 13:47