# Convolutional neural networks with Swift - Pittsburgh ML Summit ‘19

Thank. You all for coming thank. Google for having me today. We're gonna talk about convolutional. Neural networks, with, Swift and a, little bit of Python. Very. Broadly we're gonna explore the problem of image, recognition. Purpose. My presentation, is to take you from will, say zero or about as basic, as you can get in this field all the, way up to the current state of the art so. Towards that end we're gonna do a quick review of neural. Networks how they work we'll, look at a one-dimensional version of the in this problem which is a well. Understood problem and, computer. Vision from. There we'll introduce convolutions. And then, we'll tackle illness, together using, 2d, approach from. There we'll look at how we can introduce color and start to stack or convolutions, in order. To tackle. Larger, problems and. Then from there we'll look how we can take the same basic approach and, even. More layers to build up to vgg which, is our first state of the art approach from, 2014, or so from. Then we can modify the vgg, basic, network to, produce residual, networks which are very powerful, modern. Approaching this field and, then at the end we'll, look at efficient that which is very recent paper in this field and, do a quick demo of running that on the edge GPU, device, so. Here, we go. Very, broadly, these. Are like the four big categories, as a computer vision I think you should be aware of I'll. Convert them into the international, standard, of cat and dog units so. We, have image, recognition or, is this a cat or dog picture, object. Detection or where, is the katniss picture. Image. Segmentation or, which, pixels are cat pixels and then, finally instance segmentation. How many sets of cat and dog pixels do we have, today. We're just gonna be focused on the upper left quadrant here so just, cats and dogs. Neural. Networks. Know, this field is well. Machine, learning has historically, been focused, on sort of reducing. Problems down to the simplest dimension, we'll say trying, to figure out if there's just one variable that changes things and so. Neural networks are kind of like an outgrowth for computer science, we'll say they were kind of a curiosity for the longest time the basic. Trick, that a neural network does is, they can learn how to separate, high dimensional data so. Images, we, might think of them as being simple but the machine they're actually kind of complicated you. Have a red channel a green Channel blue channel a -. Width component, and then, you're trying to map, it to some category, at the end so. If you could actually just imagine, for each input picture, you. Mapped it to a specific category, that's literally what a neural network learns, in. Order to do this we, often end up doing a lot of math where we do a applied to B be, applied to C C, apply to D and so, on and so forth in. Order to do this then, we, use back propagation and the chain rule from calculus. Everybody. Hates to talk about the chain rule and so, somebody. Say hey why. Don't we have a computer keep track all this stuff so. Back out, of differentiation, is not really a new. Subject in this field it's actually from the 1970s. Or so what. Is new is compiling. Auto differentiation. With, the compiler, in order, to model these neural, networks at the. Language. Level, so. Swift. Isn't. Really magically, special in and of itself. This, slide up here this upper. Right. Thing is sort. Of a slide, I stole from Chris latter's, keynote. Presentation, at LLVM Congress earlier but. It's basically demonstrating, how all these worlds are sort of moving together. Swizz, real secret power is that, it was the first language of LLVM, which. Is a modern. Compiled that it's used almost everywhere nowadays, so. Basically, all these, worlds are sort of coming together or. You can write your high-level neural, network code in your particular programming, language it'll. Be converted to a, intermediate. Language and then, finally LLVM, will spit out for whatever device you, actually need to run it on so. Right, now people are coding stuff for CPUs and GPUs and, TP use but. In a new area that's coming out is like running stuff on devices so say on your mobile phone or even, as like an edge TP devices, we'll see later and.

So The whole theory of this project then is, by getting everything to sort of follow path you'll. Be able to target all these different runtimes so the same cloud code that you're writing can run up in the cloud or on the device in your hands. The. Second level of this then and, this is kind of the really new area is, this whole ml ir, so. Rather than having each of these languages implement. Their own abstract, syntax for, doing this neural, network stuff, they're, trying to sort of model it at a cleaner. Level so all these languages will generate ml I our code and, then from there we can go LLVM to your device. So. We here at the bottom we. Have some sort of the different. Forms of basic neural networks. Over. Here we have the perceptron, so, if you can imagine an imaginary, line dividing. All your cats and dog pictures, that's. Literally what a perceptron, is and. This is from 1958, this is not as new as you may think the. Basic problem is like, what I said that you can't actually reduce the data down to one dimension that easily so. Basically. They have these sort of hidden layer approaches, where, you run thing through a set of neurons and and, that's how you get to your actual result so. Pay attention to our deep fee for a neural network and then, we can add some convolutions, on top because, that's what we're gonna do for our next two steps. So. The immanent data set is, a well understood data set and computer vision it's. A collection of hand-drawn digits they're. All black and white so these values, are from 0. To 255. There. Are 28 pixels by 28 pixels wide so there's. 8 right here is just literally what one of the digits in the data set would look like. We're. Not even gonna treat. This as actual image data what, we're gonna do is unroll it so. We're literally going to take the top row and then, pull, off each row at a time until we have a really long vector so. The second picture right here is sort of demonstrating, just a four by four unrolling. Loop of maybe, say like an imaginary one but. We can imagine this same concept across the 28 by 20 pixels to. Produce an input vector that's 784. Pixels, long. So. Next we're just going to take our input vector of 784, pixels and. We're gonna run this through two fully connected layers of 512, neurons and we're going to map it to an put layer obtained, categories, the. Number zero through nine. So. I originally. Set out to write of this demo, but. This gentleman named Quan is out, in. Mountain. View he's, a GE out there he wrote this code so, I simply took his code and modified it slightly in order to produce these results. So. This is what our very simple neural networks gonna look like it's. Nothing more than our input layer. 784. To 512, 512. To 512 again and in 512, 210 out at the end. The. Reason we're, using these Swift native, data types because that means that actually if now we can just define. Our differentiation. Function, in this, simple line right here and, the compiler will take care of all the magic of actually making that happen, so. Let's see what this would look like. Here's. All the code his, actual code he, got all the way down to about 40 lines which is quite. Elegant. But. All I did was modify, this, bit. So now we'll run his basic. In this demo across. The in this data set I'm. Running this on one of my well my computer back, in Missouri but, SSH. In here. So. This, simple neural network is able to get about 94%, accuracy on, the in this data set we're. Kind of cheating because we're using large fully connected layers but bear. With me and I hope this approach will make sense. Convolutions. I would. Love to throw one slide up here and explain to you all convolutions. In one slide but i don't think that's possible. But, i think this slide right here which I stole from an Nvidia deck like a year ago is. About the best way I can try to tackle the subject what. We have on our back is. Sort of our input image and, then what we're going to export is sort of a blurred version of our input image so. We have this sort of three by three convolutional, Karluk kernel in the middle and all. It is is the number one so. What that means is that for, each input set of three pixels our, output is simply just going to be the sum of these pixels together so. I don't know if you can see the numbers very well but it's literally two plus one plus two. Plus one plus one to get seven out. We. Then take this whole little window move. It over one set of pixels and repeat, the process again, keep.

On Going until we reach the end of the row and then we repeat moving, everything down one row so. This process of going over the image is, called striding, and this, is a very important concept for you to understand. The. Other concept, you need to understand is max pooling so. All we're going to do is, take this group of 16 pixels and convert it to a set of four and we're, literally just for each colored, region kind, of find the largest pixel and, make that be our output. So. If we take these two concepts, together and revisit. Them this problem we, can actually significally. Improve our quality just by changing how we're modeling our data so. We're going to take our same 784. But, we'll treat is an actual image so it'll be 28 by 28 pixels. Now we'll, run us through two layers the 3x3 convolution. Maxvill, operation, and it, will keep our same densely. Connected layers, and output. Of, ten. Categories. Oops. So. Here's what the actual swift code for this looks like I've. Literally taken, an example from before and. We've added a stack of convolutions. On top. Then. We take our input run it through our convolutional. Layer and then, send it to our same. Output. Densely connected layers as before. This. Will run, this. Goes a little bit slow I didn't, quite install. Everything in the optimized manner but. Eventually we'll run we'll get up to about 97%, accuracy, on the in this dataset so, by simply changing how we mater how, we've modeled the data using, convolutions, we'd, be able to cut our air in half on, this. Toy. Problem we'll say. Where. Do we go from here. Let's. Take on a slightly larger more, complicated problem this, is a data set called C far it's, a collection of color. Pictures, so, we have pictures of cats dogs, animals as well as like human vehicles so like cars and trucks we. Have ten categories and now, we're going to be working with color data so, we have a RGB, component. But. Our, same basic approach that we use for before, well. We, can scale it up to tackle this problem, so. And. We'll, simply, take our input data 32, by 32 by 3 channels will. Run through two sets of convolutions, Oh max, pull two, more sets of convolutions, a max, pull our, same to densely connected layers and then, we'll have ten categories for our outputs. So here's. What this model looks like and we've, done nothing more really than add another stack of convolutions, if. You look at the very first line 3 by 3 by 3 by 32, that's, where we introduced color and. Didn't really actually make, our net worth that much more complicated. So. For this one. For. This one I took. The CFR, demo from the Swift. Tensorflow slashed, Swift. Models repository. And just. Replaced, put, my model in there over, that and then we ran it so. That. Will look like this. I'll, let it run, it'll. Take a run. Eventually. We'll end up with a network around somewhere, around 70% accuracy, which. Isn't gonna allow you to write a paper anytime soon but it does like technically, work this approach. So you might look at this thing and say well, heck let's just keep on doing, this approach let's stack up more and more convolutions, I think.

If You could jump in a time machine and go back in time five years. You. Could then be the world's foremost expert. In computer vision so. This is the vgg network from, 2014. Or so and it's. Nothing more complicated than the things I've shown you so far, we're. Dealing with the image net data set so, we have a slightly larger, input of 224. By 224, pixels and. But. We, take our input two. Layers of three by three convolutions, the max pool two more layers of three by three convolutions, we're. Looking at the VG g19 so we, have three. Layers or sorry four layers of three by three convolutions, max pool four. Layers of three by three convolutions, max pool four. Layers of three by three convolutions, max, pool I'm. Using a slightly larger dense, layer we, are using 512, for the two demos before this, one is simply 4k, 4096. And then, imagenet has a thousand categories, so, we have a thousand, output. Nodes at the end. And. Then. So. Yeah, so we take this and we'll, say let's apply like one more sort of mental leap on top rather. Than think of this as being two two four four four let's, think to this is one set of two layers one. Set of two layers two sets of two layers two, sets of two layers and two, sets of two layers, if. You can do that step, then. We can jump over here to, vgg. Which, is their first or second ResNet which is our first solid. Modern approach in this field, the. Basic, so. On the left side here we, have the same vgg network that we were looking at before so, 2 2 4 4 4 in. The middle we have the background of. What's called resna 34, but. It's conceptually, no more complicated, than anything we've looked at thus, far we. Have three sets of these two three by three players. Four. Sets of these two three by three layers six, sets these two three by three layers three, sets these two three by three layers and then, we have our output layer. The magic of residual networks is this, sort of dotted line that's being drawn down over here on the side. Basically. The. Problem of the vgg approach is that. These. Convolutional, approaches are not very resistant, to noise so, it's actually about as big of a network as you can make the. Problem is we'll say if each layer only introduces, like 0.1 percent noise by, the time he goes through 19 layers that's, going to significantly affect your results, so. ResNet basically. Introduces, this concept of skip connections, and basically. Then neural networks are kind of extremely lazy so. If they can find an answer then, basically they'll shortcut everything else so. The power of these residual networks then is that, basically you can stack layers and layers of convolutions, until, you find something that's sort of over fits your problem and then you can sort of dial it back to, produce like a simplified, in theory, best case answer. So, that then, is ResNet 34. We. Need to do one more trick, we. Need to go, away from our 3x3, convolutions, so.

We're Gonna go from the if, we look here in this other. Quadrant, what, we're gonna do is replace our three by two three by three layers with, a 1 by 1 3, by 3 1, by 1 style, approach so. 3 + 4 6 + 3 is 16, times. 2 is 32, +, ahead and now put layer so that's ResNet 34, the. Same 16, times 3 plus, head and output is 48, + 2 so this is ResNet 50 so, let's do a quick demo of training resident 50 on the imagenet data set using, a cloud GPU. We'll. Need to do first. Is. Create. A cloud TPU so that's, simply running this command I did. This, ten, minutes ago so we won't have to watch it, get. Started. So. Here, we have a cloud. TP you're running, up. In the cloud. So. We'll. Start this whole process it'll, spit out a whole bunch of line noise we'll, say a. Lot. Of warnings about tensorflow too but. If. We wait about ten to twelve hours this. Will output a ResNet. Fifty trained on the imagenet data, set. So where do we go from here. Don't, let the 2015. Up there for you this, resin at 50 is more. Or less probably. Your best first, bet for most computer, vision problems, still, today many. People have come up with different networks, some, of them are you know technically, better or, technically produce slightly better results but, more often than not you should come back to, this basic model for, your first approach. Let's, look at these bottleneck, blocks a little bit more. Basically. I would argue that this 1-3-1 approach is not as powerful as the 3x3 approach we've looked at so for, the. Reason this bottleneck, layer has, better results is, hidden, in this 256, that's shown on the last layer, this. 131 the, last layer is technically, four times as large as the, other stuff, so. Basically I would argue that. This. Bottleneck layer is not as powerful as, the. 3x3, approach however. It's. Cheaper, so. Because it's cheaper we can run more of it and because we can run more of it that's, ultimately why, this, approach is produces. Better result, so. In order to replace res Nets we, need something that's not necessarily better we, need something that's actually cheaper, or to. Use a slightly different word we'll. Say more efficient. So. This. Is a paper that came out in May of this year and, it's. A culmination of several years of research by the Google. Team. Effectively. People, have tried. To build larger networks with. People. Who tried to build deeper networks and people. Have tried to the larger networks in terms of the size of the inputs but. Nobody's really found like the perfect combination. So. This paper what. They did is they, took the in NASA approach from NASA, net from last year, they. Added in some different ball layer. Types from, other cutting, edge networks and. Basically effectively, they left the computer and, let it search across all this parameter space in order to find the most optimal, set of networks. They've. Done similar things this before, in the past notably. With NASA, and then the ameba net papers from last year but. What's interesting to me about this paper is, that they're. Applying. Sort of human intuition and logic on top so. They've literally, come up with a formula whereby. If you come up with one network they, can basically multiply. The parameters, in your network in order to produce larger. Versions. Of it, so. This. Is really cool, I think, there's a lot of times the reinforcement. Learning stuff you, kind of end up with networks that only computers, understand, whereas, this is like humans. Adding another layer of intuition on top so sort of working together we'll, sit. Which. Brings us to, efficient. Net - HTTP, you we. Can think of our search space as being, like say accuracy, or.

Quality Of our models but. We can also model, our search space differently, so, we can say what. Does our latency, you know how long does this network take to run how, large is our network how many different operations, are we using how many. Individual. Parameters, and so, they've, have, these edge dpu devices, which Google has been shipping out there, like 75, bucks or so you, can buy and then, they gave, this efficient, net thing this. Edge TPU. Hardware, type and said produce, the best type of network for this particular device so, what, happened was is that we have sort of a one-by-one convolution, combined for 3x3 convolution, and what. The network found is that by combining these two together into. A larger 3x3, convolution, you, could actually produce better results in, faster. Amount of time. So. What. We have up here then is our, resident at 50 model and as. You can see sort of up here we. Have what we would call the holy grail of image. Recognition search, we, have a network that's smaller. Faster. And more. Accurate, which. Is all you can really ask for. So. What, we're gonna do now is demo. Running. Efficient. Net - edge TPU, - the, S variant, the one does arrows pointing to on a natural. HTP device. So. The. First trick we're gonna do is, we're gonna use a TPU three instead. Of a TPU two so. That's in this command here. The. Second trick we need is. That this is all a little bit bleeding edge so, we have to use a nightly build a tensor flow so. We. Tell. The computer to do that right here. Next. We have a bunch of parameters and whatnot but basically, very similar to our resident, command. So. Here's my edge our side here's my cloud TPU phiiiy, running, and. We'll just literally, copy paste the command in here. We'll, give it a few seconds to get going okay. So, now we're training edge efficient, net - edge TPU - s in the, cloud on a TPU v3. This. Will take about 30 hours to run but. At the end we'll have produced the checkpoint. Next. We'll just literally copy this checkpoint from our remote server down, to my local machine we'll. Skip that there. The edge TPU device uses, innate math whereas. The cloud is using floating-point so. We need to read convert. Quantize our models so convert from floating-point into n to eight so. For this. We'll. Use another script that the edge TPU people have provided, the. Only fun, part of getting this working is, that this relies on the tensorflow xla ops which. Are not, installed. By default and, the tensor flow builds so, you have to compile it from source. This. Takes about a minute or so to run. So this takes about a minute or so to run and we'll have a quantize, checkpoint, of our efficient. That - edge CP, edge TPU bill then. We just need to simply run. The. Device. Using actual local. Okay, then. We just need to run our. Run. Our model locally using an actual edge TPU device I got. A picture of a panda off Wikipedia we're using that for input and as, you can see it thinks we have a panda with we'll say 60%, proper, probability but. It might also be a frocks with 11%. Or 12% probability. Or stuff. So fairly broadly our goal, was to explore, the, concept. Of convolutional, neural network to perform image recognition, towards. That end we built a one-dimensional, neural network we. Added convolutions, and then we approach the m-mister column again using a 2d approach from. There we looked at how we could stack blocks up and we're tackle, larger, and more complicated problems in this field then, we looked at how we can introduce residual, layers and then, finally begin to actually modify our different block types in, order to produce the state of our art approach in this field. I've. Talked a lot about. Images. Up here we'll say but. Maybe the more interesting, applications. Of CNN's, are in completely different fields. So. We can add another layer on top of our 2d CNN, in order, to get a 3d CNN, we, can use this to start to tackle depth. Data so. Like lidar stuff, like that. People. Have taken language, models and converted them into the CNN, style approaches, so, cue a net was an interesting, paper from last year where they did that. Planet. Detection, they. Can take like a 1d scene and approach, and, then do some other tricks, on top in, order to begin detect, exoplanets so. This Astra net was a really interesting paper in, this field they, also fold paper came out earlier this year they. Use a combination a 1d 2d, and 3d neural, networks together in, order to significantly, advance the state of the art and protein modeling, and. Then finally the. Ever-popular. Alphago, and alpha zero engines. Originally. I tried, to put up a little bit of each of these papers up here but, this slide got a little bit busy but. I reduced it back down to this one picture so what. You're looking at is, the inner layer of the alpha 0 engine, which, is composed of 40.

Of These residual, blocks that you're looking at right here, what. I thought was interesting is, that this, alpha 0 block is, composed, literally, of a residual layer the, same approach we looked at before with. Two pairs of 3 by 3 combinations. So. The same approaches, that we've used to, do our image, recognition can. And a completely different domain plus, a whole bunch of reinforcement, learning be, used to solve the game of go, so. That's. All I got and. Thanks. For coming.

*2019-11-29 10:47*