This Insane AI Video Search Technology Selected by NVIDIA and Snowflake | Twelve Labs, Jae Lee

This Insane AI Video Search Technology Selected by NVIDIA and Snowflake | Twelve Labs, Jae Lee

Show Video

If you try to accommodate all of the advices that you get from your mentors, your company will most likely become like an underwear company totally different from what you wanted to build. Having some fundamental like foundation for yourself and for the company, and be able to say thank you for your advice, but no thank you. So having that the gut to say no to someone that you you respect as a founder, I think I grew a lot. Hi yo my name is Jae. I'm one of the co-founders and CEO of Twelve Labs.

Twelve Labs is an AI research and product company based here in San Francisco and Seoul. We are building video foundation models for developers and enterprises building video centric products. We basically build humongous AI models that can understand videos like humans, and we serve it to developers via APIs that are looking into building really powerful semantic search or classification or summarization into their products. We currently have a little over 20,000 developers that are actively using our search API. We have largest creators of the world adopting Twelve Labs as well as media, entertainment, and large sports organizations and law enforcement. So I was born in Seoul. I spent about ten years in Seoul.

I had a chance to move to the States. I moved when I was 11, so I went through elementary and middle school in Knoxville, Tennessee. I was able to pick up a lot of the culture and things like that. I was very interested in expanding my perspective and exploring the new world. My first experience with software engineering, or at least coding, was Matlab. My uncle was actually getting his PhD at the University of Tennessee, so I see him kind of like plotting this distribution graphs and things like that, and which made me curious what what he was doing.

That's how I got into playing around with like, small data sets and doing the same thing that he was doing, because I wanted to be relevant and I wanted to talk to him about a bunch of things. He probably thought what I was doing was pretty cute, which sparked my interest in learning more about how do we capture all this data and be able to create a system that can really understand the distributions of the things of the world. And it just kind of felt like if you had the understanding of that, it gives you power to predict anything. And I went to Berkeley for college and studied computer science, so I really geeked out on AI and software engineering. So I spent about 15 years in the United States.

So half of my life in Seoul and in the States, I was drafted into this organization called Korean Cyber Command, where like minded people are already there, armed with incredible knowledge in software engineering and AI. I joke about it I served the country with keyboards rather than a rifle. I was really fortunate to have met my chief architect, SJ, and then Aiden kind of also joined in. He was, you know, I still clearly remember Aiden with his buzzcut coming in from from boot camp. But from day one, we knew we had this common interest in AI. And what can we do as a as a young scientist, really like push the frontier of AI development.

And we've spent a lot of times reading papers, discussing, arguing. And it turned out there were two clear paths. One was after military, we go pursue a career in academia, go become a professor, or start something of our own. And looking back, what we realized is that we were spending so much time together and military is this really special setting where, you know, you're basically jamming in 50 to 100 of 20 year olds, right? ith Testosterones what we thought was, you know, if we're having this much fun in military, imagine what we can do when we go out. So it was pretty clear to us that we're going to start. I think we've spent about a year and a half really thinking about what is the next frontier for AI, and how can we contribute to pushing that boundary? There's this seminal paper called Attention is All You Need, which some people call it the transformer paper, which is making a lot of impact now.

But when it was published in 2017, at least for text and image based, you know, foundation models, probably capital was going to be a mode whoever raises the most amount of money. What we realized was there's still a lot of unexplored research areas for multimodal video understanding there, rather than capital is probably going to be really passionate, smart, but also slightly dumb people. That's like dumb enough to start. It will have really good chance of succeeding. And also we've realized that with the explosion of that video and, and other complex multimedia data is going to be the infrastructural data for the internet. And we realized that probably developers and enterprises need something better than object detection or transcription to make sense of all, all this video data that the humanity is creating.

So it was like a no brainer to start building for video Understanding the model that Twelve Labs is building, it's basically trying to map human language into whatever that's happening in video content. So if you can map precise human language to whatever that's happening within video content, that gives you this emergence capabilities, like being able to search for things really well or being able to classify things or summarize, you know, we didn't all join the Korean Cyber Command at the same time. So SJ was already like six months ahead of me, and then Aiden was six months behind.

So like, okay, we decided we're gonna, we're gonna start this company. But then SJ is like leaving like next year and then I'm leaving the year after, and then Aiden is like six months after. So how do we do it? So it was genuinely very scary. So we had our ideas. I remember SJ was discharged on Thursday and he came back to the military base Saturday of that week with our laptops, and he took us out in front of our military base. There was a bagel shop called Last Bagel and that was like our office where SJ would bring all of our laptops and we would do our research and do some little bit of prototyping.

And we did that for six months. And then I got this charge, and then I did the whole laptop carrying and and taking out Aiden. So we did that for like a good like year so that everyone is out. And then we had a bunch of friends that were working in AI and crypto and blockchain and Web3 was just booming. And we had a mutual friend, like the founders had a mutual friend that had like a really nice office in Seoul, and it was like, oh, you guys can come in and use our office space. So that's what we did.

And then after like three weeks, that company went bankrupt. And then really scary, like people started coming in and we had like our desktops and our GPUs all set up there. And we got really scared.

So we brought everything back out and we found like really tiny, tiny office, probably like a size of a dressing room. That's where, like, all five of us kind of spent the next six months before we raised our proper seed round. Looking back, if we were to do it again, I don't know if we'll be able to do it, but some people say ignorance is bliss. And I think we were just like, really naive and just really excited about building this company. And I guess not knowing what was ahead allowed us to kind of do what we did. When we first started the company and we hired our first employees, they had a hard time explaining what Twelve Labs does to their parents.

Really new thing. What is foundation model? What is video understanding? It's a new concept that's hard to understand for I guess, people that are not in this space. Nowadays people talk about the foundation layer, the tooling layer and the application layer. It all. Everyone's very familiar.

When we started Twelve Labs, I think technologically it made total sense. We knew it was going to happen, but what was uncertain about it was will market accept? But knowing that we are at the verge of breakthrough in building an AI that can at least, you know, get to a certain level of human understanding of videos. So we were betting on markets acceptance of foundation models. The founders are pretty pretty much broke, right? Because we spent two years at a military, and I think we had $2,000 to start with, so barely, barely enough to do anything but figuring out what was going to be impactful that we can do, given our current resources that will put us in the map, or at least, you know, let the world know that what we're doing is relevant. So our tactic here was, okay, we're going to talk to a bunch of customers.

And there were early believers in Twelve Labs who took our APIs and built awesome things with us, but we needed more exposure basically. So as a team, we've decided to participate in ICCV it's International Conference in Computer Vision. They're putting this like awesome competition for video understanding. So we talked to Aiden. We have nothing to lose and only to gain.

The team was extremely supportive of of Aiden spearheading that effort with the team. All I can do to support is. You know, there was some ideas and and directional kind of feedback that I gave to Aiden, but we needed compute and we needed determination to put some serious cash behind And back then for Twelve Labs like $200,000 in in compute wise, it was a lot of money for us. So and just thinking that, okay, we're going to blow through $200,000 in ten days in compute was really scary. But the team was able to use that capital, that that precious capital and build something incredible helped us win the competition.

So I think the important thing is, if you're building something really impactful and you think that it's going to significantly change the industry that you're in, there will always be someone that has very similar thesis. It's just a matter of how do you get yourself out there? How do you let the people know that you exist. For us, that was the competition. After winning the competition, companies like Index Ventures and Radical Ventures had very strong thesis around multimodal AI. The next idea would be what's next?

And video happens to be the most relatable multimodal data. These amazing companies actually came in inbound, so they reached out to us and we started jamming. The conversations turned into next conversation, and then we talked about technology and and it just happened very serendipitously. The first pitch deck I was in Seoul, and my first call with Index Ventures was at like, it was like 3:30 a.m. Seoul time, and we didn't have pitch deck, and we just felt like this is our first meeting and we knew nothing about fundraising then. We didn't even know this was going to be like a friendly introduction, but I just kind of felt the need to like, oh, we don't have pitch deck, we need to build one.

So I remember staying up till like 3 a.m. Building it. I think the storyline was like quite simple. We didn't have much too much to show for it. The idea was, hey, the problem that we're solving is massive. 80% of the world's data is in video, and there's no adequate solution out there or technology out there for developers and enterprises to make sense of it all. That is the market that we're we're tackling.

We want to index all of that, like 80% of the world's data. And this is the technology underlying research work that we've done. And then that was it. VCs asked a lot of hard questions.

The most memorable one, if I bring you TikTok as a customer and they want to index like billion hours of content, how long does it take? And I think we were we were thinking about maybe like 1,000,000 hours. It's going to take like ten years, right? That's when we realized we should never be comfortable with what we've built, this whole new, incredibly large world out there. You know that maybe some people are impressed with our system being able to index like 1,000,000 hours. But there is others that are thinking about billion, 10 billion, 100,000,000,000 hours.

And that was like a really challenging question because like what we said during our pitch is we want to index all of the world's videos. And that question kind of made me stunned. And to think about all of the technical issues that that we had at the time, he probably find it like, funny, right? I was trying to give my best answer as a founder. It's like, so the company has raised about $30 million in seed funding.

My name is Soyoung Lee. I'm one of the co-founders of Twelve Labs, and I currently lead our business development and go to market. We have a customer who was paying for our product, but they weren't actually using it. But we had gone through a lot of work to actually get them as a customer, through a lot of sales and kind of relationship building and so on.

But I think they were, you know, they were extremely early and we had almost pushed the sales to happen. And I think what we learned from that experience was we have to optimize even early on, it might have been better for us to not actually make the sale because the customer probably wasn't ready. They didn't have the passion or the innovative drive that our other customers had had. But we were optimizing for kind of hearing the Yes We tried so hard to make that no into a yes. And we had succeeded. But at the at the end, I think it turned out that we probably should have kept it out No and focused on all the other customers where the yes was more clear, and they had a very clear vision of how they can build their new experiences with the technology, and especially for earlier products and earlier technologies where resources are limited and you want to build for your best customers and the innovators in every field, you should probably start to optimize for hearing the no than the yes, because that will help you find the right direction faster.

We quickly. I think we learned through trial and error that the types of companies that we need to early customers that we need to find and work with are true innovators in their field, whether they come from like the content creation space, law enforcement space, e-learning and so on. You know, we've had instances of trying to oversell.

We were going through sales, you know, sales 101 books or like sales methodologies. And we were trying to pitch to the customer, hey, here's the use case you could build out with our, you know, with our technology, you know, we'll improve your ROI by x percent with our technology. And we did make some sales from those, but I don't think it was something that we probably should have spent so much time on doing.

Because if you find the right customer who's innovative, you don't need to explain anything to them. You show them a use case based demo of, you know, for us it would be we would index some videos that resemble the customers, and then we show them how you can search, or you can generate text very easily, like, like just like a person would. Had they been watching the video, they can draw out the full map of, okay, this is what I want to provide to my customers or my users.

This is a technology that you have and I can they can fill that gap in pretty easily. And they know already what the return on that or what the opportunity of that experience would be for them, for the kind of really large, high profile customers that we have right now. It was all the same process. We actually learned all of these from the customers because they had, you know, seeing our demo and seeing the early kind of hints of the technology, they were able to teach us about how they could utilize the technology.

And even today, it's not just a single use case that they want to power. They come to us with 4 or 5 different ideas of how the technology can impact different units, different business units and different, you know, optimize different workflows or build new experiences. I think we we make stupid mistakes probably every day. The one that I regret the most is we knew we had like this conviction around building a foundation model, but then we didn't have any data point as to how do you build that company. And I think I blindly kind of believed that the startup mantra like, identified a narrow problem and building a very narrow solution for it. I think we've spent a lot of early days thinking about, okay, if we have this really powerful AI that can understand videos, what do we do with it? And we know, we know from the from the get go that we wanted to serve it to developers and enterprises.

But I think, you know, we've had mentors and, and other founders like, oh, you should build TikTok 2.0 or you should build like YouTube 2.0 or you should build that. And I think we've spent a lot of time thinking, okay, maybe like TikTok 2.0 makes sense, or maybe like Gong 2.0 like sales call analysis. But that didn't really like excite us because we knew, like, we're good at building the infrastructure and helping developers build the next thing. And that's probably the stupidest thing that we've done. It's like spending time on thinking about things that we're not excited about.

As a founder and CEO, it's really hard to get distracted. If you are a first time founder or a young founder, your mentor's advice means a lot to you. But having your own grounding, having your own, also really like relying a little bit on your own gut feeling is very important because what we've realized, if you try to accommodate all of the advices that you get from your mentors, your company will most likely become like an underwear company, totally different from what you wanted to build. So my key takeaway is having some fundamental like foundation for yourself and for the company, and be able to say thank you for your advice, but no thank you. So having that the gut to say no to someone that you respect as a founder I think I grew a lot right.

Twelve Labs has multi-year compute partnership with Oracle Cloud Infrastructure, where we get all of the state of the art Nvidia chips. And Oracle have put together this small event for their key partners that are building foundation models for. And Aidan and I had chance to meet with Jenson because Jenson was at that event and we had, I think, 5 to 10 minutes to talk about Twelve Labs and I think it seems like he had he has a special place in his heart about computer vision and video understanding is, you know, that was one of the first use cases that Nvidia chips powered. So we got to meet with like Nvidia folks from that event.

And then Twelve Labs was featured in Nvidia's 2023 GTC. And then I think that kind of sparked other people from Nvidia to be interested in Twelve Labs. And the Nvidia's venture team reached out to us. It was quite casual.

We were talking about Twelve Labs and the future that we're drawing and the future of multimodal video understanding. And I think the venture team also had an idea of how Nvidia and Twelve Labs can partner up more than just financial investment, but also think about really robust product partnership from then on, what Twelve Labs is doing and what Nvidia wants in vision and video understanding was just like a perfect match. It happened quite naturally from from conversing.

Our technology, our roadmap and Nvidia's future in terms of like producing really powerful chips for edge devices for smart cities, right? So there's that natural fit of two companies product really creating synergy. Nowadays I am focusing mostly on hiring. So I think Twelve Labs is a group of great people. I spend a lot of time meeting great people.

I want to be able to recognize greatness when he or she comes in good engineers or even like just good people in general have this core values. I would go near their places and get together at a cafe, and we would speak for three four hours. My way of deciding whether this person is a good fit for Twelve Labs is I'm able to learn from their core values.

Everyone's really good at coding nowadays, but great engineers can apply their core values and their skill sets, and is able to talk about the company that they're excited about and how they want to impact it. How do you see the product evolving? How do you see our interfaces evolving? Some of the best engineers are not the best coder, but having that perspective, really strong perspective and and groundedness is very important, and I try to look for that. Twelve Labs vision in the next two years is really becoming horizontal. Video understanding infrastructures for all of the businesses and developers that are working with video data, we want to enter into streaming as well.

So real time video data and really become a visual cortex for modern video applications.

2024-12-23 06:08

Show Video

Other news

AMD BC-250 Обзор и запуск игр. Играем на чипе PlayStation 5. Simple guide how to run games on BC-250 2025-01-13 23:44
Mobileye: Now. Next. Beyond CES 2025 Press Conference with Prof. Amnon Shashua 2025-01-12 10:27
B-52 Bomber Astro Tracker - Part 2: Power up and gyro-stabilization 2025-01-08 21:33