TIGER-Lab Talk: Diffusion-based Controllable Video Editing

TIGER-Lab Talk: Diffusion-based Controllable Video Editing

Show Video

yeah I think people are joining okay yeah so so this week we will have uh HHO um to to present uh his working video video so hanho is a graduate student at Kai AI uh and he his research are focused on uh controllable video synthesis and video editing using diffusion models uh so H I will pass it to you okay good morning everyone I'm h h j and thank you for having me to give a talk to uh this group at University of walu and in this work I will introduce is like several works on diffusion based video editing and generation and instead of diving into Minor Details of each work I would try to share my perspective on the problem settings and and precise insights of the proposed solution um before we start like I would like to share a bit on myself uh I majored computer science for my bachelor and since last year I've been pursuing Masters at Ka in South Korea and this summer I was lucky to have done internship at Adobe London lab and recently I I got a full-time offer from Adobe and we join there as a re research engineer after my master finishes and also this is a my group in Ka and yeah it's like a huge group and and our group has always been working on division based generative modeling so now let's let me talk uh start with my uh my project called Ground of video which I presented at ICL this year uh very simply uh wrapping this is the first work that brought grounding conditions to video editing and and still a very powerful framework for complex Mar attributed editing of videos even though it's based on the image division model so at the time of This research there there were no public publicly available video Edition models like unlike nowadays as a result prior works and concrete works at this time had to rely on stable divion to perform any types of video editing and natur there there were several limitations so temporal consistency of the generated videos were pretty disappointing and here by temporal consistency I mean this noticeable the occurrence and appearance inconsistencies another limitation I found was that almost all existing methods become very inacurate in the scenario of complex merch attribute editing so the core idea the simple and core idea to address all these issues is to get help from additional spatial conditions to complement the reverse division process and regarding spatial conditions I prefer to category categorize them in two different types uh specially continuous representations and specially discrete uh representations where most people usually focus on the former only so good examples of specially continuous representations are like depth map and Edge Maps they are specially aligned with the input frames or image strictly and they explicitly provide information about the structure and shape of objects through every part of the image or the frames on the other hand especially discrete conditions refer to modalities that do not rigidly uh dictate the precise structure and shape of objects instead they give uh implicit hints on the over layout of the scene and positions of each entities specifically we focus here in this word we focus on groundings a pair and here by groundings we mean a pair of bonding box coordinates and text description that pertain to the identity inside the bounding box and the intuition of ground the video is to use the best of both words to guide the denoising process specifically these three modalities so death Maps flow maps and video groundings then how can we inject The spatially Continuous conditions to the denoiser unit backbone since these conditions are aligned especially with the input tensor things are very easy and intuitive so I go briefly here so first we warp the initial latent using the corespondence of optical flow map which guarantees that the repeating regions across frames share the same pixel values and doing this greatly uh enhance the tempor consistency and second we use deps maps and for DS Maps we use control net uh we encode DS map features to the latent unit space using control net and the encoded features are simply added to the original unit backom features which give pretty structure guidance as demonstrated in various image editing works now uh making use of specially discrete conditions the groundings is quite tricky because they are not specially aligned in other words these tensors are in totally different shapes as the input tensor the the input video but uh gated mechanism gated attention mechanism solves this issue so gated attention accepts grounding tokens so we first need to encode the groundings the layout information into grounding tokens as shown in the right part and of course for all the video frames independently for each transform block within the in a backbone gting gated attention layers are inserted specifically between self attention and cross attention layers and the Gated attentions are responsible for projecting the grounding information from the grounding tokens to the original latent features and this is how the Gated attention mechanism operates in it sense it is basically a self attention uh the proper frame tokens the original tokens which are the outputs of the previous self attention are merged are first merged with the grounding tokens and then all the tokens are processed together in the green in self attention modu after this attention the resulting grounding tokens are discarded and only the updated purple frame tokens proceed to the next cross attention layer and lastly since we build our meet on a text to image divion model the model inherently does not allow an interaction across frames to handle this we uh use a common strategy that is that is pretty common in video Ting projects so we inflate existing spal attentions all existing spal attentions into spatial temporal attention and here by spal attentions we mean both self attention the original Self attention and the newly added gated self attention so to sum up this is the overall framework of ground the video so the left side illustrates the input preparation stage and the right side shows the den noising process and as you can notice grounder video requires quite a few more conditions than other works but for example like depth flow groundings but obtaining all these conditions are automated and thanks to them ground the video significantly out outperforms existing baselines in the scenarios of complex edit so for example given this input video of man walking a dog on a road so ground of video can easily change the dog into Lego dog or the person into Simpson with a great temporal consistency and accuracy but but usually all other works can do this too and changing the background is are so easy however like this is like the main contribution of this project so you canly edit the dog man Road respectively in a one goal including the case of multiple edits and also including the multiple edits and star transfer and also this is another example or rabbit eating a watermelon on a table so single object editing is easy well it can also perform much more complex edits even four different edits uh concr uh in the same time this is another research so quate research simple added to complex edit to even more complex edits and this framework can also generate styliz videos simly uh given this reward videos it can generate videos that have high fality to both the input videos and the target star text and for the Baseline comparisons we use 20 videos from Davis data set and for division base other video editing Bas signs tin video confer video confer video and the product gen one so qualitative L are shown on the top and the numbers are shown on the [Music] bottom this is another uh comparison uh example where the editing scenarios are we intentionally made editing scenarios very complex and challenging and one necessary and emerging research direction is to assess and compare existing video editing methods systematically because every paper says they are the best using like different metrics and different benchmarks so here's like a one recent paper called edit board which provides a comprehensive evaluation Benchmark for text based video editing specifically and in particular they compare five video editing models including ground video and all all these models are text image divion based for a fair comparison and they split video editing into a few subtasks like single object editing multiobject editing and a few more categories and ground of video surpasses other baselines in most in almost metrics in and also in most sub texts even in their discussion section they exclude the ground of video and they continue their analysis so we've explored what we can do with image dision models then what created video tasks can we do in the presense of video ediction models so I will introduce my work BMC which was presented in the cvpr uh this year so before we talk about the task the specific task called motion customization which is like the main topic of this work but let's discuss a bit more on text to video ediction model so finally text to video dipion models have been open source not all but Oro and usually the close sourc ones are usually very Superior to open sourced ones so many categorizations exist but I like to classify videoed models into non-cascaded ones and cascaded ones so non-cc video defion models are the typical models we normally refer to like any at this for stable video diffusion or open Sora open Sora plan and uh from gaan noise and text prompt as condition they iteratively Den noise and reach clean level videos and these videos are usually like the final output videos that users get on the other end cascaded vide models consists of several denoising process from each models where the term cascaded can simply be understood as a sequential pipeline of experts where each expert are actually video dition models so they were first prop proposed by make a video and imaging video and the recently announced movie genen also belongs to this family because they this movie genen employs A spal examplar which is a videoed model so specific details Barry by works but LLY speaking the cascaded video diffusion pipeline consists of three stage so key frame generator key frame generation tempor of sampling and then Spa of sampling where all these models are again video models especially video divion models so the initial key frame generator has a very strong text to video prior and generates spatially and temporally uh temporary low resolution videos from noise and then The Temper of sampler performs frame interpolation in other words they increase the FP FPS of the original key frames then the spacal sampler performs video per resolution where the final output video is produced So Below are this the below videos are the example outputs of imag video pipeline so I'm I'm personally a very big fan of of cascaded Bion Frameworks and I still believe that these Frameworks are quite un underexplored despite the fact that they can easily generate very high quality and high resolution long videos uh for one reason why it is not under explored is that I believe is that there's like a one fatal limitation which is that the whole process inherently takes a long time but surpris personally like acceleration research like consistency models or Progressive distillation models This research are not being studied on the proper computation botom X as I experience so for example some few step dition models have been propose for the video models but they they can be only applied those methods can only be applied to the initial key frame generation stage which is the what we call the General text to video Edition model and which is actually is only once throughout this whole process also the latent the shape the latent size the tensor size are at this stage are temporary and spatially very small so even without the acceleration this inference takes a very short time however the L later temporal and spal sampling stages are applied multiple times to generate just one video and and usually one single inference takes much longer than the first key frame generation stage and even worse so even when equipped with multiple gpus this upsampling can't be paralyzed but has to be applied sequentially because they take the previous output as the input so I believe the I personally believe accelerating the video sampler video models is like a very practical future work so now back to vmc so so in this work we proposed the video motion customization framework on top of a cascaded pipeline um but here what do we mean by customization and like motion customization so given the S image for reference appearance appearance customization like green boost in generative models aims to fine-tune the image models to generate sub images in in a diverse context like a example is like texture inversion however like on the other hand the the given a reference video the motion customization aims to adapt pre-trained video models to create videos that feature the reference motion concept instead of appearance concept across bar various visual context and scenarios but without learning the appearance of the input video so we first make the problem easy with a few insights so first key frames are almost always enough to determine the motion of a whole long video thus if the reference video is temporarily short we use as is and if the video is temporarily long we sample only key frames and use this resting spars video as a reference video otherwise the computation memory won't fit easily to like like unless you have super huge GPU moreover that being said in the cascaded pipeline we can only adapt or fine-tune the initial key frame generation model on the extracted key frames while while freezing the layer of sampler models it's also very important to note that the key frame space is specially low resolution space which makes us which makes the model adaptation even faster with very low memory usage and second Insight is that the temperal tensions are where in interactions between frames occur and where the robust motion prior of the model is stored so to this end uh we propos to tackle only the key frame Generation video Edition model and within this model we we fine tune only the temporal tension layers and this training design leads to two advantages first preservation of the model's inherent capacity for generic synthesis since most rates are frozen and including existing all spatial rates like surf tension cross tensions and second significantly lightweight and fast adaptation more specifically the training the fine tuning requires only 15 Gaby GPU memory where it finishes Less in less than four minutes Harbor optimizing optimizing the model with the standard division training objective is to entangle learning of both appearance and motion as observed in the Tino video Baseline and its followup Works uh as as an example of this video presented so airplanes in the output video looks very similar to the sharks in the video caused by the intangle Learning of both appearance and the motion uh whatever the parameter effent functioning strategy is is employed so to this end we propose a new objective that distills motion information of a video V so our intuition is that the resure vectors between consecutive frames include Prett a huge information about motion trajectories so given an input video v0 v0 we can sample VT at any times St T in a closed form using the division forward then we Define motion vector by frame resistors at time t as in the equation two and the Epsilon resor Vector Delta Epsilon T is also defined similarly where uh we empirically set the St c as simply one here then the resure VOR data VT can be lineated as the equation three where Delta Epsilon T is normally distributed with zero mean and two I variant and in this sense Delta VT can be acquired through the defis corner in equation for in the light of this our goal is to transfer motion information to the temporal tension layers by elaborating the motion vectors thus we first simulate the motion vectors using the video denot Epsilon Theta then the denoise video Vector estimates 3 head zero can be derived by apply applying two these formulas in equation five then the doo uh motion Vector estimate the red colored Delta V zero head can be defined in terms of noise emotion Vector Delta VT and the predicting noise the Delta Epsilon Theta is in equation six so similarly in equation s the this blue colored brownr motion Vector Delta V zero can be obtained in terms of Delta V Zer and Delson T finally so this is our objective so our objective is to fine tune the model by aligning the BR to motion vector and it's the noise estimate as in equation 8 and by using a to distance for aligning fun function this is exactly equivalent to matching the ground throughs Epsilon resur and predicted the ill residues so to to sum up we find key frame Generation video model unit by aligning the the ground to Epsilon reges the ground to the noise reges and the predicting noise reses and once trained in the inference phase our process begins by Computing the inverted latens from the input video through the DD inversion and subsequently the inverted latens are fed into the temporarily fine-tuned key frame generation model so the generated key frames then undergo temporal and spacal of sampler of sampling using the uh unchanged unmodified uh temporary and SPAC of sampler models so these are are share some research of BMC so we can customize the motion of sharks streaming as presented in the in video to various visual contacts and we can customize the motion of pills falling we can also customize the motion of a car [Music] driving and motion of a car driving can be transferred to a course running too and the Rabid Ting video again we can change to various uh animals and the foods and the background we can also customize the motion of an airplane and a barge flying and these are another like other Resorts so for comparison we compared BMC against their baselines video composer gen one ttin video and control video so at the time of This research so this was like one of the very first work in video motion customization so these Works were not devis for motion customization and there were no baselines we could use like motion customization baselines yeah so they are just simply like video editing baselines yeah so others face difficulties in in like avoiding from the original shape of the subject in the input video leading to issues like a shark shaped airplane so before we end talk on BMC I would like to say that the motion customization is high high potential in commercial use cases because it allows users to explore their creativity with very high Freedom so for example uh P Labs recently announced their new tool like a p effects which is basically applying the customized motion effects to the input image or the user provided video and if I yeah and if I were to design this to given a foundation video model I think like uh one reference video would be enough unlike like so in the vmc we use a single video but like I for this kind of complex motion customization I think many reference videos are necessary yeah but for the for the training objective I think like uh other doesn't have to be BMC because there are other great concurrent Works in motion customization such as uh motion director and diffusion motion transfer I really like both of works and and just to give you a brief summary of other Works how they achieve motion customization so so motion director so to learn appearance properties on the left so motion director funs spatial layers using image division laws using a reference image and then to learn motion properties motion director findes tempor layers using video defion laws using a reference video and of fing they optimize lower layers instead of the original tension layers to preserve the original capacity and in the defion motion transfer on the right they propose a certain objective and optimize noisy Laten during the reverse divion process speciically only during the early denoting steps so their objective the objective they use is quite similar to our prmc objective but they calculate this law using the diffusion features instead of the predicted noise yeah so next I would like to talk about the dream motion which was presented in the eccv so dream motion is the first framework that utilizes score dation of text to video model for video editing and in this work we first question why is diffusion based video editing challenging so text griven defis based video edting presents a unique challenge that was not in presented in the image editing scenario so reconstructing temporarily consistent and and complex rearward motion throughout the reverse dipion process which starts from Gan noise so the D Noise video should not only have high fidelity to the Target text the editing text but also should have high fidelity to the complex motion of the input video so for example this video depicts a this video on the top depicts a flamingo walking on the grass but lacks Fidelity to the original video the motion of the original video so specifically specifically the problem is the ancestor deficient sampling process that starts from the gausian noise or at most DDM inversion inverted noise and as a result uh most other works supplement the reverse diffusion process by injecting attention features integrating explicit Vis hints like desk Maps or Optical flow or over or overfitting the the video model unit the the unit of the video model into the input video however all this expensive efforts can be eliminated we believe it can be eliminated if we can if we actually don't have to start from the gaan noise so in other words we propose to diverge from the standard ancestor Den noising process but instantly we want to directly edit the non-noisy clean level input video and the way we achieve this is by score distillation sampling so this figure illustrates the existing usage of score dation sampling so they usually elabor pretaining text to image diff model for the optimization of a generator model or some rendering model to meet the given text based uh constraints where it's denoted as y in the figure and and the generator model is usually non-pixel representations like nerve representation uh however instead in a dream motion we do not need an external generator or render we par parameterize the input video by F XF and optimize F directly using the gradients of the SDS laws so that being said we start the optimization start uh starting from the video that already exhibits rear rearward motion without any noise th we do not need to worry about establishing or or generating like a realistic motion from the gaan noise so and also we use text video di model instead of text image model so so that we can explo exploit the motion prior the reach motion prior of the video model to obtain the temporary consistent gradients so then the optimization progress will look like this ideally so as you can see the optimization keeps happening in the non-noisy clean video space thanks to the scoriation so to inject the appearance indicated by the Target so editing TX we extend the data denoting score mechanism to distill video scores instead of image scores so so the upper branch is the reference Branch where we have an input video and a corresponding text the text corresponding to the input video and the bottom branch is the target branch uh initially we copy and paste the input video uh which is the optimization Target the red colored box in the figure uh for both videos we add we add the same noise exactly same noise and predict the noise using a frozen text to video model but the difference is that in the reference Branch we have uh matching text input on the top but on the target branch on the bottom we have a mismatching text input the editing text then statically this predicted noise in the up in the top right the Epsilon hat the we have gradients toward Den noising Direction but this predicted noise on the bottom the Epsilon will have two directions Den noising Direction and the editing Direction because the text input doesn't match the input video the input video input to the unit so the finer gradients used for the optimization is computed by simply computed simply computed by subtracting the top subtracting the Bottom by top on yeah then we have gradients which is directed to changing lion to tiger in this figure then gradually the optimization Target will be edited to follow the target text however this optimization alone raises a problem so score distillation gradients are inherently noisy and the inoc crate gradients accumulate structure errors throughout the optimization process for example you can check the ring areas of the edited videos here yeah so I I believe that this kind of issue happens because the available text video model didn't have like a super strong generative prior or the motion prior so to address structure Integrity which we focused on S similarity descriptor which is a representation that is robust to local texture patterns yet preserves the spal layout over layout and shape and perceived semantics of the objects and their surroundings so the effectiveness of s similarity based descriptors in capturing structure while ignoring appearance like a local appearance information have been previously demonstrated in style transfer tasks using deep CNN features or image editing tasks using Vision Transformer features and dream motion this work proposed to use S similarity through deep division features so in our framework we first extract attention key features from both Branch during the denoising step of score distillation then we comput serf similarity map using the extracted defion features from both Branch branches and match the S similarity Maps across both spacial and temporary dimensions which we call SpaceTime self similarity matching in this work and thanks to the correction by self simility map matching now we can generate videos without structure and motion errors so these are editing results of three motion so especially for the red box edit case we compare with other Baseline on the next slide so our method showed High highest preservation of the motion and Fidelity to the Target text this is another example and this is another example too and this is another comparison with the baselines so in addition to dream motion like I would like to share one another work called uh emu edit uh emu video edit by meta which was also presented in eccv this year so if we are to do fully supervised mod training solution for video editing source and Target video pairs are necessary which are extremely hard to obtain because they simply do not exist in real world so one can of course use off thes Shelf video editing methods to get this high expensive data set and do some data curation or or other approaches like instruct PS to p uh however like this emu edit believe that distillation of multiple teacher networks could be another solution alternative solutions to this which does not require any fully supervis video pairs data set so for example say we have a foundation uh text uh image to image model so image editing model on top which is relatively easier to get and also a text to video model that shares the same spatial rate with the image editing model but have temporal R where they have the motion prior then distilling the scores of image to image teer from the top will teach a create editing capability while the scores of the text video model from the bottom we're assured that student model keeps generating temporary consistent videos and this is how emu video edit trains their foundational video editing model without any supervised video editing data set so very lastly I would like to share my latest project called track for Gen very briefly which I worked uh during my internship at odob so this is not about U controllable video editing or synthesis but the main focus of this project is introducing dance Point correspondence in video diffusion space so since last year generative priors uh in image diffusion models have shown to have strong 3D implications uh notably pying works like this demonstrated that correspondence can emerge in image divion models without any expli it supervision these Works typically extract division features at specific time steps and layers of the unit then perform a nearly neighbor search to establish correspondence so following this trend our natur question was can we use video Edition models instead for this 3D correspondence text so since they're trained on videos they should outperform image prior which have never been exposed to videos during the training so specifically unlike existing video correspondence methods that typically extract features from videos on a frame by frame manner using image models we believe that video dition models provide the stronger foundation for finding visure correspondence across frames however this isn't there isn't a single go to choice for video edtion models unlike the case of image diffusion where stab division dominates so so we analyzed the features of four different video defion models highlighted in blue so geros scope text to video stable video image to video I2 Excel and also like recently trans dit based open Sora and let me jump directly to the conclusion so we discover we discovered all models commonly performed bad at dance video Point person task also called video Point tracking so for example here's a direct comparison between op features and Dino features for the video Point tracking of the foreground object of the video this video so as you can see Dino 2 performs significantly better than the op Sor features then also it's worth noting that that the open was the video model with the most temporarily Rich correspondence in its features among all the models we investigated but still uh were bad compared to dino to image features so the next question naturally becomes can we improve temporary correspondence of video Deion features and this question forms the core idea of our track for Gen paper and to achieve this we first needed to understand how found how Foundation or video trackers are typically trained so the specifics can get quite complex and are not uh uniform but I provide a simplified overview of the very basic laws so given a curry frame uh TQ the the C frame uh and a c Point location the point location is uh noted as PQ here uh and a Target frame T the goal of video tracking is to find the corresponding Target point in the Target frame so first visit tracking is done using some rich semantic features so for example trackers use some foundational image model to incude the frame such as ret then using these encoded features the large F you extract the feature of the cry point the feature Vector of the Cy Point Q then obtain then you can obtain post volume the Ser so the similarity map by do product between the perur point feature vector and the whole Target frame features large ft for the Target frame features and the Q for the C Point vector and a simple way to determine so given this CT the cost value map a simple way to determine the target point in the Target frame is doing argmax operation in the cost volume but argmax is not differentiable meaning that you can't back propagate the loss so instead you do the differentiable soft argx so first get the argx location the constant location which is a constant then compute weighted average using points within certain radius so doing so results in the predicted location of the target point the X head and Y head then the loss is defined by simple regression between the predicted Point coordinates and the ground to point coordinates and here the coordinates are usually normalized between minus one and minus one to one so using this tracking back propagation algorithm we first propose a hypothesis based on findings from recent literature on image division model so studies recently have shown that image division Fe image division models learn discriminative features in their hidden Stace and that the better division models produce better representations in their hidden States uh more specifically uh recent promising work called Raa demonstrated that improving feature representations leads to better regeneration outcomes in image so we adopt the same philosophy but see this with the temporal perspective so we argue that uh fure lever tempor consistency directly correl lights with pixel L temporal consistency so more specifically tracking failures in diffusion features often coincide with appearance STS in the generated video pixels so by appearance DFT here we refer to a phenomena where visual elements gradually change unnaturally or mutate or degrade over time uh leading to inconsistencies in the objects within the generated videos so in contrast when videos are generated using features that are temporarily consistent enough to enable smooth tracking the researching videos are fully consistent so this is another example so take a closer look at the center part where the pr stft happens in like the mid of the center part where the tracking fail if the pixels in the pixel space there's inconsistency and and if there's no inconsistency the tracking is completely smooth so and this is the track Forge and training overview so in addition to the standard video dision laws we provide the track level supervision the video tracking laws used in tracking video tracking literature uh but we use the division features for the loss so and in our paper we often call these laws as correspondence laws so regarding where we pick uh features and perform tracking we use the third upsampler layer of the unet backbone which we uh did investiga which layer is the best and also this choice is supported by like the literature on the image diffusion model so the correspondence using image division model so because they are the same unit and usually this video model start from stable diffusion the image diffusion so here right in the in the in the figure the red colored blocks red color blocks represent layers optimized by the division lws standard division lws where green blocks are optimized by the correspondence loss blocks color both red and green are influenced by The Joint diffusion and correspondent fls but with but since we decide to F tune from a pre-trained model model instead of training from scratch we make a small architecture change so specifically instead of directly using the raw di features for the correspondence estimation we propose a trainable refiner module RP F which is designed to refine the raw features by projecting them projecting them into a correspondence reach features space so we call this a defined features yeah so the conclusion of this joint training framework is that this leads to improve video generation in terms of appearance constancy or appearance consistency in addition to the temporary consistent video Edition features so we first qualitatively compute compare against the pre-trained state video defion the original stable video defion model this is another example and one could claim that maybe the Improvement is coming from the data set we use for the fing so we also compare with the F stable video so find stable video means the sa video that is fine tuned with the same videos from the data set as track for Gen but without any correspondence supervision or the correspondent loss on its features and this is another example and lastly this is a comparison with track for Gen trained without the refiner modu where we showed the addition um Network rile yeah and and for all the comparisons we show the quantitative evaluations in the paper so we also compared the video Point tracking performance against the original stable video Edition and one additional video Edition model geoscope text to video as a reference so existing video D models significantly lack video Point tracking capability in their features but track for Gen enables a create video Point tracking thanks to the additional tracking supervision this is another video tracking qualitative example and we also conduct quantitative uh tracking experiments on several tracking benchmarks so the video Point tracking capability is also demonstrated when we investigate the features of the generative videos generated by P for Gen so when we analyze the their features during the denoising process specifically we extract their features from the last denoising step and perform video tracking using nearest neighbor search uh then we could see that the tracks are correctly estimated showing the feature space temporal consistency this is another example yeah so this is the end for track for Gen and also this is the what I prepare for this talk and yeah so thank you for listening and I would happy to like get any questions or comments yeah thanks so uh I have question regarding like in the first work um in ground a video uh it seems there was a place um in the Gated diffus like in the Gated atten self attention where where you try to use like a like a forear uh preprocess for the for i I'm curious is is it actually a common practice or or you find it particularly useful to to use the for there it is just a practice that I didn't validate it's like for the this kind of literature in image like they you just use the F uh encoding to turn these numbers the coordinates to the Token so they just use it so I just followed the practice Yeah Yeah I think from the like I think it comes from the nerve like how they incode the the locations to some oh yeah yeah come from it actually come from nerves yeah yeah okay okay and also we we use the same like this is not trained like so we use the pre-trained from the nerve lerat thing yeah I see I see yeah so so so in the like so in vmc if I understood it correctly uh it would bequ like training for each single video right so to like addit and and then it it also same applies to to the like the dream dream dream motion so so in BC so it is like a one shot framework so for each video we have to fine tune the backbone model for BMC but for for dream motion so we basically make the video model Frozen yeah but we optimize the the input video directly so we we do not in dream motion we do not require any model tuning yeah okay okay no model tuning in in motion dream motion but but okay I I see I see so so I wonder like in in all kinds of method um so so like score distribtion or maybe like a motion Vector optimization or even inversion based editing method uh which one do you find um the most efficient and it would be like uh less computational uh [Music] expensive so actually actually score decision is quite computation uh cheap because uh so when we compute this loss like for example the loss L SDS so the trick so the like I think like this is like the best advantage of scor dation so we do not have to how do I say like turn gradients on this unit so we so just like when we do inference like we do not have to turn on the gradients so you can use like no grad and get this gradients and get this uh epsilons and then just if you sub subtract this epsilons then that is like the direct the gradients where we can use it to like optimize like whatever you parameterize so I think scoriation is quite good that yeah like a very good choice for if you're if you're if you're like a Cons considering about the efficiency yeah and this is not really time uh consuming too yeah I see I see then then like I'm am like those methods which which one do you think uh would would be the best uh I think but but but for the efficiency I think like my first work like round the video is like the most efficient in terms of both time and and and memory because like it's simply zero shot you don't optimize play and any model but but but you need this pre-train some trained layers in the middle the G attention and and this is could be understood to as like a model training solution using some some right data set but I think so it is quite hard to achieve getting this rates but like once you get it this is like the I think that is like the most practical work yeah yeah and and and to my knowledge like companies like Adobe or like or like video companies they they do prefer mod training approaches over like one shot or like latent optimization approach yeah I see I see so so how do you like what do you think of um so so in the current presentation uh it seems like the latest editing Works they focused on like like maybe structural change or or like a background change um how about uh like what what if we want to do works on like like motion change or maybe let's say like adding a little object in the scene and and assigned a motion to it yeah so there's a very good question so I think that is actually the trend because like there are many many works not only my Works doing the appearance editing or like background editing or like those kind of appearance editing basically but like uh so recently people are quite using focusing on the condition called so these kind of condition so tracks so they given a video and some pre-trained video tracker one can obtain like video trajectories of either dance or spars trajectories and people are using these kind of trajectory conditions to make a uh image to using an image to video model so input video I no input image and some maybe maybe some tracks using from other reference video and then they use image to video backbone and do somehow use these track conditions to achieve like animating some static image with your intended motion yeah yeah like the the Torah Torah works right like conditioning on the trajectories yeah [Music] yeah yeah yeah like I also have like one um like one final question about the personalization so so I think P lab recently just released like 2.0 and then they have uh like a new uh new uh feature that they were able to add ingredients in into the video yeah like like I was actually very impressed by by that and then they said they're not doing this uh by law and then it seems they were able to inject uh any kind of objects in any numbers into the video into the video generation so so so actually I think I've seen the demo of that but I haven't used it so by uh injecting ingredients so it's like you like there's like a input video and like a video of like for example a man like sitting and then you can like input like some yellow hat and then the man will do the same thing but with a hat right it's like that right I think for their current feature is only for uh video generation so no video editing yet but I think yeah I think it would be like a good idea to to explore in this this in this direction maybe like uh editing but but you have but you get the control uh on what subject to add yeah I think in image domain we already have some work like this like the the dream edit from from our lab um where we focused on image and add certain subjects in it yeah I think it would be very cool if if we can also look into this kind of method yeah I think I think so in the dream IDE do you use like some Crea data set like or um yeah for for dream edit like uh because that was like the early exploration work so so we we were uh relying on the dream boo models to to do this kind of stuff yeah I yeah that's like a cool work but I think for the P I think they probably made their own data set like ingredients as input and like the because they have to make it like they make to make it a product and and acquires like latency so like they probably make the those the pair of data set video data set [Music] yeah yeah okay I I see I see yeah like okay I'm going to pass uh to other people uh woming and don't F do you guys have questions okay maybe not um yeah uh okay let me stop let me stop the recording first

2025-01-11 02:09

Show Video

Other news

Sony MDS-JB930 QS Минидиск дека (обзор) 2025-01-20 12:40
6G Talk - Threat landscape and potential solutions in 6G | Prof. Mika Ylianttila 2025-01-17 09:41
How Farmers Harvest Rye by Machine: Cultivation Technology and Processing | Farming Documentary 2025-01-16 03:51