Where is WinFS Now - 2008 Microsoft Podcast

Show video

JU: You made a fascinating remark  last time we spoke, which was that   most of WinFS either already has  shipped, or will ship. I think that   would surprise a lot of people, and I'd like  to hear more about what you meant by that.   QC: WinFS was about a lot of things. In part  it was about trying to create something for  

the Windows platform and ecosystem around  shared data between applications. Let's   set that aside, because that part's not shipping. JU: So you mean schemas that would define   contacts, and other kinds of shared entities? QC: Yeah. That's a mechanism, a technology   required for that shared data platform.  Now the notion of having that shared data  

platform as part of Windows isn't something  we're delivering on this turn of the crank.   We may choose to do that sometime in the future,  based on the technology we're finishing up here,   in SQL, but it's not on the immediate roadmap. JU: OK.   QC: Now let's look under the covers, and  ask what was required to deliver on that   goal. It's about schemas, it's about integrated  storage, it's about object/relational,   a bunch of things. And that's the layer you can  look at and say, OK, the WinFS project, which went   from ... well, it depends who you ask, but I  think it went from 2002 until we shut it down in   2006 ... what was the technology that was being  built for that effort, in order to meet those  

goals? And what happened to all that stuff? You can catalog that stuff,   and look at work that we're doing now  for SQL Server 2008, or ADO.NET, or   VS 2008 SP1, and trace its lineage back to WinFS. JU: Let's do that.   QC: OK. I guess we can start at the top, with  schemas. We're not doing anything with schemas.   At the end of the WinFS project we had settled  on a set of schemas. It was a very typical   computer science problem, where the schemas  started out as a super-small set of things,   and then became the inclusion of all possible  angles, properties, and interests of anybody   interested in that topic whatsoever. We wound up  with a contact schema with 200 or 300 properties.   Then by the time we shipped the WinFS beta we were  back down to that super-small subset. Here's the  

10 things about people that you need  to know in common across applications.   But all that stuff is gone. The schemas, and a  layer that we internally referred to as base,   which was about the enforcement of the schemas,  all that stuff we've put on the shelf. Because  

we didn't need it. It was for that particular  application of all this other technology.   So that's the one piece that didn't go anywhere. Next layer down is the APIs. The WinFS APIs were a   precursor to a more generalized set of  object/relational APIs, which is now shipping   as what we call entity framework in ADO.NET. What's getting delivered as part of VS 2008  

SP1 is an expression of that, which allows you  to describe your business objects in an abstract   way, using a fairly generalized  entity/relationship model.   In fact we got best paper at SIGMOD last year  on the model, it's a very good piece of work.   So you describe your business entities in that  way, with a particular formal language...  

JU: For people who haven't seen this, how  would you characterize that language?   QC: It's pretty standard entity-relational. It's  really a matter of describing to the system a set   of properties and collections and relationships  among entities. The important thing we tell people   is to describe their entities as they think about  them. Not as they think they should be expressed   in a fully normalized database schema, and not  as they need to program to them as objects, but   in terms of how they think about them, and want to  be able to report on them, or interact with them.   From there we can derive  objects you can program against,   we can derive schemas to build a store of them. The traceback to WinFS is that we had a very  

fixed way of doing this for a particular set  of entities. We built the schema around items,   and items were entities that had relationships  to other items. We built this whole model on a   more generic substrate that we never expressed. So we said OK, we didn't ship the WinFS APIs,   but we have this asset, a more generalized  expression framework for entities, let's   figure out how to finish that work up, and get  that delivered as part of the next ADO release.   This stuff is now very well integrated  with LINQ. You can do LINQ to relational,   where LINQ will look down into the database, look  at the schemas that are there, and express that   directly up into LINQ. Or you can do LINQ to  entities, which allows you to have a layer of  

abstraction between what you're programming to  and your underlying physical database schema.   That work is ongoing, we're getting good  feedback, we'll see how far it takes us.   JU: How much continuity is  there in terms of the team?   QC: A lot. When I did the reorg, I had an Excel  sheet of everyone in the organization and where we   were moving them to. Last I looked at it, 80-plus  percent of the team was still in SQL somewhere.  

One of the interesting things about WinFS was  that we started hiring a different kind of   person. The database team is full of traditional  hardcore systems database guys. When we did WinFS   we were looking for a different thing. JU: In fact you don't consider yourself   to be a hardcore database guy, right? QC: Right. I'm a good example. I started at   Microsoft in the Word group, and went from there  to IIS to something called Application Center,   worked on the manageability technologies for a  while, and then was asked to come over and do   WinFS. So my background was much more about how to  use databases, how do you build apps around them,   and not so much what are the internal  algorithms you should use for bitmap indexing.   Of course we had a lot of folks from the core  database team, but we hired a lot of folks that   had experience with compilers, with user  interfaces, with building apps on the database.  

A lot of those folks who were leading  the API effort for WinFS are now   leading the API effort for all of SQL. So that's the story for the API team.   As for the rest of it, well, there's obviously  a big chunk around file systems. If you want   to do this shared data model, you want it to be  applicable to all data, not just things you can   express relationally. So we had to figure out how  to merge database constructs with file systems.   A lot of people thought this was impossible,  and would harken back to Cairo and various   other projects announced and unannounced to  the public world around integrated storage,   that didn't necessarily produce fruit. We had one key advantage. We found an  

architectural approach that allowed us  to control the semantics, and provide   transactional database consistency over the  files that were involved, while still allowing   the file system to be in control when it  came to file-handle-level operations.   We did it with a kernel driver that  allowed us to control the namespace,   and keep the database involved. The database  lives up in user mode. As far as the operating   system is concerned, there's no difference  between SQL Server and Microsoft Word. They're   high-level user-mode apps that occasionally  drop down and make requests of the kernel.   So there was a fundamental disconnect.  How do we maintain control over this  

low-level system concept, the file system, by  a user-mode app? We built a kernel-level driver   to communicate back to the user-mode SQL process.  It had a cache of what things should look like,   and what things are in what state, but it was  there along the API path for the file system,   to allow it to control the namespace  operations over files that were "in" WinFS.   People would often ask me if WinFS was a file  system, and I'd struggle with the answer to that,   because, well, you know, from a certain standpoint  the answer is yes. The stuff I saw in the shell,   was it in the WinFS filesystem? Well, OK.  But there are no streams inside the database.   So from a user perspective, those files were "in"  the filesystem. But from an API perspective it was   more nuanced than that. I could still use the  Win32 APIs, get some file, open it, and from  

that point forward the semantics were exactly  like NTFS. Because it was NTFS at that point.   There was a certain place along the API chain  where the database was completely out of the way.   This allowed us to get the perfect compatibility  that had tripped up other integrated storage   efforts in the past. Other efforts  tried to get this compatibility by  

emulating all the Win32 APIs, which is  tough. And the performance bar is very high.   JU: So how does this carry forward, if it does? QC: It does. That approach was so good that   we decided to generalize it for SQL Server  2008, as a feature called filestream. It's   basically a new kind of blob support for the  database. You configure a column for filestream,   you can take a file and insert it as  a record, you get back a file handle,   you can stream things into that file  handle. You can do queries and get back  

file handles, and get streaming API-level NTFS  performance on the files you put in there.   What we have not done is the namespace  support. So you don't get to walk through a   directory of files. You examine a row, you  ask that row to give you back the right token,   you start doing the Win32 operations on it. But the rest is integrated. You back   up the database, you back up the filestream.  From most perspectives -- except mirroring,   which we didn't get to fully integrating  -- it looks like any other blob.  

JU: Where do you see that  being used to good effect?   QC: Right now there's a choice people have  to make. There's a size limit on blobs in the   database, because we put them inside database  pages, and that leads to a performance problem   as well. If you want to pull a 2-gigabyte stream  out of the database with traditional blobs, it's   not as performant as walking up  to NTFS and using a file handle.  

We have to recreate the file by putting together  a series of database pages that are themselves   a level of indirection on file system pages. So people today have to make a choice.   Do I want the integration with the database, so  backup works, my transactional semantics work,   all this stuff works, and live with the  performance and size limitations. Or do I   want the best possible performance, and basically  no limitations on size, by putting things in the   file system, and then having my application logic  figure out how to glue together the database world   and these files that are now strewn about  the file system. And when I do a backup,   then I also have teach my operations guys that  when you back up the database your not backing   up all the data, you also have to worry about  these files the database knows nothing about.   With filestream, people don't have to make the  choice. They get the performance they want,  

with the database integration they expect. Now the next place to take that, after 2008,   is to add Win32 support. So we did  this other feature as part of WinFS,   which we're calling hierarchical ID.  It's a column type, a new column type,  

which creates hierarchy support in the database. We did this for WinFS because obviously if you're   storing your data in a filesystem-like hierarchy,  you need to be able to do things like show me all   the stuff in this folder, and answer that query  lickety-split. You can't be walking through   record by record looking for matches. JU: Or dealing with the SQL way of   expressing hierarchy, which is  doable but beyond my comprehension.   QC: Yeah, it's hard. The fundamental problem  is that the query processor doesn't understand   the concept of path. It understands matches on  columns. It can find substrings within records,  

but it's kind of brute force. You  can use fulltext indexing, but...   JU: ... but you don't get containment for free. QC: That's right. So hierarchical ID   is a column type that teaches the optimizer  about hierarchy, about path, so you can   do queries that find all the things  contained within this part of the path.   So we have that feature also shipping in 2008,  and there are all sorts of different uses for it.   For example, people use it for compliance. They'll  create a hierarchy of different confidentialities  

and compliance levels. This thing is  confidential, which is a superset of   things that are executive-eyes-only. Hierarchies  like that are just out there in the world.   JU: How do you build and visualize them? QC: You tell us about them. You express   the form of your hierarchy, and you  populate the records accordingly.   But I don't think there's a tool yet. JU: So there's the filestream piece,  

and the hierarchical ID piece, and  then the Win32 namespace pieces   is the shoe that hasn't yet dropped? QC: That's right. In the next release we   anticipate putting those two things together, the  filesystem piece and the hierarchical ID piece,   into a supported namespace. So you'll  be able to type //machinename/sharename,   up pops an Explorer window, drag and drop  a file into it, go back to the database,   type SELECT *, and suddenly a record appears. Potential uses for that? It's all over the place.   Take our own expense reports. We used  to have these Excel form templates, and  

you'd fill it out and submit it to some system.  Then we hit a phase where it was all online,   so you're on the plane home and too bad for  you. But imagine they could reintroduce that   template again, and you could save that  Excel file directly into the database.   Or more importantly, if you go to edit the thing,  you don't have this process where you've taken a   copy of the thing, you're editing it, you're  sending it back through a mid-tier system   that then has to reconcile the database records  with the filesystem records. I can just say, oh,  

I need to add three more things. I double-click,  and yes I'm still interacting with some web-based   app, but the links I get are real Win32 links.  I open the thing, I edit it, I stick it back,   everything knows that it was changed  within the right transactional semantics.   People are constantly having to bridge between  the file world, and the world of data around the   files. Providing Win32 support gives developers  the opportunity to allow the desktop clients to   directly interact with a file that's part  of some application, without having to go   through all the semantics of the mid-tier. Are there always going to be some applications   that will want to have mid-tier control over every  aspect of every part of every workflow? Of course.  

But from a productivity standpoint,  to be able to allow people to   build applications more quickly, to be able to  customize applications and not have to manage   all those semantics themselves, that's huge. Sync is another topic, but imagine we build   the right things around synchronization, so  people can take the files offline. It's a   major productivity gain. As a developer, you know  the consistency of the world you're dealing with.  

You're not having to create and manage and  upload and deal with copying all on your own.   JU: You've alluded to the downside already, which  is that it now becomes a new data management   discipline that is neither familiar to the people  from the filesystem world nor from the database   world, it's a hybrid, and that's an obstacle. QC: Sure, there's a learning curve, as with   any other new technology. So, that's the filesystem piece,   and I'm really proud of the work we've done there.  We're introducing the kernel driver in 2008,  

we're giving people this nice  marriage between the two worlds,   and then we get to take that next step in the next  release and give people the complete picture.   I can live with the argument that we don't  have integrated storage yet. Yes, we have   filestream blobs in the database, which is a big  step. We have the performance and the database   consistency all in one package, and that's a huge  step forward. But when we have Win32, at that   point, unarguably, we have integrated storage. JU: How do you think that plays out as the  

center of gravity shifts toward the cloud? QC: There is no app in the world that doesn't   need a database. Every cloud app has one under the  covers somewhere. One thing we've learned in the   last few years is that the fuzziness between  structured data and unstructured data   is just increasing. The major online  apps that I interact with have both.   You know, Hotmail has attachments. And they have  limitations on attachments because they have   trouble managing sizes and whatever else. We have things now where people can create   some space, put some files up there, but  man, if you want any metadata around those   files, too bad, it's just a dumb blob store. JU: What I'm getting to here is that, well,   part of the challenge for WinFS as originally  conceived, with a heavy client component,   was: How do you get the network effects? Five  years later the center of gravity has shifted,   there are shared spaces in the cloud  where those effects can happen.  

QC: Yes. And I think the technology we're building  is underlying technology for the cloud apps.   All of our major properties are built on SQL, and  they want to use this stuff, we have work going   on there, pre-release work to take advantage  of these features, because they want them.   From a business standpoint, my first concern  is how to provide value to our customers.   And those are our customers. The people  building the cloud apps are our customers.   Now, beyond that, one of the things we used to  say about WinFS was that it was the world's best   mashup playground, because you had all the  data in one place. In the mashup world you're   talking to one service at a time. Do I think that the opportunity  

to build applications that solve real  end-user problems building on technology   like this continues to thrive? Sure. When I think about the enterprise space,   which is primarly where we sell SQL, they want  this. They want a repository, and they want it   not to be restricted on the types of data it has. You'd be surprised, SQL's behind some of  

the biggest cloud services on the  planet. And our customers who are   building them have been struggling with this  structured-versus-unstructured data problem.   Filestream alone gives them the answer. They  don't so much need the Win32 aspect, because   they have enough app development expertise in the  mid-tier to bridge this stuff reasonably well. But   they do want the transactional and backup  consistencies that filestream gives them.   JU: Is that ultimate mashup playground also  a good environment in which to iteratively   work out what some key schemas need to be? QC: Yeah, that leads to another interesting point.  

Going through the litany of technologies that  have come from WinFS, one of them is the notion of   what I refer to as semi-structured records. The  schema is not necessarily all that well defined   at the outset of the application.  How does the database handle that?   We had built WinFS around a feature called UDTs,  which is a column type -- a CLR type system type.   We finished that up, and we built a  whole spatial datatype on it in SQL   Server 2008, it's all good stuff. But when we stepped back and looked  

at the semi-structured data problem in a larger  context, beyond the WinFS requirements, we saw the   need to extend the top-level SQL  type system in that way. Not just   UDTs, but to have arbitrary extensibility. So we did this feature in SQL Server 2008   that we internally refer to as sparse columns.  It's a combination of various things. First,  

a large number of columns. Right now there's  a 1024 limit on the number of columns in a   single SQL table. We're way widening that out. That comes of course with the ability to store   data that's very sparsely populated across a  large number of columns. In SQL Server 2005   we actually allocate space for every column  in every row, whether it's filled or not.   JU: This is what the semantic web  folks are interested in, right? Having   attributes scattered through a sparse matrix? QC: That's right. And that leads to another   thing which we call column groups, which allow  you to clump a few of them together and say,   that's a thing, I'm going to put  a moniker on that and treat it   as an equivalence class in some dimension. Then we have something called filter indices,  

where instead of creating an index  that spans all the records in a table,   you can specify what records it applies to. JU: When it's really cheap to make lots of   those equivalences, you get the ability to let  people call things however they want to call them.   There can be lots of aliases and labels  floating around, and people can have their own   vocabularies. You don't have to be so rigid about  names. As you discover equivalences, you map them,  

and that's very efficient. Versus trying to get  people in committees to agree how to call things,   that's the hardest problem in the world. But if  you can let people operate in their own semantic   namespaces, and then bridge things together... QC: And that gets back to why   the entity data model is so important.  It lets people have their own way  

of describing, programming to, and interacting  with the data they want to deal with.   JU: Now what about relationships? In WinFS, a  relationship among entities was a first-class   object. How does that carry forward? QC: The notion of a relationship   is a first-class object in the entity data  model. Now what we haven't done there is bridged   an understanding of that into the database itself.  Can the query processor understand a relationship,  

and be optimal for navigating through those  semantics? We haven't bridged that part of the   world yet. It's certainly possible to create  database schemas that allow you to have good   query efficiency through your entity model,  but it's still intellectual work. We'd   like it to be so that the database  can look at an EDM schema and create   at least the approriate indices so when  you are examining things through that lens,   we can make sure your experience is optimal. QC: Finally there's synchronization. It   went through a classic computer-science learning  curve as well. At first we said, we need to synch  

with the cloud, with other WinFS instances,  with server systems, how hard can this be?   Then we quickly realized how hard this was. What  should be more infamous than people breaking their   pick on integrated storage is people breaking  their pick on multimaster replication. It's   an incredibly difficult problem to get right. Apps that have gotten this right for a particular   domain have become wildly popular. Lotus  Notes got it right for a particular domain,  

so did Exchange and Outlook, but a  generalized solution has been very elusive.   Anyway, we did a partnership with Microsoft  research, and at some point along the arc we   solved it fairly well. It's not trivial. This  is not something that ends up being a simple   solution to this very complex problem.  It's actually reasonably sophisticated,   but it works, and we built it in  as part of the last WinFS beta.   As they realized they were onto something, they  started to fork out a componentized version of   it that's now finding its way into a bunch  of Microsoft products. The official branding  

is Microsoft Sync Framework. I think they're on  target for shipping it in six different products,   and for embedding it all over the place. Building an app like Outlook, from scratch,   is hard. You can always interact with your data,  when you're connected the thing will always   synchronize and reconcile, when it's offline  it still provides a consistent experience.   To build that from scratch, it's really hard.  Taking the sync framework allows people to go   and build that experience without having to solve  the hard multimaster synchronization problems.  

QC: Finally, we'd done a bunch of work to keep  the SQL engine tamed and behaving properly on the   desktop. Some of that has found its way into SQL  Server 2008 and some has not, because there's a   less pressing need for it. But for departments,  and for SQL Server Express on the desktop,   we still want to finish that. JU: So to wrap up, I'd like you to reflect   on how the original environment for WinFS was the  end-user desktop, but now the environment in which   many of these technologies have come to fruition  is the enterprise datacenter and backoffice. How   do these worlds yet come together? QC: I was very happy to be able to   take the technology forward, because I saw the  broad applicability, not just in the problem   space we were working on, but in terms of  the general usefulness of the database.  

My job is to grow the usefulness of the database.  The work we did with WinFS was in line with that,   and I'm happy with that, but there's a part of  me which is still unfulfilled. Boy, what would it   mean if every application could have some shared  notions about, for example, the people in my life,   that other applications could plug into and use. Can we express that fully in a cloud way? Maybe.   It harkens back to the old Hailstorm ideas. And  we have things like Astoria [SQL Server Data  

Services] that is a projection of entities  over the web. That's awfully familiar,   both in terms of WinFS and in terms of Hailstorm. Where it goes, I don't know. We've made a choice   right now to incubate some underlying  platform technologies for the web, and   allow the operating system team to cycle on  the stuff that's on their plates right now.  

But I think not too long from now  we'll come out of those cycles and say,   OK, we have all this fundamental technology,  what's the next big innovation we can do?   That's kind of where we got tripped  up in the Longhorn cycle. We   were building too much of the house at once.  We had guys working on the roof while we were   still pouring concrete for the foundation. At one point we realized we needed to decouple   things. And that really did give this team the  freedom to go off and take these underlying  

technologies, which we believe were fundamental  to the database, and get them done correctly.   But I do at some point want to see that  place in my heart fulfilled around the   shared data ecosystem for users, because  I believe the power of that is enormous.   I think we'll get there. But for  now we'll let the concrete dry,   and get the framing in place, and then we'll  see how the rest of the house shapes up.

2021-01-06

Show video