Search results for 'data gravity'

Data Launched

1 Jul

In case anyone missed it, I launched on Tuesday of last week. I intend to continue writing posts on PaaS, IaaS, and Big Data topics and theory here while pursuing Data Gravity and Data Physics oriented efforts at .

More interesting updates soon!


Artificial Data Gravity

20 Feb

Having covered Data Gravity several times on this blog, I thought that it would be time to cover a derivative topic: Artificial Data Gravity.

Recall that Data Gravity is the attractive force created by Data amassing and the needs of Apps and Services to leverage low latency and high bandwidth.

Artificial Data Gravity is the creation of attractive forces through indirect or outside influence.  This could be something such as costs, throttling, specialization, Legislative, Usage, or other forms.  Below I will walk through examples of Public Clouds creating, exerting, and leveraging Artificial Data Gravity.

Costs : The fact that AWS S3 Is free for unlimited Transfer In-bound traffic along with Windows Azure, are great examples of Artificially encouraging Data to amass internally.  By allowing you to put more Data inside of S3 or Azure, this encourages Data Gravity patterns through Artificial means.

Throttling : The Twitter API is a great example with its well known API that allows 350 Calls per/hour.  This makes it nearly impossible to replicate the traffic on twitter without special (and very expensive agreements in place).

Specialization : Specialized services such as DynamoDB not only encourage Data Gravity through transfer pricing, but encourage low writes, high reads based on a 1:5 ratio.  Not only are you unlikely to ever leave DynamoDB, you are also encourage to write code as write efficient as possible due to costs.

Legislative : There are many laws that restrict the location and govern the security and use of Data, these are not technical or physics related, but artificial means of influencing Data Gravity as mentioned in this GigaOM piece covering the law dictating Data Gravity.

Usage : Dropbox charges each individual user for use of Shared Data (Artificial Usage).  This means that each person pays for the Data consuming their storage, however Dropbox is only storing a single copy and pointing all authorized users to that single copy.

There are certainly other forms of Artificial Data Gravity that are not listed in the examples above, if you can think of a concrete example, please comment.

One last note : I’m not saying there is anything particularly wrong with Artificial Data Gravity, however it is something to be aware of as it is one of the behaviors/motivations exhibited from Data Gravity as a whole.

Defying Data Gravity

2 Apr

How to Defy Data Gravity

Since I have changed companies, I have been incredibly busy as of late and my blog has had the appearance of neglect.  At a minimum I was trying to do a post or two per week.  The tempo will be changing soon to move closer to this….

As a first taste of what will be coming in a couple of weeks I thought I would talk a bit about something I have been thinking a great deal about.

Is it possible to defy Data Gravity?

First a quick review of Data Gravity:

Data Gravity is a theory around which data has mass.  As data (mass) accumulates, it begins to have gravity.  This Data Gravity pulls services and applications closer to the data.  This attraction (gravitational force) is caused by the need for services and applications to have higher bandwidth and/or lower latency access to the data.

Defying Data Gravity, how?

After considering how this might be possible, I believe that the following strategies/approaches could make it feasible to come close to Defying Data Gravity.

All of the bullets below could be leveraged to assist in defying Data Gravity, however they all have both pros and cons.  The strengths of some of the patterns and technologies can be weaknesses of others, which is why they are often combined in highly available and scalable solutions.

All of the patterns below provide an abstraction or transformation of some type to either the data or the network:

  • Load Balancing : Abstracts Clients from Services, Systems, and Networks from each other
  • CDNs : Abstract Data from it’s root source to Network Edges
  • Queueing (Messaging or otherwise) : Abstracts System and Network Latency
  • Proxying : Abstracts Systems from Services (and vice versa)
  • Caching : Abstracts Data Latency
  • Replication : Abstracts Single Source of Data (Multiplies the  Data i.e. Geo-Rep or Clustering)
  • Statelessness : Abstracts Logic from Data Dependencies
  • Sessionless : Abstracts the Client
  • Compression (Data/Indexing/MapReduce) : Abstracts (Reduces) the Data Size
  • Eventual Consistency : Abstracts Transactional Consistency (Reduces chances of running into Speed of Light problems i.e. Locking)

So to make this work, we have to fake the location and presence of the data to make our services and applications appear to have all of the data beneath them locally.  While this isn’t a perfect answer, it does give the ability to move less of the data around and still give reasonable performance.  Using the above patterns allows for the movement of an Application and potentially the services and data it relies on from one place to another – potentially having the effect of Defying Data Gravity.  It is important to realize that the stronger the gravitational pull and the Service Energy around the data, the less effective any of these methods will be.

Why is Defying Data Gravity so hard?

The speed of light is the answer.  You can only shuffle data around so quickly, even using the fastest networks, you are still bound by distance, bandwidth, and latency.  All of these are bound by time, which brings us back to the speed of light.  You can only transfer so much data across the distance of your network, so quickly (in a perfect world, the speed of light becomes the limitation).

The many methods explained here are simply a pathway to portability, but without standard services, platforms, and the like even with the patterns etc. it becomes impossible to move an Application, Service, or Workload outside of the boundaries of its present location.

A Final Note…

There are two ways to truly Defy Data Gravity (neither of which is very practical):

Store all of your Data Locally with each user and make them responsible for their Data

If you want to move, be willing to accept downtime (this could be minutes to months) and simply store off all of your data and ship it somewhere else.  This method would work now matter how large the data set as long as you don’t care about being down.

Service Energy – Complimenting Data Gravity

2 Feb

It has been a while since I posted anything referring to Data Gravity.  While Data Gravity is interesting and can explain many motivations of Cloud Companies and their Data Services, there are other influential forces at work.
Service Energy

What am I referring to as a Service in this case?  Any code or logic that has been deployed by a provider to expose a resource.

Examples include:

  • APIs
  • Message Queues and Buses
  • Automation, Scripting, and Provisioning Interfaces
  • Web Services
  • Many more…

When resources are externalized, this is what enhances the value of Data and helps increase Data Mass and Data Gravity. As a Service is used more frequently, the amount of energy it is emitting increases in our analogy.  The emitted energy has effects just as it would in Physics.  Service Energy has the ability to assist in Escape Velocity as well as increase Data Gravity, all depending on what the Service Energy is doing.
Service Energy shows motivations in Clouds for specific behaviors such as:

Why Salesforce acquired Heroku – (Heroku is indeed a Ruby PaaS, but it was beginning to bring in SERVICES from outside which increased its Service Energy)  Salesforce needs this in the Ecosystem, just like it needed to create to help increase it’s Data Mass and therefore it’s Data Gravity.

Why Amazon created SQS and SES (These are services that encourage additional consumption of Compute but more so amplifies the amount of data (Data Mass)

It should be noted in the picture above that the Data is made accessible through a service which is why it has Service Energy around it, which should be distinguished from Data Gravity. Remember, Service Energy does NOT attract, but can amplify.

Service Energy also can be used for Escape Velocity.  By properly architecting applications and even Service Oriented Platforms, the Data Mass can be spread across many providers (and even sources inside of those providers).  This provides looser coupling between the App and a specific Cloud, which gives more flexibility.  The trade-off is that this design is more prone to service interruptions, latency, and bandwidth constraints.

There is much more to be said about Service Energy in the future including exploring other effects it has with more IaaS centric solutions.

Data Gravity – in the Clouds

7 Dec

Today announced at Dreamforce.  I realized that many could be wondering why they decided to do this and more so, why now?

The answer is Data Gravity.

Consider Data as if it were a Planet or other object with sufficient mass.  As Data accumulates (builds mass) there is a greater likelihood that additional Services and Applications will be attracted to this data. This is the same effect Gravity has on objects around a planet.  As the mass or density increases, so does the strength of gravitational pull.  As things get closer to the mass, they accelerate toward the mass at an increasingly faster velocity.  Relating this analogy to Data is what is pictured below.

Data Gravity

Services and Applications can have their own Gravity, but Data is the most massive and dense, therefore it has the most gravity.  Data if large enough can be virtually impossible to move.
What accelerates Services and Applications to each other and to Data (the Gravity)?
Latency and Throughput, which act as the accelerators in continuing a stronger and stronger reliance or pull on each other.  This is the very reason that VMforce is so important to Salesforce’s long term strategy.  The diagram below shows the accelerant effect of Latency and Throughput, the assumption is that the closer you are (i.e. in the same facility) the higher the Throughput and lower the Latency to the Data and the more reliant those Applications and Services will become on Low Latency and High Throughput.
Note:  Latency and Throughput apply equally to both Applications and Services
How does this all relate back to  If can build a new Data Mass that is general purpose, but still close in locality to its other Data Masses and App/Service Properties, it will be able to grow its business and customer base that much more quickly.  It also enables VMforce to store data outside of the construct of ForceDB (Salesforce’s core database) enabling knew Adjacent Services with persistence.
The analogy holds with the comparison of your weight being different on one planet vs. another planet to that of services and applications (compute) having different weights depending on Data Gravity and what Data Mass(es) they are associated with.
Here is a 3D video depicting what I diagrammed at the beginning of the post in 2D.


More on Data Gravity soon (There is a formula in this somewhere)

The Periodic Point

22 Aug

Over 20 years ago, I started my professional career  joining GE working as a Systems Engineer for a large bank and doing several public sector assignments. After a few years, I worked for Sprint. Several years later, I founded my first startup which was sold to Quest Software (which in turn was bought by Dell). After doing 6 months of consulting for a Grid computing company (remember those?), I started another company. This company would ultimately be bought by Solarwinds. I moved on to Dell’s DCS (hyperscale compute) group for about 9 months. It was then that I wrote about Data Gravity, for the first time. I also discovered a project VMware was working on called Project Maple. 

Project Maple was later renamed Cloud Foundry. Blogging about this discovery led to my recruitment by Jerry Chen to join the Cloud Foundry team. Working with the Cloud Foundry team in the early days was surreal to say the least. Eventually I was recruited away to Warner Music Group, where I became SVP of Engineering, working for Jonathan Murray. At WMG, we built a first of its kind software factory by leveraging Cloud Foundry OSS, which enabled an increase in application delivery speed by an order of magnitude.

Just after the first version of the factory shipped, I was contacted by Adam Wray, asking if I was interested in joining him at Basho as part of a new funding round with a new investor. This sounded like a great opportunity to experience joining a NoSQL startup. After leaving Basho at the end of June, I found myself at a periodic point. Much like Nick Weaver announcing recently that he had returned to EMC, I have returned to GE.

I have joined the GE Digital business as VP of Software Engineering, working as part of the team on Machine Learning for IIOT. 

Why GE Digital and the team?

I see something that is beyond the other opportunities I considered. The most compelling reason, is being able to have a profound effect on an incredibly large and diverse number of businesses and therefore, affecting a disproportionately large number of people’s lives in very positive ways. GE Digital’s Predix Platform directly supports all of the different GE Business Units’ IoT efforts, including the; Oil & Gas, Energy, Aviation, Health, Power, and Transportation divisions to name a few. The team is amplifying the benefits and discoveries made by looking at all of this IoT data and applying machine learning against it. The work that GE Digital is doing with Predix and the Industrial Internet of Things is truly game, and life changing. 

The team itself is comprised of some of the smartest people I have ever worked with, Josh Bloom being a prime example. All of them are humble and kind, yet wickedly smart. They have created a unique culture of diversity, happiness, positivity, humbleness, respect, openness; all of this in a highly professional productive environment that avoids unnecessary meetings. The team is truly an incredible team, and I look forward to learning and growing with them.

Why me?

The unique challenges that the team has are familiar to me; How do you grow/build a “startup” inside a large company, and how do you grow that team to scale? What are the processes needed to achieve this? What does the reporting structure look like? Where do you find talent? How do you bridge the large company with the startup inside?

Some of the technical challenges they face also line up with my experience. How do you run a PaaS at scale? How do you run a PaaS on a PaaS? How do you go about building and operating a software factory? Those are some of the examples of why this unique challenge and team were so attractive and such a great fit.

If you are looking for a great engineering or data science position and either live within commuting distance of San Francisco, or would consider relocating, and believe that Machine Learning and IOT are the future, please send me a DM on twitter (@mccrory).

Cloud Escape Velocity – Switching Cloud Providers

18 Dec

The term Escape Velocity is the speed needed to “break free” from a gravitational field without further propulsion according to  Data Gravity as explained in THIS previous post is what attracts and builds more Data, Applications, and Services on Clouds.  Data Gravity also is what creates a high level of Escape Velocity to move to another Cloud.

Some background on why this post is timely:

A few days ago Amazon announced a new AWS service for importing VMware disk images (VMDKs) into EC2.  VMware already offered a method for converting EC2 instances through their Converter tool into Workstation VMs and with a 2nd pass conversion into ESX VMs.  While all of this sounds wonderful and it does have value, it brings to light an entirely different issue.  Only Stateful / Fully encapsulated applications can be moved around in this way.

Examples of sources of Cloud Gravity (App, Service, and Data Gravity Combined) on your specific Application.

If someone selects a Cloud provider and writes an application leveraging anything more than a handful of VMs, Data Gravity will make it virtually impossible to move to a new/different Cloud provider.  Don’t believe it?

See the Diagrams Below:

Here is a diagram of an app that has a Low Escape Velocity because of Lower Cloud Gravity:

Cloud Escape Velocity with Low Gravity

Below is a diagram of an app that has a High Escape Velocity because of High Cloud Gravity:

Cloud Escape Velocity with High Gravity

Some potential dependencies include:

Database with a specific API

Web Worker which serves as a web interface and uses internal Authentication (Your user logins are here!)

Application code that uses the Database and Web Workers specific APIs and/or depends on Low Latency and High Throughput access to them.

Here are a few additional things to think about:

– The longer (more time) an Application stays in a specific Cloud the more difficult it is to move.  Why?  Data Gravity increases due to more Mass (data being stored).  Imagine accumulating 100’s of GBs of Data, how easy will it be to shuffle/transfer that much data around?

– The more provider APIs and Services that you depend on the harder it is to move.  Why?  Because there are only two paths that can be taken in a move.  The first is to find another provider that has the exact same set of APIs and Services (this will limit your choices).  The second is to change or rewrite your application to take advantage of the new Cloud provider’s APIs and/or Services.

– Different providers have different charges for the consumption of the same resources.  Your current provider gives free usage of queues for applications. The provider you are looking to go to charges after the first X number of messages on the queue.  Now what do you do?  You will either pay more when you move, rewrite your application to fit the new provider’s model, or pick another provider that has free queue usage.

– Different QoS guarantees from Cloud provider to Cloud provider.  Some Cloud providers offer SLAs with reimbursements for outages, others only offer best effort.  Some providers offer tiered Services, others only offer a single tier.  What happens if you want to move and you can’t get the minimum level of QoS that you need?

This is NOT an attempt to dissuade anyone from using Public Clouds (they are incredibly valuable and powerful), but I would like more people to go in eyes wide open.