Cloud storage

AWS S3 behind Netflix success

“Big Data and Cloud Storage” series Vol. 5:   Event and Company #3

AWS S3 behind Netflix success

Netflix as the big data tycoon

Netflix is known as one of the most sophisticated user player in big data community.  They appear regularly in big data conferences like Strata and discuss how they utilize the data analytics in their business, and what their infrastructure is like.

My theory why Netflix is successful while many others are not, is that their sophisticated big data power enables them to deliver better service and wider margin.  Media industry people often see online video delivery as just another distribution means and do not pay too much attention to this “brain” part of the cloud, but it is the secret source of their success.

From the user data to recommendations

I have tried all major movie services for years, including Netflix, Hulu, Apple, Amazon, cable’s TVEverywhere, as well as Joost, CinemaNow and MovieLink (remember them?).  Among them, Netflix stands out in the power of recommendation. Other services push the ones that they want to show such as new shows, while Netflix top page is filled by personalized recommendations.

At the discussions in big data conferences, Netflix shows off how they utilize the amazing details of the usage data to come up with such recommendations.

With streaming, Netflix knows what you watch at which date and what time, if you quit watching, where you stop and whether you restart watching or not, on what device.  It is not a simple “people who watch this movie also watch these” factor.

In my household, I have Netflix account and everyone else in the family share my account.  Each have very different taste, so I was feeling pity for confusing Netflix, but they are actually one step ahead.  They already roughly know the profile of my family members through the analysis of such usage data.  And they show it in a subtle way, such as “SF Action” or “Foreign Art Films”, not creepy way such as “one for your teenage son” or “for mom”.

Scale out on Amazon S3

Netflix is the most well known user of Amazon Web Service (AWS) as their infrastructure to support this massive data analytics operation.  They state that “data center management is not our main business” as the reason to use AWS.

They used to have their own data center and was running Oracle database early in their history, but the data amount exploded as their online streaming service was catching on, to the point where they cannot catch up by building the new one anymore.  So they moved to almost 100% cloud-based in 2009-10 both in processing and storage, to be able to scale rapidly.

Currently, AWS’s S3 is used to store both video and user behavior data.  User order gets processed in NoSQL database Cassandra, and then the data is dumped into S3 once a day.  According to an engineer’s confession in Strata speech, they had so much trouble in this transfer process, so they developed their own software to do this and named it Aegisthus.  Aegisthus is a figure who killed the princess Cassandra of Troy in a famous tragedy of Greek mythology.

User data stored in S3 is analyzed with Hadoop tools, and the results are also stored in S3 again.  S3 is generally known as "Pay as you go" service, but big customers like Netflix usually are assigned with a fixed capacity, so they use the slack capacity for user data analytics after midnight of the West Coast, when video stream volume decrease sharply.

The speaker emphasized the concept "the right tools for the right job" in his speech.  Depending what your business model is, you have to choose where to put your own resources and what you buy from outside.  The big data strategy is not solely defined by the amount of data or company size.  Strategic priorities often are more important in your decision of “build or buy”.  Cloud storage provide advantages for enterprise of all sizes.

Cloud Expo Europe and Citrix

“Big Data and Cloud Storage” Vol. 4

Event and Company #2

Cloud Expo Europe and Citrix

In London, “Cloud Expo Europe” took place on January 29th and 30th, 2013. Cloudian exhibited at the Expo, so I asked Giorgio Propersi, General Manager, Americas and EMEA at Cloudian, how the Expo and the cloud industry look like in London.

◆ “Cloud” as an international phenomenon

I attended another “Cloud Expo” in Santa Clara last fall. In that conference, I felt that the main focus was on OpenStack – an open source software for IaaS (Infrastructure as a Service) – and was wondering if it was the same in London, but Giorgio has a different impression.

Giorgio: Both in Santa Clara and in London, I think the shows were neutral, rather than focused on OpenStack. Surely, OpenStack foundation and other big OpenStack supporters such as Rackspace were big sponsors, but there were also many companies supporting other types of cloud such as CloudStack.

What amazed me was how much the (London) show grew bigger than last year. 2012 was much smaller and it was more “hosting” focused. This year, finally the show was really “cloud” centered, and companies were showing cloud computing or cloud storage technology, and all technologies around the cloud (such as how to manage the cloud, keep track of what is happening in the cloud, debug the cloud and so on). The floor was very full. The show organizer was expecting 5000 attendants, but I thought there were much more people.

Once inside the venue, many US companies were exhibiting, (NTT was also there), and it was really hard to see the difference from an U.S. - based show. It was indeed a very international show. There were some European companies but these Companies were at the Santa Clara show too. I think the balance was the same as in the U.S. Customer profile was also not much different, just geographically different, with more representation from companies centered in Europe and Asia (such as BT and Tata Communications). There was also a nice representation of small European service providers from many countries in Europe.

◆ CloudStack and Citrix

CloudStack is another open-source IaaS software. CloudStack was developed by Cloud.com, which was acquired by Citrix in 2011. Both Cloud.com and Citrix were OpenStack members, but after the acquisition, Citrix released CloudStack in 2012 and donated it to Apache Software Foundation, then decided to leave OpenStack group. A bit complicated, but anyway CloudStack is now a separate project from OpenStack. Joe Onisick writes in his article in Network Computing that CloudStack is better packaged for enterprise adoption, while OpenStack is more like framework and has strong supporters.

Giorgio: We need to keep in mind that Cloudian integrates with both OpenStack and CloudStack. It is hard to simplify the differences between them, and customers may choose one or the other for completely different reasons. There are plenty of technical papers describing the differences between the two open source approaches, and the merits/demerits of each one.

Quite often, if I am an enterprise or a service provider and am moving my system to the cloud, I would look for a solution that is proven, and fully supported by my technology supplier. If I get the open source code directly, I will need to commit a lot of my internal engineering resources; and later on I will be responsible to support my cloud. This may not be good for many companies. Assuming for example I am a bank, I would rather spend my time and money in doing what I do best, such as banking, so I would rather go talk to my trusted technology supplier, who will take care of my cloud. I buy the cloud from my technology provider not because of which open source software they use inside, but because the cloud solution they will be proposing to me works for me, and is optimized with my existing system and the price is right. In some cases, customers don’t even know what technology their cloud is based on.

And open source is not the only cloud technology. Microsoft, VMware and others have their own cloud solutions.

Citrix provides Cloud Platform, their commercial version based on CloudStack, and this the cloud platform people buy from Citrix today.

◆ STaaS and Secondary Storage

Citrix’ Cloud Platform is such a solution for enterprise customers, but they don’t have the object storage piece, so they integrate with Cloudian.

Giorgio: Citrix can provide two additional functionalities by integrating Cloudian (the object storage infrastructure provided by Cloudian, Inc.) to their Cloud Platform, functionalities which they don’t have at this moment. One is STaaS (Storage as a Service) based on S3. STaaS means the capability of storing objects in the cloud, and use the cloud as storage. And the S3 compatibility allows the concept of the hybrid cloud. Many companies have adopted the concept of hybrid cloud. For example, I want to store specific files in the public cloud (such as Amazon), and specific files to my private cloud; and I keep changing my mind with regard to the destination of my data. The only way to handle this situation (that I want “some” data in the private cloud, and “some” data in the public cloud) is to have the same interface to the public and to the private cloud. So I can easily switch between the two. This interface is S3, which is fully supported by Cloudian. So the STaaS functionality with S3 compatibility is very beneficial; and this is what Cloudian adds to Cloud Platform.

The other functionality has to do with the way Secondary Storage is stored. While primary storage is the immediate disc for items that need to be accessed very quickly and used directly by the application (such as an excel spreadsheet), Secondary storage is used for snapshots, templates, ISOs, VMs, etc. If you use Cloudian to store Secondary Storage, then Secondary Storage becomes available to every zone within a CloudStack cloud. In non-Cloudian environment, typically secondary storage is stored in the local NAS, and because of that it can only be accessed locally; and if that zone is down, these templates, VMs, snapshots, etc. are not available – which is bad. This functionality is very important. Visitors to our booth at Expo really liked this, since maintaining visibility to all critical Secondary Storage from every zone is of paramount importance.

◆ “Object storage”, rather than big data or cloud

I have been writing this column on the theme of “big data and cloud storage,” but Giorgio prefers to describe what we are dealing with here as “object storage”, rather than “big data” or “cloud storage.”

Giorgio: We prefer to refer our product as the latest and greatest object storage technology (rather than big data).

The term “big data” can be misleading, because the size really is not always the motivation for object storage. Many companies start using object storage in a small way, such as with 5 or 10 terabyte, but they store data in object storage in the cloud (instead than using most traditional storage technology) because of the a. cost, b. efficiency and c. scalability, so they can scale to big data later on. People like object storage because of its simplicity, its affordability and its scalability.

And object storage fits so well to the cloud. Cloud is important because I don’t have to buy my storage anymore, or I don’t have to hire people to manage it either. I can outsource my storage to the cloud.

◆ Cloudian for Citrix

So what is important for customers in choosing object storage?

Giorgio: Compared to other object storage partners of Citrix, Cloudian’s strengths are the full S3 compatibility, and the ability to support multiple datacenters. Multi-datacenter support is not easy. When our first European customer Lunacloud wanted to add a second datacenter in France (on top of the existing datacenter in Portugal), it was a big factor. We support several configurations with regard to how many replicas can be kept, and where these replicas are kept. Other companies cannot do that. And keep in mind that Cloudian was Citrix only storage partner at the Citrix booth in London.

At London Expo, we announced Cloud Portal Business Manager (CPBM) with Citrix as well. It is a dashboard to manage cloud services on the web, so now a customer – using the CPBM, can add (or modify) a cloud storage service, that is provided by Cloudian through this portal.

「武器商人アマゾンに竹槍で挑戦するグーグル」記事公開

年末にお知らせしたとおり、ENOTECHの公式サイトに日本語ブログも統合し、新しいブログをオープンいたしました。 それで、テストを兼ねて早速お知らせです。

ZDNet「ビッグデータとクラウド・ストレージ」第八回 がアップされました。

同じ記事は、クラウディアンのブログでもご覧になれます。

宜しくお願いいたします!

From box to cloud – Random thoughts at Hosting and Cloud Conference

“Big Data and Cloud Storage” Vol. 2:   Event and Company #2

From box to cloud – Random thoughts at Hosting and Cloud Conference

Die-hard Local Businesses as “Cloud” infrastructure

It was my first time to attend “Hosting and Cloud Transformation Summit2012” held in Las Vegas on September 19th and 20th.

I enjoy feeling various beat of each industry when I attend conferences. This time, the impression of this conference was quite different from equivalent Silicon Valley ones.

I am accustomed to see Asians, Europeans, men and women all mix together, in jeans-based business casual, while the exhibit booths push fashionable and cutting-edge image, in SV. But here at HCTS, most attendees are white male in suit, and the booths are more practical and industrial. I often encounter such atmosphere at traditional telecom segments, which is supported by many local-oriented small to medium service providers.

In the 1990’s, thousands of discount long distance telephone carriers and CLECs flourished across the US. WorldCom was the most famous one, but most of them were not “money game” type people at all. They continued on as honest local businesses, have changed the line of business into another form as the industry changes, and one of their evolved form is the hosting business.

Such not-so-flashy but vast and steady infrastructure business is the building block of the cutting edge cloud services such as social media and online games.

 Hosting business is doing well

But their profitability is quite flashy indeed, said DH Capital, investment bank specialized in this area. They claim that after the over-supply period of the bubble, demand increased accordingly and now the market is at just a good balance. EBITDA multiple of publicly traded hosting companies are at 15-20x, nice level as a steady industry.

Traditional hosting as a “box” business can earn a steady income when the box is full. But many predict a huge demand increase in the near future, and the threat of overflow. Conference organizer The 451 Group pointed out in the keynotes the cause of such demand change is caused by “Internet of Things (IoT)” and “Big Data.”

The audience reaction felt a bit slow to these issues, another difference from Silicon Valley where people are actually feeling the pinch of data overflow. I guess it is because the data overflow situation has not spread into many other places.

◆ From Box to Cloud

But such data wave is spreading for sure.

According to The 451 Group, Internet infrastructure market is estimated to grow from $39 billion in 2010 to $68 billion in 2013, at 20% CAGR. Among its sub-sectors, the largest are traditional managed hosting and multi-tenant datacenter, but the fastest-growing is cloud computing with 62% annual growth. “Cloud” is more flexible and scalable form of service compared to “box” type traditional hosting.

Big data characteristic is often express as “3V”, or “Volume, Velocity and Variability.” The 451 Group argues that datacenter has to be elastic as well to handle such type of data communication.

Datacenter management has to adjust as well. In the conference, for example, Schneider Electronic explained that the electricity management has to be upper-layer conscious, because if the power of a part of virtual machine goes down, the management system has to know where to back up.

 Amazon Dominance

Looking from a different point of view, cloud service can be divided into three layers, “SaaS (Software as a Service)”, “PaaS (Platform as a Service)” and “IaaS (Infrastructure as a Service).” Hosting service providers are in IaaS area.

The 451 Group showed that approximately half of IaaS market is taken by Amazon, followed by Rackspace and Verizon Business with a wide margin. It is safe to say that Amazon S3 (Simple Storage Service) has become the de facto standard in cloud storage.

Although Amazon was not present in the conference, many speakers mentioned it in a context that managed hosting players are feeling threat from Amazon S3. The 451 Group, however, claimed that both have different roles and will co-exist even in the future.

 Cloudian Community Edition

In the conference, this column’s sponsor Cloudian announced free “Cloudian Community Edition”.

Cloudian software enables hosting providers and enterprise users build “Amazon Style” cloud storage system, compatible with S3.

Community Edition includes the same functionality as the standard edition and is free up to 100 Terabyte.

Please refer to Cloudian website for more details.

Era of Data Explosion and Big Data

Big Data and Cloud Storage Vol. 1:  Trend #1

I am starting to write a series of "Big Data and Cloud Storage" on this blog, sponsored by Cloudian, who provides the cloud storage software.  The first of the series is the historical trend of data explosion and the need for cloud storage.

Era of Data Explosion and Big Data

  • Analog Data and Digital Data

Humankind has accumulated a dazzling pile of analog data in its history over thousands of years. From Buddhist scriptures and printed Gutenberg bibles to enormous amount of modern-day books, photos, music and videos, it still continues to get accumulated day by day.

Digital data is not far behind. Now that major media and personal communications all turned into digital format, digital data is imploding at seams.

So here is a question. Which do you think is the bigger data, analog or digital?

The answer depends on “when”. Consulting firm McKinsey published a report on “big data” in May 2011, and in it, they show an estimate of the share of digital among all the accumulated data. In 2000, thousands-of-year-old analog scores 75% of total. In 2007, however, digital overwhelms analog by 94% share, surpassing analog in mere 7 years.

Digital technology started off in the 80’s with personal computer invention, and by the time of Net Bubble in the 90’s, most of media, such as mail, photo, music and video were in digital format. Yet, in the 2000, we had way more analog, but in 10 years after that, digital exploded as such. How, then, did it happen?

  • Bubble Burst and User Generated Contents

In the 1990’s, e-mail emerged as an alternative of snail mails. Then came e-commerce as a catalog alternative, and news portal as a newspaper/magazine alternative. Back then, transmission speed and technology was still limited, so a relatively small number of providers were producing catalog and articles, and delivering these contents to users through Net in an unilateral manner.

Throughout the bubble period, huge scale of Internet infrastructure was built, but after the bubble burst in 2000, demand suddenly shrunk and price of over-supplied fiber optics and datacenter plummeted.

Sometime later, a new type of Internet companies rose from the ashes of bubble, such as Google and Facebook. These new species of Web industry was later named “Web2.0”. They provide “interactive” flow of information on the Net, created platforms for “user generated contents” and revolutionalized the net business. They are not the alternatives of something, but are totally unique to the Web technology and had totally different cost structure. People started to share their thoughts and photos on blogs, and videos on YouTube. And all these user generated contents have been published and accumulated on the Internet.

  • Data gathers on “cloud” and becomes “brain”

Google’s then-chairman Eric Schmidt uttered the word “cloud computing” in 2006 in a speech, popularizing the term “cloud”. Cloud computing means the system to keep data and application in Internet, rather than on desktop computer. The term “cloud” came from the “cloud” figure on the network chart to express Internet. Such idea has already been advocated in the past, but around this time, finally came true, as the network environment caught up with broadband penetration.

As data transformed from analog to digital, and gets published on the cloud, now we can easily gather many different kinds of data in the cloud, sort it and extract meaning from it. Starting off as a monad of individual computers in the 80’s, they get connected with nerves of Internet in the 90’s to form a earthworm, and in the 2000’s evolved into human brain.

And this highly intelligent brain activity on the Internet is called “big data”. The more information is stored, the better the brain works, and as the brain works well, it gets more and more interesting to learn the new things, so the brain autonomously and increasingly sucks in the new data.

In summary, digital data explosion and the subsequent trend towards big data was triggered by Web industry’s movement into “cloud”.