Big Data

【イベントのお知らせ】JSNC講演「CESとStrata」4/10

CESStrata.jpeg

北カリフォルニア・ジャパン・ソサエティ主催の日本語イベントにて、「展示会レクチャーシリーズ第二弾:CESとStrata」の講演を行います。

4月10日4:00-7:00pm、場所はサンマテオの楽天オフィス内にあるRakuNestです。

SVOIの「展示会でベンチャーを探そう」企画をベースに、1月のCES(コンスーマーテック)と3月のStrata(データ・AI)の概況と気になるベンチャーについて紹介します。

さらに今回は、「やってみよう、ミニ・ビッグデータ」ということで、ビッグデータの典型的な活用法である「ビジネス・インテリジェンス(BI)」ツールの「ミニ版」をご紹介します。データ専門家でなくても、誰でも無料で、ちょっとばかりBIを体験できます。

お申し込みは、こちらのジャパン・ソサエティのサイトからどうぞ。

オリンピックの「オンデマンド放映」とは何か

オリンピックというのは、きわめて多くの種目のスポーツが同時並行して競われ、きわめて多くの国が参加して、きわめて多くの視聴者が世界中にいる、という、究極のビッグデータ的イベントです。

周波数と一日24時間という大きな制約がある地上波テレビでは、その中からごく一部分しか抜き出すことができません。また、地上波テレビは多くの場合(日本ならNHK以外)CMでお金を稼ぎますので、CMが流れる瞬間になるべく多くの人が見ているようにしなければなりません。このため、どうしても「最大公約数的」に、その国の選手が活躍する+テレビ向けのメジャーなスポーツを選んで放映します。

アメリカは1970年代頃からケーブルテレビが普及しだして、何度かの政策的な後押しを経て、現在では全家庭の85%程度が、ケーブルまたはその競合の有料テレビを契約するに至っています。ケーブルでは周波数の制約がないので、きわめて多数のチャンネルを設定することができます。スポーツは地上波・ケーブルのキラーコンテンツでもあり、アメリカのテレビ業界ではスポーツは特別な地位にあります。アメリカでは、4大メジャー局のひとつNBCがオリンピック放映権を持っていますが、NBCは傘下にNBCSN、MSNBC、Bravo、USA、Telemundoなどのケーブル・チャンネルがあり、これらのケーブルチャンネルでも放映しています。それでも、放映される中身はやはりテレビ局が選んで編成しています。

さらに、ネットでのオンデマンド放映もあります。この形態がいつ始まったかはよく覚えていませんが、オリンピックでいうとすでに数回はオンデマンドでやっています。最初のうちは、「オリンピック・オンデマンド・パッケージ」のような形で有料でサインアップしなければならなかったので、全く人気がありませんでしたが、2010年前後から、テレビ業界が「ユーチューブ対策」として「TV everywhere」とよばれる方式を積極的に導入し、ケーブルテレビの契約者がパスワード認証で他の端末(パソコン、スマホなど)で番組を見られるようになり、ケーブル契約のオマケとして、オリンピックのオンデマンドが見られるようになっています。

NBCはこの(1)地上波(2)ケーブル(3)オンデマンド、の3つの方式のミックスでオリンピックを放映しているわけで、それぞれの方式に一長一短があり、それぞれに合わせた中身とビジネスモデルになっています。いずれもCMとケーブル会社から受け取る配信料の組み合わせで、(1)はCMの比重が大きく、(3)は配信料が大きく、(2)はその中間となります。

ここで「ケーブルからの配信料」というのがキーとなります。ケーブル契約者(単純化するためにケーブルと呼びますが、衛星テレビなど他の有料テレビでも同様)は、月に100ドル以上の高い加入料を払っています。NBCなどの地上波チャンネルも、MSNBCなどのケーブル専門チャンネルも、加入者が払う加入料から一部をコンテンツ料金として受け取る仕組みになっています。地上波主要局とESPN・ディズニー・ディスカバリーなどといったケーブル専門の主要チャンネルは、「ベーシック・パッケージ」という基本サービスに含まれており、それ以外の例えばHBOなどのプレミアム・チャンネルは個別に契約することになります。

地上波テレビは、日本と同様アメリカでも、CM収入が下がりつつあり(それでも多いですが)、これを補うために、地上波各局は配信料を引き上げるようケーブル会社と交渉(時には決裂して、チャンネルがブラックアウトしてしまうことも)したり、ケーブル専用チャンネルを買収してチャンネル数を増やしたりしており、「オンデマンド」の展開もこの努力の一つです。オンデマンドで視聴する加入者は、ケーブル契約者であり、ユーザー名でトラックすることもできるので、その分の配信料受け取りを増やすことに加え、ユーザー・プロファイルに合わせた広告を配信(テレビと同じような番組埋め込みCM)することも可能です。(やっているかどうかわかりませんが)

オンデマンドの場合は、NBCのサイトでスポーツ種目や選手名からサイト内サーチをかけることができます。例えば「Kei Nishikori」でサーチすると、錦織の出ている試合でオンデマンド配信されている過去の動画がずらっと出てきます。そのうち見たいものをクリックすると、ケーブル会社のアカウント情報(ユーザー名とパスワード)入力を求められ、ログインなしでも初回は「お試し30分」だけ見られますが、それ以上はログインする必要があります。動画は、見慣れた試合中継のようなアナウンサーも解説者もおらず、試合の映像と場内の音声が淡々と流れるだけです。(ただ、映像は通常のスポーツ中継と全く同じで、点数をとった選手をアップにしたり、水泳では水の中からの映像がはいったりなど、画面が切り替わってわかりやすく見せるようにはしています。)

NBCのサイトは必ずしもインターフェースが使いやすいとはいえませんが、それでも「日本選手を見たい」とか、「マイナースポーツを見たい」という人にはとてもありがたい仕組みです。これだけ大量の動画を短期間に多数の視聴者が集中する環境で、認証して配信するというのはかなりの技術が必要で、つい職業病でそちらの心配をしてしまいますが、今やビッグデータ技術の進展のおかげで、このような配信方法が可能となっているわけです。アメリカでも、最初の頃はもっと見づらくて大変でしたが、技術面でもどんどん進歩しているのがわかります。

一つ、重要なポイントとしては、オンデマンド配信が始まってから、テレビの視聴者はかえって増えているということが一般に言われています。今回のリオも、(例えばロシアがドーピングでやられてその分アメリカがメダル独占状態という点もありますが)過去最高の視聴者数になると見込まれていますし、例えばアメリカン・フットボールなどでも同様の結果が出ているので、テレビ各局は積極的にオンデマンド技術に投資するようになっています。

アメリカでビジネス的にこれが成り立つのは、上記のように「ケーブル契約が高くて、配信料としてコンテンツ各社にもたくさん流すだけの原資がある」という特殊事情があります。また、2007年の「脚本家組合スト」をきっかけとして、コンテンツ会社が受け取ったコンテンツ料を、俳優・監督・脚本家から各種スタッフに至るまで、どれだけの配分をするかという仕組みも整備されているため、テレビを作る人たちも、こうしてオンデマンドからの配信料が増えると自分たちも潤うというインセンティブがあります。

私は最近の日本のオンデマンド放映事情をあまり詳しく知らないのですが、Newspicksのコメントを見る限り、まだそれほど進んでいないように見えます。その背景事情はとりあえず置いておき、日本でも今後、「CMではない加入料を誰が入り口で十分な額徴収するか(お金の入り口の多様化)」という点と、「コンテンツ配信料をどう配分するか」という点を、アメリカとは背景が違うので、日本式のやり方で整備する必要があると思っています。絶対ダメな理由がいくらでも出てくることを覚悟でいうと、私は、NHK料金徴収の仕組みを使い、NHKが子会社を作って「配信インフラ」と「料金回収」のプラットフォームになり、民放のオンデマンド配信を代行するのがいいのでは、と思ったりしています。

アメリカの場合、ケーブル料金が高いというのは継続的に批判を浴びている点ではありますが、そのおかげで、上記のように試行錯誤したり、制作方式や配信方式に先行投資したりする原資ともなっているワケです。そして、こういう大手のユーザーがあるために、アメリカではビッグデータのスタートアップがどんどん生まれてくるというエコシステムも形成されています。

日本のブロードバンドや映像配信サービスはアメリカと比べてあまりにも遅れていて、いわば「ビジネスモデルのトリクルダウンの一番トップ」にあるべき映像サービスの遅れが、日本のIT競争力をさらに弱めてしまうと懸念しています。ちょうど、東京オリンピックもあることですし、テレビ局の及び腰の元凶と言われてきた某芸能事務所も弱体化の様子を見せていることですし、ここで頑張って、日本でもテレビのオンデマンドを本格的に拡大する努力を、テレビ側の人たちがすべき、と私は考えています。

AWS S3 behind Netflix success

“Big Data and Cloud Storage” series Vol. 5:   Event and Company #3

AWS S3 behind Netflix success

Netflix as the big data tycoon

Netflix is known as one of the most sophisticated user player in big data community.  They appear regularly in big data conferences like Strata and discuss how they utilize the data analytics in their business, and what their infrastructure is like.

My theory why Netflix is successful while many others are not, is that their sophisticated big data power enables them to deliver better service and wider margin.  Media industry people often see online video delivery as just another distribution means and do not pay too much attention to this “brain” part of the cloud, but it is the secret source of their success.

From the user data to recommendations

I have tried all major movie services for years, including Netflix, Hulu, Apple, Amazon, cable’s TVEverywhere, as well as Joost, CinemaNow and MovieLink (remember them?).  Among them, Netflix stands out in the power of recommendation. Other services push the ones that they want to show such as new shows, while Netflix top page is filled by personalized recommendations.

At the discussions in big data conferences, Netflix shows off how they utilize the amazing details of the usage data to come up with such recommendations.

With streaming, Netflix knows what you watch at which date and what time, if you quit watching, where you stop and whether you restart watching or not, on what device.  It is not a simple “people who watch this movie also watch these” factor.

In my household, I have Netflix account and everyone else in the family share my account.  Each have very different taste, so I was feeling pity for confusing Netflix, but they are actually one step ahead.  They already roughly know the profile of my family members through the analysis of such usage data.  And they show it in a subtle way, such as “SF Action” or “Foreign Art Films”, not creepy way such as “one for your teenage son” or “for mom”.

Scale out on Amazon S3

Netflix is the most well known user of Amazon Web Service (AWS) as their infrastructure to support this massive data analytics operation.  They state that “data center management is not our main business” as the reason to use AWS.

They used to have their own data center and was running Oracle database early in their history, but the data amount exploded as their online streaming service was catching on, to the point where they cannot catch up by building the new one anymore.  So they moved to almost 100% cloud-based in 2009-10 both in processing and storage, to be able to scale rapidly.

Currently, AWS’s S3 is used to store both video and user behavior data.  User order gets processed in NoSQL database Cassandra, and then the data is dumped into S3 once a day.  According to an engineer’s confession in Strata speech, they had so much trouble in this transfer process, so they developed their own software to do this and named it Aegisthus.  Aegisthus is a figure who killed the princess Cassandra of Troy in a famous tragedy of Greek mythology.

User data stored in S3 is analyzed with Hadoop tools, and the results are also stored in S3 again.  S3 is generally known as "Pay as you go" service, but big customers like Netflix usually are assigned with a fixed capacity, so they use the slack capacity for user data analytics after midnight of the West Coast, when video stream volume decrease sharply.

The speaker emphasized the concept "the right tools for the right job" in his speech.  Depending what your business model is, you have to choose where to put your own resources and what you buy from outside.  The big data strategy is not solely defined by the amount of data or company size.  Strategic priorities often are more important in your decision of “build or buy”.  Cloud storage provide advantages for enterprise of all sizes.

「武器商人アマゾンに竹槍で挑戦するグーグル」記事公開

年末にお知らせしたとおり、ENOTECHの公式サイトに日本語ブログも統合し、新しいブログをオープンいたしました。 それで、テストを兼ねて早速お知らせです。

ZDNet「ビッグデータとクラウド・ストレージ」第八回 がアップされました。

同じ記事は、クラウディアンのブログでもご覧になれます。

宜しくお願いいたします!

Memory of the cloud brain – what is cloud storage?

“Big Data and Cloud Storage” Trend 2:  “Big Data and Cloud” Vol. 3

What is Cloud Storage?

Memory of the cloud brain

In my previous article, I wrote that “cloud” is becoming the "brain" of the Internet world and its “thinking” activities correspond to “big data”. This time, I will talk about another brain function “memory”, which is “cloud storage”. The word “STaaS (Storage as a Service)” is used interchangeably.

Dropbox is an easy-to-understand example. To be precise, Dropbox is an end user application and cloud storage is an infrastructure for applications, but consider it as a metaphor to understand its role.

Documents are stored in the Dropbox server in the cloud. It gained popularity as the document sharing tool between the desktop and mobile devices, as a part of the web world transition to "mobile and cloud" era, as I mentioned in the first article. It is also used as a groupware to share files team members, and similar service Box is widely used by enterprise users.

These are particularly storage-centered services, but virtually all web services need storage, such as mail storage in Gmail and photo storage in Facebook

“Kanban sysytem” cloud storage

Cloudian distinguishes Dropbox-like upper later file share as “online storage” and lower layer infrastructure for application as “cloud storage” for app providers. The following discussion is about the latter.

Major players such as Facebook and Google own and operate in-house storage infrastructure. However, many other online service providers strategically choose to outsource it. The major online movie streaming provider Netflix, who owns a huge amount of video and customer data, is a good example of such “cloud storage”.

Specialized consulting firm 451Group forecasts global market of cloud storage grows to $ 6.0 billion in 2015 from $ 1.3 billion in 2011. Majority is the storage-centric services ($750M → $4.7B), with backup and archiving ($550M → $1.3B) consist the rest.

451 Group defines cloud storage with two factors as follow;

1) Storage capacity can be obtained in on-demand basis. 2) Data is in a hosted environment and can be accessed via Internet.

If data amount drastically fluctuates from time to time, it is too expensive to own the storage capacity enough for the peak time, like an empty highway in the countryside. Instead, cloud storage (STaaS) can work as the Kanban system. Among the above two items, (1) is the major characteristic of cloud storage, whereas (2) is also for a traditional hosting service. This Kanban-like scalability is called "scale-out” in the cloud industry.

As mentioned in my last article, Amazon is the giant in this world. There are practically no start-ups inSilicon Valleywho don’t use the Amazon cloud service. Amazon’s cloud storage is ideal for them, as it is hard to predict the capacity requirement over time and the budget is tight.

Amazon customers include some large enterprises like Netflix, as well as those start-ups, and it is the only cloud storage vendor that their annual revenue exceeds $100M. In the 451 report, Amazon owns almost 50% market share, although there is no exact data available at hand. Salesforce.com, Rackspace, Microsoft and HP are followers.

Storage system of Amazon

Amazon’s cloud storage S3 (Simple Storage Service) is a part of Amazon Web Services (AWS). “S3” has becomes de facto standard of cloud storage.

S3 uses the technology called Object Storage, one of the three storage methods:

(1) Block Storage:

Data is cut into a certain size, and mechanically stored as 1s and 0s. It is used in SAN (Storage Area Network) that requires fast access over a very short distance.

(2) File Storage:

A collection of data is stored in file format, carries metadata such as file name and file format, in a hierarchical structure of directory or folder, much like on the PC desktop. It is used in NAS (Network Attached Storage).

(3) Object Storage:

A big chunk of data is packaged like a box, including metadata, which is called an object. Each box is given an OID (Object ID), and all objects are saved in a flat manner.

File storage is easy to understand by analogy with the paper folders, but is inefficient due to several problems. The data access operation requires following the folder structure from the top to the bottom, and needs to go back to the top to move to a different folder. Metadata is located outside of a holder, and concurrent operation is problematic because the name of the upper folder is shared by multiple files

In contrast, with object storage, OID is the only key necessary to access an object, much like pulling out a whole box by looking at a tag attached to it. It is not necessary to go up and down the hierarchy and all metadata is also stored in a box.

Only one object is tied to one OID, so parallel data accessing is easy. This higher efficiency results in lower cost and high scalability, as long as the contents of the box are not changed.

With these characteristics, object storage is a preferred method for cloud storage which requires storing massive static data, such as images, videos and e-mails, and cost efficiency and scale-out ability are quite important.

Challengers

Not many players challenge to the dominance of Amazon at the moment. In theUS, some companies such as Microsoft and HP serves their existing enterprise customers, slightly different customer base. Google is sometimes mentioned as a direct competitor to Amazon, but their target is small and medium-sized customers and their market share is still small. InEurope, LunaCloud has emerged as an Amazon style competitor.

InJapan, Nifty Cloud and Yahoo! Cloud have been providing similar services, and recently NTT Communications entered this field. Please see below for more details.

Shouldn't Apple forget about map and TV and worry about music?

Pachinko-Gandum.jpg

Shouldn't Apple forget about map and TV and worry about music? I just wrote a Japanese column on Nikkei Business Online about Apple's map app problem.  In a nutshell, I wrote that the trouble was caused by their lack of expertise in "cloud" and "big data".  It is not due to the absence of Steve Jobs - they already have a bad track record with MobileMe and Ping - the latter being shut down yesterday.

Pachinko Gandum

There has been a lot of rumor about Apple's entry into TV, and I would imagine they have a capability to produce a beautiful piece of TV set - or rather, in my world, iPad is already one.  But if they try to eliminate Netflix from the equation and do it on their own, I guess the same "map" type problem would happen in streaming service.  They pioneered video distribution service on iTunes Store, but after that breakthrough, their iTunes service has not been improved so much.  Netflix, on the other hand, is working SO HARD behind the scenes to brush up their big-data-based recommendation skills, and I believe that is the heart of their success.

Apple is working very had to catch up, aggressively hiring cloud/big-data engineers.  But it will take years to accumulate data and the expertise to turn it into a viable products.

So if they have to work so hard on this area, I wonder why not start from their roots and strength, which is music service.  They already have so much data in music purchases of their huge number of registered users.  The report says that they are delaying "Pandora" type streaming music service due to the right negotiation problem with Sony, but even if that does not exist, I wonder if they can provide good enough interface, given they are still on the learning curve in cloud and big data.

Era of Data Explosion and Big Data

Big Data and Cloud Storage Vol. 1:  Trend #1

I am starting to write a series of "Big Data and Cloud Storage" on this blog, sponsored by Cloudian, who provides the cloud storage software.  The first of the series is the historical trend of data explosion and the need for cloud storage.

Era of Data Explosion and Big Data

  • Analog Data and Digital Data

Humankind has accumulated a dazzling pile of analog data in its history over thousands of years. From Buddhist scriptures and printed Gutenberg bibles to enormous amount of modern-day books, photos, music and videos, it still continues to get accumulated day by day.

Digital data is not far behind. Now that major media and personal communications all turned into digital format, digital data is imploding at seams.

So here is a question. Which do you think is the bigger data, analog or digital?

The answer depends on “when”. Consulting firm McKinsey published a report on “big data” in May 2011, and in it, they show an estimate of the share of digital among all the accumulated data. In 2000, thousands-of-year-old analog scores 75% of total. In 2007, however, digital overwhelms analog by 94% share, surpassing analog in mere 7 years.

Digital technology started off in the 80’s with personal computer invention, and by the time of Net Bubble in the 90’s, most of media, such as mail, photo, music and video were in digital format. Yet, in the 2000, we had way more analog, but in 10 years after that, digital exploded as such. How, then, did it happen?

  • Bubble Burst and User Generated Contents

In the 1990’s, e-mail emerged as an alternative of snail mails. Then came e-commerce as a catalog alternative, and news portal as a newspaper/magazine alternative. Back then, transmission speed and technology was still limited, so a relatively small number of providers were producing catalog and articles, and delivering these contents to users through Net in an unilateral manner.

Throughout the bubble period, huge scale of Internet infrastructure was built, but after the bubble burst in 2000, demand suddenly shrunk and price of over-supplied fiber optics and datacenter plummeted.

Sometime later, a new type of Internet companies rose from the ashes of bubble, such as Google and Facebook. These new species of Web industry was later named “Web2.0”. They provide “interactive” flow of information on the Net, created platforms for “user generated contents” and revolutionalized the net business. They are not the alternatives of something, but are totally unique to the Web technology and had totally different cost structure. People started to share their thoughts and photos on blogs, and videos on YouTube. And all these user generated contents have been published and accumulated on the Internet.

  • Data gathers on “cloud” and becomes “brain”

Google’s then-chairman Eric Schmidt uttered the word “cloud computing” in 2006 in a speech, popularizing the term “cloud”. Cloud computing means the system to keep data and application in Internet, rather than on desktop computer. The term “cloud” came from the “cloud” figure on the network chart to express Internet. Such idea has already been advocated in the past, but around this time, finally came true, as the network environment caught up with broadband penetration.

As data transformed from analog to digital, and gets published on the cloud, now we can easily gather many different kinds of data in the cloud, sort it and extract meaning from it. Starting off as a monad of individual computers in the 80’s, they get connected with nerves of Internet in the 90’s to form a earthworm, and in the 2000’s evolved into human brain.

And this highly intelligent brain activity on the Internet is called “big data”. The more information is stored, the better the brain works, and as the brain works well, it gets more and more interesting to learn the new things, so the brain autonomously and increasingly sucks in the new data.

In summary, digital data explosion and the subsequent trend towards big data was triggered by Web industry’s movement into “cloud”.

"Big Data" series started for Cloudian

I have started to write "Big Data and Cloud Storage" series for Cloudian on ZDNet Japan and Cloudian website in Japanese. English version is coming up soon. 「ビッグデータとクラウド・ストレージ」に関するよもやま話新シリーズを開始いたしました。ZDNetと、クラウディアン社サイトの両方で読めます。

「ビッグデータとクラウド・ストレージ」 連載 第一回 - トピックス - ZDNet Japan What is BIGDATA?: ネットにおける脳の高度な知的活動が「ビッグデータ」

今回のシリーズは、一般的な記事ではなく、クラウディアン社から依頼を受けて、同社の広報の一環として執筆する、いわば「ホワイトペーパー」のようなものです。クラウディアン社は、クラウドストレージ向けソフトウェアを提供する企業です。詳しくは下記をご参照ください。

クラウディアン

直接の製品宣伝ではなく、「企業によるビッグデータの活用、そのためのクラウドストレージ」という動向について、より多くの方に興味を持っていただき、同社製品のターゲットとする市場を広げようというのが目的ですので、できるだけ読んで楽しいものにしたいと考えています。 連載といっても、半年ほどの「短期シリーズ」になる予定です。2週間に一度の頻度で更新です。どうぞ宜しくお願いいたします。