You are on page 1of 166

Hadoop

IBMHadoop

IBM

Hadoop


Paul C. ZikopoulosMBA IBM Software Group
World Wide Database Competitive Big Data
SWAT Paul 18
Paul 320 14
DB2 pureScaleRisk Free Agile ScalingMcGraw-Hill
2010 Break Free with DB2 9.7 A Tour of Cost-Slashing New
Features McGraw-Hill 2010 Information on Demand
Introduction to DB2 9.5 New FeaturesMcGraw- Hill2007 DB2
Fundamentals Certification for Dummies For Dummies 2001
DB2 for Windows for DummiesFor Dummies2001 Paul
DB2 DRDA ClustersBI
DBA Chachi

Chlo paulz_ibm@msn.com
@BigData_paul

Chris EatonIBM
Chris LinuxUNIX Windows
DB2 19 ,
Chris DB2
The High
Availability Guide to DB2IBM Press2004 IBM DB2 9 New
FeaturesMcGraw-Hill2007 Break Free with DB2 9.7A Tour
of Cost-Slashing New FeaturesMcGraw-Hill2010 Chris
IT
Toolbox DB2 http://it.toolbox.com/blogs/db2luw
Dirk deRoosIBM IBM
Dirk 11 IBM Toronto DB2 Development
Dirk New Brunswic

Thomas DeutschMBAIBM Tom


Apache Hadoop
IBM Research IBM Software Group
Hadoop IBM Research
Big Data Tom CTO
Tom
IBM Enterprise Mashups Tom
FileNet IBM FileNet FileNet Content

Management FileNet IBM Lotus


InfoSphere 20
Tom
Tom Fordham Maryland MBA

George LapisMS CSIBM


30 IBM Almaden
R* Starburst
DB2
10 George
DB2 SQL/XML XQuery George
Optim Database IBM
George IBMs InfoSphere BigInsights

DB2 DBA Hadoop


Steven SitMSIBM
IBM Steven
IBM
17StevenIBM
Steven Western
Ontario Syracuse


Hadoop

Paul C. Zikopoulos
Chris Eaton
Dirk deRoos
Thomas Deutsch
George Lapis

McGraw-Hill
bulksales@mcgraw-hill.com
Hadoop
2012 by The McGraw-Hill Companies
1976

McGraw-Hill

IBM InfoSphere Streams InfoSphere BigInsights

IBM
IBM
1234567890
ISBN
MHID

DOC DOC

10987654321

978-0-07-179053-6
0-07-179053-5

Paul Carlstroem

Patty Mon

Sheena Uprety,
Cenveo Publisher
Services

Lisa Theobald

Paul Tyler

Cenveo Publisher
Services

Jeff Weeks

George Anderson

Cenveo Publisher
Services

Stephanie Evans
McGraw-Hill McGraw-Hill
McGraw-Hill

IBM 18

Chloe

2011 8 12 100 IBM


38
10 14 1/4 1
1/2 5
IBM 18 IBM 18

IBM

IBM
Martin WildbergerBob
PicianoDale Rebhorn Alyse Passarelli IBM

Grace Madeleine Zikopoulos Chlo Alyse


Zikopoulos
Paul Zikopoulos

Teresa

Riley Sophia


10
Chris Eaton

SandraErik Anna
Paul
Dirk deRoos

Lauren William

Anant Jhingran
Thomas Deutsch
IBM

George Lapis

IBM
Paul
Amy Tiffany
Ronald

Steven Sit

15

3 IBM

35
II

4 Hadoop

51

5 InfoSphere BigInsights

81

6 IBM InfoSphere Streams

123

xv

xxi

xxiii

Hadoop

12

15

15

17

IT IT

18

20

24

26

29

31

IBM

35

37

39

IBM 1

40

40

49

xii

II

Hadoop

53

Hadoop

54

Hadoop

55

Hadoop56
MapReduce

60

Hadoop63
Hadoop64
PigandPigLatin65
Hive67
Jaql68
Hadoop73
73
74
Hadoop76
ZooKeeper76
HBase77
Oozie78
Lucene78
Avro80
80

InfoSphereBigInsights

81

82
BigInsights1.2Hadoop84
HadoopGPFSSNC85
HadoopGPFSGPFS86
GPFSSNC88
GPFSSNC91
GPFSSNCPOSIX92
GPFSSNC94
GPFSSNCHadoop95
GPFSSNCPOSIX92
GPFSSNC94
GPFSSNCHadoop95

Contents

95

96

xiii

97

99

102

103

Netezza

103

DB2forLinux,UNIX,andWindow

104

JDBCModule

104

InfoSphereStreams

105

InfoSphereDataStage

105
RStatisticalAnalysisApplications

106

MapReduce

106

107

BigSheets

BigInsights

112
118

109

118

121

IBMInfoSphereStreams
InfoSphereStreams

124

InfoSphereStreams
InfoSphereStreams

123
125

129

130

StreamsProcessingLanguage

131

133

134

137
.

138

139

140

141


Rob Thomas

TomTom
Chris 20
Chris Tom
10
Tom 20 Chris
1.25 /

*****

80%
15

5

xv

xvi

10

1.25 /

Rob Thomas
IBM

Foreword

xvii

Anjul Bhambhri
70 System R
System R
SQL DB2 Oracle SQL/DS
ALLBASE Non-Stop SQL

90 IT

ERP
SCM

xviii

90
IBM
Garlic 2001 XML DB2 pureXML
XML XML

IBM
2011 50 IBM
IBM 30
IBM DB2
InformixSolid DB

Netezza Smart Analytics System


SPSS Cognos

xix

PaulGeorgeTom Dirk

Anjul Bhambhri
IBM


Shivakumar
(Shiv) Vaithyanathan Roger Rea Robert Uleman James R. Giles
Kevin Foster Ari Valtanen Asha Marsh Nagui Halim Tina Chen
Cindy SaraccoVijay R. BommireddipalliStewart TateGary Robinson
Rafael Coss Anshul Dawra Andrey Balmin Manny Corniel Richard
HaleBruce BrownMike BruleJing Wei LiuAtsushi TsuchiyaMark
Samson Douglas McGarrie Wolfgang Nimfuehr Richard Hennessy
Daniel Dubriwny
IBM
Rob Thomas Anjul Bhambhri

(DE) Steve Brodsky BigInsights


(STSM)Shankar Venkataraman Bert Van der
LindenIBM

Steven Sit

Susan Visser Linda Currie

Sheena UpretyPatty
MonPaul Tyler Lisa Theobald
McGraw-Hill Paul Carlstroem

xxii

Linda Snow Wendy


Lucas

xxiv

IBM

IBM Hadoop
IBM Hadoop
Apache Hadoop BigInsights Hadoop

IBM

IBM Hadoop

Hadoop IBM
IBM

(ROI) IBM
IBM Hadoop

IBM

IBM

xxv

IBM

100 300
Airbus
10
10
40
300 (RFID)
[]

20

xxvi

IBM
IBM

Pyotr Smirnov
Smirnov

xxvii


I 3
1 3
Twitter Facebook
IBM
3
3 VV3

IBM 30

XM

Facebook V3
V3
ID

xxviii

IT

3 IBM
IBM

Hadoop

Claude MonetIBM
IBM
IBM IBM
Hadoop
IBM
BigInsights Hadoop
Hadoop Java
BigInsights
Hadoop

IBM
IBM

xxix

IBM
Think Watson Jeopardy!
IBM
247
IBM SPSSCognosSmart Analytics
SystemsNetezza 5 IBM
140

IBM IBM
Eclipse (UIMA)
Apache DerbyLuceneXQuerySQL Xerces XML
(IDE)
IBM Hadoop Jaql 4
IBM Hadoop IBM
Hadoop Hadoop
FacebookLinkedIn Hadoop
Hadoop IBM Hadoop

II 4
Hadoop
Apache
Hadoop Pig Hive HDFS
MapReduce ZooKeeper

xxx

5
IBM
IBM
InfoSphere BigInsights (BigInsights) IBM Hadoop
3 IBM
IBM IBM General Parallel
File System (GPFS) GPFS (SNC)
Hadoop IBM BigInsights
Java
Hadoop
GLP
Hadoop

xxxi

6
6 IBM InfoSphere Streams (Streams)
Streams

Streams Streams
Streams
BigInsights Hadoop
Streams

IBM

WebSphere

Blackberry AppWorld
Apple AppStore

5
IBM

100

IBM

IBM

(machine-to-machine, M2M)
(YoY)

GPS

IBM

IBM

3 1-1
IBM

IBM

2000 800,000 PB

BigInsights 2020 35
ZB Twitter 7 TB Facebook 10 TB
TB

Big
Data

TB

ZB

1-1 IBM V 3

Variety
Velocity
Volume

PB

iTunes

2007
I35W 200

TB 10
1 TB

1-2

1-2

Data Available
Percent of data an

TB PB ZB

Web

20%
80%
Twitter
JSON

PB TB RFID

IBM

GPS

IBM
Hadoop
IBM

IBM

Hadoop

Hadoop
Hadoop
2

10

2002 Sarbanes-Oxley (SOX)


CEO CFO 302

Hadoop

Hadoop

IT

Hadoop
TweetFacebook
Hadoop
IT

CIO

11

CIO (CAPEX) (OPEX) 4


4

(cost per compute)

Hadoop

Hadoop IBM

12

/
IBM
InfoSphere BigInsights Hadoop

Hadoop

30 mg/kg (30 ppm)

13

IBM

IBM

IBM IBM InfoSphere BigInsights


(BigInsights) IBM InfoSphere Streams (Streams)

1
(V3)

16

17

IBM

IBM BigInsights Hadoop

18

IT IT
IT
(data exhaust)

IT

DB2
BigInsights

GB

IT

IT

19

IT IT

IBM
(FSS)
IT IT

IT IT
IT
IT IT
(SOA)

20

SOA
20
IT

1TB 5

21

20%
2-1
80%
CIO CAPEX OPEX

BigInsights
80%

- 2-2

22

Mashup

SOA Web

ODS

+++

ERPCRM
2-1 20%
Data Quality/Governance/

2-2

InfoSphere StreamsDB2
IBM

2-2

3
2

23

Mashup

ODS

SOA Web

InfoSphere
BigInsights

+++

ERPCRM

2-2
Data Quality/Governance/

50%
80%
BigInsights


InfoSphere Streams 2-2
-
Streams

24

(FBI)
600

IBM
Cognos Consumer Insights (CCI)
BigInsights CCI

25

SAPDB2TeradataOracle

Facebook
Facebook

Twitter tweet (Ttps)


Super Bowl 2011 2011 2 4064 Ttps
Twitter Ttps 5106 Ttps
6939 Ttps Twitter

26

7166 Ttps

7196 Ttps Beyonce


Twitter 8868 Ttps

Lady Gaga (@ladygaga) Tweeter

(CSR)

CSR

27

(Streams)
(BigInsights)
Streams
BigInsights
//
Streams
CSR

CSR
CSR

70%

2%

CSR

28

BigInsights

CSR
CSR

Watson
(BigInsights)

Streams

29

Streams
BigInsights BigInsights
Watson
Streams

BigInsights

2008

1520%

30

80%
CAPEX OPEX

31

20,000 40,000

10% 5%

90%

Streams

BigInsights Vestas
Vestas

32

5 65
43,000 Vestas
IBM BigInsights

20 30

Vestas 2.6 PB (2600 TB)

6 PB (6000 TB)
Vestas

Vestas
Hadoop

Vestas IBM
Hadoop
2 IBM
Hadoop

33

IBM System x InfoSphere BigInsightsVestas

Wind and Site


Competency CenterVestas
1

300
Vestas
PB

IBM IBM

3
IBM

Hadoop

Hadoop MapReduce

- -
IBM
35

36

Understanding Big Data

Google 2007 10 MapReduce


IBM

IT
IT

IT
Hadoop

IBM

IBM
Hadoop

IBM
Hadoop
IBM

BigInsights
IBM
IBM Hadoop

IBM IBM

Why IBM for Big Data?

37

247

IBM BigInsights Cognos Consumer Insights

IBM Hadoop
BigInsights IBM
5

CIO

IT

Hadoop

Hadoop TwitterFacebook Yahoo
500

38

Understanding Big Data

IT
(RYO)
Hadoop
(SLA) IT

(MTTR)
(RPO)
Hadoop

Hadoop

IBM IBM
IBM Hadoop

Hadoop
Hadoop
(OLTP)

Hadoop SLA

Hadoop
SLA

Hadoop

Why IBM for Big Data?

39

IBM

IBM Data Server


Hadoop
IBM

IBM

IBM
BigInsights 200 IBM
5 IBM
General Parallel File System Shared Nothing Cluster (GPFS-SNC)
SC10 Storage ChallengeSC10

40

Understanding Big Data

IBM 1
IBM Hadoop 2011 5
1
Hadoop

IT
IBM SPSS IBM
Cognos Unica CoreMetrics
Netezza IBM Smart Analytics System IBM
(BAO) IBM

InfoSphere Streams
Streams (BigInsights)
IBM
IBM
IBM
5 IBM 24
140 IBM Research 8000 IBM
200
Hadoop
Apache Hadoop

IBM

Why IBM for Big Data?

41

IBM

IBM
IBM

IBM
FortranDRAMATMUPC RISC
PCSQL XQuery
www.ibm.com/ibm100/
IBM IBM

1956
IBM Random Access
Method of Accounting and Control
RAMAC 50 2

IBM 2000
10,000 1997 10
IBM
PB

IBM

1970
IBM Ted Codd

42

Understanding Big Data

IBM
DB2 Informix Netezza Oracle
SybaseSQL Server

IBM
1971
IBM
5000
IBM ViaVoice 64,000
260,000 1997 ViaVoice

VoiceType 2
IBM

1980RISC
IBM RISC
IBM John Cocke 20 70 RISC

RISC

(HPC)
Watson
Jeopardy
1988NSFNET
IBM National Science Foundation (NSF) MCI Merit

Why IBM for Big Data?

43

NSFNET 200 6
NSFNET Internet
Internet NSFNET Internet
56kb/s 1.5Mb/s 45Mb/s
Internet Internet 1995 4
Internet 93
5 Internet
Internet

NFSNET
1993
IBM

DB2 Database Partitioning Facility (DB2 DPF)


(MPP) IBM
Smart Analytics System
Hadoop II
MapReduce
DB2 DPF

1996
1996 IBM

IBM Research Lloyd Treinish

44

Understanding Big Data

10
IBM
(NOAA) 1995

2 Vestas IBM

1997
32 IBM RS/6000 SP Deep Blue
Garry Kasparov
Watson
Watson
Deep Blue
Hadoop

2000Linux
2000 IBM Linux Linux
IBM 10 Linux
IBM
IBM CEO CIO
Linux Linux

Why IBM for Big Data?

45

Hadoop Hadoop

2004
Blue Gene IBM
PFLOPS 2004 9
IBM Blue Gene PFLOPS
IBM Blue Gene
Blue Gene
2009 Blue Gene
Barack Obama IBM National
Medal of Technology and Innovation
2009
IBM

IBM
IBM

2009
IBM Streams

IBM Software Group InfoSphere Streams IBM


6

46

Understanding Big Data

IBM

Streams
Streams 500,000 CDR
60 CDR 4 PB
2009
IBM Enterprise CloudIBM
2010
2000 IBM
IBM
1500

IT
IBM
IT

2010 GPFS
SNC
1998 IBM General Parallel File System (GPFS)
POSIX (SAN)
DB2 pureScale Oracle RAC
GPFSGPFS

GPFS

Why IBM for Big Data?

47

GPFS GFPSSNC SC10 Storage


Challenge 2010SC10 2010 PB
EB
Hadoop Hadoop

2011Watson
IBM Watson -Question-AnsweringQA
Watson

Watson

2011 2 Watson
Jeopardy!
Ken Jennings Brad Rutter
Decision Augmentation
Watson Hadoop
BigInsights
Hadoop

IBM ResearchInfoSphere BigInsights

IBM Research IBM


Streams IBM IBM Research
BigInsights IBM Software Group
(SWG) IBM Research Advanced Text Analytics Toolkit
SystemT Intelligent Scheduler

48

Understanding Big Data

Hadoop Hadoop
FLEX BigInsights
GPFS-SNC 12 Adaptive
MapReduce Machine Learning Toolkit
System ML BigInsights

II

IBM Research Hadoop


BigInsights IBM Research
IBM Research

IBM
Hadoop Apache IBM IBM
Apache

IBM DB2 pureScale Tivoli System Automation


GPFS HACMP DB2 pureScale

IBM
IBM Mashup
Information Management Lotus IBM Enterprise
Content ManagementCognosWebSphere Tivoli IBM
Hadoop IBM Cognos Consumer
Insight (CCI) IBM
CCI BigInsights IBM

Why IBM for Big Data?

49

Internet
CCI
BigInsights
BigInsights
CCI
CCI

BigInsights

50

Understanding Big Data

IBM

2035
50%
50%

20,000 TB
10 PB

IBM Smart
Grid IBM

IBM

IBM

II

51

Hadoop

Hadoop IBM
InfoSphere BigInsights (BigInsights)
BigInsights

Hadoop
Hadoop
BigInsights Hadoop

53

54

Understanding Big Data

Apache Hadoop Hadoop


BigInsights
Hadoop Hadoop

Hadoop

Hadoop
Hadoop (http://hadoop.apache.org/) Apache Software Foundation
Apache Java Hadoop

Hadoop Google GoogleFile System (GFS)


MapReduce mapper
reducer
MapReduce IBM 2007 10 Google
MapReduce GFS Internet
Hadoop
Hadoop
Hadoop

Hadoop

Hadoop Hadoop
Hadoop
Hadoop Doug
Cutting Cutting

All About Hadoop: The Big Data Lingo Chapter

55

Cutting

Cutting
Pinky Squiggles
Hadoop (Hadoop Distributed File
System) (MapReduce)Hadoop

Hadoop Hadoop

Hadoop
Hadoop
Apache AvroCassandra HBase
Chukwa
Hive SQL Mahout
Pig Hadoop
ZooKeeper

Hadoop
Hadoop Hadoop Distributed File System (HDFS)
Hadoop MapReduce Hadoop Common Hadoop
MapReduce
Hadoop

56

Understanding Big Data

Hadoop Distributed File System

Hadoop
HDFS Hadoop

map reduce
Hadoop
MapReduce

Hadoop
(SAN) (NAS) SAN NAS Hadoop

1000 3
3000 + 1000

Hadoop
(MTTF)
Hadoop
MTTF
Hadoop
HDFS HDFS
Hadoop

A
1 B 2 Hadoop

HDFS 4-1

All About Hadoop: The Big Data Lingo Chapter

Block_1

Block_1

Block_3

Block_3

57

Block_2

Block_3

Block_2

Block_2

Block_1
Rack 1

Rack 2

4-1 HDFS 3

Rack


Hadoop
Hadoop

HDFS Apache Hadoop


64 MB
NameNode

BigInsights 128 MB IBM


Hadoop

58

Understanding Big Data

512
4 KB 32 KB Hadoop


3
Hadoop HDFS

Hadoop NameNode
NameNode HDFS
NameNode

NameNode Hadoop
(SPOF) NameNode
Hadoop
NameNode

Hadoop 0.21
BackupNode BackupNode NameNode

4-1 3
block_n block_n' block_n''

HDFS Hadoop
Hadoop MapReduce
NameNode Hadoop
Hadoop

All About Hadoop: The Big Data Lingo Chapter

59

MapReduce Hadoop
NameNode

MapReduceHDFS
NameNode
MapReduce
NameNode NameNode Hadoop
NameNode
NameNode
MapReduce
NameNode
IBM Hadoop
IBM IBM
General Parallel File System (GPFS) GPFS
SAN 2009 GPFS
GPFS-SNC Hadoop GFPS-SNC HDFS
NameNode GPFSSNC Hadoop SPOF
GPFS-SNC Hadoop

NameNode HDFS
Portable Operating System Interface for UNIX (POSIX)

HDFS

Java IT
HDFS

60

Understanding Big Data

BigInsights Hadoop
GPFS-SNC IEEE POSIX
APIshell UNIX AIX
Apple OSX HP-UX

MapReduce
MapReduce Hadoop
Hadoop MapReduce

Hadoop MapReduce

MapReduce Hadoop
map
/reduce map
MapReduce
reduce map
5
Hadoop

All About Hadoop: The Big Data Lingo Chapter

61

Toronto, 20
Whitby, 25
New York, 22
Rome, 32
Toronto, 4
Rome, 33
New York, 18

MapReduce 5 map mapper


5 mapper
mapper
(Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)
4 mapper 4

(Toronto,
(Toronto,
(Toronto,
(Toronto,

18)
32)
22)
31)

(Whitby,
(Whitby,
(Whitby,
(Whitby,

27)
20)
19)
22)

(New
(New
(New
(New

York,
York,
York,
York,

32)
33)
20)
19)

(Rome,
(Rome,
(Rome,
(Rome,

37)
38)
31)
30)

5 reduce reduce

(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)


map reduce

mapping
(reducing)

Hadoop MapReduce

62

Understanding Big Data

Hadoop
JobTracker JobTracker NameNode

map reduce

JobTracker

Hadoop TaskTracker

JobTracker JobTracker

4-2 MapReduce reduce


4-2 map
reduce
Toronto
reduce

Map

Shuffle

Reduce

Part_1

Map

Shuffle

Reduce

Part_2

Map

4-2

Map
Map

MapReduce

All About Hadoop: The Big Data Lingo Chapter

63

reducer Toronto
reduce Shuffle
map reduce Hadoop
map
Combiner reduce 4-2
reduce
reducer
Hadoop MapReduce Java
Java Archive (jar) JobTracker Hadoop
map reduce MapReduce
Apache Hadoop Hadoop
Hello World WordCountWordCount
Java
Hadoop
BigDataUniversity.com InfoSphere BigInsights Basic
Edition (www.ibm.com/software/data/infosphere/ biginsights/basic.html)
IBM
Hadoop
BigInsights
Basic Edition IBM
InfoSphere BigInsights Enterprise Edition Hadoop

Hadoop
Hadoop Hadoop

shell
HDFS POSIX
Linux UNIX HDFS

64

Understanding Big Data

HDFS /bin/hdfs dfs <args>


shell args

HDFS shell
cat

(stdout)

chmod

chown

copyFromLocal HDFS
copyToLocal

HDFS

cp

HDFS

expunge

HDFS
MAC Windows
HDFS

expunge

ls
mkdir
mv
rm

HDFS

HDFS rm
skiptrash

Hadoop
Hadoop
Hadoop MapReduce API Java

All About Hadoop: The Big Data Lingo Chapter

65

MapReduce

XML IMS
3GL 4GL
Hadoop Hadoop
Hadoop

Pig Hive Jaql


ZooKeeper

Pig PigLatin
Pig Yahoo! Hadoop
mapper reducer
Pig Pig
PigLatin Hadoop
PigLatin
Java (JVM) Java
Pig
mapper
reducer Pig HDFS LOAD
mapper
reducer DUMP
STORE

LOAD
Hadoop Hadoop HDFS
Pig Pig

66

Understanding Big Data

LOAD 'data_file' 'data_file' HDFS

Pig
USING LOAD

TRANSFORM
FILTER
JOINGROUP
ORDER Pig
Twitter (English) iso_language
tweet tweet

L = LOAD 'hdfs//node/tweet_data';
FL = FILTER L BY iso_language_code
EQ 'en'; G
= GROUP FL BY from_user;
RT = FOREACH G GENERATE group, SUM(retweets);

DUMP STORE
DUMP STORE Pig
Pig DUMP
DUMP STORE

DUMP

Pig Hadoop
Pig Pig
Java Grunt
Hadoop Pig
Pig
map reduce

All About Hadoop: The Big Data Lingo Chapter

67

map reduce

Hive
Pig
Facebook Hadoop
SQL
Hadoop Hive SQL
SQL Hive Query Language (HQL)
HQL HQL
Hive MapReduce Hadoop
SQL
(DBMS) Hive
Hive shell Hive
JDBC/ODBC Java Database Connectivity (JDBC) Open
Database Connectivity (ODBC) Hive
Thrift Client Hive Thrift Client

Hive C++JavaPHP
Python Ruby Hive Thrift Client
SQL DB2 Informix
Hive

CREATE TABLE Tweets(from_user STRING, userid BIGINT, tweettext STRING, retweets INT)
COMMENT 'This is the Twitter feed table' STORED AS
SEQUENCEFILE;
LOAD DATA INPATH 'hdfs://node/tweetdata' INTO TABLE TWEETS;
SELECT from_user, SUM(retweets)
FROM
TWEETS
GROUP BY from_user;

68

Understanding Big Data

Hive SQL
Hive Hadoop MapReduce
Hadoop Hive Hadoop
Hive
DB2
Hive

Jaql
Jaql JavaScript Object Notation (JSON)
JSON
IBM IBM
Jaql HDFS
Pig Hive Jaql
LispSQLXQuery PigJaql
Jaql
MapReduce
Jaql JSON
Jaql
JSON

JSON /
MapReduce Hadoop
//

JSON

All About Hadoop: The Big Data Lingo Chapter

69

JSON { string : value } [ value,


value, ] value JSON
JSON Twitter JSON
tweet
results: [
{
created_at: "Thurs, 14 Jul 2011 09:47:45 +0000" from_user:
"eatonchris"
geo: {
coordinates: [
43.866667
78.933333
]
type: "Point"
}
iso_language_code: "en"
text: " Reliance Life Insurance migrates from #Oracle to #DB2 and cuts
costs in half. Read what they say about their migration
http://bit.ly/pP7vaT"
retweet: 3 to_user_id: null
to_user_id_str: null
}
Jaql JSON
JSON Jaql Jaql
XMLCSV
Jaql JSON
Jaql Pig SQL

Jaql
Jaql Jaql

Twitter
FILTER FILTER
SQL FILTER
WHERE

70

Understanding Big Data

Twitter eatonchris

filter $.from_user == "eatonchris"


tweet
filter $.retweet > 2
TRANSFORM TRANSFORM
SQL
SELECT N1
N2TRANSFORM
transform { sum: $.N1 + $.N2 }
GROUP GROUP SQL GROUP BY
tweet

group into count($)


tweet
Jaql
group by u = $.from_user into { total: sum($.retweet) };
JOIN JOIN WHERE
SQL
tweet JSON tweet Tweeter

following = { from_user: "eatonchris" },


{ from_user: "paulzikopoulos" }

All About Hadoop: The Big Data Lingo Chapter

71

JOIN Twitter feed Twitter


following tweet
join feed, follow
where feed.from_user = following.from_user into {feed.*}
EXPAND EXPAND

geolocations = [[93.456, 123.222],[21.324, 90.456]]


geolocations -> expand;

[93.456, 123.222, 21.324, 90.456]


SORT SORT
Jaql
sort by desc Jaql
TOP TOP n n TOP
<integer>

Jaql
Jaql
HDFS
100
Jaql

Jaql
MapReduce Jaql

72

Understanding Big Data

->Jaql SQL SQL


SELECT list Jaql

Jaql Jaql
tweet
$tweets = read(hdfs("tweet_log"));
$tweets
-> filter $.iso_language_code = "en"
-> group by u = $.from_user
into { user: $.from_user, total: sum($.retweet)
};

HDFS $tweetsJaql
$tweets FILTER
iso_language_code = en tweet GROUP BY
tweet
Jaql map reduce
Hadoop
Jaql JSON Jaql
Jaql
XMLCSV
Jaql
JavaJavaScript
PythonPerlRuby

Hadoop Streaming
Java map reduce
Hadoop Streaming StreamingAPI Streaming
UNIX

All About Hadoop: The Big Data Lingo Chapter

73

Hadoop

Streaming Python Ruby

Streaming map reduce Python

hadoop jar contrib/streaming/hadoop-streaming.jar \


-input input/dataset.txt \
-output output \
-mapper text_processor_map.py \
-reducer text_processor_reduce.py

Hadoop
HDFS POSIX

HDFS HDFS
HDFS API GPFS-SNC
GPFS-SNC
GPFS-SNC Hadoop
HDFS
FlumeHadoop

Hadoop API
shell HDFS
HDFS copyFromLocal

74

Understanding Big Data

HDFS copyToLocal

hdfs dfs copyFromLocal /user/dir/file hdfs://s1.n1.com/dir/hdfsfile


hdfs dfs copyToLocal hdfs://s1.n1.com/dir/hdfsfile /user/dir/file

HDFS shell Java


shell Java API HDFS Java
API

hadoop fs Hadoop shell HDFS


Java
HDFS C++ API
Thrift Java API
Java HDFS org.apache.hadoop.fs
MapReduce HDFS HDFS
HDFS
HDFS
APPEND GPFS-SNC Hadoop

Flume
(flume)
Flume Apache
Hadoop
Flume sourcedecorator sink
source Flume
sink Flume

All About Hadoop: The Big Data Lingo Chapter

75

decorator

Flume TCP
(stdin)

Web
TAIL
exit

Flume sink
Collector Tier Event
HDFS
Flume Agent Tier Event

Flume
Basic
HDFS bucket
Hadoop
Flume

IBM Information Server


IBM Information Server Hadoop

76

Understanding Big Data

Hadoop
Hadoop Hadoop
Apache
ZooKeeperHBase Oozie Lucene
4 Hadoop
InfoSphere BigInsights

ZooKeeper
ZooKeeper Apache
ZooKeeper

500 Hadoop
10
Hadoop
ZooKeeper
ZooKeeper
Java C
ZooKeeper ZooKeeper

ZooKeeper
ZooKeeper
ZooKeeper

Hadoop ZooKeeper
ZooKeeper

All About Hadoop: The Big Data Lingo Chapter

77

ZooKeeper znode
ZooKeeper znode
znode ZooKeeper
znode znode

ZooKeeper znode
znode

HBase
HBase HDFS

HBase SQL HBase
HBase Java MapReduce
HBase AvroREST Thrift
Avro
Google
HBase
HBase
HBase

timestamp
servernameHBase
(column family)
HBase

78

Understanding Big Data

HDFS NameNode MapReduce JobTracker


TaskTracker HBase HBase master
node region server
HDFS NameNode
BigInsights
HBase

Oozie
MapReduce
Oozie

Oozie

Oozie Directed Acyclical Graph


DAG (Acyclical)

DAG (action nodes) (dependency nodes


MapReduce Pig
Java

4-3 Oozie

Lucene
Lucene Apache
Lucene Hadoop

All About Hadoop: The Big Data Lingo Chapter

79

MR1

Pig

Java
MR2

HDFS

4-3

MR3

Oozie

2005 Apache Lucene Java


Lucene C++
PythonPerl Internet Lucene

Lucene
Lucene
Lucene
Lucene
BigInsights Jaql
Jaql Lucene
BigInsights
BigInsights
MapReduce

BigInsights Hadoop
HDFS Lucene
Hadoop GPFS-SNC

80

Understanding Big Data

Avro
Avro Apache Avro

Avro

JSON Avro
Jaql JSON
Avro API


STRINGINT[eger]LONGFLOATDOUBLEBYTENULL
BOOLEAN recordarray
enummapunion
fixed
CC++C#JavaPythonRuby PHP Avro
API Hadoop

Hadoop
IBM InfoSphere
BigInsights
IBM Hadoop
IBM

5
InfoSphere BigInsights

Hadoop
Hadoop
Hadoop Apache
Hadoop 2006
Hadoop
1.0
Hadoop
Hadoop
4 Hadoop
(HDFS) NameNode
(SPOF) 0.21
NameNode Hadoop
NameNode
Hadoop
Hadoop

81

82

Understanding Big Data

MapReduce
Hadoop
Java MapReduce
Pig Jaql MapReduce
Hadoop
Hadoop

IBM InfoSphere BigInsights (BigInsights)


IBM

Hadoop

IBM Hadoop
IBM Hadoop
IBM

Hadoop

BigInsights IBM
IBM Hadoop
BigInsights

InfoSphere BigInsights: Analytics for Big Data at Rest

83

BigInsights
Apache Hadoop
Hadoop BigInsights
BigInsights Hadoop

Hadoop

(RYO)
Hadoop

Hadoop
BigInsights
Hadoop Apache Hadoop

1. Hadoop

2. Hadoop
3. Hadoop

4. Hadoop Shell (SSH)

5. Hadoop I/O
JobTracker TaskTracker
6. HDFS NameNode
NameNode
7. HADOOP_CLASSPATH
HADOOP_PID_DIRHADOOP_HEAPSIZE JAVA_HOME

84

Understanding Big Data

Hadoop

Hadoop RYO
Hadoop Hadoop
MapReduce

Hadoop
BigInsights

PigHive Flum Hadoop

BigInsights BigInsights
BigInsights
Hadoop

IBM
BigInsights

BigInsights 1.2 Hadoop


BigInsights Apache Hadoop
IBM
BigInsights 1.2

HadoopHDFS MapReduce

0.20.2

Jaql

0.5.2

Pig

0.7

Flume

0.9.1

InfoSphere BigInsights: Analytics for Big Data at Rest

85

Hive

0.5

Lucene

3.1.0

ZooKeeper

3.2.2

Avro

1.5.1

HBase

0.20.6

Oozie

2.2.2

BigInsights IBM

BigInsights Hadoop
IBM

Hadoop IBM

BigInsights
Hadoop BigInsights

Hadoop
GPFS-SNC
General Parallel File System (GPFS) IBM Research 90
(HPC) 1998 GPFS
Blue Gene
WatsonJeopardy! ASC Purple ASC Purple
GPFS 120 GB/
HPC GPFS
GPFS DB2 pureScale Oracle RAC

86

Understanding Big Data

GPFS Web

GPFS

Hadoop HDFS
HDFS
Hadoop

GPFS

Hadoop GPFS
GPFS
GPFS (SAN) Hadoop
SAN Hadoop
MapReduce
SAN
I/O

2009 IBM GPFS GPFS-SNC


Hadoop IBM GPFS Hadoop
Hadoop

Hadoop

GPFS-SNC
Hadoop JobTracker

GPFS 256 KB Hadoop


BigInsights 128
MB GPFS-SNC GPFS

InfoSphere BigInsights: Analytics for Big Data at Rest

87

map Hadoop

HDFS
Hadoop Hadoop
HDFS Lucene GPFS-SNC
Lucene
Lucene 256 KB GPFS
Hadoop

GPFS-SNC

HDFS

HDFS

HDFS

GPFS-SNC
GPFS-SNC

88

Understanding Big Data

GPFS-SNC

HDFS NameNode

GPFS IT
GPFS-SNC GPFS Hadoop
GPFS-SNC
GPFS-SNC (HSM)

HDFS
GPFS-SNC 2010 Supercom

GPFS-SNC
GPFS-SNC

5-1 GPFS-SNC
HDFS 5-1 NameNode
NameNode GPFS-SNC
HDFS
5-1
CPURAM [Quorum]

InfoSphere BigInsights: Analytics for Big Data at Rest

89

Q
NSD
P
CM

NSD

NSD
MN

NSD

Q
NSD
S

NSD

NSD

Q
NSD

FSM

5-1

GPFS-SNC

GPFS-SNC

GPFS-SNC (NSD)
GPFS-SNC
NSD NSD

(Q) GPFS- SNC

7
GPFS-SNC

GPFS-SNC Cluster Manager (CM)


Cluster Manager

90

Understanding Big Data

Cluster
Manager
GPFS-SNC (P)

(S)

GPFSSNC SPOF

GPFS-SNC (FSM)
Cluster Manager

CPU
GPFS-SNC
5-1 Metanode (MN)GPFS-SNC
Metanode
Metanode
Metanode
Metanode Metanode

GPFSSNC

InfoSphere BigInsights: Analytics for Big Data at Rest

91

GPFS-SNC
GPFS-SNC BigInsights
GPFS- SNC
Cluster Manager
BigInsights GPFS-SNC

GPFS-SNC
GPFS-SNC HDFSHadoop MapReduce
Hadoop GPFS-SNC HDFS
TaskTracker JobTracker MapReduce
Hadoop
SPOF JobTracker
HDFS
NameNode

NameNode
GPFS-SNC
NameNode HDFS
GPFS-SNC

Cluster Manager

Cluster Manager
Cluster Manager

92

Understanding Big Data


Cluster Manager


Cluster Manager
Cluster Manager
Cluster Manager

GPFS-SNC POSIX
GPFS-SNC HDFS GPFS-SNC
HDFS HDFS
HDFS POSIX
GPFS-SNC POSIX Hadoop

GPFS-SNC

GPFS-SNC HDFS
Hadoop
HDFS Hadoop shell
Hadoop

InfoSphere BigInsights: Analytics for Big Data at Rest

93

IT
HDFS
Hadoop shell
BigInsights GPFS-SNC POSIX Hadoop
IT
Hadoop
GPFS-SNC
(PiT)

GPFS-SNC Hadoop
HDFS Hadoop
HDFS MapReduce

Hadoop GPFS- SNC


Hadoop

/
GPFS-SNC POSIX
MapReduce
GPFS-SNC
Hadoop HDFS
HDFS BigIndex
Lucene
GPFS-SNC
HDFSLucene HDFS
Lucene

94

Understanding Big Data

GPFS-SNC HDFS GPFS-SNC


POSIX
(ACL) HDFS
Hadoop 0.20HDFS

Hadoop 0.21 0.22 HDFS

GPFS-SNC

GPFS-SNC
GPFS
GPFS-SNC
HDFS
GPFS-SNC
stripe and mirror everythingSAME

HDFS
HDFS

GPFS-SNC GPFS-SNC

HDFS NameNode

GPFS-SNC
HDFS Hadoop
Pig Jaql

InfoSphere BigInsights: Analytics for Big Data at Rest

95

2000

1500

GPFS

Teragen

Grep

1000
500
0

5-2

HDFS

Postmark

CacheTest Terasort

GPFS HDFS

I/O
IBM Research Hadoop GPFS-SNC
HDFS GPFS-SNC HDFS

5-2
HDFS GPFS-SNC Hadoop

GPFS-SNC 10 Hadoop 16
HDFS Hadoop

GPFS-SNC Hadoop
GPFS-SNC Hadoop
IBM
Hadoop
Hadoop GPFS-SNC

Hadoop

96

Understanding Big Data

Hadoop

Hadoop

mapper 5-3

Hadoop
Hadoop 0.20.2
Avro

MapReduce
mapper
5-3 1 GB Hadoop
BigInsights 128 MB
8 Hadoop

5-4

Big data
represents

5-3

a new era
in data

exploration
and

utilization,
and IBM

Hadoop

is uniquely
positioned

to help clients develop and


design,
execute

a Big Data
strategy

InfoSphere BigInsights: Analytics for Big Data at Rest

97

0001 1010 0001 1101 1100 0100 1010 1110 0101 1100 1101 0011 0001 1010 0001 1101 1100 0101
1100

5-4

Hadoop 0.20.2
mapper
Jaql
mapper
TextInput- Format Hadoop Pig
MapReduce

CPU
Hadoop
CUP
CUP CUP
MapReduce

CPU CPU I/O


I/O CPU
I/O

98

Understanding Big Data

BigInsights IBM LZO


BigInsights IBM LZO
MapReduce
Hadoop GNU LZO
IBM GNU LZO
IBM LZO
GNU LZO
mapper
GNU LZO mapper

IBM GNU (GPL)


Hadoop
GPL
IBM LZO BigInsights BigInsights

GPL
Hadoop 0.21bzip2
bzip2 IBM LZO bzip2

5-5

0001 1010

5-5

0001 1101

1100 0100 1010 1110

0101 1100 1101 0011 0001 1010

1101 1100

InfoSphere BigInsights: Analytics for Big Data at Rest

99

mapper

IBM LZO

.cmx

bzip2

.bz2

Hadoop 0.21

gzip

.gz

DEFLATE

.deflate

BigInsights 4 IBM
LZObzip2gzip DEFLATE
Hadoop

http://stephane.lesimple.fr/wiki/blog/lzop_vs_compress_
vs_gzip_vs_bzip2_vs_lzma_vs_lzma2-xz_benchmark_reloaded
96 MB IBM LZO
LZO

(MB)

LZO

36

bzip2

19

22

0.6
5

gzip

23

10

1.3

BigInsights Web
BigInsights
BigInsights
HDFS GPFS-SNC BigInsights

100

Understanding Big Data

8080

Apache Hadoop
Hadoop
BigInsights

MapReduce
ZooKeeper
BigInsights
5-6
Hadoop

5-6 HDFS HDFS

HDFS Hadoop
Flume

5-6

BigInsights

InfoSphere BigInsights: Analytics for Big Data at Rest

101

GPFS GPFS-SNC HDFS BigInsights


GPFS-SNC

Web

5-7 Job Status

Job Summary

XML
mapper /

5-7

BigInsights Job Status Jobs in Progress

102

Understanding Big Data

BigInsights Hadoop

Hadoop
Hadoop BigInsights
Hadoop
BigInsights LDAP
LDAP
REST HTTP
Apache Hadoop
Hadoop

BigInsights (LDAP)
LDAP LDAP
LDAP LDAPS (LDAP over HTTPS) BigInsights
LDAP LDAP BigInsights
BigInsights
LDAP

Hadoop Kerberos
Active Directory
BigInsights LDAP
LDAP Kerberos LDAP
BigInsights
Kerberos

InfoSphere BigInsights: Analytics for Big Data at Rest

103

BigInsights GPFS-SNC
HDFS GPFS-SNC

Apache Hadoop HDFS HDFS


IBM
BigInsights

IBM
Hadoop
2
IBM
BigInsights NetezzaDB2 for Linux, UNIX and
Windows Java Database
Connectivity (JDBC) InfoSphere Streams InfoSphere Information
Server Data StageR

Netezza
BigInsights BigInsights Netezza
Netezza Adapter Jaql
Jaql
Netezza Adapter
mapper SQL

Netezza Adapter Netezza


UNIX JDBC mapper

104

Understanding Big Data

mapper Netezza
Netezza UNIX

DB2 for Linux, UNIX and Windows


BigInsights DB2 for Linux, UNIX, and
Windows BigInsights (UDF)
DB2 JDBC BigInsights

BigInsights DB2 DB2 UDF


Jaql DB2 BigInsights Jaql
DB2 9.5 Jaql
Jaql DB2 Jaql

BigInsights
BigInsights Jaql
BigInsights Jaql DB2

DB2 BigInsights
Hadoop
Hadoop DB2 SQL
BigInsights BigInsights
Hadoop

IT

JDBC
Jaql JDBC JDBC

SQL

InfoSphere BigInsights: Analytics for Big Data at Rest

105

Jaql MapReduce map


SQL

InfoSphere Streams
6 Streams IBM
Streams BigInsights
BigInsights Streams BigInsights
Streams BigInsights Streams

Streams BigInsights
Streams
BigInsights
Streams BigInsights
BigInsights BigInsights
Streams
BigInsights
Streams BigInsights Advanced Text Analytics Toolkit
IBM Research SystemT
Web

InfoSphere DataStage
DataStage (ETL)

DataStage BigInsights BigInsights

BigInsights DataStage HDFS GPFS-SNC

106

Understanding Big Data

GPFS-SNC GPFS-SNC
HDFS GPFS-SNC POSIX
DataStage BigInsights DataStage

Information Server BigInsights DataStage


BigInsights ETL
Information Server BigInsights

R
BigInsights Jaql R R Project
www.r-project.org Jaql R
Jaql MapReduce R

Intelligent
Scheduler
Hadoop (FIFO)
Apache Hadoop
Fair Scheduler Capacity Scheduler
Fair Scheduler
BigInsights Capacity Scheduler

FAIR
SLA

IBM Research Hadoop


Intelligent Scheduler FLEX
Fair Scheduler
Intelligent Scheduler

InfoSphere BigInsights: Analytics for Big Data at Rest

107

Intelligent Scheduler

average response time


maximum stretch
user priority

MapReduce
IBM Research Hadoop
IBM Research
MapReduce (Adaptive MapReduce) mapper
mapper Hadoop map

MapReduce Hadoop
(split) mapper
mapper mapper
mapper mapper

map BigInsights

MapReduce mapper
mapper Hadoop
mapper mapper

108

Understanding Big Data

1000

mapper
mapper

800
600
400
200

AM

32

64

128

256

512

1024

2048

(MB)
5-8 mapper

mapper mapper
mapper
mapper 5-8
mapper
mapper AM 32 MB
mapper mapper

mapper mapper mapper


5-9 TERASORT
map mapper
AM 32 MB
mapper mapper

MapReduce
BigInsights

InfoSphere BigInsights: Analytics for Big Data at Rest

1500

mapper
mapper

1200

109

900
600
300

5-9

AM

16

32

64

128

256

512

1024

(MB)

TERASORT

BigSheets
BigInsights
Hadoop

BigInsights BigInsights
Hadoop BigInsights
Apache Hadoop Hadoop BigInsights
BigInsights
MapReduce

Hadoop
MapReduce BigInsights
BigSheets
Hadoop BigSheets
BigSheets

110

Understanding Big Data

BigSheets
1. Web
HTTPHDFS
Amazon S3 (s3n) Amazon S3
(s3) Web

BigSheets
Twitter
BigSheets
2.
5-10 BigSheets

5-10

BigSheets

InfoSphere BigInsights: Analytics for Big Data at Rest

111

Run
GBTB PB

3.
BigSheets

5-11

5-11

BigSheets

112

Understanding Big Data

BigSheets

Advanced Text Analytics Toolkit


BigSheets BigInsights

Web

BigInsights
Advanced Text Analytics ToolkitIBM Research
2004 SystemTIBM
Advanced Text Analytics Toolkit IBM
Lotus NotesIBM eDiscovery AnalyzerCognos Consumer
InsightInfoSphere Warehouse Advanced Text Analytics
Toolkit
BigInsights Advanced Text Analytics Toolkit

MapReduce Advanced
Text Analytics Toolkit

InfoSphere BigInsights: Analytics for Big Data at Rest

113

In the 2010 World Cup of Soccer, the team from the Netherlands
distinguished themselves well, losing to Spain 1-0 in the Final. Early in the
second half, Dutch striker Arjen Robben almost changed the tide of the
game on a breakaway, only to have the ball deflected by Spanish keeper,
Iker Casillas. Near the end of regulation time, winger Andres Iniesta
scored, winning Spain the World Cup.

Name
Arjen Robben
Iker Casillas
Andres Iniesta

Position
Striker
Goalkeeper
Winger

Country
Netherlands
Spain
Spain

114

Understanding Big Data

BigInsights Advanced Text Analytics Toolkit


Advanced Text Analytics Toolkit Annotator
Query Language (AQL)

AQL SQL
AQL

create view PersonPhone as select P.name as person, N.number as


phone
from Person P, Phone PN, Sentence S where Follows(P. name.
PN.number, 0, 30)
and Contains(S.sentence, P.name) and Contains(S. sentence,
PN.number)
and ContainsRegex(/\b(phone|at)\b/, SpanBetween(P. name,
PN.number));
5-12
Advanced Text Analytics Toolkit Eclipse
AQL
5-13
<>

<>
030

phone
at

5-12

InfoSphere BigInsights: Analytics for Big Data at Rest

5-13

115

AQL

PB

Provenance 5-14

Advanced
Text Analytics Toolkit

116

Understanding Big Data

Notes

//

URL

AQL
AQL
Advanced Text Analytics Toolkit
5-15
Advanced Text Analytics Toolkit BigInsights
Advanced
Text Analytics Toolkit BigInsights 5-16
AQL Analytics Operator
Graph (AOG) BigInsights Web AOG

>>
Person
person: Peggy

Person
person: Horton

Person
person: Stanley

Person
person: Rick Buy

Person
person: Mark Metts

5-14

Provenance

PersonCand
person: Peggy
PersonCand
person: Horton
PersonCand
person: Stanley
PersonCand
person: Stanley
PersonCand
person: Rick
PersonCand
person: Rick Buy
PersonCand
person: Mark Metts
PersonCand
person: Metts

UnionOp0
person: Rick
UnionOp4
person: RickBuy

FirstCaps
name: RickBuy

FirstName
first: Rick
CapsPerson
word: Buy

Throughput(KB/sec)

InfoSphere BigInsights: Analytics for Big Data at Rest

5-15

117

700
600
500
400
300

ANNIE
ANNIE

200
100
0

20

40
60
(KB)

80

100

Advanced Text Analytics Toolkit

Throughput (KB/sec)KB/s

BigInsights mapper AOG


mapper Jaql Advanced Text
Analytics Toolkit AOG mapper

BigInsights Advanced Text Analytics Toolkit

Hadoop

AQL

Analytics
Operator
Graph

5-16

Advanced Text Analytics Toolkit BigInsights

118

Understanding Big Data

2012 BigInsights Machine Learning Toolkit


IBM Research SystemML
2012
BigInsights Hadoop

Machine Learning Toolkit


MapReduce
Java
MapReduce
Machine Learning Toolkit IBM Research

Hadoop
MapReduce
MapReduce
MapReduce
Hadoop

BigInsights
BigIndex Hadoop

InfoSphere BigInsights: Analytics for Big Data at Rest

119

BigIndex Apache Lucene IBM Lucene Extension


Library (ILEL) IBM Lucene
ILEL Lucene
BigIndex BigInsights
Lotus ConnectionsIBM Content Analyzer
Cognos Consumer InsightIBM BigIndex
Intranet Gumshoe Hadoop in Action
Chuck Lam [Manning Publications, 2010]
BigIndex
BigInsights TB
BigIndex

BigIndex

ID

Twitter

5-17 BigIndex BigInsights

120

Understanding Big Data

...

BigInsights

5-17

BigIndex

5-17

1.
BigInsights
Flume
Streams HDFS GPFS-SNC
Twitter
2.

3.

/
4.
Lucene
MapReduce Hadoop
Lucene
Lucene
Lucene
5.
Runtime Shard
Cluster

InfoSphere BigInsights: Analytics for Big Data at Rest

121

6.

7.
BigIndex
Java API Jaql REST
HTTP API

BigInsights
BigInsights

GPFS-SNC
IBM LZO

MapReduce Intelligent Scheduler

BigInsights
BigSheets
BigInsights
IBM Research Machine Learning Analytics Toolkit
IT
BigInsights IBM

6
IBM InfoSphere Streams

IBM Hadoop
IBM
BigInsights
IBM InfoSphere Streams (Streams)

Streams
(MPP)

Streams (streams)
IBM InfoSphere Streams
Streams

123

124

Understanding Big Data

InfoSphere Streams
Streams
Streams BigInsights
Streams
Streams

Streams
BigInsights

(CEP) Streams
Streams
Streams

Source

Streams Sink

Streams
Streams

Streams

IBM InfoSphere Streams: Analytics for Big Data in Motion

125

Streams
BigInsights Text Analytics Toolkit
Streams
(PE)

Streams

Streams 80%

BigInsights
Streams BigInsights IBM
Text
Analytic Toolkit Streams BigInsights

InfoSphere Streams
Streams

Streams

(FSS)

126

Understanding Big Data

FSS Streams Algo


Trading 1270
130
Streams Financial
Information eXchange (FIX)
Streams
Streams

Streams
(UOIT)
Streams

1000

IBM InfoSphere Streams: Analytics for Big Data in Motion

127

Streams
24

Streams


http://www.youtube.com/watch?v=QVbnrlqWG5I

(CDR)
CDR
CDR

Streams

Streams (RTAP)

Globe Telecom

Globe Telecom
10 40
CDR Internet (IPDR)
IPDR Internet (IP)

128

Understanding Big Data

CDR CDR IPDR


Streams
500,000 / 60
4 PB (4000 TB) Streams
CDR 1 GBpsX (XRD) 100
MBpsStreams

Streams

TerraEchos InfoSphere Streams

Fiber Optic Sensor


System Boarder Application Frost & Sullivan
Streams Streams
Processing Language (SPL) 45%

Streams

Streams

IBM InfoSphere Streams: Analytics for Big Data in Motion

129

GPS

Streams

Streams
Streams

InfoSphere Streams
Streams

130

Understanding Big Data

430 Streams

Streams

6-1
Functor

Split Split

FileSink

FileSour
ce

Functor

Split
ODBCAppend

6-1

IBM InfoSphere Streams: Analytics for Big Data in Motion

131

Streams

N
M

Streams

Streams Processing Language


Streams Processing Language (SPL)
Streams
Streams
SPLStreams 45%

SPL Streams Streams


(bin) Streams

132

Understanding Big Data

SPL
SPL

Streams

SPL SPL

composite toUpper {
graph
stream<rstring line> LineStream = FileSource() {
param file
: "input_file";
format
: line;
}
stream<LineStream> upperedTxt = Functor(LineStream)
{
output upperedTxt
: line = upper(line);
}
() as Sink = FileSink(upperedTxt) { param file
: "/dev/stdout"; format
: line;
}
}
SPL FileSource
LineStream Functor
upperedTxt
Sink upperedTxt

Streams

Streams

IBM InfoSphere Streams: Analytics for Big Data in Motion

133

FileSource FileSink
FileSource FileSink

txt
csv
bin
line
block BLOB

TCPSource/UDPSource TCPSink/UDPSink
TCPSource TCPSink Streams
TCP IP IPv4 IPv6

FileSource FileSink txt csv


UDPSource UDPSink UDP
TCP

Export Import
export import export

134

Understanding Big Data

streamID ID
streamID export
import Streams

MetricsSink
MetricsSink
(named meter)
Streams Studio

MetricsSink

Streams

Streams
Streams

Filter
filter
Streams filter

[ETL] match discard )

Functor
functor

IBM InfoSphere Streams: Analytics for Big Data in Motion

135

functor

Punctor
punctor

punctor

functor
0

Sort
sort

Streams

count

delta

time

punctuation punctor

136

Understanding Big Data

Streams

Join
join

inner joins
outer joins
sort

Aggregate
aggregate
Sort
aggregate groupBy partitionBy
aggregate
countsumaveragemaxminfirstlastcount distinct

Beacon
beacon
beacon
n/10/ n
beacon Streams

IBM InfoSphere Streams: Analytics for Big Data in Motion

137

Throttle Delay
throttle delay
throttle
throttle /
delay delay
delay
A
B 10 B C 3 delay

Split Union
split

union

Streams
Streams

Streams
Streams
Database Toolkit Financial Markets
Toolkit

Database ToolkitRelational
Database Toolkit ODBC SolidDB

138

Understanding Big Data

Streams

ODBCAppend

SQL INSERT

ODBCEnrich

ODBCSource

SolidDBEnrich SolidDB

Financial Markets Toolkit


Financial Information eXchange (FIX)
Streams Financial Markets
Toolkit FIX
FIXMessageToStream FIX
StreamToFIXMessage

FIX

WebSphere MQ
WebSphere Front Office

Streams

IBM
IT

IBM InfoSphere Streams: Analytics for Big Data in Motion

139

Streams Streams

Streams

IBM

Streams Streams

SPL
SPL

SPL

SPL

(PE)
PE
B

PE Streams
PE Streams

140

Understanding Big Data

PE
PE
PE
stopped PE
PE
Streams PE

RecoveryMode=ON

Streams Eclipse
InfoSphere Streams Studio (Streams Studio)
Streams SPL
Eclipse Streams Studio
Streams Streams Explorer Streams
Streams

SPL Streams Studio

Streams

Streams Studio Streams


SPL

IBM InfoSphere Streams: Analytics for Big Data in Motion

141

Streams Streams Studio


IDE Streams

Streams
WebSphere

Streams Eclipse IDE

Eclipse IBM
Rational Streams
BigInsights Hadoop
BigInsights Streams BigInsights

Streams BigInsights
IBM


IBM
BigInsights wiki
BigInsights Twitter Facebook

ibm.com/developerworks/wiki/biginsights

IBM
BigInsights (M97)
InfoSphere Streams (N08)
ibm.com/certify/mastery_tests

IBM

ibm.com/software/data/education

InfoSphere BigInsights Apache Hadoop


BigInsights 1
InfoSphere Streams
InfoSphere Streams 2

ibm.com/software/data/education/bookstore

You might also like