Professional Documents
Culture Documents
IBMHadoop
IBM
Hadoop
Paul C. ZikopoulosMBA IBM Software Group
World Wide Database Competitive Big Data
SWAT Paul 18
Paul 320 14
DB2 pureScaleRisk Free Agile ScalingMcGraw-Hill
2010 Break Free with DB2 9.7 A Tour of Cost-Slashing New
Features McGraw-Hill 2010 Information on Demand
Introduction to DB2 9.5 New FeaturesMcGraw- Hill2007 DB2
Fundamentals Certification for Dummies For Dummies 2001
DB2 for Windows for DummiesFor Dummies2001 Paul
DB2 DRDA ClustersBI
DBA Chachi
Chlo paulz_ibm@msn.com
@BigData_paul
Chris EatonIBM
Chris LinuxUNIX Windows
DB2 19 ,
Chris DB2
The High
Availability Guide to DB2IBM Press2004 IBM DB2 9 New
FeaturesMcGraw-Hill2007 Break Free with DB2 9.7A Tour
of Cost-Slashing New FeaturesMcGraw-Hill2010 Chris
IT
Toolbox DB2 http://it.toolbox.com/blogs/db2luw
Dirk deRoosIBM IBM
Dirk 11 IBM Toronto DB2 Development
Dirk New Brunswic
Steven SitMSIBM
IBM Steven
IBM
17StevenIBM
Steven Western
Ontario Syracuse
Hadoop
Paul C. Zikopoulos
Chris Eaton
Dirk deRoos
Thomas Deutsch
George Lapis
McGraw-Hill
bulksales@mcgraw-hill.com
Hadoop
2012 by The McGraw-Hill Companies
1976
McGraw-Hill
IBM
IBM
1234567890
ISBN
MHID
DOC DOC
10987654321
978-0-07-179053-6
0-07-179053-5
Paul Carlstroem
Patty Mon
Sheena Uprety,
Cenveo Publisher
Services
Lisa Theobald
Paul Tyler
Cenveo Publisher
Services
Jeff Weeks
George Anderson
Cenveo Publisher
Services
Stephanie Evans
McGraw-Hill McGraw-Hill
McGraw-Hill
IBM 18
Chloe
IBM
IBM
Martin WildbergerBob
PicianoDale Rebhorn Alyse Passarelli IBM
Teresa
Riley Sophia
10
Chris Eaton
SandraErik Anna
Paul
Dirk deRoos
Lauren William
Anant Jhingran
Thomas Deutsch
IBM
George Lapis
IBM
Paul
Amy Tiffany
Ronald
Steven Sit
15
3 IBM
35
II
4 Hadoop
51
5 InfoSphere BigInsights
81
123
xv
xxi
xxiii
Hadoop
12
15
15
17
IT IT
18
20
24
26
29
31
IBM
35
37
39
IBM 1
40
40
49
xii
II
Hadoop
53
Hadoop
54
Hadoop
55
Hadoop56
MapReduce
60
Hadoop63
Hadoop64
PigandPigLatin65
Hive67
Jaql68
Hadoop73
73
74
Hadoop76
ZooKeeper76
HBase77
Oozie78
Lucene78
Avro80
80
InfoSphereBigInsights
81
82
BigInsights1.2Hadoop84
HadoopGPFSSNC85
HadoopGPFSGPFS86
GPFSSNC88
GPFSSNC91
GPFSSNCPOSIX92
GPFSSNC94
GPFSSNCHadoop95
GPFSSNCPOSIX92
GPFSSNC94
GPFSSNCHadoop95
Contents
95
96
xiii
97
99
102
103
Netezza
103
DB2forLinux,UNIX,andWindow
104
JDBCModule
104
InfoSphereStreams
105
InfoSphereDataStage
105
RStatisticalAnalysisApplications
106
MapReduce
106
107
BigSheets
BigInsights
112
118
109
118
121
IBMInfoSphereStreams
InfoSphereStreams
124
InfoSphereStreams
InfoSphereStreams
123
125
129
130
StreamsProcessingLanguage
131
133
134
137
.
138
139
140
141
Rob Thomas
TomTom
Chris 20
Chris Tom
10
Tom 20 Chris
1.25 /
*****
80%
15
5
xv
xvi
10
1.25 /
Rob Thomas
IBM
Foreword
xvii
Anjul Bhambhri
70 System R
System R
SQL DB2 Oracle SQL/DS
ALLBASE Non-Stop SQL
90 IT
ERP
SCM
xviii
90
IBM
Garlic 2001 XML DB2 pureXML
XML XML
IBM
2011 50 IBM
IBM 30
IBM DB2
InformixSolid DB
xix
PaulGeorgeTom Dirk
Anjul Bhambhri
IBM
Shivakumar
(Shiv) Vaithyanathan Roger Rea Robert Uleman James R. Giles
Kevin Foster Ari Valtanen Asha Marsh Nagui Halim Tina Chen
Cindy SaraccoVijay R. BommireddipalliStewart TateGary Robinson
Rafael Coss Anshul Dawra Andrey Balmin Manny Corniel Richard
HaleBruce BrownMike BruleJing Wei LiuAtsushi TsuchiyaMark
Samson Douglas McGarrie Wolfgang Nimfuehr Richard Hennessy
Daniel Dubriwny
IBM
Rob Thomas Anjul Bhambhri
Steven Sit
Sheena UpretyPatty
MonPaul Tyler Lisa Theobald
McGraw-Hill Paul Carlstroem
xxii
xxiv
IBM
IBM Hadoop
IBM Hadoop
Apache Hadoop BigInsights Hadoop
IBM
IBM Hadoop
Hadoop IBM
IBM
(ROI) IBM
IBM Hadoop
IBM
IBM
xxv
IBM
100 300
Airbus
10
10
40
300 (RFID)
[]
20
xxvi
IBM
IBM
Pyotr Smirnov
Smirnov
xxvii
I 3
1 3
Twitter Facebook
IBM
3
3 VV3
IBM 30
XM
Facebook V3
V3
ID
xxviii
IT
3 IBM
IBM
Hadoop
Claude MonetIBM
IBM
IBM IBM
Hadoop
IBM
BigInsights Hadoop
Hadoop Java
BigInsights
Hadoop
IBM
IBM
xxix
IBM
Think Watson Jeopardy!
IBM
247
IBM SPSSCognosSmart Analytics
SystemsNetezza 5 IBM
140
IBM IBM
Eclipse (UIMA)
Apache DerbyLuceneXQuerySQL Xerces XML
(IDE)
IBM Hadoop Jaql 4
IBM Hadoop IBM
Hadoop Hadoop
FacebookLinkedIn Hadoop
Hadoop IBM Hadoop
II 4
Hadoop
Apache
Hadoop Pig Hive HDFS
MapReduce ZooKeeper
xxx
5
IBM
IBM
InfoSphere BigInsights (BigInsights) IBM Hadoop
3 IBM
IBM IBM General Parallel
File System (GPFS) GPFS (SNC)
Hadoop IBM BigInsights
Java
Hadoop
GLP
Hadoop
xxxi
6
6 IBM InfoSphere Streams (Streams)
Streams
Streams Streams
Streams
BigInsights Hadoop
Streams
IBM
WebSphere
Blackberry AppWorld
Apple AppStore
5
IBM
100
IBM
IBM
(machine-to-machine, M2M)
(YoY)
GPS
IBM
IBM
3 1-1
IBM
IBM
2000 800,000 PB
BigInsights 2020 35
ZB Twitter 7 TB Facebook 10 TB
TB
Big
Data
TB
ZB
1-1 IBM V 3
Variety
Velocity
Volume
PB
iTunes
2007
I35W 200
TB 10
1 TB
1-2
1-2
Data Available
Percent of data an
TB PB ZB
Web
20%
80%
Twitter
JSON
PB TB RFID
IBM
GPS
IBM
Hadoop
IBM
IBM
Hadoop
Hadoop
Hadoop
2
10
Hadoop
Hadoop
IT
Hadoop
TweetFacebook
Hadoop
IT
CIO
11
Hadoop
Hadoop IBM
12
/
IBM
InfoSphere BigInsights Hadoop
Hadoop
13
IBM
IBM
1
(V3)
16
17
IBM
18
IT IT
IT
(data exhaust)
IT
DB2
BigInsights
GB
IT
IT
19
IT IT
IBM
(FSS)
IT IT
IT IT
IT
IT IT
(SOA)
20
SOA
20
IT
1TB 5
21
20%
2-1
80%
CIO CAPEX OPEX
BigInsights
80%
- 2-2
22
Mashup
SOA Web
ODS
+++
ERPCRM
2-1 20%
Data Quality/Governance/
2-2
InfoSphere StreamsDB2
IBM
2-2
3
2
23
Mashup
ODS
SOA Web
InfoSphere
BigInsights
+++
ERPCRM
2-2
Data Quality/Governance/
50%
80%
BigInsights
InfoSphere Streams 2-2
-
Streams
24
(FBI)
600
IBM
Cognos Consumer Insights (CCI)
BigInsights CCI
25
SAPDB2TeradataOracle
Facebook
Facebook
26
7166 Ttps
(CSR)
CSR
27
(Streams)
(BigInsights)
Streams
BigInsights
//
Streams
CSR
CSR
CSR
70%
2%
CSR
28
BigInsights
CSR
CSR
Watson
(BigInsights)
Streams
29
Streams
BigInsights BigInsights
Watson
Streams
BigInsights
2008
1520%
30
80%
CAPEX OPEX
31
20,000 40,000
10% 5%
90%
Streams
BigInsights Vestas
Vestas
32
5 65
43,000 Vestas
IBM BigInsights
20 30
6 PB (6000 TB)
Vestas
Vestas
Hadoop
Vestas IBM
Hadoop
2 IBM
Hadoop
33
IBM IBM
3
IBM
Hadoop
Hadoop MapReduce
- -
IBM
35
36
IT
IT
IT
Hadoop
IBM
IBM
Hadoop
IBM
Hadoop
IBM
BigInsights
IBM
IBM Hadoop
IBM IBM
37
247
IBM Hadoop
BigInsights IBM
5
CIO
IT
Hadoop
Hadoop TwitterFacebook Yahoo
500
38
IT
(RYO)
Hadoop
(SLA) IT
(MTTR)
(RPO)
Hadoop
Hadoop
IBM IBM
IBM Hadoop
Hadoop
Hadoop
(OLTP)
Hadoop SLA
Hadoop
SLA
Hadoop
39
IBM
IBM
IBM
BigInsights 200 IBM
5 IBM
General Parallel File System Shared Nothing Cluster (GPFS-SNC)
SC10 Storage ChallengeSC10
40
IBM 1
IBM Hadoop 2011 5
1
Hadoop
IT
IBM SPSS IBM
Cognos Unica CoreMetrics
Netezza IBM Smart Analytics System IBM
(BAO) IBM
InfoSphere Streams
Streams (BigInsights)
IBM
IBM
IBM
5 IBM 24
140 IBM Research 8000 IBM
200
Hadoop
Apache Hadoop
IBM
41
IBM
IBM
IBM
IBM
FortranDRAMATMUPC RISC
PCSQL XQuery
www.ibm.com/ibm100/
IBM IBM
1956
IBM Random Access
Method of Accounting and Control
RAMAC 50 2
IBM 2000
10,000 1997 10
IBM
PB
IBM
1970
IBM Ted Codd
42
IBM
DB2 Informix Netezza Oracle
SybaseSQL Server
IBM
1971
IBM
5000
IBM ViaVoice 64,000
260,000 1997 ViaVoice
VoiceType 2
IBM
1980RISC
IBM RISC
IBM John Cocke 20 70 RISC
RISC
(HPC)
Watson
Jeopardy
1988NSFNET
IBM National Science Foundation (NSF) MCI Merit
43
NSFNET 200 6
NSFNET Internet
Internet NSFNET Internet
56kb/s 1.5Mb/s 45Mb/s
Internet Internet 1995 4
Internet 93
5 Internet
Internet
NFSNET
1993
IBM
1996
1996 IBM
44
10
IBM
(NOAA) 1995
2 Vestas IBM
1997
32 IBM RS/6000 SP Deep Blue
Garry Kasparov
Watson
Watson
Deep Blue
Hadoop
2000Linux
2000 IBM Linux Linux
IBM 10 Linux
IBM
IBM CEO CIO
Linux Linux
45
Hadoop Hadoop
2004
Blue Gene IBM
PFLOPS 2004 9
IBM Blue Gene PFLOPS
IBM Blue Gene
Blue Gene
2009 Blue Gene
Barack Obama IBM National
Medal of Technology and Innovation
2009
IBM
IBM
IBM
2009
IBM Streams
46
IBM
Streams
Streams 500,000 CDR
60 CDR 4 PB
2009
IBM Enterprise CloudIBM
2010
2000 IBM
IBM
1500
IT
IBM
IT
2010 GPFS
SNC
1998 IBM General Parallel File System (GPFS)
POSIX (SAN)
DB2 pureScale Oracle RAC
GPFSGPFS
GPFS
47
2011Watson
IBM Watson -Question-AnsweringQA
Watson
Watson
2011 2 Watson
Jeopardy!
Ken Jennings Brad Rutter
Decision Augmentation
Watson Hadoop
BigInsights
Hadoop
48
Hadoop Hadoop
FLEX BigInsights
GPFS-SNC 12 Adaptive
MapReduce Machine Learning Toolkit
System ML BigInsights
II
IBM
Hadoop Apache IBM IBM
Apache
IBM
IBM Mashup
Information Management Lotus IBM Enterprise
Content ManagementCognosWebSphere Tivoli IBM
Hadoop IBM Cognos Consumer
Insight (CCI) IBM
CCI BigInsights IBM
49
Internet
CCI
BigInsights
BigInsights
CCI
CCI
BigInsights
50
IBM
2035
50%
50%
20,000 TB
10 PB
IBM Smart
Grid IBM
IBM
IBM
II
51
Hadoop
Hadoop IBM
InfoSphere BigInsights (BigInsights)
BigInsights
Hadoop
Hadoop
BigInsights Hadoop
53
54
Hadoop
Hadoop
Hadoop (http://hadoop.apache.org/) Apache Software Foundation
Apache Java Hadoop
Hadoop
Hadoop Hadoop
Hadoop
Hadoop Doug
Cutting Cutting
55
Cutting
Cutting
Pinky Squiggles
Hadoop (Hadoop Distributed File
System) (MapReduce)Hadoop
Hadoop Hadoop
Hadoop
Hadoop
Apache AvroCassandra HBase
Chukwa
Hive SQL Mahout
Pig Hadoop
ZooKeeper
Hadoop
Hadoop Hadoop Distributed File System (HDFS)
Hadoop MapReduce Hadoop Common Hadoop
MapReduce
Hadoop
56
Hadoop
HDFS Hadoop
map reduce
Hadoop
MapReduce
Hadoop
(SAN) (NAS) SAN NAS Hadoop
1000 3
3000 + 1000
Hadoop
(MTTF)
Hadoop
MTTF
Hadoop
HDFS HDFS
Hadoop
A
1 B 2 Hadoop
HDFS 4-1
Block_1
Block_1
Block_3
Block_3
57
Block_2
Block_3
Block_2
Block_2
Block_1
Rack 1
Rack 2
4-1 HDFS 3
Rack
Hadoop
Hadoop
58
512
4 KB 32 KB Hadoop
3
Hadoop HDFS
Hadoop NameNode
NameNode HDFS
NameNode
NameNode Hadoop
(SPOF) NameNode
Hadoop
NameNode
Hadoop 0.21
BackupNode BackupNode NameNode
4-1 3
block_n block_n' block_n''
HDFS Hadoop
Hadoop MapReduce
NameNode Hadoop
Hadoop
59
MapReduce Hadoop
NameNode
MapReduceHDFS
NameNode
MapReduce
NameNode NameNode Hadoop
NameNode
NameNode
MapReduce
NameNode
IBM Hadoop
IBM IBM
General Parallel File System (GPFS) GPFS
SAN 2009 GPFS
GPFS-SNC Hadoop GFPS-SNC HDFS
NameNode GPFSSNC Hadoop SPOF
GPFS-SNC Hadoop
NameNode HDFS
Portable Operating System Interface for UNIX (POSIX)
HDFS
Java IT
HDFS
60
BigInsights Hadoop
GPFS-SNC IEEE POSIX
APIshell UNIX AIX
Apple OSX HP-UX
MapReduce
MapReduce Hadoop
Hadoop MapReduce
Hadoop MapReduce
MapReduce Hadoop
map
/reduce map
MapReduce
reduce map
5
Hadoop
61
Toronto, 20
Whitby, 25
New York, 22
Rome, 32
Toronto, 4
Rome, 33
New York, 18
(Toronto,
(Toronto,
(Toronto,
(Toronto,
18)
32)
22)
31)
(Whitby,
(Whitby,
(Whitby,
(Whitby,
27)
20)
19)
22)
(New
(New
(New
(New
York,
York,
York,
York,
32)
33)
20)
19)
(Rome,
(Rome,
(Rome,
(Rome,
37)
38)
31)
30)
5 reduce reduce
mapping
(reducing)
Hadoop MapReduce
62
Hadoop
JobTracker JobTracker NameNode
map reduce
JobTracker
Hadoop TaskTracker
JobTracker JobTracker
Map
Shuffle
Reduce
Part_1
Map
Shuffle
Reduce
Part_2
Map
4-2
Map
Map
MapReduce
63
reducer Toronto
reduce Shuffle
map reduce Hadoop
map
Combiner reduce 4-2
reduce
reducer
Hadoop MapReduce Java
Java Archive (jar) JobTracker Hadoop
map reduce MapReduce
Apache Hadoop Hadoop
Hello World WordCountWordCount
Java
Hadoop
BigDataUniversity.com InfoSphere BigInsights Basic
Edition (www.ibm.com/software/data/infosphere/ biginsights/basic.html)
IBM
Hadoop
BigInsights
Basic Edition IBM
InfoSphere BigInsights Enterprise Edition Hadoop
Hadoop
Hadoop Hadoop
shell
HDFS POSIX
Linux UNIX HDFS
64
HDFS shell
cat
(stdout)
chmod
chown
copyFromLocal HDFS
copyToLocal
HDFS
cp
HDFS
expunge
HDFS
MAC Windows
HDFS
expunge
ls
mkdir
mv
rm
HDFS
HDFS rm
skiptrash
Hadoop
Hadoop
Hadoop MapReduce API Java
65
MapReduce
XML IMS
3GL 4GL
Hadoop Hadoop
Hadoop
Pig PigLatin
Pig Yahoo! Hadoop
mapper reducer
Pig Pig
PigLatin Hadoop
PigLatin
Java (JVM) Java
Pig
mapper
reducer Pig HDFS LOAD
mapper
reducer DUMP
STORE
LOAD
Hadoop Hadoop HDFS
Pig Pig
66
Pig
USING LOAD
TRANSFORM
FILTER
JOINGROUP
ORDER Pig
Twitter (English) iso_language
tweet tweet
L = LOAD 'hdfs//node/tweet_data';
FL = FILTER L BY iso_language_code
EQ 'en'; G
= GROUP FL BY from_user;
RT = FOREACH G GENERATE group, SUM(retweets);
DUMP STORE
DUMP STORE Pig
Pig DUMP
DUMP STORE
DUMP
Pig Hadoop
Pig Pig
Java Grunt
Hadoop Pig
Pig
map reduce
67
map reduce
Hive
Pig
Facebook Hadoop
SQL
Hadoop Hive SQL
SQL Hive Query Language (HQL)
HQL HQL
Hive MapReduce Hadoop
SQL
(DBMS) Hive
Hive shell Hive
JDBC/ODBC Java Database Connectivity (JDBC) Open
Database Connectivity (ODBC) Hive
Thrift Client Hive Thrift Client
Hive C++JavaPHP
Python Ruby Hive Thrift Client
SQL DB2 Informix
Hive
CREATE TABLE Tweets(from_user STRING, userid BIGINT, tweettext STRING, retweets INT)
COMMENT 'This is the Twitter feed table' STORED AS
SEQUENCEFILE;
LOAD DATA INPATH 'hdfs://node/tweetdata' INTO TABLE TWEETS;
SELECT from_user, SUM(retweets)
FROM
TWEETS
GROUP BY from_user;
68
Hive SQL
Hive Hadoop MapReduce
Hadoop Hive Hadoop
Hive
DB2
Hive
Jaql
Jaql JavaScript Object Notation (JSON)
JSON
IBM IBM
Jaql HDFS
Pig Hive Jaql
LispSQLXQuery PigJaql
Jaql
MapReduce
Jaql JSON
Jaql
JSON
JSON /
MapReduce Hadoop
//
JSON
69
Jaql
Jaql Jaql
Twitter
FILTER FILTER
SQL FILTER
WHERE
70
Twitter eatonchris
71
Jaql
Jaql
HDFS
100
Jaql
Jaql
MapReduce Jaql
72
Jaql Jaql
tweet
$tweets = read(hdfs("tweet_log"));
$tweets
-> filter $.iso_language_code = "en"
-> group by u = $.from_user
into { user: $.from_user, total: sum($.retweet)
};
HDFS $tweetsJaql
$tweets FILTER
iso_language_code = en tweet GROUP BY
tweet
Jaql map reduce
Hadoop
Jaql JSON Jaql
Jaql
XMLCSV
Jaql
JavaJavaScript
PythonPerlRuby
Hadoop Streaming
Java map reduce
Hadoop Streaming StreamingAPI Streaming
UNIX
73
Hadoop
Hadoop
HDFS POSIX
HDFS HDFS
HDFS API GPFS-SNC
GPFS-SNC
GPFS-SNC Hadoop
HDFS
FlumeHadoop
Hadoop API
shell HDFS
HDFS copyFromLocal
74
HDFS copyToLocal
Flume
(flume)
Flume Apache
Hadoop
Flume sourcedecorator sink
source Flume
sink Flume
75
decorator
Flume TCP
(stdin)
Web
TAIL
exit
Flume sink
Collector Tier Event
HDFS
Flume Agent Tier Event
Flume
Basic
HDFS bucket
Hadoop
Flume
76
Hadoop
Hadoop Hadoop
Apache
ZooKeeperHBase Oozie Lucene
4 Hadoop
InfoSphere BigInsights
ZooKeeper
ZooKeeper Apache
ZooKeeper
500 Hadoop
10
Hadoop
ZooKeeper
ZooKeeper
Java C
ZooKeeper ZooKeeper
ZooKeeper
ZooKeeper
ZooKeeper
Hadoop ZooKeeper
ZooKeeper
77
ZooKeeper znode
ZooKeeper znode
znode ZooKeeper
znode znode
ZooKeeper znode
znode
HBase
HBase HDFS
HBase SQL HBase
HBase Java MapReduce
HBase AvroREST Thrift
Avro
Google
HBase
HBase
HBase
timestamp
servernameHBase
(column family)
HBase
78
Oozie
MapReduce
Oozie
Oozie
4-3 Oozie
Lucene
Lucene Apache
Lucene Hadoop
79
MR1
Pig
Java
MR2
HDFS
4-3
MR3
Oozie
Lucene
Lucene
Lucene
Lucene
BigInsights Jaql
Jaql Lucene
BigInsights
BigInsights
MapReduce
BigInsights Hadoop
HDFS Lucene
Hadoop GPFS-SNC
80
Avro
Avro Apache Avro
Avro
JSON Avro
Jaql JSON
Avro API
STRINGINT[eger]LONGFLOATDOUBLEBYTENULL
BOOLEAN recordarray
enummapunion
fixed
CC++C#JavaPythonRuby PHP Avro
API Hadoop
Hadoop
IBM InfoSphere
BigInsights
IBM Hadoop
IBM
5
InfoSphere BigInsights
Hadoop
Hadoop
Hadoop Apache
Hadoop 2006
Hadoop
1.0
Hadoop
Hadoop
4 Hadoop
(HDFS) NameNode
(SPOF) 0.21
NameNode Hadoop
NameNode
Hadoop
Hadoop
81
82
MapReduce
Hadoop
Java MapReduce
Pig Jaql MapReduce
Hadoop
Hadoop
Hadoop
IBM Hadoop
IBM Hadoop
IBM
Hadoop
BigInsights IBM
IBM Hadoop
BigInsights
83
BigInsights
Apache Hadoop
Hadoop BigInsights
BigInsights Hadoop
Hadoop
(RYO)
Hadoop
Hadoop
BigInsights
Hadoop Apache Hadoop
1. Hadoop
2. Hadoop
3. Hadoop
5. Hadoop I/O
JobTracker TaskTracker
6. HDFS NameNode
NameNode
7. HADOOP_CLASSPATH
HADOOP_PID_DIRHADOOP_HEAPSIZE JAVA_HOME
84
Hadoop
Hadoop RYO
Hadoop Hadoop
MapReduce
Hadoop
BigInsights
BigInsights BigInsights
BigInsights
Hadoop
IBM
BigInsights
HadoopHDFS MapReduce
0.20.2
Jaql
0.5.2
Pig
0.7
Flume
0.9.1
85
Hive
0.5
Lucene
3.1.0
ZooKeeper
3.2.2
Avro
1.5.1
HBase
0.20.6
Oozie
2.2.2
BigInsights IBM
BigInsights Hadoop
IBM
Hadoop IBM
BigInsights
Hadoop BigInsights
Hadoop
GPFS-SNC
General Parallel File System (GPFS) IBM Research 90
(HPC) 1998 GPFS
Blue Gene
WatsonJeopardy! ASC Purple ASC Purple
GPFS 120 GB/
HPC GPFS
GPFS DB2 pureScale Oracle RAC
86
GPFS Web
GPFS
Hadoop HDFS
HDFS
Hadoop
GPFS
Hadoop GPFS
GPFS
GPFS (SAN) Hadoop
SAN Hadoop
MapReduce
SAN
I/O
Hadoop
GPFS-SNC
Hadoop JobTracker
87
map Hadoop
HDFS
Hadoop Hadoop
HDFS Lucene GPFS-SNC
Lucene
Lucene 256 KB GPFS
Hadoop
GPFS-SNC
HDFS
HDFS
HDFS
GPFS-SNC
GPFS-SNC
88
GPFS-SNC
HDFS NameNode
GPFS IT
GPFS-SNC GPFS Hadoop
GPFS-SNC
GPFS-SNC (HSM)
HDFS
GPFS-SNC 2010 Supercom
GPFS-SNC
GPFS-SNC
5-1 GPFS-SNC
HDFS 5-1 NameNode
NameNode GPFS-SNC
HDFS
5-1
CPURAM [Quorum]
89
Q
NSD
P
CM
NSD
NSD
MN
NSD
Q
NSD
S
NSD
NSD
Q
NSD
FSM
5-1
GPFS-SNC
GPFS-SNC
GPFS-SNC (NSD)
GPFS-SNC
NSD NSD
7
GPFS-SNC
90
Cluster
Manager
GPFS-SNC (P)
(S)
GPFSSNC SPOF
GPFS-SNC (FSM)
Cluster Manager
CPU
GPFS-SNC
5-1 Metanode (MN)GPFS-SNC
Metanode
Metanode
Metanode
Metanode Metanode
GPFSSNC
91
GPFS-SNC
GPFS-SNC BigInsights
GPFS- SNC
Cluster Manager
BigInsights GPFS-SNC
GPFS-SNC
GPFS-SNC HDFSHadoop MapReduce
Hadoop GPFS-SNC HDFS
TaskTracker JobTracker MapReduce
Hadoop
SPOF JobTracker
HDFS
NameNode
NameNode
GPFS-SNC
NameNode HDFS
GPFS-SNC
Cluster Manager
Cluster Manager
Cluster Manager
92
Cluster Manager
Cluster Manager
Cluster Manager
Cluster Manager
GPFS-SNC POSIX
GPFS-SNC HDFS GPFS-SNC
HDFS HDFS
HDFS POSIX
GPFS-SNC POSIX Hadoop
GPFS-SNC
GPFS-SNC HDFS
Hadoop
HDFS Hadoop shell
Hadoop
93
IT
HDFS
Hadoop shell
BigInsights GPFS-SNC POSIX Hadoop
IT
Hadoop
GPFS-SNC
(PiT)
GPFS-SNC Hadoop
HDFS Hadoop
HDFS MapReduce
/
GPFS-SNC POSIX
MapReduce
GPFS-SNC
Hadoop HDFS
HDFS BigIndex
Lucene
GPFS-SNC
HDFSLucene HDFS
Lucene
94
GPFS-SNC
GPFS-SNC
GPFS
GPFS-SNC
HDFS
GPFS-SNC
stripe and mirror everythingSAME
HDFS
HDFS
GPFS-SNC GPFS-SNC
HDFS NameNode
GPFS-SNC
HDFS Hadoop
Pig Jaql
95
2000
1500
GPFS
Teragen
Grep
1000
500
0
5-2
HDFS
Postmark
CacheTest Terasort
GPFS HDFS
I/O
IBM Research Hadoop GPFS-SNC
HDFS GPFS-SNC HDFS
5-2
HDFS GPFS-SNC Hadoop
GPFS-SNC 10 Hadoop 16
HDFS Hadoop
GPFS-SNC Hadoop
GPFS-SNC Hadoop
IBM
Hadoop
Hadoop GPFS-SNC
Hadoop
96
Hadoop
Hadoop
mapper 5-3
Hadoop
Hadoop 0.20.2
Avro
MapReduce
mapper
5-3 1 GB Hadoop
BigInsights 128 MB
8 Hadoop
5-4
Big data
represents
5-3
a new era
in data
exploration
and
utilization,
and IBM
Hadoop
is uniquely
positioned
a Big Data
strategy
97
0001 1010 0001 1101 1100 0100 1010 1110 0101 1100 1101 0011 0001 1010 0001 1101 1100 0101
1100
5-4
Hadoop 0.20.2
mapper
Jaql
mapper
TextInput- Format Hadoop Pig
MapReduce
CPU
Hadoop
CUP
CUP CUP
MapReduce
98
GPL
Hadoop 0.21bzip2
bzip2 IBM LZO bzip2
5-5
0001 1010
5-5
0001 1101
1101 1100
99
mapper
IBM LZO
.cmx
bzip2
.bz2
Hadoop 0.21
gzip
.gz
DEFLATE
.deflate
BigInsights 4 IBM
LZObzip2gzip DEFLATE
Hadoop
http://stephane.lesimple.fr/wiki/blog/lzop_vs_compress_
vs_gzip_vs_bzip2_vs_lzma_vs_lzma2-xz_benchmark_reloaded
96 MB IBM LZO
LZO
(MB)
LZO
36
bzip2
19
22
0.6
5
gzip
23
10
1.3
BigInsights Web
BigInsights
BigInsights
HDFS GPFS-SNC BigInsights
100
8080
Apache Hadoop
Hadoop
BigInsights
MapReduce
ZooKeeper
BigInsights
5-6
Hadoop
HDFS Hadoop
Flume
5-6
BigInsights
101
Web
Job Summary
XML
mapper /
5-7
102
BigInsights Hadoop
Hadoop
Hadoop BigInsights
Hadoop
BigInsights LDAP
LDAP
REST HTTP
Apache Hadoop
Hadoop
BigInsights (LDAP)
LDAP LDAP
LDAP LDAPS (LDAP over HTTPS) BigInsights
LDAP LDAP BigInsights
BigInsights
LDAP
Hadoop Kerberos
Active Directory
BigInsights LDAP
LDAP Kerberos LDAP
BigInsights
Kerberos
103
BigInsights GPFS-SNC
HDFS GPFS-SNC
IBM
Hadoop
2
IBM
BigInsights NetezzaDB2 for Linux, UNIX and
Windows Java Database
Connectivity (JDBC) InfoSphere Streams InfoSphere Information
Server Data StageR
Netezza
BigInsights BigInsights Netezza
Netezza Adapter Jaql
Jaql
Netezza Adapter
mapper SQL
104
mapper Netezza
Netezza UNIX
BigInsights
BigInsights Jaql
BigInsights Jaql DB2
DB2 BigInsights
Hadoop
Hadoop DB2 SQL
BigInsights BigInsights
Hadoop
IT
JDBC
Jaql JDBC JDBC
SQL
105
InfoSphere Streams
6 Streams IBM
Streams BigInsights
BigInsights Streams BigInsights
Streams BigInsights Streams
Streams BigInsights
Streams
BigInsights
Streams BigInsights
BigInsights BigInsights
Streams
BigInsights
Streams BigInsights Advanced Text Analytics Toolkit
IBM Research SystemT
Web
InfoSphere DataStage
DataStage (ETL)
106
GPFS-SNC GPFS-SNC
HDFS GPFS-SNC POSIX
DataStage BigInsights DataStage
R
BigInsights Jaql R R Project
www.r-project.org Jaql R
Jaql MapReduce R
Intelligent
Scheduler
Hadoop (FIFO)
Apache Hadoop
Fair Scheduler Capacity Scheduler
Fair Scheduler
BigInsights Capacity Scheduler
FAIR
SLA
107
Intelligent Scheduler
MapReduce
IBM Research Hadoop
IBM Research
MapReduce (Adaptive MapReduce) mapper
mapper Hadoop map
MapReduce Hadoop
(split) mapper
mapper mapper
mapper mapper
map BigInsights
MapReduce mapper
mapper Hadoop
mapper mapper
108
1000
mapper
mapper
800
600
400
200
AM
32
64
128
256
512
1024
2048
(MB)
5-8 mapper
mapper mapper
mapper
mapper 5-8
mapper
mapper AM 32 MB
mapper mapper
MapReduce
BigInsights
1500
mapper
mapper
1200
109
900
600
300
5-9
AM
16
32
64
128
256
512
1024
(MB)
TERASORT
BigSheets
BigInsights
Hadoop
BigInsights BigInsights
Hadoop BigInsights
Apache Hadoop Hadoop BigInsights
BigInsights
MapReduce
Hadoop
MapReduce BigInsights
BigSheets
Hadoop BigSheets
BigSheets
110
BigSheets
1. Web
HTTPHDFS
Amazon S3 (s3n) Amazon S3
(s3) Web
BigSheets
Twitter
BigSheets
2.
5-10 BigSheets
5-10
BigSheets
111
Run
GBTB PB
3.
BigSheets
5-11
5-11
BigSheets
112
BigSheets
Web
BigInsights
Advanced Text Analytics ToolkitIBM Research
2004 SystemTIBM
Advanced Text Analytics Toolkit IBM
Lotus NotesIBM eDiscovery AnalyzerCognos Consumer
InsightInfoSphere Warehouse Advanced Text Analytics
Toolkit
BigInsights Advanced Text Analytics Toolkit
MapReduce Advanced
Text Analytics Toolkit
113
In the 2010 World Cup of Soccer, the team from the Netherlands
distinguished themselves well, losing to Spain 1-0 in the Final. Early in the
second half, Dutch striker Arjen Robben almost changed the tide of the
game on a breakaway, only to have the ball deflected by Spanish keeper,
Iker Casillas. Near the end of regulation time, winger Andres Iniesta
scored, winning Spain the World Cup.
Name
Arjen Robben
Iker Casillas
Andres Iniesta
Position
Striker
Goalkeeper
Winger
Country
Netherlands
Spain
Spain
114
AQL SQL
AQL
<>
030
phone
at
5-12
5-13
115
AQL
PB
Provenance 5-14
Advanced
Text Analytics Toolkit
116
Notes
//
URL
AQL
AQL
Advanced Text Analytics Toolkit
5-15
Advanced Text Analytics Toolkit BigInsights
Advanced
Text Analytics Toolkit BigInsights 5-16
AQL Analytics Operator
Graph (AOG) BigInsights Web AOG
>>
Person
person: Peggy
Person
person: Horton
Person
person: Stanley
Person
person: Rick Buy
Person
person: Mark Metts
5-14
Provenance
PersonCand
person: Peggy
PersonCand
person: Horton
PersonCand
person: Stanley
PersonCand
person: Stanley
PersonCand
person: Rick
PersonCand
person: Rick Buy
PersonCand
person: Mark Metts
PersonCand
person: Metts
UnionOp0
person: Rick
UnionOp4
person: RickBuy
FirstCaps
name: RickBuy
FirstName
first: Rick
CapsPerson
word: Buy
Throughput(KB/sec)
5-15
117
700
600
500
400
300
ANNIE
ANNIE
200
100
0
20
40
60
(KB)
80
100
Throughput (KB/sec)KB/s
Hadoop
AQL
Analytics
Operator
Graph
5-16
118
Hadoop
MapReduce
MapReduce
MapReduce
Hadoop
BigInsights
BigIndex Hadoop
119
BigIndex
ID
120
...
BigInsights
5-17
BigIndex
5-17
1.
BigInsights
Flume
Streams HDFS GPFS-SNC
Twitter
2.
3.
/
4.
Lucene
MapReduce Hadoop
Lucene
Lucene
Lucene
5.
Runtime Shard
Cluster
121
6.
7.
BigIndex
Java API Jaql REST
HTTP API
BigInsights
BigInsights
GPFS-SNC
IBM LZO
BigInsights
BigSheets
BigInsights
IBM Research Machine Learning Analytics Toolkit
IT
BigInsights IBM
6
IBM InfoSphere Streams
IBM Hadoop
IBM
BigInsights
IBM InfoSphere Streams (Streams)
Streams
(MPP)
Streams (streams)
IBM InfoSphere Streams
Streams
123
124
InfoSphere Streams
Streams
Streams BigInsights
Streams
Streams
Streams
BigInsights
(CEP) Streams
Streams
Streams
Source
Streams Sink
Streams
Streams
Streams
125
Streams
BigInsights Text Analytics Toolkit
Streams
(PE)
Streams
Streams 80%
BigInsights
Streams BigInsights IBM
Text
Analytic Toolkit Streams BigInsights
InfoSphere Streams
Streams
Streams
(FSS)
126
Streams
(UOIT)
Streams
1000
127
Streams
24
Streams
http://www.youtube.com/watch?v=QVbnrlqWG5I
(CDR)
CDR
CDR
Streams
Streams (RTAP)
Globe Telecom
Globe Telecom
10 40
CDR Internet (IPDR)
IPDR Internet (IP)
128
Streams
Streams
Streams
129
GPS
Streams
Streams
Streams
InfoSphere Streams
Streams
130
430 Streams
Streams
6-1
Functor
Split Split
FileSink
FileSour
ce
Functor
Split
ODBCAppend
6-1
131
Streams
N
M
Streams
132
SPL
SPL
Streams
SPL SPL
composite toUpper {
graph
stream<rstring line> LineStream = FileSource() {
param file
: "input_file";
format
: line;
}
stream<LineStream> upperedTxt = Functor(LineStream)
{
output upperedTxt
: line = upper(line);
}
() as Sink = FileSink(upperedTxt) { param file
: "/dev/stdout"; format
: line;
}
}
SPL FileSource
LineStream Functor
upperedTxt
Sink upperedTxt
Streams
Streams
133
FileSource FileSink
FileSource FileSink
txt
csv
bin
line
block BLOB
TCPSource/UDPSource TCPSink/UDPSink
TCPSource TCPSink Streams
TCP IP IPv4 IPv6
Export Import
export import export
134
streamID ID
streamID export
import Streams
MetricsSink
MetricsSink
(named meter)
Streams Studio
MetricsSink
Streams
Streams
Streams
Filter
filter
Streams filter
Functor
functor
135
functor
Punctor
punctor
punctor
functor
0
Sort
sort
Streams
count
delta
time
punctuation punctor
136
Streams
Join
join
inner joins
outer joins
sort
Aggregate
aggregate
Sort
aggregate groupBy partitionBy
aggregate
countsumaveragemaxminfirstlastcount distinct
Beacon
beacon
beacon
n/10/ n
beacon Streams
137
Throttle Delay
throttle delay
throttle
throttle /
delay delay
delay
A
B 10 B C 3 delay
Split Union
split
union
Streams
Streams
Streams
Streams
Database Toolkit Financial Markets
Toolkit
Database ToolkitRelational
Database Toolkit ODBC SolidDB
138
Streams
ODBCAppend
SQL INSERT
ODBCEnrich
ODBCSource
SolidDBEnrich SolidDB
FIX
WebSphere MQ
WebSphere Front Office
Streams
IBM
IT
139
Streams Streams
Streams
IBM
Streams Streams
SPL
SPL
SPL
SPL
(PE)
PE
B
PE Streams
PE Streams
140
PE
PE
PE
stopped PE
PE
Streams PE
RecoveryMode=ON
Streams Eclipse
InfoSphere Streams Studio (Streams Studio)
Streams SPL
Eclipse Streams Studio
Streams Streams Explorer Streams
Streams
Streams
141
Streams
WebSphere
Eclipse IBM
Rational Streams
BigInsights Hadoop
BigInsights Streams BigInsights
Streams BigInsights
IBM
IBM
BigInsights wiki
BigInsights Twitter Facebook
ibm.com/developerworks/wiki/biginsights
IBM
BigInsights (M97)
InfoSphere Streams (N08)
ibm.com/certify/mastery_tests
IBM
ibm.com/software/data/education
ibm.com/software/data/education/bookstore