SlideShare a Scribd company logo
1 of 47
Download to read offline
gluent.com 1
Connecting	
  Hadoop	
  and	
  Oracle
Tanel	
  Poder
a	
  long	
  time	
  computer	
  performance	
  geek
gluent.com 2
Intro:	
  About	
  me
• Tanel	
  Põder
• Oracle	
  Database	
  Performance	
  geek
• Exadata	
  Performance	
  geek
• Linux	
  Performance	
  geek
• Hadoop	
  Performance	
  geek
• CEO	
  &	
  co-­‐founder:
Expert	
  Oracle	
  Exadata	
  
book
(2nd edition	
  is	
  out	
  now!)
Instant	
  
promotion
gluent.com 3
Clouds are	
  here	
  to	
  stay!
Private	
  data	
  centers	
  
aren't	
  going	
  anywhere
plumbing
gluent.com 4
Hadoop is	
  here	
  to	
  stay!
Traditional	
  DBMS
aren't	
  going	
  anywhere
plumbing
gluent.com 5
All	
  Enterprise	
  Data	
  Available	
  in	
  Hadoop!
GluentHadoop
Gluent
MSSQL
Tera-­‐
data
IBM	
  
DB2
Big	
  Data	
  
Sources
Oracle
App	
  
X
App	
  
Y
App	
  
Z
gluent.com 6
Gluent	
  Offload	
  Engine
Gluent
Hadoop
Access	
  any	
  data	
  
source	
  in	
  any	
  
DB	
  &	
  APP
MSSQL
Tera-­‐
data
IBM	
  
DB2
Big	
  Data	
  
Sources
Oracle
App	
  
X
App	
  
Y
App	
  
Z
Push	
  processing	
  
to	
  Hadoop
gluent.com 7
Big	
  Data	
  Plumbers
gluent.com 8
Did	
  you	
  know?
• Cloudera	
  Impala	
  now	
  has	
  analytic	
  functions and	
  can	
  spill	
  
workarea buffers	
  to	
  disk
• …since	
  v2.0	
  in	
  Oct	
  2014
• Hive	
  has	
  "storage	
  indexes"	
  (Impala	
  not	
  yet)
• Hive	
  0.11	
  with	
  ORC	
  format,	
  Impala	
  possibly	
  with	
  new	
  Kudu	
  storage
• Hive	
  supports	
  row	
  level	
  changes &	
  ACID	
  transactions
• …	
  since	
  v0.14	
  in	
  Nov	
  2014
• Hive	
  uses	
  base	
  +	
  delta	
  table approach	
  (not	
  for	
  OLTP!)
• SparkSQL is	
  the	
  new	
  kid	
  on	
  the	
  block
• Actually,	
  Spark	
  is	
  old	
  news	
  already	
  -­‐>	
  Apache	
  Flink
gluent.com 9
Scalability	
  vs.	
  Features	
  (Hive	
  example)
$ hive (version 1.2.1 HDP 2.3.1)
hive> SELECT SUM(duration)
> FROM call_detail_records
> WHERE
> type = 'INTERNET‘
> OR phone_number IN ( SELECT phone_number
> FROM customer_details
> WHERE region = 'R04' );
FAILED: SemanticException [Error 10249]: Line 5:17
Unsupported SubQuery Expression 'phone_number':
Only SubQuery expressions that are top level
conjuncts are allowed
We	
  Oracle	
  users	
  
have	
  been	
  spoiled	
  
with	
  very	
  
sophisticated	
  SQL	
  
engine	
  for	
  years	
  :)
gluent.com 10
Scalability	
  vs.	
  Features	
  (Impala	
  example)
$ impala-shell
SELECT SUM(order_total)
FROM orders
WHERE order_mode='online'
OR customer_id IN (SELECT customer_id FROM customers
WHERE customer_class = 'Prime');
Query: select SUM(order_total) FROM orders WHERE
order_mode='online' OR customer_id IN (SELECT
customer_id FROM customers WHERE customer_class =
'Prime')
ERROR: AnalysisException: Subqueries in OR predicates are
not supported: order_mode = 'online' OR customer_id IN
(SELECT customer_id FROM soe.customers WHERE
customer_class = 'Prime')
Cloudera:	
  CDH5.3	
  
Impala	
  2.1.0
gluent.com 11
Hadoop	
  vs.	
  Oracle	
  Database:	
  One	
  way	
  to	
  look	
  at	
  it
Hadoop Oracle
Cheap,	
  
Scalable,
Flexible
Sophisticated,
Out-­‐of-­‐the-­‐box,
Mature
Big	
  Data	
  Plumbing?
gluent.com 12
3	
  major	
  big	
  data	
  plumbing	
  scenarios
1. Load data	
  from	
  Hadoop	
  to	
  Oracle
2. Offload data	
  from	
  Oracle	
  to	
  Hadoop
3. Query Hadoop	
  data	
  in	
  Oracle	
  
A	
  big	
  difference	
  between	
  
data	
  copy and	
  on-­‐
demand	
  data	
  query tools
gluent.com 13
Right	
  tool	
  for	
  the	
  right	
  problem!
How	
  to	
  know	
  what's	
  right?
gluent.com 14
Sqoop
• A	
  parallel	
  JDBC	
  <-­‐>	
  HDFS	
  data	
  copy tool
• Generates	
  MapReduce	
  jobs	
  that	
  connect	
  to	
  Oracle	
  with	
  JDBC
Oracle	
  DB
MapReduce
MapReduce
MapReduce
MapReduce JDBC
SQOOP
fil
e
Hadoop	
  /	
  HDFSfilefil
e
fil
e
file
Hive	
  metadata
gluent.com 15
Example:	
  Copy	
  Data	
  to	
  Hadoop	
  with	
  Sqoop
• Sqoop	
  is	
  "just"	
  a	
  JDBC	
  DB	
  client	
  capable	
  of	
  writing	
  to	
  HDFS
• It	
  is	
  very	
  flexible
sqoop import --connect jdbc:oracle:thin:@oel6:1521/LIN112
--username system
--null-string ''
--null-non-string ''
--target-dir=/user/impala/offload/tmp/ssh/sales
--append -m1 --fetch-size=5000
--fields-terminated-by '','' --lines-terminated-by ''n''
--optionally-enclosed-by ''"'' --escaped-by ''"''
--split-by TIME_ID
--query ""SELECT * FROM SSH.SALES WHERE TIME_ID < TO_DATE('
1998-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS',
'NLS_CALENDAR=GREGORIAN') AND $CONDITIONS""
Just	
  an	
  example	
  of	
  sqoop	
   syntax
gluent.com 16
Sqoop	
  use	
  cases
1. Copy	
  data	
  from	
  Oracle	
  to	
  Hadoop
• Entire	
  schemas,	
  tables,	
  partitions	
  or	
  any	
  SQL	
  query	
  result
2. Copy	
  data	
  from	
  Hadoop	
  to	
  Oracle
• Read	
  HDFS	
  files,	
  convert	
  to	
  JDBC	
  arrays	
  and	
  insert
3. Copy	
  changed	
  rows from	
  Oracle	
  to	
  Hadoop
• Sqoop	
  supports	
  adding	
  WHERE	
  syntax	
  to	
  table	
  reads:
• WHERE last_chg > TIMESTAMP'2015-01-10 12:34:56'
• WHERE ora_rowscn > 123456789
gluent.com 17
Sqoop	
  with	
  Oracle	
  performance	
  optimizations	
  -­‐ 1
• Is	
  called	
  OraOOP
• Was	
  a	
  separate	
  module	
  by	
  Guy	
  Harrison	
  team	
  at	
  Quest	
  (Dell)
• Now	
  included	
  in	
  standard	
  Sqoop	
  v1.4.5 (CDH5.1)
• sqoop import direct=true …
• Parallel	
  Sqoop	
  Oracle	
  reads	
  done	
  by	
  block	
  ranges	
  or	
  partitions
• Previously	
  required	
  wide	
  index	
  range	
  scans to	
  parallelize	
  workload
• Read	
  Oracle	
  data	
  with	
  full	
  table	
  scans and	
  direct	
  path	
  reads
• Much	
  less	
  IO,	
  CPU,	
  buffer	
  cache	
  impact	
  on	
  production	
  databases
gluent.com 18
Sqoop	
  with	
  Oracle	
  performance	
  optimizations	
  -­‐ 2
• Data	
  loading	
  into	
  Oracle	
  via	
  direct	
  path	
  insert or	
  merge
• Bypass	
  buffer	
  cache	
  and	
  most	
  redo/undo	
  generation
• HCC	
  compression	
  on-­‐the-­‐fly	
  for	
  loaded	
  tables
• ALTER TABLE fact COMPRESS FOR QUERY HIGH
• All	
  following	
  direct	
  path	
  inserts	
  will	
  be	
  compressed
• Sqoop	
  is	
  probably	
  the	
  best	
  tool	
  for	
  data	
  copy/move use	
  case
• For	
  batch	
  data	
  movement	
  needs
• If	
  used	
  with	
  the	
  Oracle	
  optimizations
gluent.com 19
What	
  about	
  real-­‐time	
  change	
  replication?
• Oracle	
  Change	
  Data	
  Capture
• Supported	
  in	
  11.2	
  – but	
  not	
  recommended	
  by	
  Oracle	
  anymore
• Desupported	
  in	
  12.1
• Oracle	
  GoldenGate	
  or	
  DBVisit	
  Replicate
1. Sqoop	
  the	
  source	
  table	
  snapshots	
  to	
  Hadoop
2. Replicate	
  changes	
  to	
  "delta"	
  tables	
  on	
  HDFS	
  or	
  in	
  HBase
3. Use	
  a	
  "merge"	
  view	
  on	
  Hive/Impala	
  to	
  merge	
  base	
  table with	
  delta
• Other	
  vendors:
• (Dell)	
  Quest	
  SharePlex	
  – mines	
  redologs too
• (VMWare)	
  Continuent	
  Tungsten	
  – uses	
  CDC	
  under	
  the	
  hood!
gluent.com 20
Query Hadoop	
  data	
  as	
  if	
  it	
  was	
  in	
  (Oracle)	
  RDBMS
• The	
  ultimate	
  goal	
  (for	
  me	
  at	
  least	
  :)
• Flexible
• Scalable
• Affordable	
  at scale	
  *
gluent.com 21
What	
  do	
  SQL	
  queries	
  do?
1.Retrieve	
  data
2.Process	
  data
3.Return	
  results
Disk	
  reads
Decompression
Filtering
Column	
  extraction	
  &	
  pruning
Join
Parallel	
  distribute
Aggregate
Sort
Final	
  result	
  column	
  projection
Send	
  data	
  back	
  over	
  network
…or	
  dump	
  to	
  disk
gluent.com 22
How	
  does	
  the	
  query	
  offloading	
  work?	
  (Ext.	
  tab)
-----------------------------------------------------------
| Id | Operation | Name |
-----------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT ORDER BY | |
|* 2 | FILTER | |
| 3 | HASH GROUP BY | |
|* 4 | HASH JOIN | |
|* 5 | HASH JOIN | |
|* 6 | EXTERNAL TABLE ACCESS FULL| CUSTOMERS_EXT |
|* 7 | EXTERNAL TABLE ACCESS FULL| ORDERS_EXT |
| 8 | EXTERNAL TABLE ACCESS FULL | ORDER_ITEMS_EXT |
-----------------------------------------------------------
Data	
  retrieval
nodes
Data	
  processing
nodes
• Data	
  retrieval nodes	
  do	
  what	
  they	
  need	
  to	
  produce	
  rows/cols
• Read	
  Oracle	
  datafiles,	
  read	
  external	
  tables,	
  call	
  ODBC,	
  read	
  memory	
  
• In	
  the	
  rest	
  of	
  the	
  execution	
  plan,	
  everything	
  works	
  as	
  usual
gluent.com 23
How	
  does	
  the	
  query	
  offloading	
  work?
Data	
  retrieval
nodes	
  produce	
  
rows/columns
Data	
  processing
nodes	
  consume	
  
rows/columns
Data	
  processing nodes	
  
don't	
  care	
  how	
  and	
  
where	
  the	
  rows	
  were	
  
read	
  from,	
  the	
  data	
  flow	
  
format	
  is	
  always	
  the	
  
same.
gluent.com 24
How	
  does	
  the	
  query	
  offloading	
  work?	
  (ODBC)
--------------------------------------------------------
| Id | Operation | Name | Inst |IN-OUT|
--------------------------------------------------------
| 0 | SELECT STATEMENT | | | |
| 1 | SORT ORDER BY | | | |
|* 2 | FILTER | | | |
| 3 | HASH GROUP BY | | | |
|* 4 | HASH JOIN | | | |
|* 5 | HASH JOIN | | | |
| 6 | REMOTE | orders | IMPALA | R->S |
| 7 | REMOTE | customers | IMPALA | R->S |
| 8 | REMOTE | order_items | IMPALA | R->S |
--------------------------------------------------------
Remote SQL Information (identified by operation id):
----------------------------------------------------
6 - SELECT `order_id`,`order_mode`,`customer_id`,`order_status` FROM `soe`.`orders`
WHERE `order_mode`='online' AND `order_status`=5 (accessing 'IMPALA.LOCALDOMAIN' )
7 – SELECT `customer_id`,`cust_first_name`,`cust_last_name`,`nls_territory`,
`credit_limit` FROM `soe`.`customers` WHERE `nls_territory`
LIKE 'New%' (accessing 'IMPALA.LOCALDOMAIN' )
8 - SELECT `order_id`,`unit_price`,`quantity` FROM `soe`.`order_items` (accessing
'IMPALA.LOCALDOMAIN' )
Data	
  retrieval
nodes
Remote	
  Hadoop	
  
SQL
Data	
  processing
business	
  as	
  usual
gluent.com 25
Evaluate	
  Hadoop-­‐>Oracle	
  SQL	
  processing
• Who	
  reads	
  disk?
• Who	
  and	
  when	
  filters	
  the	
  rows	
  and	
  prunes	
  columns?
• Who	
  decompresses	
  data?
• Who	
  converts	
  the	
  resultset to	
  Oracle	
  internal	
  datatypes?
• Who	
  pays	
  the	
  CPU	
  price?!
gluent.com 26
Sqooping	
  data	
  around
HDFS
Hadoop Oracle
Oracle	
  DB
Sqoop
Map-­‐reduce	
  
Jobs
IO
JDBC
Scan,	
  join,	
  
aggregate
Datatype	
  
conversion
Load	
  
table
Read	
  IOWrite	
  IO
+	
  Filter
gluent.com 27
Oracle	
  SQL	
  Connector	
  for	
  HDFS
HDFS
Hadoop Oracle
Oracle	
  DB
Joins,	
  
processing	
  
etc
libhdfs
libjvm
Java	
  HDFS	
  
client
External	
  
Table
Filter	
  +	
  
Datatype	
  
conversion
No	
  filter	
  
pushdown	
   into	
  
Hadoop.
Parse	
  Text	
  file	
  or	
  
read	
  pre-­‐converted	
  
DataPump	
  file
gluent.com 28
Query	
  using	
  DBlinks	
  &	
  ODBC	
  Gateway
HDFS
Hadoop Oracle
Oracle	
  DB
Impala	
  /	
  Hive
Decompress,
Filter,	
  Project
IO	
  +	
  
filter
Thrift	
  
protocol
ODBC	
  
Gateway Joins,	
  
processing	
  
etc
Datatype	
  
conversion
ODBC	
  
driver
gluent.com 29
Big	
  Data	
  SQL
Oracle	
  BDA	
  HDFS
Hadoop Oracle
Oracle	
  Exadata	
  DB
Oracle
Storage	
  Cell
Decompress,
Filter,	
  Project
IO	
  +	
  
filter
iDB
Hive	
  
External	
  
Table
Joins,	
  
processing	
  
etc
Datatype	
  
conversion
rows
gluent.com 30
Oracle	
  <-­‐>	
  Hadoop	
  data	
  plumbing	
  tools	
  -­‐ capabilities
Tool LOAD	
  
DATA
OFFLOAD
DATA
ALLOW	
  
QUERY	
  DATA
OFFLOAD	
  
QUERY
PARALLEL	
  
EXECUTION
Sqoop Yes Yes Yes
Oracle	
  
Loader	
  for	
  
Hadoop
Yes Yes
Oracle	
  SQL	
  
Connector
for	
  HDFS
Yes Yes Yes
ODBC
Gateway
Yes Yes Yes
Big Data	
  
SQL
Yes Yes Yes Yes
gluent.com 31
Oracle	
  <-­‐>	
  Hadoop	
  data	
  plumbing	
  tools	
  -­‐ overhead
Tool Extra
disk	
  
writes
Decompress	
  
CPU
Filtering	
  CPU Datatype	
  
Conversion
Sqoop Oracle Hadoop Oracle Oracle
Oracle	
  
Loader	
  for	
  
Hadoop
Oracle Hadoop Oracle Hadoop
Oracle	
  SQL	
  
Connector
for	
  HDFS
Oracle Oracle /
Hadoop	
  *
ODBC
Gateway
Hadoop Hadoop Oracle
Big Data	
  
SQL
Hadoop Hadoop Hadoop
Parse	
  text	
  files	
  
on	
  HDFS	
  or	
  read	
  
pre-­‐converted	
  
DataPump	
  binary
Oracle	
  12c	
  +	
  BDA	
  
+	
  Exadata	
  only
OK	
  for	
  occasional	
  
archive	
  query
gluent.com 32
Example	
  Use	
  Case
Data	
  Warehouse	
  Offload
gluent.com 33
Dimensional	
  DW	
  design
Hotness	
  of	
  data
DW	
  FACT	
  TABLES
in	
  TeraBytes
HASH(customer_id)
RANGE	
  (order_date)
Old	
  Fact	
  data	
  rarely	
  
updated
Fact	
  tables	
  time-­‐
partitioned
D
DDD
D
D
DIMENSION	
  TABLES
in	
  GigaBytes
Months	
  to	
  years	
  of	
  
history
After	
  multiple	
  joins	
  
on	
  dimension	
   tables	
  –
a	
  full	
  scan	
  is	
  done	
  on	
  
the	
  fact	
  table
Some	
  filter	
  predicates
directly	
  on	
  the	
  fact	
  
table	
  (time	
  range),	
  
country	
  code	
  etc)
gluent.com 34
DW	
  Data	
  Offload	
  to	
  Hadoop
Cheap	
  Scalable	
  Storage
(e.g HDFS	
  +	
  Hive/Impala)
Hot	
  Data
Dim.	
  
Tables
Hadoop
Node
Hadoop
Node
Hadoop
Node
Hadoop
Node
Hadoop
Node
Expensive
Storage
Expensive	
  Storage
Time-­‐partitioned	
  fact	
  table Hot	
  DataCold	
  Data
Dim.	
  
Tables
gluent.com 35
1.	
  Copy	
  Data	
  to	
  Hadoop	
  with	
  Sqoop
• Sqoop	
  is	
  "just"	
  a	
  JDBC	
  DB	
  client	
  capable	
  of	
  writing	
  to	
  HDFS
• It	
  is	
  very	
  flexible
sqoop import --connect jdbc:oracle:thin:@oel6:1521/LIN112
--username system
--null-string ''
--null-non-string ''
--target-dir=/user/impala/offload/tmp/ssh/sales
--append -m1 --fetch-size=5000
--fields-terminated-by '','' --lines-terminated-by ''n''
--optionally-enclosed-by ''"'' --escaped-by ''"''
--split-by TIME_ID
--query ""SELECT * FROM SSH.SALES WHERE TIME_ID < TO_DATE('
1998-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS',
'NLS_CALENDAR=GREGORIAN') AND $CONDITIONS""
Just	
  an	
  example	
  of	
  sqoop	
   syntax
gluent.com 36
2.	
  Load	
  Data	
  into	
  Hive/Impala	
  in	
  read-­‐optimized	
  format
• The	
  additional	
  external	
  table	
  step	
  gives	
  better	
  flexibility	
  when	
  
loading	
  into	
  the	
  target	
  read-­‐optimized	
  table
CREATE TABLE IF NOT EXISTS SSH_tmp.SALES_ext (
PROD_ID bigint, CUST_ID bigint, TIME_ID timestamp, CHANNEL_ID
bigint, PROMO_ID bigint, QUANTITY_SOLD bigint, AMOUNT_SOLD
bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY '"' LINES TERMINATED BY 'n'
STORED AS TEXTFILE LOCATION '/user/impala/offload/tmp/ssh/sales'
INSERT INTO SSH.SALES PARTITION (year)
SELECT t.*, CAST(YEAR(time_id) AS SMALLINT)
FROM SSH_tmp.SALES_ext t
gluent.com 37
3a.	
  Query	
  Data	
  via	
  a	
  DB	
  Link/HS/ODBC
• (I'm	
  not	
  covering	
  the	
  ODBC	
  driver	
  install	
  and	
  config here)
SQL> EXPLAIN PLAN FOR SELECT COUNT(*) FROM ssh.sales@impala;
Explained.
SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
--------------------------------------------------------------
| Id | Operation | Name | Cost (%CPU)| Inst |IN-OUT|
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | 0 (0)| | |
| 1 | REMOTE | | | IMPALA | R->S |
--------------------------------------------------------------
Remote SQL Information (identified by operation id):
----------------------------------------------------
1 - SELECT COUNT(*) FROM `SSH`.`SALES` A1 (accessing 'IMPALA’)
The	
  entire	
  query	
  gets	
  sent	
  to	
  Impala	
  thanks	
  
to	
  Oracle	
  Heterogenous Services
gluent.com 38
3b.	
  Query	
  Data	
  via	
  Oracle	
  SQL	
  Connector	
  for	
  HDFS
CREATE TABLE SSH.SALES_DP
( "PROD_ID" NUMBER,
"CUST_ID" NUMBER,
"TIME_ID" DATE,
"CHANNEL_ID" NUMBER,
"PROMO_ID" NUMBER,
"QUANTITY_SOLD" NUMBER,
"AMOUNT_SOLD" NUMBER
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "OFFLOAD_DIR"
ACCESS PARAMETERS
( external variable data
PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
)
LOCATION
( 'osch-tanel-00000', 'osch-tanel-00001',
'osch-tanel-00002', 'osch-tanel-00003',
'osch-tanel-00004', 'osch-tanel-00005'
)
)
REJECT LIMIT UNLIMITED
NOPARALLEL
The	
  external	
  variable	
  data
option	
  says	
  it's	
  a	
  DataPump	
  
format	
  input	
  stream
hdfs_stream is	
  the	
  Java	
  
HDFS	
  client	
  app	
  installed	
  
into	
  your	
  Oracle	
  server	
  as	
  
part	
  of	
  the	
  Oracle	
  SQL	
  
Connector	
  for	
  Hadoop
The	
  location	
  files	
  are	
  small	
  
.xml	
  config files	
  created	
  by	
  
Oracle	
  Loader	
  for	
  Hadoop	
  
(telling	
  where	
  in	
  HDFS	
  our	
  
files	
  are)
gluent.com 39
3b.	
  Query	
  Data	
  via	
  Oracle	
  SQL	
  Connector	
  for	
  HDFS
• External	
  Table	
  access	
  is	
  parallelizable!
• Multiple	
  HDFS	
  location	
  files	
  must	
  be	
  created	
  for	
  PX	
  (done	
  by	
  loader)
SQL> SELECT /*+ PARALLEL(4) */ COUNT(*) FROM ssh.sales_dp;
----------------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | SORT AGGREGATE | | | | |
| 2 | PX COORDINATOR | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | Q1,00 | PCWP | |
| 5 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |
| 6 | EXTERNAL TABLE ACCESS FULL| SALES_DP | Q1,00 | PCWP | |
----------------------------------------------------------------------------------
Oracle	
  Parallel	
  Slaves	
  can	
  access	
  a	
  different	
  
DataPump	
  files	
  on	
  HDFS
Any	
  filter	
  predicates	
  are	
  not
offloaded.	
  All	
  data/columns	
  are	
  read!
gluent.com 40
3c.	
  Query	
  via	
  Big	
  Data	
  SQL
SQL> CREATE TABLE movielog_plus
2 (click VARCHAR2(40))
3 ORGANIZATION EXTERNAL
4 (TYPE ORACLE_HDFS
5 DEFAULT DIRECTORY DEFAULT_DIR
6 ACCESS PARAMETERS (
7 com.oracle.bigdata.cluster=bigdatalite
8 com.oracle.bigdata.overflow={"action":"truncate"}
9 )
10 LOCATION ('/user/oracle/moviework/applog_json/')
11 )
12 REJECT LIMIT UNLIMITED;
• This	
  is	
  just	
  one	
  simple	
  example:
ORACLE_HIVE	
  would	
  use	
  Hive	
  
metadata	
  to	
  figure	
  out	
  data	
  location	
  
and	
  structure	
  (but	
  storage	
  cells	
  do	
  the	
  
disk	
  reading	
  from	
  HDFS	
  directly)
gluent.com 41
How	
  to	
  query	
  the	
  full	
  dataset?
• UNION	
  ALL
• Latest	
  partitions	
  in	
  Oracle:	
  (SALES	
  table)
• Offloaded	
  data	
  in	
  Hadoop:	
  (SALES@impala or	
  SALES_DP ext tab)
SELECT PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID, QUANTITY_SOLD,
AMOUNT_SOLD
FROM SSH.SALES
WHERE TIME_ID >= TO_DATE(' 1998-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS')
UNION ALL
SELECT "prod_id", "cust_id", "time_id", "channel_id", "promo_id",
"quantity_sold", "amount_sold"
FROM SSH.SALES@impala
WHERE "time_id" < TO_DATE(' 1998-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS')
CREATE VIEW app_reporting_user.SALES AS
gluent.com 42
Hybrid	
  table	
  with	
  selective	
  offloading
Cheap	
  Scalable	
  Storage
(e.g HDFS	
  +	
  Hive/Impala)
Hot	
  Data
Dim.	
  
Tables
Hadoop
Node
Hadoop
Node
Hadoop
Node
Hadoop
Node
Hadoop
Node
Expensive
Storage
TABLE	
  ACCESS	
  FULLEXTERNAL	
  TABLE	
  ACCESS	
  /	
  DBLINK
SELECT
c.cust_gender, SUM(s.amount_sold)
FROM ssh.customers c, sales_v s
WHERE c.cust_id = s.cust_id
GROUP BY c.cust_gender
SALES	
  union-­‐all	
  view
HASH	
  JOIN
GROUP	
  BY
SELECT	
  STATEMENT
gluent.com 43
Partially	
  Offloaded	
  Execution	
  Plan
--------------------------------------------------------------------------
| Id | Operation | Name |Pstart| Pstop | Inst |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | HASH GROUP BY | | | | |
|* 2 | HASH JOIN | | | | |
| 3 | PARTITION RANGE ALL | | 1 | 16 | |
| 4 | TABLE ACCESS FULL | CUSTOMERS | 1 | 16 | |
| 5 | VIEW | SALES_V | | | |
| 6 | UNION-ALL | | | | |
| 7 | PARTITION RANGE ITERATOR| | 7 | 68 | |
| 8 | TABLE ACCESS FULL | SALES | 7 | 68 | |
| 9 | REMOTE | SALES | | | IMPALA |
--------------------------------------------------------------------------
2 - access("C"."CUST_ID"="S"."CUST_ID")
Remote SQL Information (identified by operation id):
----------------------------------------------------
9 - SELECT `cust_id`,`time_id`,`amount_sold` FROM `SSH`.`SALES`
WHERE `time_id`<'1998-07-01 00:00:00' (accessing 'IMPALA' )
gluent.com 44
Performance
• Vendor	
  X:	
  "We	
  load	
  N	
  Terabytes	
  per	
  hour"
• Vendor	
  Y:	
  "We	
  load	
  N	
  Billion	
  rows	
  per	
  hour"
• Q:	
  What	
  datatypes?
• So	
  much	
  cheaper	
  to	
  load	
  fixed	
  CHARs
• …but	
  that's	
  not	
  what	
  your	
  real	
  data	
  model	
  looks	
  like.
• Q:	
  How	
  wide	
  rows?
• 1	
  wide	
  varchar column or	
  500	
  numeric	
  columns?
• Datatype	
  conversion	
  is	
  CPU	
  hungry!
Make	
  sure	
  you	
  
know	
  where	
  in	
  
stack	
  you'll	
  be	
  
paying	
  the	
  price
And	
  is	
  it	
  once	
  or	
  
always	
  when	
  
accessed?
gluent.com 45
Performance:	
  Data	
  Retrieval	
  Latency
• ClouderaImpala
• Low	
  latency	
  (subsecond)
• Hive
• Use	
  latest	
  Hive	
  0.14+	
  with	
  all	
  the	
  bells'n'whistles (Stinger,	
  TEZ+YARN)
• Otherwise	
  you'll	
  wait	
  for	
  jobs	
  and	
  JVMs	
  to	
  start	
  up	
  for	
  every	
  query
• Oracle	
  SQL	
  Connector	
  for	
  HDFS
• Multisecond latency	
  due	
  to	
  hdfs_stream JVM	
  startup
• Oracle	
  Big	
  Data	
  SQL
• Low	
  latency	
  thanks	
  to	
  "Exadata	
  storage	
  cell	
  software	
  on	
  Hadoop"
Time	
  to	
  first	
  row/byte
gluent.com 46
Big	
  Data	
  SQL	
  Performance	
  Considerations
• Only	
  data	
  retrieval (the	
  TABLE	
  ACCESS	
  FULL	
  etc)	
  is	
  offloaded!
• Filter	
  pushdown	
  etc
• All	
  data	
  processing still	
  happens	
  in	
  the	
  DB	
  layer
• GROUP	
  BY,	
  JOIN,	
  ORDER	
  BY,	
  Analytic	
  Functions,	
  PL/SQL	
  etc
• Just	
  like	
  on	
  Exadata…
gluent.com 47
Thanks!
We	
  are	
  hiring	
  [bad	
  ass]	
  developers	
  &	
  data	
  engineers!!!
http://gluent.com
Also,	
  the	
  usual:
http://blog.tanelpoder.com
tanel@tanelpoder.com

More Related Content

What's hot

January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
Cloudera, Inc.
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 

What's hot (20)

Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Oracle 12.2 sharded database management
Oracle 12.2 sharded database managementOracle 12.2 sharded database management
Oracle 12.2 sharded database management
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
SQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cSQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12c
 
Optimizing your Database Import!
Optimizing your Database Import! Optimizing your Database Import!
Optimizing your Database Import!
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Oracle 21c: New Features and Enhancements of Data Pump & TTS
Oracle 21c: New Features and Enhancements of Data Pump & TTSOracle 21c: New Features and Enhancements of Data Pump & TTS
Oracle 21c: New Features and Enhancements of Data Pump & TTS
 
Is hadoop for you
Is hadoop for youIs hadoop for you
Is hadoop for you
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for Hadoop
 
An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)
 
Architecture of exadata database machine – Part II
Architecture of exadata database machine – Part IIArchitecture of exadata database machine – Part II
Architecture of exadata database machine – Part II
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 

Similar to Connecting Hadoop and Oracle

COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
ssusere05ec21
 

Similar to Connecting Hadoop and Oracle (20)

Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Replicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analyticsReplicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analytics
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDC
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The Move
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 

More from Tanel Poder

More from Tanel Poder (12)

Troubleshooting Complex Oracle Performance Problems with Tanel Poder
Troubleshooting Complex Oracle Performance Problems with Tanel PoderTroubleshooting Complex Oracle Performance Problems with Tanel Poder
Troubleshooting Complex Oracle Performance Problems with Tanel Poder
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
 
Modern Linux Performance Tools for Application Troubleshooting
Modern Linux Performance Tools for Application TroubleshootingModern Linux Performance Tools for Application Troubleshooting
Modern Linux Performance Tools for Application Troubleshooting
 
GNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for DatabasesGNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for Databases
 
SQL in the Hybrid World
SQL in the Hybrid WorldSQL in the Hybrid World
SQL in the Hybrid World
 
Tanel Poder - Scripts and Tools short
Tanel Poder - Scripts and Tools shortTanel Poder - Scripts and Tools short
Tanel Poder - Scripts and Tools short
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in Action
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
 
Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)
 
Oracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention TroubleshootingOracle Latch and Mutex Contention Troubleshooting
Oracle Latch and Mutex Contention Troubleshooting
 
Oracle LOB Internals and Performance Tuning
Oracle LOB Internals and Performance TuningOracle LOB Internals and Performance Tuning
Oracle LOB Internals and Performance Tuning
 

Recently uploaded

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 

Connecting Hadoop and Oracle

  • 1. gluent.com 1 Connecting  Hadoop  and  Oracle Tanel  Poder a  long  time  computer  performance  geek
  • 2. gluent.com 2 Intro:  About  me • Tanel  Põder • Oracle  Database  Performance  geek • Exadata  Performance  geek • Linux  Performance  geek • Hadoop  Performance  geek • CEO  &  co-­‐founder: Expert  Oracle  Exadata   book (2nd edition  is  out  now!) Instant   promotion
  • 3. gluent.com 3 Clouds are  here  to  stay! Private  data  centers   aren't  going  anywhere plumbing
  • 4. gluent.com 4 Hadoop is  here  to  stay! Traditional  DBMS aren't  going  anywhere plumbing
  • 5. gluent.com 5 All  Enterprise  Data  Available  in  Hadoop! GluentHadoop Gluent MSSQL Tera-­‐ data IBM   DB2 Big  Data   Sources Oracle App   X App   Y App   Z
  • 6. gluent.com 6 Gluent  Offload  Engine Gluent Hadoop Access  any  data   source  in  any   DB  &  APP MSSQL Tera-­‐ data IBM   DB2 Big  Data   Sources Oracle App   X App   Y App   Z Push  processing   to  Hadoop
  • 8. gluent.com 8 Did  you  know? • Cloudera  Impala  now  has  analytic  functions and  can  spill   workarea buffers  to  disk • …since  v2.0  in  Oct  2014 • Hive  has  "storage  indexes"  (Impala  not  yet) • Hive  0.11  with  ORC  format,  Impala  possibly  with  new  Kudu  storage • Hive  supports  row  level  changes &  ACID  transactions • …  since  v0.14  in  Nov  2014 • Hive  uses  base  +  delta  table approach  (not  for  OLTP!) • SparkSQL is  the  new  kid  on  the  block • Actually,  Spark  is  old  news  already  -­‐>  Apache  Flink
  • 9. gluent.com 9 Scalability  vs.  Features  (Hive  example) $ hive (version 1.2.1 HDP 2.3.1) hive> SELECT SUM(duration) > FROM call_detail_records > WHERE > type = 'INTERNET‘ > OR phone_number IN ( SELECT phone_number > FROM customer_details > WHERE region = 'R04' ); FAILED: SemanticException [Error 10249]: Line 5:17 Unsupported SubQuery Expression 'phone_number': Only SubQuery expressions that are top level conjuncts are allowed We  Oracle  users   have  been  spoiled   with  very   sophisticated  SQL   engine  for  years  :)
  • 10. gluent.com 10 Scalability  vs.  Features  (Impala  example) $ impala-shell SELECT SUM(order_total) FROM orders WHERE order_mode='online' OR customer_id IN (SELECT customer_id FROM customers WHERE customer_class = 'Prime'); Query: select SUM(order_total) FROM orders WHERE order_mode='online' OR customer_id IN (SELECT customer_id FROM customers WHERE customer_class = 'Prime') ERROR: AnalysisException: Subqueries in OR predicates are not supported: order_mode = 'online' OR customer_id IN (SELECT customer_id FROM soe.customers WHERE customer_class = 'Prime') Cloudera:  CDH5.3   Impala  2.1.0
  • 11. gluent.com 11 Hadoop  vs.  Oracle  Database:  One  way  to  look  at  it Hadoop Oracle Cheap,   Scalable, Flexible Sophisticated, Out-­‐of-­‐the-­‐box, Mature Big  Data  Plumbing?
  • 12. gluent.com 12 3  major  big  data  plumbing  scenarios 1. Load data  from  Hadoop  to  Oracle 2. Offload data  from  Oracle  to  Hadoop 3. Query Hadoop  data  in  Oracle   A  big  difference  between   data  copy and  on-­‐ demand  data  query tools
  • 13. gluent.com 13 Right  tool  for  the  right  problem! How  to  know  what's  right?
  • 14. gluent.com 14 Sqoop • A  parallel  JDBC  <-­‐>  HDFS  data  copy tool • Generates  MapReduce  jobs  that  connect  to  Oracle  with  JDBC Oracle  DB MapReduce MapReduce MapReduce MapReduce JDBC SQOOP fil e Hadoop  /  HDFSfilefil e fil e file Hive  metadata
  • 15. gluent.com 15 Example:  Copy  Data  to  Hadoop  with  Sqoop • Sqoop  is  "just"  a  JDBC  DB  client  capable  of  writing  to  HDFS • It  is  very  flexible sqoop import --connect jdbc:oracle:thin:@oel6:1521/LIN112 --username system --null-string '' --null-non-string '' --target-dir=/user/impala/offload/tmp/ssh/sales --append -m1 --fetch-size=5000 --fields-terminated-by '','' --lines-terminated-by ''n'' --optionally-enclosed-by ''"'' --escaped-by ''"'' --split-by TIME_ID --query ""SELECT * FROM SSH.SALES WHERE TIME_ID < TO_DATE(' 1998-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN') AND $CONDITIONS"" Just  an  example  of  sqoop   syntax
  • 16. gluent.com 16 Sqoop  use  cases 1. Copy  data  from  Oracle  to  Hadoop • Entire  schemas,  tables,  partitions  or  any  SQL  query  result 2. Copy  data  from  Hadoop  to  Oracle • Read  HDFS  files,  convert  to  JDBC  arrays  and  insert 3. Copy  changed  rows from  Oracle  to  Hadoop • Sqoop  supports  adding  WHERE  syntax  to  table  reads: • WHERE last_chg > TIMESTAMP'2015-01-10 12:34:56' • WHERE ora_rowscn > 123456789
  • 17. gluent.com 17 Sqoop  with  Oracle  performance  optimizations  -­‐ 1 • Is  called  OraOOP • Was  a  separate  module  by  Guy  Harrison  team  at  Quest  (Dell) • Now  included  in  standard  Sqoop  v1.4.5 (CDH5.1) • sqoop import direct=true … • Parallel  Sqoop  Oracle  reads  done  by  block  ranges  or  partitions • Previously  required  wide  index  range  scans to  parallelize  workload • Read  Oracle  data  with  full  table  scans and  direct  path  reads • Much  less  IO,  CPU,  buffer  cache  impact  on  production  databases
  • 18. gluent.com 18 Sqoop  with  Oracle  performance  optimizations  -­‐ 2 • Data  loading  into  Oracle  via  direct  path  insert or  merge • Bypass  buffer  cache  and  most  redo/undo  generation • HCC  compression  on-­‐the-­‐fly  for  loaded  tables • ALTER TABLE fact COMPRESS FOR QUERY HIGH • All  following  direct  path  inserts  will  be  compressed • Sqoop  is  probably  the  best  tool  for  data  copy/move use  case • For  batch  data  movement  needs • If  used  with  the  Oracle  optimizations
  • 19. gluent.com 19 What  about  real-­‐time  change  replication? • Oracle  Change  Data  Capture • Supported  in  11.2  – but  not  recommended  by  Oracle  anymore • Desupported  in  12.1 • Oracle  GoldenGate  or  DBVisit  Replicate 1. Sqoop  the  source  table  snapshots  to  Hadoop 2. Replicate  changes  to  "delta"  tables  on  HDFS  or  in  HBase 3. Use  a  "merge"  view  on  Hive/Impala  to  merge  base  table with  delta • Other  vendors: • (Dell)  Quest  SharePlex  – mines  redologs too • (VMWare)  Continuent  Tungsten  – uses  CDC  under  the  hood!
  • 20. gluent.com 20 Query Hadoop  data  as  if  it  was  in  (Oracle)  RDBMS • The  ultimate  goal  (for  me  at  least  :) • Flexible • Scalable • Affordable  at scale  *
  • 21. gluent.com 21 What  do  SQL  queries  do? 1.Retrieve  data 2.Process  data 3.Return  results Disk  reads Decompression Filtering Column  extraction  &  pruning Join Parallel  distribute Aggregate Sort Final  result  column  projection Send  data  back  over  network …or  dump  to  disk
  • 22. gluent.com 22 How  does  the  query  offloading  work?  (Ext.  tab) ----------------------------------------------------------- | Id | Operation | Name | ----------------------------------------------------------- | 0 | SELECT STATEMENT | | | 1 | SORT ORDER BY | | |* 2 | FILTER | | | 3 | HASH GROUP BY | | |* 4 | HASH JOIN | | |* 5 | HASH JOIN | | |* 6 | EXTERNAL TABLE ACCESS FULL| CUSTOMERS_EXT | |* 7 | EXTERNAL TABLE ACCESS FULL| ORDERS_EXT | | 8 | EXTERNAL TABLE ACCESS FULL | ORDER_ITEMS_EXT | ----------------------------------------------------------- Data  retrieval nodes Data  processing nodes • Data  retrieval nodes  do  what  they  need  to  produce  rows/cols • Read  Oracle  datafiles,  read  external  tables,  call  ODBC,  read  memory   • In  the  rest  of  the  execution  plan,  everything  works  as  usual
  • 23. gluent.com 23 How  does  the  query  offloading  work? Data  retrieval nodes  produce   rows/columns Data  processing nodes  consume   rows/columns Data  processing nodes   don't  care  how  and   where  the  rows  were   read  from,  the  data  flow   format  is  always  the   same.
  • 24. gluent.com 24 How  does  the  query  offloading  work?  (ODBC) -------------------------------------------------------- | Id | Operation | Name | Inst |IN-OUT| -------------------------------------------------------- | 0 | SELECT STATEMENT | | | | | 1 | SORT ORDER BY | | | | |* 2 | FILTER | | | | | 3 | HASH GROUP BY | | | | |* 4 | HASH JOIN | | | | |* 5 | HASH JOIN | | | | | 6 | REMOTE | orders | IMPALA | R->S | | 7 | REMOTE | customers | IMPALA | R->S | | 8 | REMOTE | order_items | IMPALA | R->S | -------------------------------------------------------- Remote SQL Information (identified by operation id): ---------------------------------------------------- 6 - SELECT `order_id`,`order_mode`,`customer_id`,`order_status` FROM `soe`.`orders` WHERE `order_mode`='online' AND `order_status`=5 (accessing 'IMPALA.LOCALDOMAIN' ) 7 – SELECT `customer_id`,`cust_first_name`,`cust_last_name`,`nls_territory`, `credit_limit` FROM `soe`.`customers` WHERE `nls_territory` LIKE 'New%' (accessing 'IMPALA.LOCALDOMAIN' ) 8 - SELECT `order_id`,`unit_price`,`quantity` FROM `soe`.`order_items` (accessing 'IMPALA.LOCALDOMAIN' ) Data  retrieval nodes Remote  Hadoop   SQL Data  processing business  as  usual
  • 25. gluent.com 25 Evaluate  Hadoop-­‐>Oracle  SQL  processing • Who  reads  disk? • Who  and  when  filters  the  rows  and  prunes  columns? • Who  decompresses  data? • Who  converts  the  resultset to  Oracle  internal  datatypes? • Who  pays  the  CPU  price?!
  • 26. gluent.com 26 Sqooping  data  around HDFS Hadoop Oracle Oracle  DB Sqoop Map-­‐reduce   Jobs IO JDBC Scan,  join,   aggregate Datatype   conversion Load   table Read  IOWrite  IO +  Filter
  • 27. gluent.com 27 Oracle  SQL  Connector  for  HDFS HDFS Hadoop Oracle Oracle  DB Joins,   processing   etc libhdfs libjvm Java  HDFS   client External   Table Filter  +   Datatype   conversion No  filter   pushdown   into   Hadoop. Parse  Text  file  or   read  pre-­‐converted   DataPump  file
  • 28. gluent.com 28 Query  using  DBlinks  &  ODBC  Gateway HDFS Hadoop Oracle Oracle  DB Impala  /  Hive Decompress, Filter,  Project IO  +   filter Thrift   protocol ODBC   Gateway Joins,   processing   etc Datatype   conversion ODBC   driver
  • 29. gluent.com 29 Big  Data  SQL Oracle  BDA  HDFS Hadoop Oracle Oracle  Exadata  DB Oracle Storage  Cell Decompress, Filter,  Project IO  +   filter iDB Hive   External   Table Joins,   processing   etc Datatype   conversion rows
  • 30. gluent.com 30 Oracle  <-­‐>  Hadoop  data  plumbing  tools  -­‐ capabilities Tool LOAD   DATA OFFLOAD DATA ALLOW   QUERY  DATA OFFLOAD   QUERY PARALLEL   EXECUTION Sqoop Yes Yes Yes Oracle   Loader  for   Hadoop Yes Yes Oracle  SQL   Connector for  HDFS Yes Yes Yes ODBC Gateway Yes Yes Yes Big Data   SQL Yes Yes Yes Yes
  • 31. gluent.com 31 Oracle  <-­‐>  Hadoop  data  plumbing  tools  -­‐ overhead Tool Extra disk   writes Decompress   CPU Filtering  CPU Datatype   Conversion Sqoop Oracle Hadoop Oracle Oracle Oracle   Loader  for   Hadoop Oracle Hadoop Oracle Hadoop Oracle  SQL   Connector for  HDFS Oracle Oracle / Hadoop  * ODBC Gateway Hadoop Hadoop Oracle Big Data   SQL Hadoop Hadoop Hadoop Parse  text  files   on  HDFS  or  read   pre-­‐converted   DataPump  binary Oracle  12c  +  BDA   +  Exadata  only OK  for  occasional   archive  query
  • 32. gluent.com 32 Example  Use  Case Data  Warehouse  Offload
  • 33. gluent.com 33 Dimensional  DW  design Hotness  of  data DW  FACT  TABLES in  TeraBytes HASH(customer_id) RANGE  (order_date) Old  Fact  data  rarely   updated Fact  tables  time-­‐ partitioned D DDD D D DIMENSION  TABLES in  GigaBytes Months  to  years  of   history After  multiple  joins   on  dimension   tables  – a  full  scan  is  done  on   the  fact  table Some  filter  predicates directly  on  the  fact   table  (time  range),   country  code  etc)
  • 34. gluent.com 34 DW  Data  Offload  to  Hadoop Cheap  Scalable  Storage (e.g HDFS  +  Hive/Impala) Hot  Data Dim.   Tables Hadoop Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Expensive Storage Expensive  Storage Time-­‐partitioned  fact  table Hot  DataCold  Data Dim.   Tables
  • 35. gluent.com 35 1.  Copy  Data  to  Hadoop  with  Sqoop • Sqoop  is  "just"  a  JDBC  DB  client  capable  of  writing  to  HDFS • It  is  very  flexible sqoop import --connect jdbc:oracle:thin:@oel6:1521/LIN112 --username system --null-string '' --null-non-string '' --target-dir=/user/impala/offload/tmp/ssh/sales --append -m1 --fetch-size=5000 --fields-terminated-by '','' --lines-terminated-by ''n'' --optionally-enclosed-by ''"'' --escaped-by ''"'' --split-by TIME_ID --query ""SELECT * FROM SSH.SALES WHERE TIME_ID < TO_DATE(' 1998-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN') AND $CONDITIONS"" Just  an  example  of  sqoop   syntax
  • 36. gluent.com 36 2.  Load  Data  into  Hive/Impala  in  read-­‐optimized  format • The  additional  external  table  step  gives  better  flexibility  when   loading  into  the  target  read-­‐optimized  table CREATE TABLE IF NOT EXISTS SSH_tmp.SALES_ext ( PROD_ID bigint, CUST_ID bigint, TIME_ID timestamp, CHANNEL_ID bigint, PROMO_ID bigint, QUANTITY_SOLD bigint, AMOUNT_SOLD bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '"' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION '/user/impala/offload/tmp/ssh/sales' INSERT INTO SSH.SALES PARTITION (year) SELECT t.*, CAST(YEAR(time_id) AS SMALLINT) FROM SSH_tmp.SALES_ext t
  • 37. gluent.com 37 3a.  Query  Data  via  a  DB  Link/HS/ODBC • (I'm  not  covering  the  ODBC  driver  install  and  config here) SQL> EXPLAIN PLAN FOR SELECT COUNT(*) FROM ssh.sales@impala; Explained. SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY); -------------------------------------------------------------- | Id | Operation | Name | Cost (%CPU)| Inst |IN-OUT| -------------------------------------------------------------- | 0 | SELECT STATEMENT | | 0 (0)| | | | 1 | REMOTE | | | IMPALA | R->S | -------------------------------------------------------------- Remote SQL Information (identified by operation id): ---------------------------------------------------- 1 - SELECT COUNT(*) FROM `SSH`.`SALES` A1 (accessing 'IMPALA’) The  entire  query  gets  sent  to  Impala  thanks   to  Oracle  Heterogenous Services
  • 38. gluent.com 38 3b.  Query  Data  via  Oracle  SQL  Connector  for  HDFS CREATE TABLE SSH.SALES_DP ( "PROD_ID" NUMBER, "CUST_ID" NUMBER, "TIME_ID" DATE, "CHANNEL_ID" NUMBER, "PROMO_ID" NUMBER, "QUANTITY_SOLD" NUMBER, "AMOUNT_SOLD" NUMBER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY "OFFLOAD_DIR" ACCESS PARAMETERS ( external variable data PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream' ) LOCATION ( 'osch-tanel-00000', 'osch-tanel-00001', 'osch-tanel-00002', 'osch-tanel-00003', 'osch-tanel-00004', 'osch-tanel-00005' ) ) REJECT LIMIT UNLIMITED NOPARALLEL The  external  variable  data option  says  it's  a  DataPump   format  input  stream hdfs_stream is  the  Java   HDFS  client  app  installed   into  your  Oracle  server  as   part  of  the  Oracle  SQL   Connector  for  Hadoop The  location  files  are  small   .xml  config files  created  by   Oracle  Loader  for  Hadoop   (telling  where  in  HDFS  our   files  are)
  • 39. gluent.com 39 3b.  Query  Data  via  Oracle  SQL  Connector  for  HDFS • External  Table  access  is  parallelizable! • Multiple  HDFS  location  files  must  be  created  for  PX  (done  by  loader) SQL> SELECT /*+ PARALLEL(4) */ COUNT(*) FROM ssh.sales_dp; ---------------------------------------------------------------------------------- | Id | Operation | Name | TQ |IN-OUT| PQ Distrib | ---------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | | | 1 | SORT AGGREGATE | | | | | | 2 | PX COORDINATOR | | | | | | 3 | PX SEND QC (RANDOM) | :TQ10000 | Q1,00 | P->S | QC (RAND) | | 4 | SORT AGGREGATE | | Q1,00 | PCWP | | | 5 | PX BLOCK ITERATOR | | Q1,00 | PCWC | | | 6 | EXTERNAL TABLE ACCESS FULL| SALES_DP | Q1,00 | PCWP | | ---------------------------------------------------------------------------------- Oracle  Parallel  Slaves  can  access  a  different   DataPump  files  on  HDFS Any  filter  predicates  are  not offloaded.  All  data/columns  are  read!
  • 40. gluent.com 40 3c.  Query  via  Big  Data  SQL SQL> CREATE TABLE movielog_plus 2 (click VARCHAR2(40)) 3 ORGANIZATION EXTERNAL 4 (TYPE ORACLE_HDFS 5 DEFAULT DIRECTORY DEFAULT_DIR 6 ACCESS PARAMETERS ( 7 com.oracle.bigdata.cluster=bigdatalite 8 com.oracle.bigdata.overflow={"action":"truncate"} 9 ) 10 LOCATION ('/user/oracle/moviework/applog_json/') 11 ) 12 REJECT LIMIT UNLIMITED; • This  is  just  one  simple  example: ORACLE_HIVE  would  use  Hive   metadata  to  figure  out  data  location   and  structure  (but  storage  cells  do  the   disk  reading  from  HDFS  directly)
  • 41. gluent.com 41 How  to  query  the  full  dataset? • UNION  ALL • Latest  partitions  in  Oracle:  (SALES  table) • Offloaded  data  in  Hadoop:  (SALES@impala or  SALES_DP ext tab) SELECT PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID, QUANTITY_SOLD, AMOUNT_SOLD FROM SSH.SALES WHERE TIME_ID >= TO_DATE(' 1998-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS') UNION ALL SELECT "prod_id", "cust_id", "time_id", "channel_id", "promo_id", "quantity_sold", "amount_sold" FROM SSH.SALES@impala WHERE "time_id" < TO_DATE(' 1998-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS') CREATE VIEW app_reporting_user.SALES AS
  • 42. gluent.com 42 Hybrid  table  with  selective  offloading Cheap  Scalable  Storage (e.g HDFS  +  Hive/Impala) Hot  Data Dim.   Tables Hadoop Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Expensive Storage TABLE  ACCESS  FULLEXTERNAL  TABLE  ACCESS  /  DBLINK SELECT c.cust_gender, SUM(s.amount_sold) FROM ssh.customers c, sales_v s WHERE c.cust_id = s.cust_id GROUP BY c.cust_gender SALES  union-­‐all  view HASH  JOIN GROUP  BY SELECT  STATEMENT
  • 43. gluent.com 43 Partially  Offloaded  Execution  Plan -------------------------------------------------------------------------- | Id | Operation | Name |Pstart| Pstop | Inst | -------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | | | 1 | HASH GROUP BY | | | | | |* 2 | HASH JOIN | | | | | | 3 | PARTITION RANGE ALL | | 1 | 16 | | | 4 | TABLE ACCESS FULL | CUSTOMERS | 1 | 16 | | | 5 | VIEW | SALES_V | | | | | 6 | UNION-ALL | | | | | | 7 | PARTITION RANGE ITERATOR| | 7 | 68 | | | 8 | TABLE ACCESS FULL | SALES | 7 | 68 | | | 9 | REMOTE | SALES | | | IMPALA | -------------------------------------------------------------------------- 2 - access("C"."CUST_ID"="S"."CUST_ID") Remote SQL Information (identified by operation id): ---------------------------------------------------- 9 - SELECT `cust_id`,`time_id`,`amount_sold` FROM `SSH`.`SALES` WHERE `time_id`<'1998-07-01 00:00:00' (accessing 'IMPALA' )
  • 44. gluent.com 44 Performance • Vendor  X:  "We  load  N  Terabytes  per  hour" • Vendor  Y:  "We  load  N  Billion  rows  per  hour" • Q:  What  datatypes? • So  much  cheaper  to  load  fixed  CHARs • …but  that's  not  what  your  real  data  model  looks  like. • Q:  How  wide  rows? • 1  wide  varchar column or  500  numeric  columns? • Datatype  conversion  is  CPU  hungry! Make  sure  you   know  where  in   stack  you'll  be   paying  the  price And  is  it  once  or   always  when   accessed?
  • 45. gluent.com 45 Performance:  Data  Retrieval  Latency • ClouderaImpala • Low  latency  (subsecond) • Hive • Use  latest  Hive  0.14+  with  all  the  bells'n'whistles (Stinger,  TEZ+YARN) • Otherwise  you'll  wait  for  jobs  and  JVMs  to  start  up  for  every  query • Oracle  SQL  Connector  for  HDFS • Multisecond latency  due  to  hdfs_stream JVM  startup • Oracle  Big  Data  SQL • Low  latency  thanks  to  "Exadata  storage  cell  software  on  Hadoop" Time  to  first  row/byte
  • 46. gluent.com 46 Big  Data  SQL  Performance  Considerations • Only  data  retrieval (the  TABLE  ACCESS  FULL  etc)  is  offloaded! • Filter  pushdown  etc • All  data  processing still  happens  in  the  DB  layer • GROUP  BY,  JOIN,  ORDER  BY,  Analytic  Functions,  PL/SQL  etc • Just  like  on  Exadata…
  • 47. gluent.com 47 Thanks! We  are  hiring  [bad  ass]  developers  &  data  engineers!!! http://gluent.com Also,  the  usual: http://blog.tanelpoder.com tanel@tanelpoder.com