45. Need server proxy access
•
•
•
Facilitate remote client
o Server process to support concurrent clients
Standard compliant connectors
o JDBC, ODBC
Security, Auditing
48. Protecting Hadoop data and services
•
•
•
•
Kerberos based authentication
Posix style file permissions
Access control for job submission
Encryption over wire
50. Need for authorization
•
•
•
Secure authorization
o Enforce policy control access to data for authenticated
user
Fine grain authorization
o Ability to control subset of data
Role based authorization
o Ability to associate privileges with roles
51. Current state of authorization
•
•
File based authorization
o Control at file level
o Insufficient for collaboration
o No fine grain access control
Sub-optimal built-in authorization
o Intended for preventing accidental changes
o Not for preventing malicious users for hacking ..
52. Apache Sentry
•
•
•
Policy Engine for authorization
Fine-grain, role based
Pluggable modules for Hadoop components
o Works with out of the box with Hive
65. Impala Architecture
Common Hive SQL and interface
Unified metadata and scheduler
Hive
Metastore
SQL App
YARN
HDFS NN
State
Store
ODBC
Query Planner
Query Planner
Query Coordinator
Query Coordinator
Query Exec Engine
Query Exec Engine
HDFS DN
65
HBase
HDFS DN
HBase
Fully MPP
Distributed
Query Planner
Query Coordinator
Query Exec Engine
HDFS DN
HBase
Editor's Notes
Open source system implemented (mostly) In Java. Provides reliable and scalable storage and processing of extremely large volumes of data (TB to PB) on commodity hardware.Two primary components, HDFS and MapReduce. Based on software originally developed at Google.An important aspect of Hadoop in the context of this talk is that it allows for inexpensive storage (<10% of other solutions) of any type of data, and places no constraints on how that data is processed.Allows companies to begin storing data that was previously thrown away.Hadoop has value as a stand-alone analytics system, but many if not most organizations will be looking to integrate into existing data infrastructure.
Current Architecture BuildIn the beginning, there were enterprise applications backed by relational databases. These databases were optimized for processing transactions, or Online Transaction Processing (OLTP), which required high speed transactional reading and writing.Given the valuable data in these databases, business users wanted to be able to query them in order to ask questions. They used Business Intelligence tools that provided features like reports, dashboards, scorecards, alerts, and more. But these queries put a tremendous burden on the OLTP systems, which were not optimized to be queried like this.So architects introduced another database, called a data warehouse – you may also hear about data marts or operational data stores (ODS) – that were optimized for answering user queries.The data warehouse was loaded with data from the source systems. Specialized tools Extracted the source data, applied some Transformations to it – such as parsing, cleansing, validating, matching, translating, encoding, sorting, or aggregating – and then Loaded it into the data warehouse. For short we call these ETL.As it matured, the data warehouse incorporated additional data sources.Since the data warehouse was typically a very powerful database, some organizations also began performing some transformation workloads right in the database, choosing to load raw data for speed and letting the database do the heavy lifting of transformations. This model is called ELT. Many organizations perform both ETL and ELT for data integration.
Issues BuildAs data volumes and business complexity grows, ETL and ELT processing is unable to keep up. Critical business windows are missed.Databases are designed to load and query data, not transform it. Transforming data in the database consumes valuable CPU, making queries run slower.
Solution BuildOffload slow or complex ETL/ELT transformation workloads to Cloudera in order to meet SLAs. Cloudera processes raw data to feed the warehouse with high value cleansed, conformed data.Reclaim valuable EDW capacity for the high value data and query workloads it was designed for, accelerating query performance and enabling new projects.Gain the ability to query ALL your data through your existing BI tools and practices.
Solution BuildOffload slow or complex ETL/ELT transformation workloads to Cloudera in order to meet SLAs. Cloudera processes raw data to feed the warehouse with high value cleansed, conformed data.Reclaim valuable EDW capacity for the high value data and query workloads it was designed for, accelerating query performance and enabling new projects.Gain the ability to query ALL your data through your existing BI tools and practices.
Conventional databases are expensive to scale as data volumes grow. Therefore most organizations are unable to keep all the data they would like to query directly in the data warehouse. They have to archive the data to more affordable offline systems, such as a storage grid or tape backup. A typical strategy is to define a “time window” for data retention beyond which data is archived. Of course, this data is not in the warehouse so business users cannot benefit from it.
Bank of AmericaA multinational bank saves millions by optimizing their EDW for analytics and reducing data storage costs by 99%.Background: A multinational bank has traditionally relied on a Teradata enterprise data warehousefor its data storage, processing and analytics. With the movement from in-person to online banking, the number of transactions and the data each transaction generates has ballooned. Challenge: The bank wanted to make effective use of all the data being generated, but their Teradata system quickly became maxed out. It could no longer handle current workloads and the bank’s business critical applications were hitting performance issues. The system was spending 44% of its resources for operational functions and 42% for ELT processing, leaving only 11% for analytics and discovery of ROI from new opportunities. The bank was forced to either expand the Teradata system which would be very expensive, restrict user access to the system in order to lessen the workload, or offloading raw data to tape backup and relying on small data samples and aggregations for analytics in order to reduce the data volume on Teradata. .Solution: The bank deployed Cloudera to offload data processing, storage and some analytics from the Teradata system, allowing the EDW to focus on its real purpose: performing operational functions and analytics. Results: By offloading data processing and storage onto Cloudera, which runs on industry standard hardware, the bank avoided spending millions to expand their Teradata infrastructure. Expensive CPU is no longer consumed by data processing, and storage costs are a mere 1% of what they were before. Meanwhile, data processing is 42% faster and data center power consumption has been reduced by 25%. The bank can now process 10TB of data every day.
This is a very quick overview and glosses over much of the capabilities and functionality offered by Flume. This is describing 1.3 or “Flume NG”.
Client executesSqoop job.Sqoop interrogates DB for column names, types, etc.Based on extracted metadata, Sqoop creates source code for table class, and then kicks off MR job. This table class can be used for processing on extracted records.Sqoop by default will guess at a column for splitting data for distribution across the cluster. This can also be specified by client.
Should be emphasized that with this system we maintain the raw logs in Hadoop, allowing new transformations as needed.
Most of these tools integrate to existing data stores using the ODBC standard.
MSTR and Tableau are tested and certified now with the Cloudera driver, but other standard ODBC based tools should also work, and more integrations will be supported soon.
JDBC/ODBC support: HiveServer1 Thrift API lacks support for asynchronous query execution, the ability to cancel running queries, and methods for retrieving information about the capabilities of the remote server.
Solution BuildOffload slow or complex ETL/ELT transformation workloads to Cloudera in order to meet SLAs. Cloudera processes raw data to feed the warehouse with high value cleansed, conformed data.Reclaim valuable EDW capacity for the high value data and query workloads it was designed for, accelerating query performance and enabling new projects.Gain the ability to query ALL your data through your existing BI tools and practices.
Showing a definite bias here, but Impala is available now in beta, soon to be GA, and supported by major BI and analytics vendors. Also the system that I’m familiar with.Systems like Impala provide important new capabilities for performing data analysis with Hadoop, so well worth covering in this context. According to TDWI, lack of real-time query capabilities is an obstacle to Hadoop adoption for many companies.
Impalad’scomponsed of 3 components – planner, coordinator, and execution engine.State Store Daemon isn’t shown here, but maintains information on impala daemons running in system