SlideShare a Scribd company logo
1 of 45
Next Revolution
Toward Open Platform

                                      NYC 2011




                       www.nexr.com
NexR Introduction
 Big data analytics firm
   Working on Hadoop and big data for 5 years
   Provided a NexR Hadoop solution to all major Korea telcos
   (KT, SKT, LG U+)
   Leading a Korean Hadoop community and holding Hadoop
   conferences


 Products
   NexR Data Analytics Platform (NDAP)
   iCube Cloud: cloud computing platform (like OpenStack)
   Massive email archiving solution (presented in Hadoop World 2009)




    Next Revolution
    Toward Open Platform                                        -2-
Agenda
 Voice of Customer: KT CDR Analysis System
 KT requirements for system migration
 NexR Data Analytics Platform (NDAP) overview
 Oracle-to-Hive migration
 Enterprise Hive
 RHive
 Lessons learned
 Conclusion

                    You can download this presentation file:
                      http://www.nexr.com/hw11/ndap.pdf



   Next Revolution
   Toward Open Platform                                        -3-
Introduction to KT business

                                                                   Business
                     1981.12 Establishment of KT Corporation       Domain
                     2002.08 Privatization from Gov. Owned         – Mobile 2G, 3G
                            Company                                – WiBro (Mobile
                                                                   WiMax)
                     2006.04 Commercial Launch of World’s
                            first Mobile WiMAX - WiBro             –Internet Access
                                                                   –IPTV
                     2008.11 Commercial Launch of Real-Time IPTV
                                                                   –VOIP
                     2009.06 Merged with KTF                       –Multimedia
                                                                   Contents
                     2010.06 Cloud Service Launched                –Local,
                                                                   International
                                                                   Telephone
                                                                   – Cloud Service




# of        Sales      Telephone      Broadband     Mobile
Employees   (2010)     Subscribers    Subscribers   Subscribers
Introduction to KT CDR data
• KT CDR(Call Detail Record)                                                            Unit : TB
                                             row data         summary              size
                          Data
                                             1 Month          1 Month     (row 1 yr +sum2 yrs)
                      Unrated CDR
                                                3.7               2.5             104
                 (VOICE, Data, SMS, MMS)
    Wireless            Rated CDR               1.5               0.2             22
                          Wi-Fi                 0.4               0.3              12
                         Wibro                  1.5               1.0             42
    Wireline           Rated CDR                1.5               1.5             55
      IPDR                IP-TV                 1.5               0.1             19
     Total                                      10                5.6             254


• KT Subscriber Analysis System(SAS) for Wireless CDR
   Reporting, call detail summary, subscriber’s call quality, call log search, etc

   Implemented with relational database over a high-end server
     - Data gathering and converting in a server every tens of seconds

     - Daily batch extract-transform-load(ETL) with SQL queries

     - Near real-time search against an indexed column(call number)

   Hundreds of DB tables, over 3000 SQL queries for 10 years
Current KT CDR Analysis System Architecture


                                        Relational
Data Sources                            Database                                   Real-Time
                                                                      Bottleneck
                                                                                    Search
                                     Bottleneck
                         Data                               Raw
  LALA2                Converting                           Data


                                                          Dimension
                                                            Table                   OLAP


                  Bottleneck                      Batch   Summary
                                                   ETL      Table
  NIBADA                              Bottleneck

                        Collector
                         Server




   ARGOS
                                    High-end
                                     Server
New Challenges Faced
• Increasing In Data Volume
   – Popular demand for Smart Phone and SNS
   – Need to do more complicated data analysis to bit the competition
   – Customer behavior analysis is needed
• Slow in performance
   – Peak time performance became unacceptable
   – Some CDR’s were lost due to slow performance
• KT Cloud Business Launched
   – Cheaper New KT Cloud H/W is available
   – Open source requirements are increasing in the company




           Can traditional DB give us an answer?
KT meets NexR for Big Data
• Scalability
   – Coping with increasing data volume
     and variety (wired, 2G, 3G, WiMax,
     LTE, WiFi, SMS, MMS, etc)
   – Enabling horizontal scalability in every
     data path (data collection, data           Replacing traditional
     storage, ETL process, data search)          RDB and DW with
                                                Hadoop and similar
• Performance                                           OSS
   – Handling streamed CDR data in near
                                                  Project start (2011.4)
     real-time
   – Completing daily ETL tasks in a given        Applying NexR solution
     time period regardless of data                for CDR analysis (pilot)
     increase


• Cost-Efficiency
   – Reducing the cost with inexpensive
     equipments
Continuous Journey for KT’s Big Data
• Step by Step approach….with NexR

         Steps           Open                         Coverage

                                  . Replacing representative data and SQLs
     Hadoop CDR         2012.1Q
   Analysis Platform              . Unrated Wireless CDR (Pilot)



                                  . Change all traditional application to OSS
                        2012.
     Wireless CDRs                . Add more views and reports



                                  . Rated CDR’s, Internet access log, TV log
    Data Integration    2013.
   Advanced Analytics             . Advanced Analytics



                                  . SNS, Location etc
     External Data      2014.
                                  . Data from KT subsidiaries
        Sources
Rethinking KT’s Requirements
                                                               Internet   Social
                                                    IPTV Log    Access    Data
                                                                 Log




                                                                                     Data
                                                          Data                      Volume
                                     Data
                                                                                       +
                                   Explosion           Integration                   Data
                                                                                    Variety


        Past                          Present                  Future



                   KT CDR System                                                     Data
           Hundreds of DB tables, 3000 SQLs                                        Interface
                    (for 10 years)                                                     +
                                                                                     Data
                                                                                   Engineers
           SQL                  Business Analysts              Who?
                          DBA
        Developers               (OLAP, SAS, etc)

   Next Revolution
   Toward Open Platform                                                                -10-
Big Data Analytics Requirements for Enterprise

    Data volume is only the basic requirement

    Data integration is the fundamental requirement
    (Structured data + Unstructured data)

    Need to preserve the existing data and apps
    Need to be familiar to enterprise data engineers
     (DBA, SQL developers, business analysts, etc)

    Smooth transition is also essential


                What’s the solution?
    Next Revolution
    Toward Open Platform                                -11-
NexR Solution: Hadoop + Hive
                                   Hive is the best solution for smooth transition
                                   from database world to Hadoop world


                             ANSI-SQL-based query engine
                                 Good for RDB migration



                                                             Batch data processing
       HOW TO                                                (ETL, Reporting, Ad-hoc query)
       CONVERT

        HOW TO
         ADAPT
                                                             Common data storage



Enterprise data engineers        File-based data store
        DBA, SQL                 Good for data integration
        Developers




      Next Revolution
      Toward Open Platform                                                             -12-
NexR Data Analytics Platform (NDAP)
 Embracing database into Hadoop world
   Support for migrating data and logics from RDB to Hadoop
   Support for integrating RDB and Hadoop
   Offering Hive tools for DBA and SQL developers


 Full package for big data analytics
   From data collection to batch data processing, real-time query, and
   even advanced analytics
   Leveraging open source technologies
   Horizontal scalability in every data processing path (collection,
   batch processing, real-time query, etc)
   Injecting real-world practices by the collaboration with KT




    Next Revolution
    Toward Open Platform                                          -13-
NDAP Bird’s View
 Used Open Sources

                                        NDAP RHive                        Advanced analytics
                                     Integration of R and Hive



                                 NDAP Enterprise Hive
                                  Oracle-to-Hive, Hive workflow,          Batch data processing
                             Hive performance monitor, query planner



                                    NDAP Data Store                       Common data storage
                              HDFS, Sqoop-based data import/export



                                       NDAP Search
                             ElasticSearch-based distributed log search   Real-time query
                                     Time-ranged index sharding


                                      NDAP Collector
                                   Flume-based data collector             Streamed data collection
                              Checkpointing for low overhead agents


                                  NDAP Admin Center
                             Zookeeper-based distributed coordinator      Coordination & Management
                             Collectd-based system/app management


      Next Revolution
      Toward Open Platform                                                                        -14-
NDAP Architecture
  Data                           NexR Data Analytics Platform (NDAP)                               Applications
 Sources
                                                                                                     Advanced
                                                           RHive                                     Analytics



                                             Enterprise               Hawk
                                                                                                         DBA
                                                              PerfMon, Query Plan
                                               Hive
                       Data
 Oracle                                                                                                 ETL
                     Importer                                         Lama
                                             Oracle-to-Hive                                         Ad-hoc query
                                                                   Hive Workflow                     Reporting
Databases
                                                                                        Existing
                                                                                         BI/DW
                       Data
  ODS                Importer                                                            OLAP
                                                                                         Server
                                                Data Store                                            OLAP
                                                (Hadoop)
                                                                               Data
                                                                             Exporter    Oracle
                     Collector

                                                                           REST/JSON
                                                  Search                      API
                                                                                                    Real-time
   Telco                                                                                              query
Equipments
(Streaming                                  Admin Center
   data)


             Next Revolution
             Toward Open Platform                                                                     -15-
NDAP Bird’s View – Today’s focus

                                     NDAP RHive
                                  Integration of R and Hive

                                                                       Today’s talk
                               NDAP Enterprise Hive
                                Oracle-to-Hive, Hive workflow,
                           Hive performance monitor, query planner



                                  NDAP Data Store
                           HDFS, Sqoop-based data import/export



                                     NDAP Search
                          Lucene-based distributed log search engine
                                 Time-ranged index sharding

                                                                       Refer to appendix
                                   NDAP Collector
                                 Flume-based data collector
                            Checkpointing for low overhead agents


                                NDAP Admin Center
                           Zookeeper-based distributed coordinator
                           Collectd-based system/app management


   Next Revolution
   Toward Open Platform                                                                -16-
Enterprise Hive




 Recreating Hive for Enterprise Data Engineers

 Two goals
   Migration of data and SQL from RDB(Oracle) to Hive
    Oracle-to-Hive support
   Rich environment for Hive developers, even DW/BI teams and DBA
    Performance monitor, query planner, workflow manager




   Next Revolution
   Toward Open Platform                                      -17-
Is Oracle-to-Hive trivial?
Simple Example
                  SELECT * FROM Employee e1, Dept d1 WHERE e1.ID = d1.Id


                  SELECT * FROM Employee e1 JOIN Dept d1 ON (e1.ID = d1.Id)




Typical Example
  SELECT /*+ PARALLEL(K1 16) USE_NL(K1 B) */
          ETL_DATE, CALL_DATE,
          CASE WHEN SUBSCRIBER_TYPE ='PREMIUM'
              THEN 'Y'
              ELSE NVL(TO_CHAR(B.I_NCN),'X')
          END AS I_NCN,
          I_INOUT,VALID_CNT, I_CFC_TYPE, ……
    FROM 3G_CALL_LOG K1
       , SASCOMM.PHONE_MAPPING B
   WHERE K1.i_etl_dt        = TO_DATE('[#SAS_YDAY#,'YYYYMMDD')
     AND K1.i_call_dt ||'' >= TO_DATE('[#SAS_YDAY#]','YYYYMMDD')
     AND K1.i_call_dt ||'' < TO_DATE('[#SAS_YDAY#]','YYYYMMDD') + 1
     and K1.I_INOUT in ('0','1')
     AND DECODE(K1.I_INOUT,'0',NVL(K1.I_OUT_CTN, I_CALLING_NUM),'1',K1.I_IN_CTN) = B.I_CTN(+)
     AND K1.CALL_DATE   >= B.SDATE(+)
     AND K1.CALL_DATE   <   B.EDATE(+);




     Next Revolution
     Toward Open Platform                                                                   -18-
Enterprise Hive – Oracle-to-Hive
 Enhancing Hive by
   Fixing Hive code (JIRA issues, 2253, 2503, 2329, 2332, etc)
   Adding Hive UDF and UDAF for Oracle compatibility


 Enterprise Hive provides
   Conversion rules, a guide and a process
   Oracle data types that are not supported in Hive
   Oracle functions that are not supported in Hive


 Three conversion points to consider
   Data model and data types
   Basic functions, aggregate and analytic functions
   SQL syntax


   Next Revolution
   Toward Open Platform                                          -19-
Oracle-to-Hive – Data Model, Types, Functions

                                                                                                      Hive refered to
                                                                                                    MySQL function syntax
              Data Model                             Basic Functions
     Oracle                  Hive                   Function
                                                                        Oracle                       Hive
                                                      Type
      Table                  Table
                                                      Math          round,ceil,mod,           round,ceil,pmod,
     Partition             Partition                Functions      power,sqrt,sin/cos        power,sqrt,sin/cos
                                                                                            substr,trim,lpad/rpad
    Sampling                 Bucket                 Character     substr,trim,lpad/rpad
                                                                                           ltrim/rtrim,regexp_repl
                                                    Functions      ltrim/rtrim,replace
                                                                                                      ace
                 Data Type                            Null
                                                                    coalesce,nvl,nvl2      coalesce (no nvl, nvl2)
                                                    Functions
     Oracle                  Hive
                            TINYINT                       Added Basic Functions (Hive UDF)
    NUMBER(n)
                          INT/BIGINT
                                                   Function Type                          Hive
   NUMBER(n,m)         FLOAT/DOUBLE
                                                      Condition                  DECODE, GREATEST
    VARCHAR2                 STRING
                                                        Null                            NVL, NVL2
                          STRING
      DATE             "yyyy-MM-dd                     Type              TO_NUMBER, TO_CHAR, TO_DATE,
                      HH:mm:ss" format               Conversion          INSTR4, DATE_FORMAT, LAST_DAY


                                       Hive data type is designed to be converted into Java data type


   Next Revolution
   Toward Open Platform                                                                                              -20-
Oracle-to-Hive – SQL Syntax & Analytic Functions
                                     Most not-supported Oracle SQL syntax can be converted with Join syntax


                                   Oracle SQL                                             Hive HQL


                     SELECT * from Employee e WHERE e.DeptNo                        SELECT * from Employee e
 IN subquery
                         IN(SELECT d.DeptNo FROM Dept d)                 LEFT SEMI JOIN Dept d ON (e.DeptNo=d.DeptNo)


                                                                                  SELECT e.* from Employee e
   NOT IN            SELECT * from Employee e WHERE e.DeptNo
                                                                        LEFT OUTER JOIN Dept d ON (e.DeptNo=d.DeptNo)
  subquery             NOT IN(SELECT d.DeptNo FROM Dept d)
                                                                                   WHERE d.DeptNo IS NULL


                                    SELECT *                                              SELECT *
    JOIN
                  FROM Employee e1, Dept d1 WHERE e1.ID = d1.Id        FROM Employee e1 JOIN Dept d1 ON (e1.ID = d1.Id)


    RANK         SELECT name,dept,salary,RANK() OVER (PARTITION BY      SELECT e.name,e.dept,e.salary,RANK(e.dept,e.salary)
  (Analytic                              dept                        FROM (SELECT name, dept, salary FROM emp DISTRIBUTED
  Function)              ORDER BY salary DESC) FROM emp                       BY dept SORT BY dept, salary DESC) e


    MIN                                                                 SELECT dept,tmp.m FROM emp JOIN (SELECT dept,
                 SELECT dept, MIN(salary) OVER (PARTITION BY dept)
 (Aggregate                                                                              MIN(salary) m
                                    FROM emp
  Function)                                                          FROM emp GROUP BY dept) tmp ON emp.dept = tmp.dept



                          Oracle analytic functions are used sometimes for statistical processing (5% in KT case)
                           Implemented some analytic functions
                            (RANK, DENSE_RANK, ROW_NUMBER, LAG, MIN, MAX, SUM)

       Next Revolution
       Toward Open Platform                                                                                          -21-
Oracle-to-Hive – Example


           select /*+ use_nl(E emp_idx1) */
                  D.dname, E.empno, E.ename,
                  decode(nvl(JOB, ‘SALESMAN’), 'SALESMAN', sal, 0) sal,
                  RANK() over (PARTITION BY D.deptno ORDER BY sal desc) ranking
           from   dept D, emp E
           where D.deptno = E.deptno
           and    E.ename in (select ename
                              from bonus
                              where job in ('SALESMAN', 'CLERK'));




           select X.dname, X.empno, X.ename
                  nexr_rank(HASH(X.deptno, X.sal), X.sal) ranking
           from (
               select D.dname, D.deptno, E.empno, E.ename,
                  (case coalese(JOB, ‘SALESMAN’) when 'SALESMAN‘ then sal else 0) sal,
               from   dept D
               join   emp   E on (D.deptno = E.deptno)
               join   bonus B on (D.ename = B.ename)
               where B.job in ('SALESMAN', 'CLERK')
           ) X
           distribute by hash(D.deptno, E.sal) sort by D.deptno, E.sal;




   Next Revolution
   Toward Open Platform                                                                  -22-
NDAP Process for RDB Migration

     Data
  Preparation               Conversion        Validation      Optimization



                               Function
  Oracle schema                                               Rewriting Hive
                              conversion
 to Hive schema                                                   queries
                                                 Data
                                                               semantically
                            SQL conversion    compatibility
 Data loading to                                               (when more
                              (by 1-on-1         check
   Hive using                                                  performance
                            conversion rule
     Sqoop                                                      is needed)
                             syntactically)




  The case of KT CDR migration
      Chose 100 representative SQLs for ETL and successfully converted
      Current step: 200-300 mainly used SQLs
      Next step(2012): 3000 SQLs



     Next Revolution
     Toward Open Platform                                                      -23-
Enterprise Hive – Rich Environment for Hive

 Building up Hive Ecosystem by
   Adding assistant programs to help DBA and SQL developers


 Enterprise Hive provides

                      Hive performance monitor and query planner


                      Hive workflow manager




   Next Revolution
   Toward Open Platform                                            -24-
Hawk – Hive Performance Monitor
 Difficulty of Hive performance diagnostics
   Metrics and logs from Hive and Hadoop are seperated
   Lack of the historical and statistical view of performance
 Hawk performance monitor for DBA
   Integrated view of a Hive query and the corresponding MapReduces
   Hourly/daily/weekly/monthly performance views of each query




                                                       Hawk Screenshot



             Hawk Architecture


    Next Revolution
    Toward Open Platform                                                 -25-
Hawk – Hive Query Planner
 Difficulty of Hive default query planner
     Too complicated due to show the detail of MapReduce execution
     Not for DBA, but for Hive internal developers
 Hawk query planner for DBA
     Displaying a Hive query in a HQL operator level (familiar to DBA)
     Showing a performance result with a query at once

Hive default query planner




                                                       Hawk query planner




                                                                         Performance result



       Next Revolution
       Toward Open Platform                                                                   -26-
Lama – Hive Workflow Manager

 Workflow development and management tool for Hive
   Managing data processing jobs for Hive
   Choosing Oozie as a core workflow engine
   Providing web-based interface
     Workflow editing & management, user management, job scheduling,
     project management, etc


 On-demand workflow change at runtime
   Need to fix and resume a workflow at runtime in failure
   Not supported in most workflow engines
   Patching Oozie for suspend/resume per action(i.e., Hive query)

 Future plan
   Supporting other data processing jobs like Pig, Sqoop, MapReduce,
   HDFS, SSH, and Java

   Next Revolution
   Toward Open Platform                                             -27-
NDAP Process for Batch Data Processing

    Analysis                 Development      Execution     Management



                                                             Performance
                                               Workflow
  Analyze service                                            diagnostics &
                               Workflow      deployment &
   request(SR)                                               optimization
                              development     scheduling
                               & testing
  Hive data and                                                 Workflow
                              & validation   Performance
 query modeling                                               suspend/fix/
                                              monitoring
                                                            resume in failure




                                 Lama             Hawk            Hawk
                                Workflow      Performance         Query
                                Manager         Monitor          Planner




      Next Revolution
      Toward Open Platform                                                   -28-
Enterprise Hive Demo




   Next Revolution
   Toward Open Platform   -29-
R for Advanced Analytics
 R (GNU open source)
   Programming language and software environment for statistical
   computing and graphics (wikipedia)
   4,000+ R libraries (more than SAS’s functionality)
   Becoming a de facto standard among statisticians


 R for Big Data
   R runs in a single node
   Some parallel R packages
     snowfall, rpvm, rmpi, etc


 Recent attempts to combine R and Hadoop
   RHIPE(Purdue), RHadoop(RA), Ricardo(IBM)


   Next Revolution
   Toward Open Platform                                        -30-
RHive
 Marrying R and Hive for Big Data Analytics
     Most R programmers are familiar to SQL
     Hive can hide the detail of Hadoop and MapReduce
     Inspired by IBM Ricardo(R+Jaql)




        Strong for deep analytics                           Strong for massive data manipulation
Lack of massive data manipulation                           Lack of analytical functionalities




                  Providing Hive interfaces in the R environment
      Allowing R programmers to use a familiar SQL for big data manipulation

                     Released as open source (Apache license version 2)
                           Source: https://github.com/nexr/RHive
                     CRAN: http://cran.r-project.org/web/packages/RHive

       Next Revolution
       Toward Open Platform                                                                 -31-
RHive API and Architecture
 RHive API
   rhive.connect(): connect R to Hive
   rhive.query(): send a Hive query and return the result
   rhive.export(): export R functions to R processes running on the MR nodes
   rhive.exportAll(): export R functions and R objects to R processes running on
   the MapReduce nodes
   rhive.close(): close a Hive connection

      RHive
   Architecture




    Next Revolution
    Toward Open Platform                                                      -32-
RHive Sample – Flight Delay Prediction
 R: Building a prediction model of flight delay using linear regression with a training data set (sampled from Hive)
 Hive: Running the prediction model(R objects) with an entire data set in Hive

  1   library(RHive)
  2   rhive.connect("127.0.0.1")
  3
  4   # get a training data set from Hive
  5   trainset <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines",fetchsize=30,limit=100)
  6
  7   # convert to numeric, and extract out missing values
  8   trainset$arrdelay <- as.numeric(trainset$arrdelay)
  9   trainset$distance <- as.numeric(trainset$distance)
 10   trainset <- trainset[!(is.na(trainset$arrdelay) | is.na(trainset$distance)),]         Data set: airline on-time performance
 11                                                                                         http://stat-computing.org/dataexpo/2009/
 12   # create a prediction model using R model objects and internal funtions               • Flight arrival and departure details for
 13   model <- lm(arrdelay ~ distance + dayofweek,data=trainset)                            all commercial flights within the USA,
 14   rhpredict <- function(arg1,arg2,arg3) {                                               from October 1987 to April 2008.
 15     if(arg1 == "NULL" | arg2 == "NULL" | arg3 == "NULL")
 16       return(0.0)
 17     res <- predict.lm(model, data.frame(dayofweek=arg1, arrdelay=arg2, distance=arg3))
 18     return(as.numeric(res))
 19   }
 20   null <- "NULL"
 21
 22   # set up R objects in Hive
 23   rhive.assign("null", null)
 24   rhive.assign("rhpredict", rhpredict)
 25   rhive.assign("model", model)
 26
 27   # export the R prediction model and run it in Hive
 28   rhive.exportAll("rhpredict", c("10.1.3.2","10.1.3.3","10.1.3.4","10.1.3.5","10.1.3.6","10.1.3.7"))
 29   rhive.query("create table delaypredict as select R('rhpredict', dayofweek, arrdelay, distance, 0.0) from airlines")

         Next Revolution
         Toward Open Platform                                                                                                  -33-
RHive Demo




   Next Revolution
   Toward Open Platform   -34-
Lessons Learned
 RDB migration to open source is complicated, time-
 consuming, and labor-intensive. It can become real
 with some practice and migration process.
   The average time of a query conversion (200~300 lines in average)
     8 hours -> 2 hours after 4 months (4 times faster)
   Advantageous to those who experienced database migration
   (similar to Oracle-to-MySQL migration)

 Current data engineers are not familiar with open
 sources like Hadoop. They want to use software tools
 similar to the ones that they use.
   Open sources such as Hadoop and MapReduce are not easy for
   current IT managers. Open sources are technology-driven, not
   demand-driven.
   Open sources and technologies need to be wrapped up in familiar
   interfaces in order to hide the detail.

   Next Revolution
   Toward Open Platform                                         -35-
Lessons Learned
 Open source software is not a panacea. Choosing a right
 open source is the first significant step. Combining several
 OSS is common. The modification of source code of OSS is
 inevitable if requirements are not negotiable.
   Combining two separate open sources, Hive and ElasticSearch for
   batch data processing and real-time query on Hadoop as a common
   data store.
   The change of Hive, ElasticSearch, Flume, Oozie, Zookeeper, etc.

 The integration of various types of data is a critical issue for
 an enterprise. Especially, the structured data of database
 and DW need to be coupled with unstructured data in order
 to better understand customer’s needs.
   It is necessary to embrace current data and business logics in a new
   environment.
   RDB/DW and Hadoop have their pros and cons, so it is necessary to
   find the right mix.

    Next Revolution
    Toward Open Platform                                             -36-
Conclusion


   Big data analytics for telco and enterprises




 Smooth transition from RDB/DW to



     NexR Data Analytics Platform (NDAP)



   Next Revolution
   Toward Open Platform                           -37-
NexR NDAP Team
   Jaesun Han                  Wonkuk Yang
   Sangmin Kwak                Sebong Oh
   JeongMin Kwon               SungHan Woo
   Keumju Kim                  Dongmin Yu
   Daegeun Kim                 Choonghyun Ryu
   Minseok Kim                 Bokju Yun
   Minwoo Kim                  Jonghee Lee
   Yeonseop Kim                HyungJoo, Lim
   Youngwoo Kim                HeeWon Jeon
   Hyeon-Cheol Nah             GooBum Jung
   SeungWoo Ryu                Sunghwan Cho
   Seoeun Park                 Junho Cho
   Young-Geun Park             ByungMyon Chae
   Eun-Sook Park               Yungtai Choi
   Chihoon Byun                Choi Jong-wook
   SeongHwa Ahn                Inho Han
    Youngbae An                Seonghak Hong
      Next Revolution
      Toward Open Platform                        -38-
Thank you
              Presentation file: http://www.nexr.com/hw11/ndap.pdf


                                                         Contact
                                                         jason.han@nexr.com
                                                         twitter: @jaesun_han




KT CDR                      NDAP       Enterprise
                                                      RHive           Appendix
 System                   Overview        Hive
                                                    (Slide 30)        (Slide 37)
(Slide 4)                 (Slide 14)   (Slide 17)


   Next Revolution
   Toward Open Platform                                                            -39-
Appendix



Next Revolution
Toward Open Platform              -40-
NDAP Collector
 Flume-based scalable data collector
    Choosing Flume due to the flexible architecture (source, decorator, sink)
    Adding a checkpoint mode and rolling/dedup
 Adding a checkpoint reliability mode
    Chukwa’s checkpoint is grafted onto Flume
    Less resource consumption in agents than Flume E2E mode
    Minimizing the amount of log data retransmitted at the failure of agents
 Rolling and deduplication
    Rolling fragmented log data periodically in Hadoop
    Removing duplicated log data in case of failover


                                              Rolling/Dedup Manager
      Zookeeper                                                              MapReduce
                                                                              Execution
                                 Rolling/De                      Workflow
                                                   Scheduler                              Data Store
                                  dup MR                         Manager
                                                                                          (Hadoop)




       Flume
                                  Source           Decorator          Sink                  Search
       Agent
                                                                             Log data
                                                                             & location
                    Checkpoint



     Next Revolution
     Toward Open Platform                                                                              -41-
NDAP Search: Near Real-Time Indexing
 Near real-time indexing using RAM Index
   Adding RAM index for near real-time indexing in ElasticSearch
   Flushing RAM index into Disk index after a given time period or a buffer
   overflow
   When searching, both RAM index and disk index are examined



                 Indexer                                         Searcher


                       add                                            search


               IndexWriter                                      IndexReader
                                             create
                           write



                                   buffer
                                                         read

                                                                        read
                       commit

                                            Disk Index


    Next Revolution
    Toward Open Platform                                                       -42-
NDAP Search: Index Split Strategies
 Modifying ElasticSearch to add more index split schemes for log search
    Searching log data has usually time constraint like daily or monthly
    Combining time-based index split and size-based index split
 Time-based index split
    Splitting an index according to a given time period
    Improving indexing and search performance
    Easy to implement auto-retention
 Size-based index split
    Splitting an index according to a given size
    Resolving a big index performance problem


                     Time-based                       Size-based                ElasticSearch
                   Index Partitions                Index Sequences              Index Shards


                    2011.10.08            0001          0002                0   Primary   Replica   Replica




                    2011.10.09            0001          0002         0003   1   Primary   Replica   Replica




Search                                                                      3   Primary   Replica   Replica




                    2011.10.30            0001          0002         0003


     Next Revolution
     Toward Open Platform                                                                               -43-
NDAP Admin Center: Distributed Coordinator
 Zookeeper-based distributed coordinator
    Zookeeper handles the coordination among NDAP components
    Patching several issues of Zookeeper and ZkClient
    Providing common libraries for NDAP components
      Gourp membership, master election, distributed lock, distributed queue
      Easy to use and more reliable than any other recipes, especially to read-and-write problems


          Patched Zookeeper Ensemble


                   Zookeeper Client Thread                         Complex, Unique, Fragile

                    Patched ZkClient Thread

                      Zookeeper Recipes

       Group          Master     Distributed   Distributed        Easy, Reusable, Fault Tolerant
     Membership       Election      Lock         Queue




                  NDAP                  NDAP
                  Search               Collector

    Next Revolution
    Toward Open Platform                                                                            -44-
NDAP Admin Center: System/App Management
 Collectd-based system and application monitoring
    Server resource monitoring: CPU, memory, disk, process, vmem, tcp connects, etc
    Application monitoring: Hadoop, ElasticSearch, Flume, Zookeeper, Memcached, Collectd, etc
    Plug-in architecture: add more applications such as NoSQL
 Resource-centric view
    Displaying all nodes’ resource status in a screen for a specific resource (cpu, mem, etc)
    Most system management tools(Ganglia, Nagios, etc) offer node-centric view


                                                                        Check threshold/
                                                      Collectd                  severity
                                                       Server


                                                                 Management
                                                                  Dashboard




                                                                                         NDAP
                                                                                         Admin


     Next Revolution
     Toward Open Platform                                                                        -45-

More Related Content

What's hot

Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sqlRam kumar
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
data warehouse , data mart, etl
data warehouse , data mart, etldata warehouse , data mart, etl
data warehouse , data mart, etlAashish Rathod
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Databasenehabsairam
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introductionPooyan Mehrparvar
 

What's hot (20)

Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
data warehouse , data mart, etl
data warehouse , data mart, etldata warehouse , data mart, etl
data warehouse , data mart, etl
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
Sqoop
SqoopSqoop
Sqoop
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 

Viewers also liked

Big data analytics -hive
Big data analytics -hiveBig data analytics -hive
Big data analytics -hivekarthika karthi
 
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)Stéphane Fréchette
 
SnapLogic Big Data Integration
SnapLogic Big Data IntegrationSnapLogic Big Data Integration
SnapLogic Big Data IntegrationSnapLogic
 
Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionMurtaza Doctor
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
March Marketers: Research Trends Presentation
March Marketers: Research Trends PresentationMarch Marketers: Research Trends Presentation
March Marketers: Research Trends PresentationAlexandra Knoll
 
Hive contributors meetup apache sentry
Hive contributors meetup   apache sentryHive contributors meetup   apache sentry
Hive contributors meetup apache sentryBrock Noland
 
Hive Correlation Optimizer
Hive Correlation OptimizerHive Correlation Optimizer
Hive Correlation OptimizerYin Huai
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4moai kids
 
Hiveハンズオン
HiveハンズオンHiveハンズオン
HiveハンズオンSatoshi Noto
 
Big Data Analytics Using Hadoop
Big Data Analytics Using HadoopBig Data Analytics Using Hadoop
Big Data Analytics Using HadoopSrikanth VNV
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InSnapLogic
 
Programming Hive Reading #3
Programming Hive Reading #3Programming Hive Reading #3
Programming Hive Reading #3moai kids
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinityShashwat Shriparv
 
Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive Liyin Tang
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object ModelZheng Shao
 
Airline and Airport Big Data: Impact and Efficiencies
Airline and Airport Big Data: Impact and EfficienciesAirline and Airport Big Data: Impact and Efficiencies
Airline and Airport Big Data: Impact and EfficienciesJoshua Marks
 

Viewers also liked (20)

Big data analytics -hive
Big data analytics -hiveBig data analytics -hive
Big data analytics -hive
 
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
SnapLogic Big Data Integration
SnapLogic Big Data IntegrationSnapLogic Big Data Integration
SnapLogic Big Data Integration
 
Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to Action
 
6.hive
6.hive6.hive
6.hive
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
March Marketers: Research Trends Presentation
March Marketers: Research Trends PresentationMarch Marketers: Research Trends Presentation
March Marketers: Research Trends Presentation
 
Hive contributors meetup apache sentry
Hive contributors meetup   apache sentryHive contributors meetup   apache sentry
Hive contributors meetup apache sentry
 
Build & test Apache Hawq
Build & test Apache Hawq Build & test Apache Hawq
Build & test Apache Hawq
 
Hive Correlation Optimizer
Hive Correlation OptimizerHive Correlation Optimizer
Hive Correlation Optimizer
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4
 
Hiveハンズオン
HiveハンズオンHiveハンズオン
Hiveハンズオン
 
Big Data Analytics Using Hadoop
Big Data Analytics Using HadoopBig Data Analytics Using Hadoop
Big Data Analytics Using Hadoop
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
 
Programming Hive Reading #3
Programming Hive Reading #3Programming Hive Reading #3
Programming Hive Reading #3
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
 
Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object Model
 
Airline and Airport Big Data: Impact and Efficiencies
Airline and Airport Big Data: Impact and EfficienciesAirline and Airport Big Data: Impact and Efficiencies
Airline and Airport Big Data: Impact and Efficiencies
 

Similar to Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data - Jason Han, NexR

Cloud Computing Service & Market Trends in Korea
Cloud Computing Service & Market Trends in KoreaCloud Computing Service & Market Trends in Korea
Cloud Computing Service & Market Trends in KoreaFanny Lee
 
Broadband World Forum 2012 Highlights
Broadband World Forum 2012 HighlightsBroadband World Forum 2012 Highlights
Broadband World Forum 2012 HighlightsAlan Quayle
 
WTSA-16_SG13_Presentation.pptx
WTSA-16_SG13_Presentation.pptxWTSA-16_SG13_Presentation.pptx
WTSA-16_SG13_Presentation.pptxlionofsouth
 
ITU-T Study Group 13 Introduction
ITU-T Study Group 13 IntroductionITU-T Study Group 13 Introduction
ITU-T Study Group 13 IntroductionITU
 
Colt SD-WAN experience learnings and future plans
Colt SD-WAN experience learnings and future plansColt SD-WAN experience learnings and future plans
Colt SD-WAN experience learnings and future plansColt Technology Services
 
CDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım Örnekleri
CDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım ÖrnekleriCDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım Örnekleri
CDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım Örneklerididemtopuz
 
Colt’s Carrier SDN & NFV: Experience, Learnings & Future Plans
Colt’s Carrier SDN & NFV: Experience, Learnings & Future PlansColt’s Carrier SDN & NFV: Experience, Learnings & Future Plans
Colt’s Carrier SDN & NFV: Experience, Learnings & Future PlansOpen Networking Summit
 
Gef 2012 InduSoft Presentation
Gef 2012  InduSoft PresentationGef 2012  InduSoft Presentation
Gef 2012 InduSoft PresentationAVEVA
 
GE Smallworld Network Inventory Overview
GE Smallworld Network Inventory OverviewGE Smallworld Network Inventory Overview
GE Smallworld Network Inventory Overviewcwilson5496
 
KT/KTDS Case Study: Open Source Database Adoption in Telecom
KT/KTDS Case Study: Open Source Database Adoption in TelecomKT/KTDS Case Study: Open Source Database Adoption in Telecom
KT/KTDS Case Study: Open Source Database Adoption in TelecomEDB
 
Low-Power Wide Area - Overview
Low-Power Wide Area - OverviewLow-Power Wide Area - Overview
Low-Power Wide Area - OverviewM2M Alliance e.V.
 
Mobile Networks - Evolving to all-IP Backbone
Mobile Networks - Evolving to all-IP BackboneMobile Networks - Evolving to all-IP Backbone
Mobile Networks - Evolving to all-IP BackboneHarry Mylonas
 
GE Smallworld Overview September2010
GE Smallworld Overview September2010GE Smallworld Overview September2010
GE Smallworld Overview September2010cwilson5496
 
Optimizing Cloud Computing with IPv6
Optimizing Cloud Computing with IPv6Optimizing Cloud Computing with IPv6
Optimizing Cloud Computing with IPv6John Rhoton
 
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj SoppannavarPurpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj SoppannavarData Con LA
 
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDBReal-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDBVoltDB
 

Similar to Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data - Jason Han, NexR (20)

Cloud Computing Service & Market Trends in Korea
Cloud Computing Service & Market Trends in KoreaCloud Computing Service & Market Trends in Korea
Cloud Computing Service & Market Trends in Korea
 
Broadband World Forum 2012 Highlights
Broadband World Forum 2012 HighlightsBroadband World Forum 2012 Highlights
Broadband World Forum 2012 Highlights
 
WTSA-16_SG13_Presentation.pptx
WTSA-16_SG13_Presentation.pptxWTSA-16_SG13_Presentation.pptx
WTSA-16_SG13_Presentation.pptx
 
ITU-T Study Group 13 Introduction
ITU-T Study Group 13 IntroductionITU-T Study Group 13 Introduction
ITU-T Study Group 13 Introduction
 
Javier Lecanda - Colt SDN/NFV Experience inca 201706
Javier Lecanda - Colt SDN/NFV Experience   inca 201706Javier Lecanda - Colt SDN/NFV Experience   inca 201706
Javier Lecanda - Colt SDN/NFV Experience inca 201706
 
Colt SD-WAN experience learnings and future plans
Colt SD-WAN experience learnings and future plansColt SD-WAN experience learnings and future plans
Colt SD-WAN experience learnings and future plans
 
CDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım Örnekleri
CDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım ÖrnekleriCDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım Örnekleri
CDRLive & CDRInsight,CDR Verileri ile Đs Zekası ve Kullanım Örnekleri
 
Colt’s Carrier SDN & NFV: Experience, Learnings & Future Plans
Colt’s Carrier SDN & NFV: Experience, Learnings & Future PlansColt’s Carrier SDN & NFV: Experience, Learnings & Future Plans
Colt’s Carrier SDN & NFV: Experience, Learnings & Future Plans
 
Gef 2012 InduSoft Presentation
Gef 2012  InduSoft PresentationGef 2012  InduSoft Presentation
Gef 2012 InduSoft Presentation
 
Infrastructure Strategies 2007
Infrastructure Strategies 2007Infrastructure Strategies 2007
Infrastructure Strategies 2007
 
GE Smallworld Network Inventory Overview
GE Smallworld Network Inventory OverviewGE Smallworld Network Inventory Overview
GE Smallworld Network Inventory Overview
 
Radisys offloading 10412_final
Radisys offloading 10412_finalRadisys offloading 10412_final
Radisys offloading 10412_final
 
KT/KTDS Case Study: Open Source Database Adoption in Telecom
KT/KTDS Case Study: Open Source Database Adoption in TelecomKT/KTDS Case Study: Open Source Database Adoption in Telecom
KT/KTDS Case Study: Open Source Database Adoption in Telecom
 
Low-Power Wide Area - Overview
Low-Power Wide Area - OverviewLow-Power Wide Area - Overview
Low-Power Wide Area - Overview
 
Mobile Networks - Evolving to all-IP Backbone
Mobile Networks - Evolving to all-IP BackboneMobile Networks - Evolving to all-IP Backbone
Mobile Networks - Evolving to all-IP Backbone
 
GE Smallworld Overview September2010
GE Smallworld Overview September2010GE Smallworld Overview September2010
GE Smallworld Overview September2010
 
Building a Digital Telco
Building a Digital TelcoBuilding a Digital Telco
Building a Digital Telco
 
Optimizing Cloud Computing with IPv6
Optimizing Cloud Computing with IPv6Optimizing Cloud Computing with IPv6
Optimizing Cloud Computing with IPv6
 
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj SoppannavarPurpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
 
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDBReal-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data - Jason Han, NexR

  • 1. Next Revolution Toward Open Platform NYC 2011 www.nexr.com
  • 2. NexR Introduction Big data analytics firm Working on Hadoop and big data for 5 years Provided a NexR Hadoop solution to all major Korea telcos (KT, SKT, LG U+) Leading a Korean Hadoop community and holding Hadoop conferences Products NexR Data Analytics Platform (NDAP) iCube Cloud: cloud computing platform (like OpenStack) Massive email archiving solution (presented in Hadoop World 2009) Next Revolution Toward Open Platform -2-
  • 3. Agenda Voice of Customer: KT CDR Analysis System KT requirements for system migration NexR Data Analytics Platform (NDAP) overview Oracle-to-Hive migration Enterprise Hive RHive Lessons learned Conclusion You can download this presentation file: http://www.nexr.com/hw11/ndap.pdf Next Revolution Toward Open Platform -3-
  • 4. Introduction to KT business Business 1981.12 Establishment of KT Corporation Domain 2002.08 Privatization from Gov. Owned – Mobile 2G, 3G Company – WiBro (Mobile WiMax) 2006.04 Commercial Launch of World’s first Mobile WiMAX - WiBro –Internet Access –IPTV 2008.11 Commercial Launch of Real-Time IPTV –VOIP 2009.06 Merged with KTF –Multimedia Contents 2010.06 Cloud Service Launched –Local, International Telephone – Cloud Service # of Sales Telephone Broadband Mobile Employees (2010) Subscribers Subscribers Subscribers
  • 5. Introduction to KT CDR data • KT CDR(Call Detail Record) Unit : TB row data summary size Data 1 Month 1 Month (row 1 yr +sum2 yrs) Unrated CDR 3.7 2.5 104 (VOICE, Data, SMS, MMS) Wireless Rated CDR 1.5 0.2 22 Wi-Fi 0.4 0.3 12 Wibro 1.5 1.0 42 Wireline Rated CDR 1.5 1.5 55 IPDR IP-TV 1.5 0.1 19 Total 10 5.6 254 • KT Subscriber Analysis System(SAS) for Wireless CDR  Reporting, call detail summary, subscriber’s call quality, call log search, etc  Implemented with relational database over a high-end server - Data gathering and converting in a server every tens of seconds - Daily batch extract-transform-load(ETL) with SQL queries - Near real-time search against an indexed column(call number)  Hundreds of DB tables, over 3000 SQL queries for 10 years
  • 6. Current KT CDR Analysis System Architecture Relational Data Sources Database Real-Time Bottleneck Search Bottleneck Data Raw LALA2 Converting Data Dimension Table OLAP Bottleneck Batch Summary ETL Table NIBADA Bottleneck Collector Server ARGOS High-end Server
  • 7. New Challenges Faced • Increasing In Data Volume – Popular demand for Smart Phone and SNS – Need to do more complicated data analysis to bit the competition – Customer behavior analysis is needed • Slow in performance – Peak time performance became unacceptable – Some CDR’s were lost due to slow performance • KT Cloud Business Launched – Cheaper New KT Cloud H/W is available – Open source requirements are increasing in the company Can traditional DB give us an answer?
  • 8. KT meets NexR for Big Data • Scalability – Coping with increasing data volume and variety (wired, 2G, 3G, WiMax, LTE, WiFi, SMS, MMS, etc) – Enabling horizontal scalability in every data path (data collection, data Replacing traditional storage, ETL process, data search) RDB and DW with Hadoop and similar • Performance OSS – Handling streamed CDR data in near  Project start (2011.4) real-time – Completing daily ETL tasks in a given  Applying NexR solution time period regardless of data for CDR analysis (pilot) increase • Cost-Efficiency – Reducing the cost with inexpensive equipments
  • 9. Continuous Journey for KT’s Big Data • Step by Step approach….with NexR Steps Open Coverage . Replacing representative data and SQLs Hadoop CDR 2012.1Q Analysis Platform . Unrated Wireless CDR (Pilot) . Change all traditional application to OSS 2012. Wireless CDRs . Add more views and reports . Rated CDR’s, Internet access log, TV log Data Integration 2013. Advanced Analytics . Advanced Analytics . SNS, Location etc External Data 2014. . Data from KT subsidiaries Sources
  • 10. Rethinking KT’s Requirements Internet Social IPTV Log Access Data Log Data Data Volume Data + Explosion Integration Data Variety Past Present Future KT CDR System Data Hundreds of DB tables, 3000 SQLs Interface (for 10 years) + Data Engineers SQL Business Analysts Who? DBA Developers (OLAP, SAS, etc) Next Revolution Toward Open Platform -10-
  • 11. Big Data Analytics Requirements for Enterprise  Data volume is only the basic requirement  Data integration is the fundamental requirement (Structured data + Unstructured data)  Need to preserve the existing data and apps  Need to be familiar to enterprise data engineers (DBA, SQL developers, business analysts, etc)  Smooth transition is also essential What’s the solution? Next Revolution Toward Open Platform -11-
  • 12. NexR Solution: Hadoop + Hive Hive is the best solution for smooth transition from database world to Hadoop world ANSI-SQL-based query engine Good for RDB migration Batch data processing HOW TO (ETL, Reporting, Ad-hoc query) CONVERT HOW TO ADAPT Common data storage Enterprise data engineers File-based data store DBA, SQL Good for data integration Developers Next Revolution Toward Open Platform -12-
  • 13. NexR Data Analytics Platform (NDAP) Embracing database into Hadoop world Support for migrating data and logics from RDB to Hadoop Support for integrating RDB and Hadoop Offering Hive tools for DBA and SQL developers Full package for big data analytics From data collection to batch data processing, real-time query, and even advanced analytics Leveraging open source technologies Horizontal scalability in every data processing path (collection, batch processing, real-time query, etc) Injecting real-world practices by the collaboration with KT Next Revolution Toward Open Platform -13-
  • 14. NDAP Bird’s View Used Open Sources NDAP RHive Advanced analytics Integration of R and Hive NDAP Enterprise Hive Oracle-to-Hive, Hive workflow, Batch data processing Hive performance monitor, query planner NDAP Data Store Common data storage HDFS, Sqoop-based data import/export NDAP Search ElasticSearch-based distributed log search Real-time query Time-ranged index sharding NDAP Collector Flume-based data collector Streamed data collection Checkpointing for low overhead agents NDAP Admin Center Zookeeper-based distributed coordinator Coordination & Management Collectd-based system/app management Next Revolution Toward Open Platform -14-
  • 15. NDAP Architecture Data NexR Data Analytics Platform (NDAP) Applications Sources Advanced RHive Analytics Enterprise Hawk DBA PerfMon, Query Plan Hive Data Oracle ETL Importer Lama Oracle-to-Hive Ad-hoc query Hive Workflow Reporting Databases Existing BI/DW Data ODS Importer OLAP Server Data Store OLAP (Hadoop) Data Exporter Oracle Collector REST/JSON Search API Real-time Telco query Equipments (Streaming Admin Center data) Next Revolution Toward Open Platform -15-
  • 16. NDAP Bird’s View – Today’s focus NDAP RHive Integration of R and Hive Today’s talk NDAP Enterprise Hive Oracle-to-Hive, Hive workflow, Hive performance monitor, query planner NDAP Data Store HDFS, Sqoop-based data import/export NDAP Search Lucene-based distributed log search engine Time-ranged index sharding Refer to appendix NDAP Collector Flume-based data collector Checkpointing for low overhead agents NDAP Admin Center Zookeeper-based distributed coordinator Collectd-based system/app management Next Revolution Toward Open Platform -16-
  • 17. Enterprise Hive Recreating Hive for Enterprise Data Engineers Two goals Migration of data and SQL from RDB(Oracle) to Hive  Oracle-to-Hive support Rich environment for Hive developers, even DW/BI teams and DBA  Performance monitor, query planner, workflow manager Next Revolution Toward Open Platform -17-
  • 18. Is Oracle-to-Hive trivial? Simple Example SELECT * FROM Employee e1, Dept d1 WHERE e1.ID = d1.Id SELECT * FROM Employee e1 JOIN Dept d1 ON (e1.ID = d1.Id) Typical Example SELECT /*+ PARALLEL(K1 16) USE_NL(K1 B) */ ETL_DATE, CALL_DATE, CASE WHEN SUBSCRIBER_TYPE ='PREMIUM' THEN 'Y' ELSE NVL(TO_CHAR(B.I_NCN),'X') END AS I_NCN, I_INOUT,VALID_CNT, I_CFC_TYPE, …… FROM 3G_CALL_LOG K1 , SASCOMM.PHONE_MAPPING B WHERE K1.i_etl_dt = TO_DATE('[#SAS_YDAY#,'YYYYMMDD') AND K1.i_call_dt ||'' >= TO_DATE('[#SAS_YDAY#]','YYYYMMDD') AND K1.i_call_dt ||'' < TO_DATE('[#SAS_YDAY#]','YYYYMMDD') + 1 and K1.I_INOUT in ('0','1') AND DECODE(K1.I_INOUT,'0',NVL(K1.I_OUT_CTN, I_CALLING_NUM),'1',K1.I_IN_CTN) = B.I_CTN(+) AND K1.CALL_DATE >= B.SDATE(+) AND K1.CALL_DATE < B.EDATE(+); Next Revolution Toward Open Platform -18-
  • 19. Enterprise Hive – Oracle-to-Hive Enhancing Hive by Fixing Hive code (JIRA issues, 2253, 2503, 2329, 2332, etc) Adding Hive UDF and UDAF for Oracle compatibility Enterprise Hive provides Conversion rules, a guide and a process Oracle data types that are not supported in Hive Oracle functions that are not supported in Hive Three conversion points to consider Data model and data types Basic functions, aggregate and analytic functions SQL syntax Next Revolution Toward Open Platform -19-
  • 20. Oracle-to-Hive – Data Model, Types, Functions Hive refered to MySQL function syntax Data Model Basic Functions Oracle Hive Function Oracle Hive Type Table Table Math round,ceil,mod, round,ceil,pmod, Partition Partition Functions power,sqrt,sin/cos power,sqrt,sin/cos substr,trim,lpad/rpad Sampling Bucket Character substr,trim,lpad/rpad ltrim/rtrim,regexp_repl Functions ltrim/rtrim,replace ace Data Type Null coalesce,nvl,nvl2 coalesce (no nvl, nvl2) Functions Oracle Hive TINYINT Added Basic Functions (Hive UDF) NUMBER(n) INT/BIGINT Function Type Hive NUMBER(n,m) FLOAT/DOUBLE Condition DECODE, GREATEST VARCHAR2 STRING Null NVL, NVL2 STRING DATE "yyyy-MM-dd Type TO_NUMBER, TO_CHAR, TO_DATE, HH:mm:ss" format Conversion INSTR4, DATE_FORMAT, LAST_DAY Hive data type is designed to be converted into Java data type Next Revolution Toward Open Platform -20-
  • 21. Oracle-to-Hive – SQL Syntax & Analytic Functions Most not-supported Oracle SQL syntax can be converted with Join syntax Oracle SQL Hive HQL SELECT * from Employee e WHERE e.DeptNo SELECT * from Employee e IN subquery IN(SELECT d.DeptNo FROM Dept d) LEFT SEMI JOIN Dept d ON (e.DeptNo=d.DeptNo) SELECT e.* from Employee e NOT IN SELECT * from Employee e WHERE e.DeptNo LEFT OUTER JOIN Dept d ON (e.DeptNo=d.DeptNo) subquery NOT IN(SELECT d.DeptNo FROM Dept d) WHERE d.DeptNo IS NULL SELECT * SELECT * JOIN FROM Employee e1, Dept d1 WHERE e1.ID = d1.Id FROM Employee e1 JOIN Dept d1 ON (e1.ID = d1.Id) RANK SELECT name,dept,salary,RANK() OVER (PARTITION BY SELECT e.name,e.dept,e.salary,RANK(e.dept,e.salary) (Analytic dept FROM (SELECT name, dept, salary FROM emp DISTRIBUTED Function) ORDER BY salary DESC) FROM emp BY dept SORT BY dept, salary DESC) e MIN SELECT dept,tmp.m FROM emp JOIN (SELECT dept, SELECT dept, MIN(salary) OVER (PARTITION BY dept) (Aggregate MIN(salary) m FROM emp Function) FROM emp GROUP BY dept) tmp ON emp.dept = tmp.dept Oracle analytic functions are used sometimes for statistical processing (5% in KT case)  Implemented some analytic functions (RANK, DENSE_RANK, ROW_NUMBER, LAG, MIN, MAX, SUM) Next Revolution Toward Open Platform -21-
  • 22. Oracle-to-Hive – Example select /*+ use_nl(E emp_idx1) */ D.dname, E.empno, E.ename, decode(nvl(JOB, ‘SALESMAN’), 'SALESMAN', sal, 0) sal, RANK() over (PARTITION BY D.deptno ORDER BY sal desc) ranking from dept D, emp E where D.deptno = E.deptno and E.ename in (select ename from bonus where job in ('SALESMAN', 'CLERK')); select X.dname, X.empno, X.ename nexr_rank(HASH(X.deptno, X.sal), X.sal) ranking from ( select D.dname, D.deptno, E.empno, E.ename, (case coalese(JOB, ‘SALESMAN’) when 'SALESMAN‘ then sal else 0) sal, from dept D join emp E on (D.deptno = E.deptno) join bonus B on (D.ename = B.ename) where B.job in ('SALESMAN', 'CLERK') ) X distribute by hash(D.deptno, E.sal) sort by D.deptno, E.sal; Next Revolution Toward Open Platform -22-
  • 23. NDAP Process for RDB Migration Data Preparation Conversion Validation Optimization Function Oracle schema Rewriting Hive conversion to Hive schema queries Data semantically SQL conversion compatibility Data loading to (when more (by 1-on-1 check Hive using performance conversion rule Sqoop is needed) syntactically) The case of KT CDR migration Chose 100 representative SQLs for ETL and successfully converted Current step: 200-300 mainly used SQLs Next step(2012): 3000 SQLs Next Revolution Toward Open Platform -23-
  • 24. Enterprise Hive – Rich Environment for Hive Building up Hive Ecosystem by Adding assistant programs to help DBA and SQL developers Enterprise Hive provides Hive performance monitor and query planner Hive workflow manager Next Revolution Toward Open Platform -24-
  • 25. Hawk – Hive Performance Monitor Difficulty of Hive performance diagnostics Metrics and logs from Hive and Hadoop are seperated Lack of the historical and statistical view of performance Hawk performance monitor for DBA Integrated view of a Hive query and the corresponding MapReduces Hourly/daily/weekly/monthly performance views of each query Hawk Screenshot Hawk Architecture Next Revolution Toward Open Platform -25-
  • 26. Hawk – Hive Query Planner Difficulty of Hive default query planner Too complicated due to show the detail of MapReduce execution Not for DBA, but for Hive internal developers Hawk query planner for DBA Displaying a Hive query in a HQL operator level (familiar to DBA) Showing a performance result with a query at once Hive default query planner Hawk query planner Performance result Next Revolution Toward Open Platform -26-
  • 27. Lama – Hive Workflow Manager Workflow development and management tool for Hive Managing data processing jobs for Hive Choosing Oozie as a core workflow engine Providing web-based interface Workflow editing & management, user management, job scheduling, project management, etc On-demand workflow change at runtime Need to fix and resume a workflow at runtime in failure Not supported in most workflow engines Patching Oozie for suspend/resume per action(i.e., Hive query) Future plan Supporting other data processing jobs like Pig, Sqoop, MapReduce, HDFS, SSH, and Java Next Revolution Toward Open Platform -27-
  • 28. NDAP Process for Batch Data Processing Analysis Development Execution Management Performance Workflow Analyze service diagnostics & Workflow deployment & request(SR) optimization development scheduling & testing Hive data and Workflow & validation Performance query modeling suspend/fix/ monitoring resume in failure Lama Hawk Hawk Workflow Performance Query Manager Monitor Planner Next Revolution Toward Open Platform -28-
  • 29. Enterprise Hive Demo Next Revolution Toward Open Platform -29-
  • 30. R for Advanced Analytics R (GNU open source) Programming language and software environment for statistical computing and graphics (wikipedia) 4,000+ R libraries (more than SAS’s functionality) Becoming a de facto standard among statisticians R for Big Data R runs in a single node Some parallel R packages snowfall, rpvm, rmpi, etc Recent attempts to combine R and Hadoop RHIPE(Purdue), RHadoop(RA), Ricardo(IBM) Next Revolution Toward Open Platform -30-
  • 31. RHive Marrying R and Hive for Big Data Analytics Most R programmers are familiar to SQL Hive can hide the detail of Hadoop and MapReduce Inspired by IBM Ricardo(R+Jaql) Strong for deep analytics Strong for massive data manipulation Lack of massive data manipulation Lack of analytical functionalities Providing Hive interfaces in the R environment Allowing R programmers to use a familiar SQL for big data manipulation Released as open source (Apache license version 2) Source: https://github.com/nexr/RHive CRAN: http://cran.r-project.org/web/packages/RHive Next Revolution Toward Open Platform -31-
  • 32. RHive API and Architecture RHive API rhive.connect(): connect R to Hive rhive.query(): send a Hive query and return the result rhive.export(): export R functions to R processes running on the MR nodes rhive.exportAll(): export R functions and R objects to R processes running on the MapReduce nodes rhive.close(): close a Hive connection RHive Architecture Next Revolution Toward Open Platform -32-
  • 33. RHive Sample – Flight Delay Prediction  R: Building a prediction model of flight delay using linear regression with a training data set (sampled from Hive)  Hive: Running the prediction model(R objects) with an entire data set in Hive 1 library(RHive) 2 rhive.connect("127.0.0.1") 3 4 # get a training data set from Hive 5 trainset <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines",fetchsize=30,limit=100) 6 7 # convert to numeric, and extract out missing values 8 trainset$arrdelay <- as.numeric(trainset$arrdelay) 9 trainset$distance <- as.numeric(trainset$distance) 10 trainset <- trainset[!(is.na(trainset$arrdelay) | is.na(trainset$distance)),] Data set: airline on-time performance 11 http://stat-computing.org/dataexpo/2009/ 12 # create a prediction model using R model objects and internal funtions • Flight arrival and departure details for 13 model <- lm(arrdelay ~ distance + dayofweek,data=trainset) all commercial flights within the USA, 14 rhpredict <- function(arg1,arg2,arg3) { from October 1987 to April 2008. 15 if(arg1 == "NULL" | arg2 == "NULL" | arg3 == "NULL") 16 return(0.0) 17 res <- predict.lm(model, data.frame(dayofweek=arg1, arrdelay=arg2, distance=arg3)) 18 return(as.numeric(res)) 19 } 20 null <- "NULL" 21 22 # set up R objects in Hive 23 rhive.assign("null", null) 24 rhive.assign("rhpredict", rhpredict) 25 rhive.assign("model", model) 26 27 # export the R prediction model and run it in Hive 28 rhive.exportAll("rhpredict", c("10.1.3.2","10.1.3.3","10.1.3.4","10.1.3.5","10.1.3.6","10.1.3.7")) 29 rhive.query("create table delaypredict as select R('rhpredict', dayofweek, arrdelay, distance, 0.0) from airlines") Next Revolution Toward Open Platform -33-
  • 34. RHive Demo Next Revolution Toward Open Platform -34-
  • 35. Lessons Learned RDB migration to open source is complicated, time- consuming, and labor-intensive. It can become real with some practice and migration process. The average time of a query conversion (200~300 lines in average) 8 hours -> 2 hours after 4 months (4 times faster) Advantageous to those who experienced database migration (similar to Oracle-to-MySQL migration) Current data engineers are not familiar with open sources like Hadoop. They want to use software tools similar to the ones that they use. Open sources such as Hadoop and MapReduce are not easy for current IT managers. Open sources are technology-driven, not demand-driven. Open sources and technologies need to be wrapped up in familiar interfaces in order to hide the detail. Next Revolution Toward Open Platform -35-
  • 36. Lessons Learned Open source software is not a panacea. Choosing a right open source is the first significant step. Combining several OSS is common. The modification of source code of OSS is inevitable if requirements are not negotiable. Combining two separate open sources, Hive and ElasticSearch for batch data processing and real-time query on Hadoop as a common data store. The change of Hive, ElasticSearch, Flume, Oozie, Zookeeper, etc. The integration of various types of data is a critical issue for an enterprise. Especially, the structured data of database and DW need to be coupled with unstructured data in order to better understand customer’s needs. It is necessary to embrace current data and business logics in a new environment. RDB/DW and Hadoop have their pros and cons, so it is necessary to find the right mix. Next Revolution Toward Open Platform -36-
  • 37. Conclusion Big data analytics for telco and enterprises Smooth transition from RDB/DW to NexR Data Analytics Platform (NDAP) Next Revolution Toward Open Platform -37-
  • 38. NexR NDAP Team  Jaesun Han  Wonkuk Yang  Sangmin Kwak  Sebong Oh  JeongMin Kwon  SungHan Woo  Keumju Kim  Dongmin Yu  Daegeun Kim  Choonghyun Ryu  Minseok Kim  Bokju Yun  Minwoo Kim  Jonghee Lee  Yeonseop Kim  HyungJoo, Lim  Youngwoo Kim  HeeWon Jeon  Hyeon-Cheol Nah  GooBum Jung  SeungWoo Ryu  Sunghwan Cho  Seoeun Park  Junho Cho  Young-Geun Park  ByungMyon Chae  Eun-Sook Park  Yungtai Choi  Chihoon Byun  Choi Jong-wook  SeongHwa Ahn  Inho Han  Youngbae An  Seonghak Hong Next Revolution Toward Open Platform -38-
  • 39. Thank you Presentation file: http://www.nexr.com/hw11/ndap.pdf Contact jason.han@nexr.com twitter: @jaesun_han KT CDR NDAP Enterprise RHive Appendix System Overview Hive (Slide 30) (Slide 37) (Slide 4) (Slide 14) (Slide 17) Next Revolution Toward Open Platform -39-
  • 41. NDAP Collector Flume-based scalable data collector Choosing Flume due to the flexible architecture (source, decorator, sink) Adding a checkpoint mode and rolling/dedup Adding a checkpoint reliability mode Chukwa’s checkpoint is grafted onto Flume Less resource consumption in agents than Flume E2E mode Minimizing the amount of log data retransmitted at the failure of agents Rolling and deduplication Rolling fragmented log data periodically in Hadoop Removing duplicated log data in case of failover Rolling/Dedup Manager Zookeeper MapReduce Execution Rolling/De Workflow Scheduler Data Store dup MR Manager (Hadoop) Flume Source Decorator Sink Search Agent Log data & location Checkpoint Next Revolution Toward Open Platform -41-
  • 42. NDAP Search: Near Real-Time Indexing Near real-time indexing using RAM Index Adding RAM index for near real-time indexing in ElasticSearch Flushing RAM index into Disk index after a given time period or a buffer overflow When searching, both RAM index and disk index are examined Indexer Searcher add search IndexWriter IndexReader create write buffer read read commit Disk Index Next Revolution Toward Open Platform -42-
  • 43. NDAP Search: Index Split Strategies Modifying ElasticSearch to add more index split schemes for log search Searching log data has usually time constraint like daily or monthly Combining time-based index split and size-based index split Time-based index split Splitting an index according to a given time period Improving indexing and search performance Easy to implement auto-retention Size-based index split Splitting an index according to a given size Resolving a big index performance problem Time-based Size-based ElasticSearch Index Partitions Index Sequences Index Shards 2011.10.08 0001 0002 0 Primary Replica Replica 2011.10.09 0001 0002 0003 1 Primary Replica Replica Search 3 Primary Replica Replica 2011.10.30 0001 0002 0003 Next Revolution Toward Open Platform -43-
  • 44. NDAP Admin Center: Distributed Coordinator Zookeeper-based distributed coordinator Zookeeper handles the coordination among NDAP components Patching several issues of Zookeeper and ZkClient Providing common libraries for NDAP components Gourp membership, master election, distributed lock, distributed queue Easy to use and more reliable than any other recipes, especially to read-and-write problems Patched Zookeeper Ensemble Zookeeper Client Thread Complex, Unique, Fragile Patched ZkClient Thread Zookeeper Recipes Group Master Distributed Distributed Easy, Reusable, Fault Tolerant Membership Election Lock Queue NDAP NDAP Search Collector Next Revolution Toward Open Platform -44-
  • 45. NDAP Admin Center: System/App Management Collectd-based system and application monitoring Server resource monitoring: CPU, memory, disk, process, vmem, tcp connects, etc Application monitoring: Hadoop, ElasticSearch, Flume, Zookeeper, Memcached, Collectd, etc Plug-in architecture: add more applications such as NoSQL Resource-centric view Displaying all nodes’ resource status in a screen for a specific resource (cpu, mem, etc) Most system management tools(Ganglia, Nagios, etc) offer node-centric view Check threshold/ Collectd severity Server Management Dashboard NDAP Admin Next Revolution Toward Open Platform -45-