Programming Cascading

Taewook Eom
Data Infrastructure Group
SK planet
taewook@sk.com
2014-09-25
Programming Cascading

Big Data Processing
저자동 고유연성
고자동 저유연성

Cascading
http://www.cascading.org/
Since 2007, by Chris Wensel (CTO, founder of Concurrent, Inc.)

Cascading
데이터베이스 기초 개념과 배관(파이프와 연산자)을 비유하여 추상화 제공
엔터프라이즈 data workflow의 비즈니스 프로세스 관리를 위한 패턴 언어

http://docs.cascading.org/cascading/2.5/userguide/pdf/userguide.pdf
http://docs.cascading.org/impatient/ https://github.com/Cascading/Impatient
Cascading for the Impatient

Cascading
•Flow Planner가 사전 계획 단계(compile time)에서 에러 확인 p.23
–연산에 필요한 필드 p.31
–연산 순서
–파이프와 탭의 연결 상태
–의존성 그래프 생성 -> DAG 생성 p.37
•DAG(Directed Acyclic Graph)
–Data work flow에 적합한 형태
–다양한 데이터 처리 엔진에서 사용: Microsoft Dryad, Apache Tez, Apache Spark
•엔터프라이즈 환경에 적합
–논리적 계획이 아닌 물리적 계획으로 예측 가능 p.24
–결정적 전략으로 실행마다 물리적 실행 계획이 바뀌지 않음 p.36
–하나의 JAR 파일로 다양한 규모 적용(Same JAR, any scale) p.33
•비즈니스 로직, 시스템 통합, 단위 테스트, 정합성 검사, 예외 처리 모두 포함
•운영상 복잡성 낮춤 p.172
–Ad-hoc Query나 빠른 응답 보다는 Hive처럼 높은 처리량 목적으로 ETL에 적합 p.142
–JAVA 개발자들에게 익숙한 도구와 절차 p.172

Cascading Terminology
http://docs.cascading.org/cascading/2.5/userguide/html/ch03.html#N2013B 3.1 Terminology
•Pipe: Data stream
•Filter: Data operation
•Tuple: Data record
•Branch: 분기나 병합이 없는 간단한 파이프 연결
•Pipe Assembly: Pipe branch들의 연결 집합
•Tuple Stream: Pipe branch나 assembly를 통과하는 Tuple들의 연속
•Tap: Data source/sink
•Flow: Tap들과 연결된 한 개 이상의 pipe assembly들의 연결상태
•Cascade
–Flow의 집합으로 하나의 프로세스로 실행
–Flow는 다른 flow의 데이터 의존성이 만족될 때까지 실행되지 않음

Pipe Types
http://docs.cascading.org/cascading/2.5/userguide/html/ch03s03.html#N20276 Types of Pipes
•Each
–Filter, Function 적용
–Filter는 Tuple 삭제만 가능
–Function은 필드 추가/변경과 여러 Tuple 출력 가능
–Function의 기본 Output Selector는 Fields.RESULT
•Every
–GroupBy, CoGroup의 결과에만 사용
–Aggregator, Buffer 적용
•Function, Aggregator, Buffer의 Output Selector 필드 꼭 지정

http://docs.cascading.org/cascading/2.5/userguide/html/ch03s03.html#N20438 The Each and Every Pipes

Buffer vs. Aggregator
•공통점
–GroupBy, CoGroup의 결과에 대해서만 동작
–Aggregator와 Buffer의 기본 Output Selector는 Fields.ALL
•차이점
–Aggregator는 chained되지만 Buffer는 chained되지 않음
–Buffer는 하나의 group에 대해 여러 개 결과 tuple 출력 가능
–Buffer는 Aggregator를 똑같이 구현할 수 있으므로 Aggregator는 Buffer의 특별히 최적화된 형태라 볼 수 있음
pipe = new GroupBy(pipe, new Fields("mdn"), new Fields("log_time")); pipe = new Every(pipe, new Count(new Fields("count"))); pipe = new Every(pipe, new Fields("mdn"), new DistinctCount(new Fields("unique_mdn_cnt"))); pipe = new Every(pipe, new Fields("pay_amt"), new Sum(new Fields("sum"), long.class)); pipe = new Every(pipe, new Fields("log_time"), new Last(new Fields("last_time")));

Pipe Types
•Merge
–Unsorted merge
–같은 필드와 타입을 가진 둘 이상의 Pipe들을 하나의 stream으로 병합
–Grouping을 하지 않아 GroupBy보다 빠름 (Aggregator/Buffer 사용불가)
•GroupBy
–Key 필드에 대해 Sorted merge
–같은 필드와 타입을 가진 둘 이상의 Pipe들만 병합 가능
–Group 내 임의의 순서 (속도는 낮아지지만 2차 정렬 가능)
–Grouping 만들어 Every를 위한 준비 작업
–Grouping 위해 grouping fields를 정렬해서 Merge 보다 느림
•grouping fields에 대해 natural order로 정렬
–2차 정렬 가능
•2차 정렬 지정하지 않으면 group 내에서 임의 순서지만 더 빠르게 수행
Fields sortFields = new Fields("value1", "value2"); sortFields.setComparator("value1", Collections.reverseOrder()); Pipe groupBy = new GroupBy(assembly, groupFields, sortFields);

Pipe Types
•서로 다른 fields 가진 둘 이상의 stream을 공통 fields 값 기준으로 Join
•CoGroup
–SQL의 join과 유사(InnerJoin, OuterJoin, LeftJoin, RightJoin, MixedJoin)
–outer join의 경우 존재하지 fields들은 null로 채움
–결과에 모든 stream의 모든 fields가 출력되기 때문에, 모든 stream의 fields들은 중복된 이름을 포함할 수 없음
•중복된 이름이 있을 경우 declaredFields 인자로 변경 가능
–field의 순서로 짝맞춤. field 이름은 개발자를 위한 수단일 뿐
–빠른 join 위해 오른쪽 stream의 모든 unique key tuple(bag)을 메모리에 저장 시도
•설정 가능한 임계치를 넘어서면 메모리에서 disk로 쓰면서 진행(성능 저하)
•임계치가 클 경우 메모리 에러 유발
•가장 큰 group을 가장 왼쪽에 넣고 적절히 임계치를 조절하면 최고 성능 발휘
•HashJoin
–한 개의 큰 stream과 작은 stream들의 join에 최적화 (Map-side Join)
•오른쪽 stream을 모두 메모리에 넣어 빠르게 비교 연산 (group 없어 전체 메모리에 올림)
–Group 필요없어 임의 순서로 Join하여 CoGroup 보다 빠름
–Group 존재하지 않아 aggregator나 buffer가 뒤따르지 못함
–CheckPoint를 HashJoin 직전에 넣어 작게 된 stream 모두 디스크에 쓰는 방식 유용

http://docs.cascading.org/cascading/2.5/userguide/html/ch03s03.html#N20630 CoGroup

String inPath = args[ 0 ]; String outPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath ); Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath ); Pipe copyPipe = new Pipe( "copy" ); FlowDef flowDef = FlowDef.flowDef() .addSource( copyPipe, inTap ) .addTailSink( copyPipe, outTap ); flowConnector.connect( flowDef ).complete();
https://github.com/Cascading/Impatient/blob/master/part1/src/main/java/impatient/Main.java
p.29 1.2 초간단 케스케이딩 애플리케이션

… Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ).addTailSink( wcPipe, wcTap ); …
p.37 1.5 흔한 단어 세기

… Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); Fields fieldSelector = new Fields( "doc_id", "token" ); Pipe docPipe = new Each( "token", text, splitter, fieldSelector ); Fields scrubArguments = new Fields( "doc_id", "token" ); docPipe = new Each( docPipe, scrubArguments, new ScrubFunction( scrubArguments ), Fields.RESULTS ); Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new Retain( wcPipe, token ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ).addTailSink( wcPipe, wcTap ); Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete();
doc_id text doc01 A rain shadow is a dry area on the lee back … doc02 This sinking, dry air produces a rain shadow, … doc03 A rain shadow is an area of dry land that lies … …
p.55 2.2 토큰 다듬기

public class ScrubFunction extends BaseOperation implements Function { public ScrubFunction( Fields fieldDeclaration ) { super( 2, fieldDeclaration ); } public void operate( FlowProcess flowProcess, FunctionCall functionCall ) { TupleEntry argument = functionCall.getArguments(); String doc_id = argument.getString( 0 ); String token = scrubText( argument.getString( 1 ) ); if( token.length() > 0 ) { Tuple result = new Tuple(); result.add( doc_id ); result.add( token ); functionCall.getOutputCollector().add( result ); } } public String scrubText( String text ) { return text.trim().toLowerCase(); } }
https://github.com/Cascading/Impatient/blob/master/part3/src/main/java/impatient/ScrubFunction.java
p.49 2.1 사용자 정의 연산

… String stopPath = args[ 2 ]; … Fields stop = new Fields( "stop" ); Tap stopTap = new Hfs( new TextDelimited( stop, true, "t" ), stopPath ); … Fields scrubArguments = new Fields( "doc_id", "token" ); docPipe = new Each( docPipe, scrubArguments, new ScrubFunction( scrubArguments ), Fields.RESULTS ); Pipe stopPipe = new Pipe( "stop" ); Pipe tokenPipe = new HashJoin( docPipe, token, stopPipe, stop, new LeftJoin() ); tokenPipe = new Each( tokenPipe, stop, new RegexFilter( "^$" ) ); Pipe wcPipe = new Pipe( "wc", tokenPipe ); wcPipe = new Retain( wcPipe, token ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addSource( stopPipe, stopTap ) .addTailSink( wcPipe, wcTap ); …
stop a about after all along an and Any …
p.57 2.3 복제 조인

… String tfidfPath = args[ 3 ]; … Fields fieldSelector = new Fields( "doc_id", "token" ); tokenPipe = new Retain( tokenPipe, fieldSelector ); … Pipe tfPipe = new Pipe( "TF", tokenPipe ); Fields tf_count = new Fields( "tf_count" ); tfPipe = new CountBy( tfPipe, new Fields( "doc_id", "token" ), tf_count ); Fields tf_token = new Fields( "tf_token" ); tfPipe = new Rename( tfPipe, token, tf_token ); Fields doc_id = new Fields( "doc_id" ); Fields tally = new Fields( "tally" ); Fields rhs_join = new Fields( "rhs_join" ); Fields n_docs = new Fields( "n_docs" ); Pipe dPipe = new Unique( "D", tokenPipe, doc_id ); dPipe = new Each( dPipe, new Insert( tally, 1 ), Fields.ALL ); dPipe = new Each( dPipe, new Insert( rhs_join, 1 ), Fields.ALL ); dPipe = new SumBy( dPipe, rhs_join, tally, n_docs, long.class ); Pipe dfPipe = new Unique( "DF", tokenPipe, Fields.ALL ); Fields df_count = new Fields( "df_count" ); dfPipe = new CountBy( dfPipe, token, df_count ); Fields df_token = new Fields( "df_token" ); Fields lhs_join = new Fields( "lhs_join" ); dfPipe = new Rename( dfPipe, token, df_token ); dfPipe = new Each( dfPipe, new Insert( lhs_join, 1 ), Fields.ALL ); Pipe idfPipe = new HashJoin( dfPipe, lhs_join, dPipe, rhs_join );
p.71 3.1 TF-IDF 구현
tfPipe: (“doc_id”, “tf_token”, “tf_count”)
dPipe: (“doc_id”)
dfPipe: (“doc_id”, “token”)
tfPipe: (“doc_id”, “token”, “tf_count”)
dPipe: (“doc_id”, “tally”)
dPipe: (“doc_id”, “tally”, “rhs_join”)
dPipe: (“rhs_join”, “n_docs”)
dfPipe: (“token”, “df_count”)
dfPipe: (“df_token”, “df_count”)
dfPipe: (“df_token”, “df_count”, “lhs_join”)
idfPipe: (“df_token”, “df_count”, “lhs_join”, “rhs_join”, “n_docs”)

Pipe tfidfPipe = new CoGroup( tfPipe, tf_token, idfPipe, df_token ); Fields tfidf = new Fields( "tfidf" ); String expression = "(double) tf_count * Math.log( (double) n_docs / ( 1.0 + df_count ) )"; ExpressionFunction tfidfExpression = new ExpressionFunction( tfidf, expression, Double.class ); Fields tfidfArguments = new Fields( "tf_count", "df_count", "n_docs" ); tfidfPipe = new Each( tfidfPipe, tfidfArguments, tfidfExpression, Fields.ALL ); fieldSelector = new Fields( "tf_token", "doc_id", "tfidf" ); tfidfPipe = new Retain( tfidfPipe, fieldSelector ); tfidfPipe = new Rename( tfidfPipe, tf_token, token ); Pipe wcPipe = new Pipe( "wc", tfPipe ); Fields count = new Fields( "count" ); wcPipe = new SumBy( wcPipe, tf_token, tf_count, count, long.class ); wcPipe = new Rename( wcPipe, tf_token, token ); wcPipe = new GroupBy( wcPipe, count, count ); FlowDef flowDef = FlowDef.flowDef() .setName( "tfidf" ) .addSource( docPipe, docTap ) .addSource( stopPipe, stopTap ) .addTailSink( tfidfPipe, tfidfTap ) .addTailSink( wcPipe, wcTap ); …
p.71 3.1 TF-IDF 구현
tfidfPipe: (“doc_id”, “tf_token”, “tf_count”,
“df_token”, “df_count”, “lhs_join”, “rhs_join”, “n_docs”)
tfidfPipe: (“doc_id”, “tf_token”, “tf_count”, “df_token”, “df_count”, “lhs_join”, “rhs_join”, “n_docs”, “tfidf”)
tfidfPipe: (“tf_token”, “doc_id”, “tfidf”)
tfidfPipe: (“token”, “doc_id”, “tfidf”)

Programming Tips
•Local Mode
–Hadoop 사용하기 전에 로컬 파일을 이용해 개발/테스트/데이터 탐색
–Hadoop API를 사용하지 않고, 메모리에서만 동작(메모리에 제한)
–로컬 테스트 가능하나 로컬과 Hadoop 미묘한 API 차이 있음
•cascading-hadoop-2.0.x.jar 대신 cascading-local-2.0.x.jar 사용
•FileTap, LocalFlowConnector 사용
•Test p.80
–CascadingTestCase
–Debug http://docs.cascading.org/cascading/2.5/userguide/html/ch09s02.html
–Assert
•http://docs.cascading.org/cascading/2.5/userguide/html/ch08s02.html
–Trap http://docs.cascading.org/cascading/2.5/userguide/html/ch08s03.html
–Sample http://docs.cascading.org/cascading/2.5/userguide/html/ch09s03.html
–Checkpoint

Programming Tips
•작은 의미 구분으로 SubAssembly와 Flow를 만들고 Cascade 연결
•Flow 연결
–Head, Tail, Assembly들은 "이름" 통해서 연결되므로 이름 명시 중요
–DAG로 되어 있어 마지막 sink들로 부터 역으로 연결 여부 검사
–Pipe는 이전 Pipe의 이름 물려 받으므로 명시적 이름 구분이 runtime 오류 방지
•필드 이름은 _, 소문자, 숫자만 사용
–한글이나 –는 Janino compiler를 사용하는 Expression 함수에서 오류 발생
–"first-name“은 필드 이름에 적합하지만, Expression에 사용되면 first-name.trim() 처럼 인식하면서 Janino에서 runtime 오류 발생
–Expression function 보다 function 구현이 Janino 문제도 없고 재사용 쉬움
•GroupBy의 sort 필드는 class type 먼저 맞추기
–HDFS에 저장 후 다시 읽으면 무조건 String 타입으로 변경됨
•Operation 재사용을 위해 전역변수나 property 이용 최소화하고 operation의 constructor 에 인자 넘기기

Programming Tips
•Reducer 개수 지정
–중간 Reducer 개수
–최종 Reducer 개수
Properties properties = new Properties(); properties.put("mapred.reduce.tasks", “10”); properties.put("mapred.map.tasks.speculative.execution", "true"); properties.put("mapred.reduce.tasks.speculative.execution", "false"); properties.put("mapred.job.priority", “HIGH”); AppProps.setApplicationJarClass(properties, Main.class); FlowConnector flowConnector = new HadoopFlowConnector(properties);
TextDelimited scheme = new TextDelimited(new Fields(“key“, “value”), true, "t"); scheme.setNumSinkParts(1); Tap sinkTap = new Hfs(scheme, outputPath, SinkMode.REPLACE);

•http://docs.cascading.org/cascading/2.5/userguide/html/ch09.html 9. Built-In Operations
–Identity Function
–Text Functions
–Regular Expression Operations
–Java Expression Operations
–Buffers
•http://docs.cascading.org/cascading/2.5/userguide/html/ch10.html 10. Built-in Assemblies
–AggregateBy (AverageBy, CountBy, SumBy, FirstBy)
–Rename
–Retain
–Unique
•http://docs.cascading.org/cascading/2.5/userguide/html/ch13.html 13. Cookbook
Programming Tips

Questions?
Questions.foreach( answer(_) )

public class DistinctCount extends BaseOperation<HashSet<String>> implements Aggregator<HashSet<String>> { public DistinctCount(Fields fieldDeclaration) { super(fieldDeclaration); } @Override public void start(FlowProcess flowProcess, AggregatorCall<HashSet<String>> aggregatorCall) { if (aggregatorCall.getContext() == null) { aggregatorCall.setContext(new HashSet<String>()); } else { aggregatorCall.getContext().clear(); } } @Override public void aggregate(FlowProcess flowProcess, AggregatorCall<HashSet<String>> aggregatorCall) { TupleEntry argument = aggregatorCall.getArguments(); HashSet<String> context = aggregatorCall.getContext(); context.add(argument.getTuple().toString()); } @Override public void complete(FlowProcess flowProcess, AggregatorCall<HashSet<String>> aggregatorCall) { aggregatorCall.getOutputCollector().add(new Tuple(aggregatorCall.getContext().size())); } }

Programming Cascading

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Programming Cascading

Similar to Programming Cascading (20)

Programming Cascading