SlideShare a Scribd company logo
1 of 25
About HiFX
Established in the year 2001, HiFX is an Amazon Web Services
Consulting Partner.
We have been designing and migrating workloads in AWS cloud
since 2010 and helping organizations to become truly data driven
by building big data solutions since 2015
Case Study with Malayala Manorama
Malayala Manorama is one of the largest media conglomerates in India. They
run manoramaonline.com, the largest news portal for Malayalees, around the
world and several digital media properties
In 2016, Manorama embarked on a project to develop an in-house
analytics pipeline that could unify enormous amounts of raw data from
multiple web domains and convert it into meaningful insights. The
company currently has 10 domains such as its matrimonial and real
estate sites, with plans to further expand its digital footprint.
HiFX, has been Malayala Manorama’s technology partner for more than
18 years and was approached to design this new data analytics pipeline.
Manorama Online
Manorama News
The Week
Vanitha
Watchtime India
E-paper/E-magazine
Chuttuvattom
OnManorama
M4Marry
HelloAddress
QuickeralaQkdoc
Entedeal
Manorama Horizon
Android
iOS
Manorama MAX
2
The Challenges
Lack of agility and accessibility for data analysis which would aid the product team to
make smart business decisions and improve strategies
Increasing volume and velocity of data. With new digital properties getting added, there
was a need to design the collection and storage layers that would scale well
Dozens of independently managed collections of data, leading to data silos. Having no
single source of truth was leading to difficulties in identifying what type of data is
available, getting access to it and integration.
Poorly recorded data. Often, the meaning and granularity of the data was getting lost in
processing
Dozens of independently managed collections of data, leading to data silos. Having no
single source of truth was leading to difficulties in identifying what type of data is
available, granting access and integration
04
03
02
01
About Lens
Vision Lens is a unified data platform with a
consolidated solution stack to
generate meaningful real time
insights and drive revenue
“
“
Better product
decisions based
on behavioral
insights
Add value
to our
businesses
€
Increase CLV
Deeply
understand every
user's journey
Immediate actions,
smart targeting and
marketing
automation
Positively
impact KPIs
Components
02A well governed data lake
architected to store raw and
enriched data thereby
eliminating storage silos
WELL GOVERNED DATA LAKE
01 Connecting dozens of data
streams and repositories to a
unified data pipeline enabling
near real-time access to any data
source
UNIFIED DATA PIPELINE
03
Data processing framework to
support streams and batches
workloads to aid analytics and
machine learning along with
smart workflow management
DATA PROCESSING
FRAMEWORK
05
Recommendations and
personalization engine powered
by machine learning
RECOMMENDATIONS ENGINE
04Well designed big data stores
for reporting and exploratory
analysis
BIG DATA STORES FOR OLAP
06
Dynamic dashboards and smart
visualizations that makes data
tell stories and drives insights.
SMART DASHBOARDS
Solution Stack
04
Track Key metrics : visits,
plays,dropouts and minutes
watched
VIDEO ANALYTICS
Watch Attention shift in
real time
Updates every few
seconds to quickly
capitalize attention to
every post, campaign and
sections
STREAMING ANALYTICS
01
02Historical View of unique
attention metrics to understand
what happened in the past and
use it to plan for the future
BATCH ANALYTICS
03
Integrations with Google
Accelerated Pages and
Facebook Instant articles
FB IA AND GOOGLE AMP
INTEGRATIONS
05
Recommendations and
personalization engine
powered by machine learning
CONTENT
PERSONALIZATION
06
Dynamic dashboards and smart
visualizations that makes data
tell stories and drives insights.
ADVANCED REPORTING
Clean structured data that
team can analyze directly
RAW DATA ACCESS07
Key Infrastructure Components
CloudFront
ECS
Kinesis Stream
S3 Bucket
EMR Spark
Sagemaker
Aurora
Redshift
Elasticsearch Service
DynamoDB
DatabricksAWS ALB Apache Airflow
Architecture
Trackers
Android SDK IOS SDK JS SDK PHP SDK Java SDK
Data / Event Trackers
Trackers allow us to collect data from any
type of digital application, service or
device. All trackers adhere to the LENS
Tracker Protocol.
Collectors-Scribe
Data Collectors
04
03
Written in
Go/Java
02
Designed for
Low LatencyEngineered for
High
Concurrency
Horizontally
Scalable
01
Scribe collects data from the trackers
and writes them to the Kinesis data
firehose.
This allows near-real time processing of
data as well as storage in the data lake
for further batch analysis.
Use ECS Fargate for the
containerization.
Scribe API endpoints
• Event tracker
• Pixel tracker
• Click tracker
• AMP tracker
Accumulo /Data Lake
A
ACCUMULO
The data consumer component
responsible for -
Reading data from the event
firehose ( Kinesis Streams )
Performing rudimentary data
quality checks
Converting data to Avro
Format with Snappy
Compression
Loading them to the Data
Lake
DATA LAKE
Data Lake supports the following
capabilities
Capture and store raw data securely
at scale at a low cost
Store many types of data in the same
repository
Define the structure of the data at
the time it is used
It is designed to
Retain all data
Support all data types
Adapt Easily to changes
Prism - Processing Engine
Using Apache Spark as our processing Engine.
It’s written in Scala.
It can run on EMR-5.27 and as a Databricks job running
on AWS spot/on-demand instances
Unified Processing Engine
Prism
Analytics Engine
Prism - Processing Engine
Data Cleanser
Performs data cleansing
including:
• Normalization
• De-duplication
• Bot-exclusion
• Fixes for client clock issues.
Data Enricher
Performs enrichment activities
including:
• User Agent Parsing to
understand OS / Platform
• Referrer Parsing to understand
channels
• IP to location transformation
• Lat+Long to location
transformation
• Widen event data with user
profile information
Data Quality Checks
Performs the data quality checks
needed to detect, report and omit
instrumentation errors
Data Reconciler
Reconciles data that is
sacrosanct like transactions
from the feeds generated by
the master db
Sessionization/User Merging
Sessionize and merge the users
based on domain/anonymous id
15
Prism
Analytics Engine
Data Refresher
Loads the data to respective tables
in the data warehouse and other
reporting data stores
Prism - Real-time Analytics
• Use structured streaming to stream live events
into Elastic Search.
• Stack can be run on both EMR and Databricks
• Run in 50 -4.x large instances, which is scaled
to 100 instances during the election time.
• Configurations:-
spark.executor.cores=4
spark.executor.memory=25g
spark.executor.instances=50
Spark Streaming
Spark Streaming
Prism - Batch Analytics
Spark on EMR/Databricks
Spark• Scheduled Job which kick off every
day to process all the events for a
day and write the cleansed
raw/aggregated data to the redshift
(primary data store).
• It also writes the data to Parquet
Format to run presto/Databricks
delta lake on the top if needed.
• Runs in 20 – r4.2xlarge instances
• Configurations:-
spark.executor.cores=3
spark.executor.memory=20g
spark.executor.instances=39
Data Stores
DATA WAREHOUSE
AMAZON REDSHIFT
Primary Data Store
• Supports batch workloads.
• Supports up to 50
concurrent queries
• Cache layer pgpool deployed
• WLM and concurrency
scaling enabled
• Elastic Resize
• Redshift spectrum to query
archived data in S3
01
REALTIME REPORTING STORE
Elasticsearch
Content Analytics Real Time
Dashboard.
• Fluidic Dashboard with
granular filters
• Data Visualization using
Kibana
02
RECOMMENDATION RESULTS
DYNAMODB
Features like,
Horizontally Scalability, low
operational overhead and
predictable performance
make Dynamodb a good
choice for storing
recommendation results
03
Orchestration
Used to programmatically author, schedule
and monitor workflows.
Workflow Management
Rich UI that makes it easy to visualize
pipelines running in production, monitor
progress, and troubleshoot issues when
needed.
Rich UI
Apache Airflow
Data Retention Strategy
 Find a balance between what’s optimal for your clients’ business needs vs. operational cost effectiveness
 Ensure the data retention policies align with the regulatory restrictions(GDPR)
 Define proper life cycle policies at different stages
 S3-IA/Glacier lifecycle policy defined for the data at rest in Data lake and a scheduled purging policy defined
for the primary data store(redshift)
 We keep a quarter worth of data in the primary data store(redshift) and older data is archived to S3.
 Redshift Spectrum is used for detailed analysis of older data.
 For YOY, QOQ comparison we pre-calculate it as part of the quarterly process and store the aggregated results
into the data store.
Page Views
Dashboard - KPIs/Different Angles
Domain Specific KPIs
Key Metrics in the Content
Dashboard.
Different Angles
New and returning Visitors
Explore the Content Data from these
Angles
Engaged Time
Social Shares and
Referrals
Bounce Rate
Video Play Rate
Titles
Authors
Sections
Tags
Referrers
Campaigns
Google AMP Facebook IA
Scalability /Performance
Collect, Storage and Process layers designed to Autoscale.
Batch analytics takes an average of 30-40 mins to process and refresh data for the entire day
across all reporting dashboards
Turnaround latency numbers at the data collector: 75 percentile - 27ms and 95 percentile - 156
ms
Currently handles about 150 GB of data per day with an average of 300 million events processed
per day
Horizontally Scalable Data Collectors, Data Consumers, Data Processors and Data Reporting
Stores
04
03
02
01
The real time streaming stack currently processes 500K events in less than 10 seconds.
05
06
Best Practices in Spark
 Use Dataset, DataFrames, Spark SQL instead of RDD to get the benefits of catalyst optimizer
 Choose the best data format and compression.
 Apache Parque gives the fastest read performance with the spark with its vectorized Parquet reader. Run
presto/Databricks delta lake on the top if needed.
 Avro offers rich schema support and more efficient writes than Parquet.
 Choose either Snappy or LZO compression as they have balance in terms of split-ability and block compression.
 Use the Spark Web UI to explore your task jobs, storage, and SQL query plan to optimize your spark execution
 Look at the spark event timeline to see the amount of time for each stage/tasks
 Check the shuffles between stages and the amount of data shuffled(Use the spark.sql.shuffle.partitions option
if needed)
 Check the join algorithms being used.
 Broadcast join should be used when one table is small.
 Sort-merge join should be used for large tables. You can use bucketing to pre-sort and group tables; this will
avoid shuffling in the sort merge
 Enable Dynamic Partition Pruning/ flattenScalarSubqueriesWithAggregates/ Bloom Filter Join/ Optimized Join
Reorder
 Use s3 instead of s3a/s3n protocol to refer the data so that it goes through the optimized path
 Use EMRFS consistency only if its required
 Find an optimal configurations on number of executors, memory setting for each executors and the no of cores for
the spark job.
Outcomes
Ability to run targeted mobile push and email campaigns
Consistent KPI measurement. The client has a consistent framework across properties to
measure KPIs
Dozens of independently managed collections of data, leading to data silos. Having no single
source of truth was leading to difficulties in identifying what type of data is available, getting
access to it and integration.
Better user experience. Recommendations running off the data in the Data Lake add value to the
digital properties we manage
Better business agility and product decisions based on behavioural insights. The journey from
data to decisions is made swifter
04
03
02
01
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS

More Related Content

What's hot

Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Rittman Analytics
 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on HadoopTyler Mitchell
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
 
How Workato creates robust data pipelines and automations for you?
How Workato creates robust data pipelines and automations for you?How Workato creates robust data pipelines and automations for you?
How Workato creates robust data pipelines and automations for you?Jeraldine Phneah
 
2021 gartner mq dsml
2021 gartner mq dsml2021 gartner mq dsml
2021 gartner mq dsmlSasikanth R
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)Jeffrey T. Pollock
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemDataWorks Summit
 
Data lake analytics for the admin
Data lake analytics for the adminData lake analytics for the admin
Data lake analytics for the adminTillmann Eitelberg
 
Oracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer IntroductionOracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer IntroductionJeffrey T. Pollock
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman
 
Data in Motion vs Data at Rest
Data in Motion vs Data at RestData in Motion vs Data at Rest
Data in Motion vs Data at RestInternap
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricksBrandon Berlinrut
 
Intelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockIntelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockJeffrey T. Pollock
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017Rittman Analytics
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureThe Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureDataWorks Summit/Hadoop Summit
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep diveDataWorks Summit
 
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data WarehouseHybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data WarehouseDataWorks Summit
 

What's hot (20)

Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
How Workato creates robust data pipelines and automations for you?
How Workato creates robust data pipelines and automations for you?How Workato creates robust data pipelines and automations for you?
How Workato creates robust data pipelines and automations for you?
 
2021 gartner mq dsml
2021 gartner mq dsml2021 gartner mq dsml
2021 gartner mq dsml
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystem
 
Data lake analytics for the admin
Data lake analytics for the adminData lake analytics for the admin
Data lake analytics for the admin
 
Oracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer IntroductionOracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer Introduction
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
Data in Motion vs Data at Rest
Data in Motion vs Data at RestData in Motion vs Data at Rest
Data in Motion vs Data at Rest
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricks
 
Intelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockIntelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff Pollock
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
 
Using Hadoop for Cognitive Analytics
Using Hadoop for Cognitive AnalyticsUsing Hadoop for Cognitive Analytics
Using Hadoop for Cognitive Analytics
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureThe Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data Architecture
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep dive
 
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data WarehouseHybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
 

Similar to ACDKOCHI19 - Next Generation Data Analytics Platform on AWS

Architecting Data Lakes on AWS
Architecting Data Lakes on AWSArchitecting Data Lakes on AWS
Architecting Data Lakes on AWSSajith Appukuttan
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxArunPandiyan890855
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesAmazon Web Services
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Amazon Web Services
 
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...Amazon Web Services
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesAmazon Web Services
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsAmazon Web Services
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...DataWorks Summit/Hadoop Summit
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Big Data Companies and Apache Software
Big Data Companies and Apache SoftwareBig Data Companies and Apache Software
Big Data Companies and Apache SoftwareBob Marcus
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchSheetal Pratik
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Steering Away from Bolted-On Analytics
Steering Away from Bolted-On AnalyticsSteering Away from Bolted-On Analytics
Steering Away from Bolted-On AnalyticsConnexica
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
 
Track 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptx
Track 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptxTrack 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptx
Track 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptxAmazon Web Services
 

Similar to ACDKOCHI19 - Next Generation Data Analytics Platform on AWS (20)

Architecting Data Lakes on AWS
Architecting Data Lakes on AWSArchitecting Data Lakes on AWS
Architecting Data Lakes on AWS
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
 
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Big Data Companies and Apache Software
Big Data Companies and Apache SoftwareBig Data Companies and Apache Software
Big Data Companies and Apache Software
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Steering Away from Bolted-On Analytics
Steering Away from Bolted-On AnalyticsSteering Away from Bolted-On Analytics
Steering Away from Bolted-On Analytics
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
Track 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptx
Track 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptxTrack 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptx
Track 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptx
 

More from AWS User Group Kochi

ACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits mark
ACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits markACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits mark
ACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits markAWS User Group Kochi
 
ACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity Stones
ACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity StonesACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity Stones
ACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity StonesAWS User Group Kochi
 
ACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWS
ACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWSACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWS
ACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWSAWS User Group Kochi
 
ACDKOCHI19 - Complete Media Content Management System and Website on Serverless
ACDKOCHI19 - Complete Media Content Management System and Website on ServerlessACDKOCHI19 - Complete Media Content Management System and Website on Serverless
ACDKOCHI19 - Complete Media Content Management System and Website on ServerlessAWS User Group Kochi
 
ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...
ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...
ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...AWS User Group Kochi
 
ACDKOCHI19 - CI / CD using AWS Developer Tools
ACDKOCHI19 - CI / CD using AWS Developer ToolsACDKOCHI19 - CI / CD using AWS Developer Tools
ACDKOCHI19 - CI / CD using AWS Developer ToolsAWS User Group Kochi
 
ACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS Cloud
ACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS CloudACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS Cloud
ACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS CloudAWS User Group Kochi
 
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...AWS User Group Kochi
 
ACDKOCHI19 - Opening Keynote - Building an Innovation mindset
ACDKOCHI19 - Opening Keynote - Building an Innovation mindsetACDKOCHI19 - Opening Keynote - Building an Innovation mindset
ACDKOCHI19 - Opening Keynote - Building an Innovation mindsetAWS User Group Kochi
 
ACDKOCHI19 - Enterprise grade security for web and mobile applications on AWS
ACDKOCHI19 - Enterprise grade security for web and mobile applications on AWSACDKOCHI19 - Enterprise grade security for web and mobile applications on AWS
ACDKOCHI19 - Enterprise grade security for web and mobile applications on AWSAWS User Group Kochi
 
ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...
ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...
ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...AWS User Group Kochi
 
ACDKOCHI19 - Demystifying amazon sagemaker
ACDKOCHI19 - Demystifying amazon sagemakerACDKOCHI19 - Demystifying amazon sagemaker
ACDKOCHI19 - Demystifying amazon sagemakerAWS User Group Kochi
 
ACDKOCHI19 - Building a serverless full-stack AWS native website
ACDKOCHI19 - Building a serverless full-stack AWS native websiteACDKOCHI19 - Building a serverless full-stack AWS native website
ACDKOCHI19 - Building a serverless full-stack AWS native websiteAWS User Group Kochi
 

More from AWS User Group Kochi (14)

ACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits mark
ACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits markACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits mark
ACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits mark
 
ACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity Stones
ACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity StonesACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity Stones
ACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity Stones
 
ACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWS
ACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWSACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWS
ACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWS
 
ACDKOCHI19 - Complete Media Content Management System and Website on Serverless
ACDKOCHI19 - Complete Media Content Management System and Website on ServerlessACDKOCHI19 - Complete Media Content Management System and Website on Serverless
ACDKOCHI19 - Complete Media Content Management System and Website on Serverless
 
ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...
ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...
ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...
 
ACDKOCHI19 - CI / CD using AWS Developer Tools
ACDKOCHI19 - CI / CD using AWS Developer ToolsACDKOCHI19 - CI / CD using AWS Developer Tools
ACDKOCHI19 - CI / CD using AWS Developer Tools
 
ACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS Cloud
ACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS CloudACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS Cloud
ACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS Cloud
 
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
 
ACDKOCHI19 - Opening Keynote - Building an Innovation mindset
ACDKOCHI19 - Opening Keynote - Building an Innovation mindsetACDKOCHI19 - Opening Keynote - Building an Innovation mindset
ACDKOCHI19 - Opening Keynote - Building an Innovation mindset
 
ACDKOCHI19 - Enterprise grade security for web and mobile applications on AWS
ACDKOCHI19 - Enterprise grade security for web and mobile applications on AWSACDKOCHI19 - Enterprise grade security for web and mobile applications on AWS
ACDKOCHI19 - Enterprise grade security for web and mobile applications on AWS
 
ACDKOCHI19 - IAM Everywhere
ACDKOCHI19 - IAM EverywhereACDKOCHI19 - IAM Everywhere
ACDKOCHI19 - IAM Everywhere
 
ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...
ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...
ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...
 
ACDKOCHI19 - Demystifying amazon sagemaker
ACDKOCHI19 - Demystifying amazon sagemakerACDKOCHI19 - Demystifying amazon sagemaker
ACDKOCHI19 - Demystifying amazon sagemaker
 
ACDKOCHI19 - Building a serverless full-stack AWS native website
ACDKOCHI19 - Building a serverless full-stack AWS native websiteACDKOCHI19 - Building a serverless full-stack AWS native website
ACDKOCHI19 - Building a serverless full-stack AWS native website
 

Recently uploaded

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

ACDKOCHI19 - Next Generation Data Analytics Platform on AWS

  • 1. About HiFX Established in the year 2001, HiFX is an Amazon Web Services Consulting Partner. We have been designing and migrating workloads in AWS cloud since 2010 and helping organizations to become truly data driven by building big data solutions since 2015
  • 2. Case Study with Malayala Manorama Malayala Manorama is one of the largest media conglomerates in India. They run manoramaonline.com, the largest news portal for Malayalees, around the world and several digital media properties In 2016, Manorama embarked on a project to develop an in-house analytics pipeline that could unify enormous amounts of raw data from multiple web domains and convert it into meaningful insights. The company currently has 10 domains such as its matrimonial and real estate sites, with plans to further expand its digital footprint. HiFX, has been Malayala Manorama’s technology partner for more than 18 years and was approached to design this new data analytics pipeline.
  • 3. Manorama Online Manorama News The Week Vanitha Watchtime India E-paper/E-magazine Chuttuvattom OnManorama M4Marry HelloAddress QuickeralaQkdoc Entedeal Manorama Horizon Android iOS Manorama MAX
  • 4. 2 The Challenges Lack of agility and accessibility for data analysis which would aid the product team to make smart business decisions and improve strategies Increasing volume and velocity of data. With new digital properties getting added, there was a need to design the collection and storage layers that would scale well Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Poorly recorded data. Often, the meaning and granularity of the data was getting lost in processing Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, granting access and integration 04 03 02 01
  • 6. Vision Lens is a unified data platform with a consolidated solution stack to generate meaningful real time insights and drive revenue “ “ Better product decisions based on behavioral insights Add value to our businesses € Increase CLV Deeply understand every user's journey Immediate actions, smart targeting and marketing automation Positively impact KPIs
  • 7. Components 02A well governed data lake architected to store raw and enriched data thereby eliminating storage silos WELL GOVERNED DATA LAKE 01 Connecting dozens of data streams and repositories to a unified data pipeline enabling near real-time access to any data source UNIFIED DATA PIPELINE 03 Data processing framework to support streams and batches workloads to aid analytics and machine learning along with smart workflow management DATA PROCESSING FRAMEWORK 05 Recommendations and personalization engine powered by machine learning RECOMMENDATIONS ENGINE 04Well designed big data stores for reporting and exploratory analysis BIG DATA STORES FOR OLAP 06 Dynamic dashboards and smart visualizations that makes data tell stories and drives insights. SMART DASHBOARDS
  • 8. Solution Stack 04 Track Key metrics : visits, plays,dropouts and minutes watched VIDEO ANALYTICS Watch Attention shift in real time Updates every few seconds to quickly capitalize attention to every post, campaign and sections STREAMING ANALYTICS 01 02Historical View of unique attention metrics to understand what happened in the past and use it to plan for the future BATCH ANALYTICS 03 Integrations with Google Accelerated Pages and Facebook Instant articles FB IA AND GOOGLE AMP INTEGRATIONS 05 Recommendations and personalization engine powered by machine learning CONTENT PERSONALIZATION 06 Dynamic dashboards and smart visualizations that makes data tell stories and drives insights. ADVANCED REPORTING Clean structured data that team can analyze directly RAW DATA ACCESS07
  • 9. Key Infrastructure Components CloudFront ECS Kinesis Stream S3 Bucket EMR Spark Sagemaker Aurora Redshift Elasticsearch Service DynamoDB DatabricksAWS ALB Apache Airflow
  • 11. Trackers Android SDK IOS SDK JS SDK PHP SDK Java SDK Data / Event Trackers Trackers allow us to collect data from any type of digital application, service or device. All trackers adhere to the LENS Tracker Protocol.
  • 12. Collectors-Scribe Data Collectors 04 03 Written in Go/Java 02 Designed for Low LatencyEngineered for High Concurrency Horizontally Scalable 01 Scribe collects data from the trackers and writes them to the Kinesis data firehose. This allows near-real time processing of data as well as storage in the data lake for further batch analysis. Use ECS Fargate for the containerization. Scribe API endpoints • Event tracker • Pixel tracker • Click tracker • AMP tracker
  • 13. Accumulo /Data Lake A ACCUMULO The data consumer component responsible for - Reading data from the event firehose ( Kinesis Streams ) Performing rudimentary data quality checks Converting data to Avro Format with Snappy Compression Loading them to the Data Lake DATA LAKE Data Lake supports the following capabilities Capture and store raw data securely at scale at a low cost Store many types of data in the same repository Define the structure of the data at the time it is used It is designed to Retain all data Support all data types Adapt Easily to changes
  • 14. Prism - Processing Engine Using Apache Spark as our processing Engine. It’s written in Scala. It can run on EMR-5.27 and as a Databricks job running on AWS spot/on-demand instances Unified Processing Engine Prism Analytics Engine
  • 15. Prism - Processing Engine Data Cleanser Performs data cleansing including: • Normalization • De-duplication • Bot-exclusion • Fixes for client clock issues. Data Enricher Performs enrichment activities including: • User Agent Parsing to understand OS / Platform • Referrer Parsing to understand channels • IP to location transformation • Lat+Long to location transformation • Widen event data with user profile information Data Quality Checks Performs the data quality checks needed to detect, report and omit instrumentation errors Data Reconciler Reconciles data that is sacrosanct like transactions from the feeds generated by the master db Sessionization/User Merging Sessionize and merge the users based on domain/anonymous id 15 Prism Analytics Engine Data Refresher Loads the data to respective tables in the data warehouse and other reporting data stores
  • 16. Prism - Real-time Analytics • Use structured streaming to stream live events into Elastic Search. • Stack can be run on both EMR and Databricks • Run in 50 -4.x large instances, which is scaled to 100 instances during the election time. • Configurations:- spark.executor.cores=4 spark.executor.memory=25g spark.executor.instances=50 Spark Streaming Spark Streaming
  • 17. Prism - Batch Analytics Spark on EMR/Databricks Spark• Scheduled Job which kick off every day to process all the events for a day and write the cleansed raw/aggregated data to the redshift (primary data store). • It also writes the data to Parquet Format to run presto/Databricks delta lake on the top if needed. • Runs in 20 – r4.2xlarge instances • Configurations:- spark.executor.cores=3 spark.executor.memory=20g spark.executor.instances=39
  • 18. Data Stores DATA WAREHOUSE AMAZON REDSHIFT Primary Data Store • Supports batch workloads. • Supports up to 50 concurrent queries • Cache layer pgpool deployed • WLM and concurrency scaling enabled • Elastic Resize • Redshift spectrum to query archived data in S3 01 REALTIME REPORTING STORE Elasticsearch Content Analytics Real Time Dashboard. • Fluidic Dashboard with granular filters • Data Visualization using Kibana 02 RECOMMENDATION RESULTS DYNAMODB Features like, Horizontally Scalability, low operational overhead and predictable performance make Dynamodb a good choice for storing recommendation results 03
  • 19. Orchestration Used to programmatically author, schedule and monitor workflows. Workflow Management Rich UI that makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. Rich UI Apache Airflow
  • 20. Data Retention Strategy  Find a balance between what’s optimal for your clients’ business needs vs. operational cost effectiveness  Ensure the data retention policies align with the regulatory restrictions(GDPR)  Define proper life cycle policies at different stages  S3-IA/Glacier lifecycle policy defined for the data at rest in Data lake and a scheduled purging policy defined for the primary data store(redshift)  We keep a quarter worth of data in the primary data store(redshift) and older data is archived to S3.  Redshift Spectrum is used for detailed analysis of older data.  For YOY, QOQ comparison we pre-calculate it as part of the quarterly process and store the aggregated results into the data store.
  • 21. Page Views Dashboard - KPIs/Different Angles Domain Specific KPIs Key Metrics in the Content Dashboard. Different Angles New and returning Visitors Explore the Content Data from these Angles Engaged Time Social Shares and Referrals Bounce Rate Video Play Rate Titles Authors Sections Tags Referrers Campaigns Google AMP Facebook IA
  • 22. Scalability /Performance Collect, Storage and Process layers designed to Autoscale. Batch analytics takes an average of 30-40 mins to process and refresh data for the entire day across all reporting dashboards Turnaround latency numbers at the data collector: 75 percentile - 27ms and 95 percentile - 156 ms Currently handles about 150 GB of data per day with an average of 300 million events processed per day Horizontally Scalable Data Collectors, Data Consumers, Data Processors and Data Reporting Stores 04 03 02 01 The real time streaming stack currently processes 500K events in less than 10 seconds. 05 06
  • 23. Best Practices in Spark  Use Dataset, DataFrames, Spark SQL instead of RDD to get the benefits of catalyst optimizer  Choose the best data format and compression.  Apache Parque gives the fastest read performance with the spark with its vectorized Parquet reader. Run presto/Databricks delta lake on the top if needed.  Avro offers rich schema support and more efficient writes than Parquet.  Choose either Snappy or LZO compression as they have balance in terms of split-ability and block compression.  Use the Spark Web UI to explore your task jobs, storage, and SQL query plan to optimize your spark execution  Look at the spark event timeline to see the amount of time for each stage/tasks  Check the shuffles between stages and the amount of data shuffled(Use the spark.sql.shuffle.partitions option if needed)  Check the join algorithms being used.  Broadcast join should be used when one table is small.  Sort-merge join should be used for large tables. You can use bucketing to pre-sort and group tables; this will avoid shuffling in the sort merge  Enable Dynamic Partition Pruning/ flattenScalarSubqueriesWithAggregates/ Bloom Filter Join/ Optimized Join Reorder  Use s3 instead of s3a/s3n protocol to refer the data so that it goes through the optimized path  Use EMRFS consistency only if its required  Find an optimal configurations on number of executors, memory setting for each executors and the no of cores for the spark job.
  • 24. Outcomes Ability to run targeted mobile push and email campaigns Consistent KPI measurement. The client has a consistent framework across properties to measure KPIs Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Better user experience. Recommendations running off the data in the Data Lake add value to the digital properties we manage Better business agility and product decisions based on behavioural insights. The journey from data to decisions is made swifter 04 03 02 01