SlideShare a Scribd company logo
1 of 70
1© Cloudera, Inc. All rights reserved.
Data for Good
2© Cloudera, Inc. All rights reserved.
Cloudera Cares &
DataKind Meetup
7 May 2015
3© Cloudera, Inc. All rights reserved.
Cloudera Cares:
An employee led and driven organization
• Launched in January 2014
• 1,400 employee hours donated in 2014
• $70k+ donated in 2014
• 20+ organizations to date
Doug Cutting participating in the
BORP Revolution Ride to help raise
funds for adaptive sports gear for
the physically challenged.
4© Cloudera, Inc. All rights reserved.
Pax Data
Doug Cutting | Chief Architect & Co-Founder
5© Cloudera, Inc. All rights reserved.
Hadoop started a revolution
6© Cloudera, Inc. All rights reserved.Click to enter confidentiality information
Now we’re winning the war
7© Cloudera, Inc. All rights reserved.Click to enter confidentiality information
How shall we govern the peace?
8© Cloudera, Inc. All rights reserved.Click to enter confidentiality information
We must not be tyrants
9© Cloudera, Inc. All rights reserved.Click to enter confidentiality information
We should use our power for good
10© Cloudera, Inc. All rights reserved.Click to enter confidentiality information
Good: Education
11© Cloudera, Inc. All rights reserved.Click to enter confidentiality information
Good: Healthcare
12© Cloudera, Inc. All rights reserved.Click to enter confidentiality information
Good: Climate
13© Cloudera, Inc. All rights reserved.Click to enter confidentiality information
How can we be trusted?
14© Cloudera, Inc. All rights reserved.Click to enter confidentiality information
Trust: Transparency
15© Cloudera, Inc. All rights reserved.
Trust: Best practices
16© Cloudera, Inc. All rights reserved.Click to enter confidentiality information
Trust: Define abuses
17© Cloudera, Inc. All rights reserved.Click to enter confidentiality information
Trust: Oversight
18© Cloudera, Inc. All rights reserved.Click to enter confidentiality information
Global effort
19© Cloudera, Inc. All rights reserved.
Our duty as professionals
20© Cloudera, Inc. All rights reserved.
Thank you!
@cutting
21© Cloudera, Inc. All rights reserved.
Cloudera Academic Partnership
Amr Awadallah | CTO & VP of Engineering
@awadallah
22© Cloudera, Inc. All rights reserved.
Cloudera + Higher Education
22
23© Cloudera, Inc. All rights reserved.
Cloudera Academic
Partnership: Overview
24© Cloudera, Inc. All rights reserved.
Impact:
Curriculum Provided
25© Cloudera, Inc. All rights reserved.
We were able to jumpstart an Introduction to Big
Data Analytics course thanks to the support of
Cloudera. The materials provided, including the
lab setup, are integral to the class.
“
”
26© Cloudera, Inc. All rights reserved.
Impact:
Enterprise Grade
Cloudera Manager
27© Cloudera, Inc. All rights reserved.
Legacy systems were preventing our labs from
mapping their genome sequences in a timely
manner. Our partnership with Cloudera will cut
the time required by scientists to deliver data
from weeks to days and, eventually, to hours.
28© Cloudera, Inc. All rights reserved.
Thank You
Get involved with the Cloudera Academic Partnership:
academic_partnerships@cloudera.com
DOING GOOD WITH DATA
30 @duncan3ross @DataKindUK
• DataKind UK is a charity that believes we can make the world better
by using data
• We work by linking data volunteers (you) with charities
COME AND JOIN DATAKIND
31 @duncan3ross @DataKindUK
DATAKIND UK TODAY
£
808 2
£850K
6,850
25 6
32 @duncan3ross @DataKindUK
WHO HAVE WE WORKED WITH?
Children
Education
Health
Young people
Advice and support
International and community
33 @duncan3ross @DataKindUK
We are hiring!
London DataDive
17-19 July
Volunteers wanted
Join us: http://www.meetup.com/DataKind-UK/
THANK YOU
CITIZENS ADVICE &
Ian Ansell, Peter Passaro,
Henry Simms & Billy Wong
318 member bureaux in England and Wales (F2F
phone, web-chat, email/letter)
2,500+ regular community locations
1,000+ ad-hoc locations
Consumer advice service (phone, email/letter)
in England, Wales and Scotland
Our website ‘Adviceguide’ providing extensive
self-help information on a wide range of topics.
2013/14
Our services
Lots of delicious data
1.Bureau Statistics
2. Bureau Evidence Forms (BEFs)
3. Web data on the Adviceguide
BUREAU ISSUE STATS
ADVICEGUIDE STATS
BUREAU ISSUE &
PROFILE STATS
The Problem
Could data science enable Citizens Advice to anticipate or
even predict changes in the issues affecting people
everyday, to act sooner to prevent problems escalating?
Identifying spike and new issues - where are the next payday loans?
The Project
1. To design a tool to harness Citizen Advice’s data so
they could better identify and react to emerging social
issues in the UK.
2. To build awareness among Citizens Advice staff of new
methods for mining and using data, and opening up the
data to staff and others.
● Original brief: Develop an Issues Early Warning
System to find the next “payday loans”
● Run two DataDives to explore the data and find
different approaches to the problem
● Run longer-term DataCorps to make sense of the
DataDive findings and develop a solution
The DataDive Experience Day 1:
I can solve all the problems
of the world with my
AWESOME DATA SCIENTIST POWERS!
The DataDive Experience Day 2:
Why are all these null values here?!?!
DataDive 1: What do we do with all
this delicious data?
● Bureau Statistics (Visitors and their Issues)
● Bureau Evidence Forms
● Google Analytics
What is the central theme across the organisation?
Issue Codes!
Bureau
Statistics
● Timestamp
● Issue Code
● Bureau ID
● Client ID
~2M visits/yr
~6M issues/yr
Trends & Issues
Exploration
Evidence
Forms
● Timestamp
● Issue Code
● Bureau ID
● Client ID
● 6 Text Fields
● ~40
Demographic
Fields
~ 50K Forms/yr
Topic Analysis &
Issues Exploration
Google
Analytics
● Timestamp
● NO ISSUE CODE!
● Sessions
● Users
● New Users
~ 16M Unique Users
Issue Code Labelling
& Data Pipelining
CAB DataCorps Project: How do we take the DataDive
work forward?
● Grand Ambition - build a prediction engine
● Needed trends across all three data types
● Evidence Forms - Better Topic Modelling
● Bureau Statistics - Look for emerging issues
● Google Analytics Data - Issue code labelling and pipeline
completion
● User Interface
DataDive 2
Citizens Advice shares their data with:
● St Mungo’s Broadway
● Northeast Child Poverty Action Committee
Elasticsearch and Kibana Save the
Day
- Struggling to get good predictions because of a
lack of contextual data
- Trend analysis was difficult because of changes
in data collection
- We already had all the evidence forms in
Elasticsearch for topic analysis
- Volunteer Ian Huston (Pivotal) started using
Kibana to explore the data
Focus Becomes the Dashboard
Final data clean up and normalisation
● Put everything into Elasticsearch
● Normalise issues codes across all 3 data types
● Other minor field normalisation
● Enrich geo data for bureau visits and evidence forms
● Evidence forms - full topic modelling
The Dashboard
Demo of the dashboard
https://drive.google.com/file/d/0B0X-Agv6DH0GZGJMbEtQdE5qUTQ/view?usp=sharing
Relationships between Issues
Motivation
● At least 30% of the CAB’s usage is by repeat
clients
● If we can offer preventive advice, we can reduce
cost and provide better service
Modelling the problem...
● Lift(B => A)
o Given B, how much more likely is A?
o = P(A|B)/P(A)
o = P(A and B)/(P(A)*P(B))
● All of the probabilities can be estimated* from case
history for each client
Time matters
● There is a temporal element to the issue counts (i.e. A must
follow B)
● If two issues happen two years apart, intuitively we would think
that the link between them is not as strong as that between two
issues that are two weeks apart
o Use exponential decay to model the “aging” of the count
Demo
Tools used - all open source
● Programming language - Python
● Statistics - Scipy
● Graph analysis - Networkx
● Web framework - Spyre
● Graph visualisation - D3.js
The Future
Dashboard and app
● give us comprehensive view of all our data
● helps to spot emerging issues and explore our
hunches
Implementation
● being integrated into Citizens Advice system
New insights already discovered
● Adviceguide Consumer section hiding key details
o just how big an issue fuel and utilities are
● Bipolar keeps cropping up in Befs around the issues of
debt
So much more than a dashboard
New analysis techniques learnt & new technologies
introduced
Excitement about data
● Kibana dashboard showcased and loved
● Could be replacing core systems, watch this space...
● Democratised our data - staff can access and play with
it
● Now, how about delivering data to the bureaux?
Citizens Advice is in love with data!
display-screen.cab-alpha.org.uk
Project CreditsDatakind:
● Emma Prest - General Manager
● Duncan Ross - Founder UK Branch
Original Data Ambassadors:
● Iago Martinez
● Arturo Sanchez Correa
● Peter Passaro
Volunteers:
● Henry Simms
● Billy Wong
● Sam Leach
● Emmanuel Lazardis
CAB Support:
● Laura Bunt
● Pete Watson
● Ian Ansell
About 30 additional volunteers who contributed at various stages!
Elasticsearch and General Data Hosting:
Google Analytics Pipelining:
Advice and Support:
Funding:
(Alan Hardy & Livia Froelicher)

More Related Content

What's hot

Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
DataWorks Summit
 
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Lucas Jellema
 

What's hot (20)

What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering Principles
 
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
 
Managing Apache HAWQ with Apache AMBARI
Managing Apache HAWQ with Apache AMBARIManaging Apache HAWQ with Apache AMBARI
Managing Apache HAWQ with Apache AMBARI
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and Future
 
The Oracle Autonomous Database
The Oracle Autonomous DatabaseThe Oracle Autonomous Database
The Oracle Autonomous Database
 
Apache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and JapanApache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and Japan
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
 
Real Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with SparkReal Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with Spark
 
Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!
 
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
 
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
 

Viewers also liked

Rimsa_phd_thesis_2013
Rimsa_phd_thesis_2013Rimsa_phd_thesis_2013
Rimsa_phd_thesis_2013
Vadim Rimsa
 
Compiled Python UDFs for Impala
Compiled Python UDFs for ImpalaCompiled Python UDFs for Impala
Compiled Python UDFs for Impala
Cloudera, Inc.
 
Presentation1, radiological imaging of hyperparathyroidism.
Presentation1, radiological imaging of hyperparathyroidism.Presentation1, radiological imaging of hyperparathyroidism.
Presentation1, radiological imaging of hyperparathyroidism.
Abdellah Nazeer
 

Viewers also liked (18)

Rimsa_phd_thesis_2013
Rimsa_phd_thesis_2013Rimsa_phd_thesis_2013
Rimsa_phd_thesis_2013
 
Metodos na geo fisica
Metodos na geo fisicaMetodos na geo fisica
Metodos na geo fisica
 
Sidney Matos Portifolio 2010
Sidney Matos   Portifolio 2010Sidney Matos   Portifolio 2010
Sidney Matos Portifolio 2010
 
Bipolar
BipolarBipolar
Bipolar
 
Compiled Python UDFs for Impala
Compiled Python UDFs for ImpalaCompiled Python UDFs for Impala
Compiled Python UDFs for Impala
 
Troubleshooting Using Cloudera Manager #cwt2015
Troubleshooting Using Cloudera Manager #cwt2015Troubleshooting Using Cloudera Manager #cwt2015
Troubleshooting Using Cloudera Manager #cwt2015
 
Risk Management for Data: Secured and Governed
Risk Management for Data: Secured and GovernedRisk Management for Data: Secured and Governed
Risk Management for Data: Secured and Governed
 
Cloudera Federal Forum 2014: The Building Blocks of the Enterprise Data Hub
Cloudera Federal Forum 2014: The Building Blocks of the Enterprise Data HubCloudera Federal Forum 2014: The Building Blocks of the Enterprise Data Hub
Cloudera Federal Forum 2014: The Building Blocks of the Enterprise Data Hub
 
SAQ by KR
SAQ by KRSAQ by KR
SAQ by KR
 
Desmayo... ¿Cuándo es peligroso?
Desmayo... ¿Cuándo es peligroso?Desmayo... ¿Cuándo es peligroso?
Desmayo... ¿Cuándo es peligroso?
 
Prostatic artery embolization
Prostatic artery embolizationProstatic artery embolization
Prostatic artery embolization
 
TRAUMATOLOGIA del Hombro Dr miguel Mite
TRAUMATOLOGIA del Hombro Dr miguel MiteTRAUMATOLOGIA del Hombro Dr miguel Mite
TRAUMATOLOGIA del Hombro Dr miguel Mite
 
Cuidados de enfermería en el tratamiento de ablación por radiofrecuencia del ...
Cuidados de enfermería en el tratamiento de ablación por radiofrecuencia del ...Cuidados de enfermería en el tratamiento de ablación por radiofrecuencia del ...
Cuidados de enfermería en el tratamiento de ablación por radiofrecuencia del ...
 
Hygiene Theory
Hygiene TheoryHygiene Theory
Hygiene Theory
 
Presentation1, radiological imaging of hyperparathyroidism.
Presentation1, radiological imaging of hyperparathyroidism.Presentation1, radiological imaging of hyperparathyroidism.
Presentation1, radiological imaging of hyperparathyroidism.
 
Lesikar's Business communication presentation
Lesikar's Business communication presentationLesikar's Business communication presentation
Lesikar's Business communication presentation
 
Lesikar's Business Communication
Lesikar's Business CommunicationLesikar's Business Communication
Lesikar's Business Communication
 
Chapter 1,2,3,4 notes
Chapter 1,2,3,4 notesChapter 1,2,3,4 notes
Chapter 1,2,3,4 notes
 

Similar to Cloudera Cares + DataKind | 7 May 2015 | London, UK

¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
Denodo
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Denodo
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
Manish Chopra
 
Think Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information ArchitectureThink Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information Architecture
Inside Analysis
 
Knowledge Matters Issue 15 - Technology at Concern
Knowledge Matters Issue 15 - Technology at ConcernKnowledge Matters Issue 15 - Technology at Concern
Knowledge Matters Issue 15 - Technology at Concern
Ellen Ward
 

Similar to Cloudera Cares + DataKind | 7 May 2015 | London, UK (20)

Webinar on Big Data Challenges : Presented by Raj Kasturi
Webinar on Big Data Challenges : Presented by Raj KasturiWebinar on Big Data Challenges : Presented by Raj Kasturi
Webinar on Big Data Challenges : Presented by Raj Kasturi
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
 
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
 
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
 
Put Alternative Data to Use in Capital Markets

Put Alternative Data to Use in Capital Markets
Put Alternative Data to Use in Capital Markets

Put Alternative Data to Use in Capital Markets

 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
 
The LCG Digital Transformation Maturity Model
The LCG Digital Transformation Maturity ModelThe LCG Digital Transformation Maturity Model
The LCG Digital Transformation Maturity Model
 
Big Data Ecosystem @ LinkedIn
Big Data Ecosystem @ LinkedInBig Data Ecosystem @ LinkedIn
Big Data Ecosystem @ LinkedIn
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Think Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information ArchitectureThink Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information Architecture
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Knowledge Matters Issue 15 - Technology at Concern
Knowledge Matters Issue 15 - Technology at ConcernKnowledge Matters Issue 15 - Technology at Concern
Knowledge Matters Issue 15 - Technology at Concern
 
The Path to Data and Analytics Modernization
The Path to Data and Analytics ModernizationThe Path to Data and Analytics Modernization
The Path to Data and Analytics Modernization
 
Creating your Center of Excellence (CoE) for data driven use cases
Creating your Center of Excellence (CoE) for data driven use casesCreating your Center of Excellence (CoE) for data driven use cases
Creating your Center of Excellence (CoE) for data driven use cases
 
Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!
 
Breed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptxBreed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptx
 
Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1
 

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Cloudera Cares + DataKind | 7 May 2015 | London, UK

  • 1. 1© Cloudera, Inc. All rights reserved. Data for Good
  • 2. 2© Cloudera, Inc. All rights reserved. Cloudera Cares & DataKind Meetup 7 May 2015
  • 3. 3© Cloudera, Inc. All rights reserved. Cloudera Cares: An employee led and driven organization • Launched in January 2014 • 1,400 employee hours donated in 2014 • $70k+ donated in 2014 • 20+ organizations to date Doug Cutting participating in the BORP Revolution Ride to help raise funds for adaptive sports gear for the physically challenged.
  • 4. 4© Cloudera, Inc. All rights reserved. Pax Data Doug Cutting | Chief Architect & Co-Founder
  • 5. 5© Cloudera, Inc. All rights reserved. Hadoop started a revolution
  • 6. 6© Cloudera, Inc. All rights reserved.Click to enter confidentiality information Now we’re winning the war
  • 7. 7© Cloudera, Inc. All rights reserved.Click to enter confidentiality information How shall we govern the peace?
  • 8. 8© Cloudera, Inc. All rights reserved.Click to enter confidentiality information We must not be tyrants
  • 9. 9© Cloudera, Inc. All rights reserved.Click to enter confidentiality information We should use our power for good
  • 10. 10© Cloudera, Inc. All rights reserved.Click to enter confidentiality information Good: Education
  • 11. 11© Cloudera, Inc. All rights reserved.Click to enter confidentiality information Good: Healthcare
  • 12. 12© Cloudera, Inc. All rights reserved.Click to enter confidentiality information Good: Climate
  • 13. 13© Cloudera, Inc. All rights reserved.Click to enter confidentiality information How can we be trusted?
  • 14. 14© Cloudera, Inc. All rights reserved.Click to enter confidentiality information Trust: Transparency
  • 15. 15© Cloudera, Inc. All rights reserved. Trust: Best practices
  • 16. 16© Cloudera, Inc. All rights reserved.Click to enter confidentiality information Trust: Define abuses
  • 17. 17© Cloudera, Inc. All rights reserved.Click to enter confidentiality information Trust: Oversight
  • 18. 18© Cloudera, Inc. All rights reserved.Click to enter confidentiality information Global effort
  • 19. 19© Cloudera, Inc. All rights reserved. Our duty as professionals
  • 20. 20© Cloudera, Inc. All rights reserved. Thank you! @cutting
  • 21. 21© Cloudera, Inc. All rights reserved. Cloudera Academic Partnership Amr Awadallah | CTO & VP of Engineering @awadallah
  • 22. 22© Cloudera, Inc. All rights reserved. Cloudera + Higher Education 22
  • 23. 23© Cloudera, Inc. All rights reserved. Cloudera Academic Partnership: Overview
  • 24. 24© Cloudera, Inc. All rights reserved. Impact: Curriculum Provided
  • 25. 25© Cloudera, Inc. All rights reserved. We were able to jumpstart an Introduction to Big Data Analytics course thanks to the support of Cloudera. The materials provided, including the lab setup, are integral to the class. “ ”
  • 26. 26© Cloudera, Inc. All rights reserved. Impact: Enterprise Grade Cloudera Manager
  • 27. 27© Cloudera, Inc. All rights reserved. Legacy systems were preventing our labs from mapping their genome sequences in a timely manner. Our partnership with Cloudera will cut the time required by scientists to deliver data from weeks to days and, eventually, to hours.
  • 28. 28© Cloudera, Inc. All rights reserved. Thank You Get involved with the Cloudera Academic Partnership: academic_partnerships@cloudera.com
  • 30. 30 @duncan3ross @DataKindUK • DataKind UK is a charity that believes we can make the world better by using data • We work by linking data volunteers (you) with charities COME AND JOIN DATAKIND
  • 31. 31 @duncan3ross @DataKindUK DATAKIND UK TODAY £ 808 2 £850K 6,850 25 6
  • 32. 32 @duncan3ross @DataKindUK WHO HAVE WE WORKED WITH? Children Education Health Young people Advice and support International and community
  • 33. 33 @duncan3ross @DataKindUK We are hiring! London DataDive 17-19 July Volunteers wanted Join us: http://www.meetup.com/DataKind-UK/ THANK YOU
  • 34. CITIZENS ADVICE & Ian Ansell, Peter Passaro, Henry Simms & Billy Wong
  • 35.
  • 36.
  • 37. 318 member bureaux in England and Wales (F2F phone, web-chat, email/letter) 2,500+ regular community locations 1,000+ ad-hoc locations Consumer advice service (phone, email/letter) in England, Wales and Scotland Our website ‘Adviceguide’ providing extensive self-help information on a wide range of topics. 2013/14 Our services
  • 40. 2. Bureau Evidence Forms (BEFs)
  • 41. 3. Web data on the Adviceguide
  • 42.
  • 43. BUREAU ISSUE STATS ADVICEGUIDE STATS BUREAU ISSUE & PROFILE STATS
  • 44. The Problem Could data science enable Citizens Advice to anticipate or even predict changes in the issues affecting people everyday, to act sooner to prevent problems escalating?
  • 45. Identifying spike and new issues - where are the next payday loans?
  • 46. The Project 1. To design a tool to harness Citizen Advice’s data so they could better identify and react to emerging social issues in the UK. 2. To build awareness among Citizens Advice staff of new methods for mining and using data, and opening up the data to staff and others.
  • 47. ● Original brief: Develop an Issues Early Warning System to find the next “payday loans” ● Run two DataDives to explore the data and find different approaches to the problem ● Run longer-term DataCorps to make sense of the DataDive findings and develop a solution
  • 48. The DataDive Experience Day 1: I can solve all the problems of the world with my AWESOME DATA SCIENTIST POWERS!
  • 49. The DataDive Experience Day 2: Why are all these null values here?!?!
  • 50. DataDive 1: What do we do with all this delicious data? ● Bureau Statistics (Visitors and their Issues) ● Bureau Evidence Forms ● Google Analytics What is the central theme across the organisation? Issue Codes!
  • 51. Bureau Statistics ● Timestamp ● Issue Code ● Bureau ID ● Client ID ~2M visits/yr ~6M issues/yr Trends & Issues Exploration Evidence Forms ● Timestamp ● Issue Code ● Bureau ID ● Client ID ● 6 Text Fields ● ~40 Demographic Fields ~ 50K Forms/yr Topic Analysis & Issues Exploration Google Analytics ● Timestamp ● NO ISSUE CODE! ● Sessions ● Users ● New Users ~ 16M Unique Users Issue Code Labelling & Data Pipelining
  • 52. CAB DataCorps Project: How do we take the DataDive work forward? ● Grand Ambition - build a prediction engine ● Needed trends across all three data types ● Evidence Forms - Better Topic Modelling ● Bureau Statistics - Look for emerging issues ● Google Analytics Data - Issue code labelling and pipeline completion ● User Interface
  • 53. DataDive 2 Citizens Advice shares their data with: ● St Mungo’s Broadway ● Northeast Child Poverty Action Committee
  • 54. Elasticsearch and Kibana Save the Day - Struggling to get good predictions because of a lack of contextual data - Trend analysis was difficult because of changes in data collection - We already had all the evidence forms in Elasticsearch for topic analysis - Volunteer Ian Huston (Pivotal) started using Kibana to explore the data
  • 55.
  • 56. Focus Becomes the Dashboard Final data clean up and normalisation ● Put everything into Elasticsearch ● Normalise issues codes across all 3 data types ● Other minor field normalisation ● Enrich geo data for bureau visits and evidence forms ● Evidence forms - full topic modelling
  • 58. Demo of the dashboard https://drive.google.com/file/d/0B0X-Agv6DH0GZGJMbEtQdE5qUTQ/view?usp=sharing
  • 60. Motivation ● At least 30% of the CAB’s usage is by repeat clients ● If we can offer preventive advice, we can reduce cost and provide better service
  • 61. Modelling the problem... ● Lift(B => A) o Given B, how much more likely is A? o = P(A|B)/P(A) o = P(A and B)/(P(A)*P(B)) ● All of the probabilities can be estimated* from case history for each client
  • 62. Time matters ● There is a temporal element to the issue counts (i.e. A must follow B) ● If two issues happen two years apart, intuitively we would think that the link between them is not as strong as that between two issues that are two weeks apart o Use exponential decay to model the “aging” of the count
  • 63. Demo
  • 64. Tools used - all open source ● Programming language - Python ● Statistics - Scipy ● Graph analysis - Networkx ● Web framework - Spyre ● Graph visualisation - D3.js
  • 65. The Future Dashboard and app ● give us comprehensive view of all our data ● helps to spot emerging issues and explore our hunches Implementation ● being integrated into Citizens Advice system
  • 66. New insights already discovered ● Adviceguide Consumer section hiding key details o just how big an issue fuel and utilities are ● Bipolar keeps cropping up in Befs around the issues of debt
  • 67. So much more than a dashboard New analysis techniques learnt & new technologies introduced
  • 68. Excitement about data ● Kibana dashboard showcased and loved ● Could be replacing core systems, watch this space... ● Democratised our data - staff can access and play with it ● Now, how about delivering data to the bureaux?
  • 69. Citizens Advice is in love with data! display-screen.cab-alpha.org.uk
  • 70. Project CreditsDatakind: ● Emma Prest - General Manager ● Duncan Ross - Founder UK Branch Original Data Ambassadors: ● Iago Martinez ● Arturo Sanchez Correa ● Peter Passaro Volunteers: ● Henry Simms ● Billy Wong ● Sam Leach ● Emmanuel Lazardis CAB Support: ● Laura Bunt ● Pete Watson ● Ian Ansell About 30 additional volunteers who contributed at various stages! Elasticsearch and General Data Hosting: Google Analytics Pipelining: Advice and Support: Funding: (Alan Hardy & Livia Froelicher)