Presented on 7 May 2015 in London, Cloudera Cares and DataKind talked about the following topics:
Cloudera Cares: How we've contributed to the community in 2014
Doug Cutting: PAX Data
Amr Awadallah: Cloudera Academic Partnership
Duncan Ross: DataKind UK Overview
Ian Ansell, Peter Passaro, Henry Simms, Billy Wong: Citizens Advice and DataKind
30. 30 @duncan3ross @DataKindUK
• DataKind UK is a charity that believes we can make the world better
by using data
• We work by linking data volunteers (you) with charities
COME AND JOIN DATAKIND
37. 318 member bureaux in England and Wales (F2F
phone, web-chat, email/letter)
2,500+ regular community locations
1,000+ ad-hoc locations
Consumer advice service (phone, email/letter)
in England, Wales and Scotland
Our website ‘Adviceguide’ providing extensive
self-help information on a wide range of topics.
2013/14
Our services
44. The Problem
Could data science enable Citizens Advice to anticipate or
even predict changes in the issues affecting people
everyday, to act sooner to prevent problems escalating?
46. The Project
1. To design a tool to harness Citizen Advice’s data so
they could better identify and react to emerging social
issues in the UK.
2. To build awareness among Citizens Advice staff of new
methods for mining and using data, and opening up the
data to staff and others.
47. ● Original brief: Develop an Issues Early Warning
System to find the next “payday loans”
● Run two DataDives to explore the data and find
different approaches to the problem
● Run longer-term DataCorps to make sense of the
DataDive findings and develop a solution
48. The DataDive Experience Day 1:
I can solve all the problems
of the world with my
AWESOME DATA SCIENTIST POWERS!
50. DataDive 1: What do we do with all
this delicious data?
● Bureau Statistics (Visitors and their Issues)
● Bureau Evidence Forms
● Google Analytics
What is the central theme across the organisation?
Issue Codes!
51. Bureau
Statistics
● Timestamp
● Issue Code
● Bureau ID
● Client ID
~2M visits/yr
~6M issues/yr
Trends & Issues
Exploration
Evidence
Forms
● Timestamp
● Issue Code
● Bureau ID
● Client ID
● 6 Text Fields
● ~40
Demographic
Fields
~ 50K Forms/yr
Topic Analysis &
Issues Exploration
Google
Analytics
● Timestamp
● NO ISSUE CODE!
● Sessions
● Users
● New Users
~ 16M Unique Users
Issue Code Labelling
& Data Pipelining
52. CAB DataCorps Project: How do we take the DataDive
work forward?
● Grand Ambition - build a prediction engine
● Needed trends across all three data types
● Evidence Forms - Better Topic Modelling
● Bureau Statistics - Look for emerging issues
● Google Analytics Data - Issue code labelling and pipeline
completion
● User Interface
53. DataDive 2
Citizens Advice shares their data with:
● St Mungo’s Broadway
● Northeast Child Poverty Action Committee
54. Elasticsearch and Kibana Save the
Day
- Struggling to get good predictions because of a
lack of contextual data
- Trend analysis was difficult because of changes
in data collection
- We already had all the evidence forms in
Elasticsearch for topic analysis
- Volunteer Ian Huston (Pivotal) started using
Kibana to explore the data
55.
56. Focus Becomes the Dashboard
Final data clean up and normalisation
● Put everything into Elasticsearch
● Normalise issues codes across all 3 data types
● Other minor field normalisation
● Enrich geo data for bureau visits and evidence forms
● Evidence forms - full topic modelling
60. Motivation
● At least 30% of the CAB’s usage is by repeat
clients
● If we can offer preventive advice, we can reduce
cost and provide better service
61. Modelling the problem...
● Lift(B => A)
o Given B, how much more likely is A?
o = P(A|B)/P(A)
o = P(A and B)/(P(A)*P(B))
● All of the probabilities can be estimated* from case
history for each client
62. Time matters
● There is a temporal element to the issue counts (i.e. A must
follow B)
● If two issues happen two years apart, intuitively we would think
that the link between them is not as strong as that between two
issues that are two weeks apart
o Use exponential decay to model the “aging” of the count
64. Tools used - all open source
● Programming language - Python
● Statistics - Scipy
● Graph analysis - Networkx
● Web framework - Spyre
● Graph visualisation - D3.js
65. The Future
Dashboard and app
● give us comprehensive view of all our data
● helps to spot emerging issues and explore our
hunches
Implementation
● being integrated into Citizens Advice system
66. New insights already discovered
● Adviceguide Consumer section hiding key details
o just how big an issue fuel and utilities are
● Bipolar keeps cropping up in Befs around the issues of
debt
67. So much more than a dashboard
New analysis techniques learnt & new technologies
introduced
68. Excitement about data
● Kibana dashboard showcased and loved
● Could be replacing core systems, watch this space...
● Democratised our data - staff can access and play with
it
● Now, how about delivering data to the bureaux?
70. Project CreditsDatakind:
● Emma Prest - General Manager
● Duncan Ross - Founder UK Branch
Original Data Ambassadors:
● Iago Martinez
● Arturo Sanchez Correa
● Peter Passaro
Volunteers:
● Henry Simms
● Billy Wong
● Sam Leach
● Emmanuel Lazardis
CAB Support:
● Laura Bunt
● Pete Watson
● Ian Ansell
About 30 additional volunteers who contributed at various stages!
Elasticsearch and General Data Hosting:
Google Analytics Pipelining:
Advice and Support:
Funding:
(Alan Hardy & Livia Froelicher)