This document provides information about Aetna, a health insurance company. It summarizes that Aetna serves about 46 million customers to help them make healthcare decisions and manage healthcare spending. Aetna offers various medical, pharmacy, dental, life, and disability insurance plans as well as Medicaid services and behavioral health programs. As of March 2015, Aetna had approximately 23.7 million medical members, 15.5 million dental members, and 15.4 million pharmacy members. Aetna works with over 1.1 million healthcare professionals across more than 674,000 primary care doctors and specialists located in 5,589 hospitals across the US and globally.
3. Helping people live healthier lives
About 46 million people rely on us to help them
make decisions about their health care and their
health care spending. Every day, we work to
make the system easier and more convenient for
our customers.
Our health insurance plans and services include:
• Medical, pharmacy and dental plans
• Life and disability plans
• Medicaid services
• Behavioral health programs
• Medical management
Aetna membership:
We proudly serve*
•23.7 million medical members
•Approximately 15.5 million dental members
•Approximately 15.4 million pharmacy benefit
management services members
Aetna health care network:
Our network stretches across the country and
across much of the globe:
•More than 1.1 million health care professionals
•More than 674,000 primary care doctors and
specialists
•5,589 hospitals
*information as of March 31, 2015
4. Use Cases
Finding Data : Data scientists spend too much time finding
correct columns for variable selection.
• On average, column investigation takes 80% of a data scientist’s time
• Time taken was spent meeting with Subject Matter Experts (SMEs)
Shape of Data : Reduce the number of ad-hoc profiling
queries run.
• ~78% of the queries run on the cluster are profiling queries
Tracking Transformations : Data scientists would like to
understand how data sets are derived.
• Transformations are only tracked at a high-level in documentation
5. Use Cases
Finding Data : Data scientists spend too much time finding
correct columns for variable selection.
• On average, column investigation takes 80% of a data scientist’s time
• Time taken was spent meeting with Subject Matter Experts (SMEs)
Shape of Data : Reduce the number of ad-hoc profiling
queries run.
• ~78% of the queries run on the cluster are profiling queries
Tracking Transformations : Data scientists would like to
understand how data sets are derived.
• Transformations are only tracked at a high-level in documentation
6. Finding Data: Challenges
Hive requires manual traversal of the schema to find tables
or columns
HDFS requires traversal of the directory listing to find a
file
External documentation of the locations of data become
stale and unreliable as data changes
No practical means to add additional metadata
7. Finding Data: Solutions
Capture Hive & HDFS metadata during
runtime and store in a repository
Provide an API to interactively search &
query the metadata
Provide an API to enrich the logical metadata
with business context
12. Use Cases
Finding Data : Data scientists spend too much time finding
correct columns for variable selection.
• On average, column investigation takes 80% of a data scientist’s time
• Time taken was spent meeting with Subject Matter Experts (SMEs)
Shape of Data : Reduce the number of ad-hoc profiling queries
run.
• ~78% of the queries run on the cluster are profiling queries
Tracking Transformations : Data scientists would like to
understand how data sets are derived.
• Transformations are only tracked at a high-level in documentation
14. Shape of Data: Challenges
Constantly accessing hive metastore for
basic stats was affecting production running
jobs
The limited number of stats in the default
Metastore was not sufficient to make an
accurate assessment of the shape of the data
15. Shape of Data: Solutions
Create a system to store profiling data that
can be cross referenced with the physical
and business metadata
Create an extensible framework for data
scientists to create and add new profiling
16. Use Cases
Finding Data : Data scientists spend too much time finding
correct columns for variable selection.
• On average, column investigation takes 80% of a data scientist’s time
• Time taken was spent meeting with Subject Matter Experts (SMEs)
Shape of Data : Reduce the number of ad-hoc profiling
queries run.
• ~78% of the queries run on the cluster are profiling queries
Tracking Transformations : Data scientists would like to
understand how data sets are derived.
• Transformations are only tracked at a high-level in documentation
17. Tracking Transformations: Challenges
Documenting transformations is a manual
task and cannot be done at scale
No mechanism for auditing data
pipelines
Identifying data quality and provenance
is a manual effort
18. Tracking Transformations: Solutions
Leverage the metadata captured for search
to construct the flow of the transformations
Provide an API for interrogating
transformation executions
Provide a means for visualizing
transformations from source to current state
19. Mosaic
Mosaic simplifies the big data environment by providing a familiar search experience.
Search
If you know how to search Google or Amazon you can search Mosaic.
Search returns your most relevant results found in Hive or HDFS and displays them in
an easy to understand format.
Get the right data, right away by refining your results using suggested filters.
See business definitions and comments from other users to bring clarity to the data.
Data Profiling
Profiling stores metrics about the data you are browsing (i.e. max, min, and the
distribution of a column).
Lineage
Sometimes where you’re going depends on where you’ve been. Explore the lineage
tabs to see where your data came from, including if it came from external systems.
Pull back the covers on derived tables and see the transformation logic that built
them.