Healthcare fraud detection

FRAUD DETECTION
BIG DATA ANALYSIS
(HEALTHCARE APPLICATION)
MAHDI ESMAILOGHLI MASMAILLOGLI@GMAIL.COM BIGDATA.CEIT.AUT.AC.IR

“… any illegal act characterized by deceit, concealment, or
violation of trust. These acts are not dependent upon the
threat of violence or physical force. Frauds are perpetrated
by parties and organizations to obtain money, property, or
services; to avoid payment or loss of services; or to secure
personal or business advantage.”
International Professional
Practices Framework(IPPF)
DEFINITION

DOMAIN OF APPLICATION
WHERE FRAUD COULD BE FOUND?
▸ HealthCare Systems
▸ Credit Cards Domain
▸ Social Networks
▸ Satellite Or Army Systems Controlling
▸ …

DIFFERENCES
WHAT IS THE CHARACTERISTICS OF HEALTHCARE DOMAIN DATA?
▸ Complexity and number of fields in these kind of data are
tremendous.
▸ The people or organizations attends to make profit to others.
▸ Data is really BIG and sometimes stream
▸ Many kinds of data like: Image, Raw Text, Sound, …
▸ Data are not labeled and hard to classification
▸ Concept drifting

SOME TIPS ABOUT IMPORTANCE OF
BigData in HealthCare

TIPS
ROLE OF BIG DATA IN HEALTHCARE
▸ DNA. One of the most important public datasets in
Amazon.
▸ Stanford’s BigData conference is all about Healthcare
▸ Microsoft has stablished an academic part to work on
healthcare
▸ Loss of money in many countries because of FRAUD in
healthcare (up to 10% US annual health care expenditure)

EXAMPLES
SOME FRAUDS THAT TRADITIONAL HEALTHCARE SYSTEMS USED TO FACE WITH
▸ Changing patient’s insurance identiﬁcation document
▸ Prescribing some ﬁxed brands of drugs by a Dr
▸ Prescribing expensive drugs than what is usual for same
disease
▸ getting some kinds of drugs by a patient more than usual
▸ and many more…

HEALTHCARE
METHODS OF FRAUD DETECTION IN

SOLUTIONS
DETECTING HEALTHCARE FRAUD
▸ Statistical
▸ Machine learning and Data mining
▸ Graph analysis

STATISTICAL METHODS
FRAUD DETECTION USING

STATISTICAL METHODS
STATISTICAL METHODS…
▸ Uses some rules
▸ Rules are described by a domain expert
▸ Creating application to initial statistical parameters ex:
▸ Count average of drugs in every prescription
▸ Total price of every disease
▸ Then they can be compared with new data. If high
difference found, ALARM GOES OFF

STATISTICAL METHODS
CONS AND PROS
▸ It’s very simple and easy to implement
▸ Low computation overhead
▸ Very easy to use for stream data
▸ Low ﬂexibility
▸ Can’t be used for data concept drifting
▸ Adding rules is hard
▸ Every thing is based on domain expert knowledge
▸ It’s possible that deﬁned solution wouldn’t be complete

MACHINE LEARNING AND DATA
MINING ALGORITHMS

MACHINE LEARNING AND DATA MINING ALGORITHMS
MACHINE LEARNING ALGORITHMS
▸ Choosing one or more machine learning algorithm based
on the data
▸ Use them for learning and detecting frauds
▸ If (data are labeled) classiﬁcation is perfect idea
▸ Else clustering
▸ Or using clustering to labeling and the using classiﬁcations

GRAPH ANALYSIS

GRAPH BASED FRAUD DETECTION
GRAPH ANALYSIS
▸ It has been going popular since 2015
▸ It’s still just a assistant system to get along with machine
learning algorithms
▸ It can’t consider all aspects
▸ But handy

USING PAGE RANK TO
HEALTHCARE FRAUD DETECTION
HORTON WORKS

USING PAGE RANK TO HEALTHCARE FRAUD DETECTION
DATA FIELDS
▸ NPI (National Provider Id)
▸ Speciality
▸ Procedure Code
▸ Count

PERSONALIZED PAGE RANK
PAGE RANK AND

EXAMPLE
PAGE RANK ON DATA 13.5%
13.5%
9.5%
13.3%
17.6%
9.5%13.7%
9.5%
Dermatologist
Surgeon
Internist

EXAMPLE
PERSONALIZED PAGE RANK ON DERMATOLOGIST SPECIALITY
24.1%
24.1%
18.7%
15.4%
7.9%
2.9%
4.1%
2.9%
Dermatologist
Surgeon
Internist

GRAPH BASED
FRAUD DETECTION
ENVIRONMENT OF
THE PAPER

ENVIRONMENT OF PAPER
ENVIRONMENT OF PAPER
▸ Dataset: CMS Medicare Part-B
▸ Used Apache HADOOP and Apache Pig
▸ 8 nodes
▸ 4 cores for each node
▸ 64 GB of memory for each node
▸ Total time of execution: 3 hours

STEP 1
COMPUTE THE SIMILARITY BETWEEN PROVIDERS
▸ Computing similarities between providers based on
shared procedure
▸ If similarity of two providers are more than a threshold an
edge connects them
▸ Sensitive Hashing & DimSum can help but it didn’t use
▸ 880K providers => 774 billion similarity computation
▸ My dataset: ~140 providers => 20K similarity computation

STEP 2
COMPUTING PERSONALIZED PAGE RANK FOR EACH SPECIALITY
▸ Loop over all specialities
▸ For each speciality apply Personalized Page Rank to the
graph
▸ Identify anomalous providers: PRSpeciality(node) high but
whose whose speciality is not the one used for the page
rank calculation

IMPLEMENTATION ON APACHE
SPARK
OUR ANALYSIS

SPARK IMPLEMENTATION
WHAT WE DID IN SPARK
▸ Implementation from the scratch
▸ Changing the algorithm of page rank in Spark GraphX
▸ Every Personalized Page Rank runs 100 loops
▸ Dataset contains 20,000 raw data
▸ It took 20 minutes to run the algorithm on a core i7, 4core
macbook Pro with 4GB memory (main part of memory
occupied by OS)

SOME RESULT OF
FRAUD DETECTION
RESULTS

ALGORITHM SPEED ANALYSIS
SPEED ANALYSIS BASED ON ITERATION COUNT
0
175
350
525
700
10 25 50 75 100
68 120
249
462
690

SOLUTION ANALYSIS
CONS. AND PROS.
▸ Algorithm need computing similarity for all pairs of providers.
▸ It just consider one aspect of the fraud. Not complete
▸ Low speed & needs huge amount of memory (because of
computing similarity at ﬁrst) - 2GB data needs 512 GB Ram
▸ Hard to add new data and update the graph
▸ High cost of part 2
▸ Needs to deﬁne rules to use graph analysis (other papers)

SOLUTION ANALYSIS
CONS. AND PROS.
▸ Part 1 needs shufﬂe => reduce performance
▸ Modeling as a graph => easy to understand and analysis
▸ New way of fraud detection. progressing
▸ Capable of using LSH but wanted 100% accuracy

SUGGESTIONS
FUTURE
▸ Using other centrality algorithms
▸ Using algorithms like community detection instead of
clustering
▸ If we injects data of patients we can do more (in a bipartite
graph we can detect frauds of more popular providers).

Healthcare fraud detection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Healthcare fraud detection

Similar to Healthcare fraud detection (20)

Recently uploaded

Recently uploaded (20)

Healthcare fraud detection