PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
2014 spark with elastic search
1. APACHE SPARK & ELASTICSEARCH
Holden Karau
Reducing duplicated code and saving on network overhead
2. Who am I?
Holden Karau
● Software Engineer @ Databricks
● I’ve worked with Elasticsearch before
● I prefer she/her for pronouns
● Author of a book on Spark and co-writing another*
● github https://github.com/holdenk
● e-mail holden@databricks.com
● @holdenkarau
*Which is why I might be sleepy today.
3. What is Spark & Elasticsearch
Spark
● Apache Spark™ is a fast and general engine for large-
scale data processing.
● http://spark.apache.org/
Elasticsearch
● Elasticsearch is a real-time distributed search and
analytics engine.
● http://www.elasticsearch.org/
4. Talk overview
Goal: understand how to work with ES & Hadoop
● Spark & Spark streaming let us re-use indexing code
● Its a bit ugly right now….
● Demo* with tweets & top hash tags per region
● We can customize the ES connector to write to the shard based on
partition
● This is an early version of the talk (feedback welcome!)
Assumptions:
● Familiar(ish) with Elasticsearch or at least Solr
● Can read Scala
*Demo gods willing.
5. Why should you care?
Small differences between off-line and on-line
Spot the difference picture from http://en.wikipedia.org/wiki/Spot_the_difference#mediaviewer/File:
Spot_the_difference.png
7. Cat picture from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
8. Lets start with the on-line pipeline
val ssc = new StreamingContext(master, "IndexTweetsLive",
Seconds(1))
// Set up the system properties for twitter
System.setProperty("twitter4j.oauth.consumerKey", cK)
System.setProperty("twitter4j.oauth.consumerSecret", cS)
System.setProperty("twitter4j.oauth.accessToken", aT)
System.setProperty("twitter4j.oauth.accessTokenSecret",
ats)
val tweets = TwitterUtils.createStream(ssc, None)
9. Lets get ready to write the data into
Elasticsearch
Photo by Cloned Milkmen
10. Lets get ready to write the data into
Elasticsearch
def setupEsOnSparkContext(sc: SparkContext) = {
val jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("mapred.output.format.class",
"org.elasticsearch.hadoop.mr.EsOutputFormat")
jobConf.setOutputCommitter(classOf[FileOutputCommitter])
jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE,
“twitter/tweet”)
FileOutputFormat.setOutputPath(jobConf, new Path("-"))
jobconf
}
12. Lets format our tweets
def prepareTweets(tweet: twitter4j.Status) = {
val loc = tweet.getGeoLocation()
val lat = loc.getLatitude()
val lon = loc.getLongitude()
val hashTags = tweet.getHashtagEntities().map(_.getText())
HashMap(
"docid" -> tweet.getId().toString,
"message" -> tweet.getText(),
"hashTags" -> hashTags.mkString(" "),
"location" -> s"$lat,$lon"
)
}
}
// Convert to HadoopWritable types
mapToOutput(fields)
}
13. And save them...
tweets.foreachRDD{(tweetRDD, time) =>
val sc = tweetRDD.context
// The jobConf isn’t serilizable so we create it here
val jobConf = SharedESConfig.setupEsOnSparkContext(sc,
esResource, Some(esNodes))
// Convert our tweets to something that can be indexed
val tweetsAsMap = tweetRDD.map(
SharedIndex.prepareTweets)
tweetsAsMap.saveAsHadoopDataset(jobConf)
}
17. Now let’s find some common
words….
// Extract the top words
val words = tweets.flatMap{tweet =>
tweet.flatMap{elem =>
elem._2 match {
case null => Nil
case _ => elem._2.split(" ")
}}}
val wordCounts = words.countByValue()
println("------")
wordCounts.foreach{ case(key, value) => println(key +
":" + value) }
println("------")
18. object WordCountOrdering extends Ordering[(String, Int)]
{
def compare(a: (String, Int), b: (String, Int)) = {
b._2 compare a._2
}
}
val wc = words.map(x => (x, 1)).reduceByKey((x,y) =>
x+y)
val topWords = wc.takeOrdered(40)(WordCountOrdering)
ok, fine, the “top” words
21. Indexing Part 2
(electric boogaloo)
Writing directly to a node with the correct shard saves us network overhead
Screen shot of elasticsearch-head http://mobz.github.io/elasticsearch-head/
22. Slight hack time
We clone the connector and do update EsOutputFormat.java [see https://github.
com/holdenk/elasticsearch-hadoop ]
private int detectCurrentInstance(Configuration conf) {
if (sparkInstance != null) {
if (log.isDebugEnabled()) {
log.debug(String.format("Using Spark patition info [%d]", sparkInstance, uri));
}
return sparkInstance;
}
….
}
23. Slight hack time
We clone the connector and do update EsOutputFormat.java [see https://github.
com/holdenk/elasticsearch-hadoop ]
public org.apache.hadoop.mapred.RecordWriter getRecordWriter(FileSystem ignored,
JobConf job, String name, Progressable progress) {
EsRecordWriter writer = new EsRecordWriter(job, progress);
// This is a special hack for Spark which sets the name as "part-[partitionnumber]" so if our
// jobconf asks for it we use this partition number as the shard number.
if (HadoopCfgUtils.useSparkPartition(job) && name.startsWith("part-") ) {
writer.setSparkInstance(Integer.valueOf(name.substring(5)));
}
return writer;
}
24. So what does that give us?
Spark sets the file name to part-[partition number]
If we have same partitioner we write directly
Likely the best place to use this is in re-indexing data
25. Re-index all the things*
// Read in our data set
val currentTweets = sc.hadoopRDD(jobConf,
classOf[EsInputFormat[Object, MapWritable]],
classOf[Object], classOf[MapWritable])
// Fetch them from twitter
val t4jt = tweets.flatMap{ tweet =>
val twitter = TwitterFactory.getSingleton()
val tweetID = tweet.getOrElse("docid", "")
Option(twitter.showStatus(tweetID.toLong))
}
t4jt.map(SharedIndex.prepareTweets)
.saveAsHadoopDataset(jobConf)
*Until you hit your twitter rate limit…. oops
26. Cat photo from https://www.flickr.com/photos/deerwooduk/579761138/in/photolist-4GCc4z-4GCbAV-6Ls27-34evHS-5UBnJv-TeqMG-4iNNn5-4w7s61-
6GMLYS-6H5QWY-6aJLUT-tqfrf-6mJ1Lr-84kGX-6mJ1GB-vVqN6-dY8aj5-y3jK-7C7P8Z-azEtd/