Scalameta is a library for static analysis and processing of Scala source code, which supports syntactic and semantic analysis. In this presentation, we explain how Scalameta works, and how you can use Scalameta for custom code analysis. We demonstrate how we have used scalameta to automate schema management and privacy protection.
4. www.scling.com
Craft vs industry
4
● Each step steered by human
○ Or primitive automation
● Improving artifacts
● Craft is primary competence
● Components made for humans
○ Look nice, "easy to use"
○ Most popular
● Autonomous processes
● Improving process that creates artifacts
● Multitude of competences
● Some components unusable by humans
○ Hard, greasy
○ Made for integration
○ Less popular
● Processes often include craft steps
5. www.scling.com
Road towards industrialisation
5
Data warehouse age -
mechanised analytics
DW
LAMP stack age -
manual analytics
Hadoop age -
industrialised analytics,
data-fed features,
machine learning
Significant change in workflows
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
6. www.scling.com
Simplifying use of new technology
6
DW
Enterprise big data failures
"Modern data stack" -
traditional workflows, new technology
Low-code, no-code
7. www.scling.com
We have seen this before
7
Difficult adoption
4GL, UML, low-code, no-code
Software engineering education
8. www.scling.com
Data engineering in the future
8
DW
~10 year capability gap
"data factory engineering"
Enterprise big data failures
"Modern data stack" -
traditional workflows, new technology
4GL / UML phase of data engineering
Data engineering education
9. www.scling.com
Value of data factories
● Factory value ~ robot movements.
● Data factory value ~ number of datasets.
● Differences are orders of magnitude
○ Enterprises: 100s / day
○ Spotify: 100Ks / day
○ Google: Bs / day
● Cost of DataOps - value of data
○ Low cost - further out the long tail
9
Disruptive value of data, ML
Traditional data warehouse reporting
10. www.scling.com
Data-factory-as-a-service
10
Data lake
● Data factory
○ Collected, raw data →
processed, valuable data
● Data pipelines customised for client
○ Analytics (BI, reports, A/B testing)
○ Data-fed features (autocomplete, search)
○ Learning systems (recommendations, fraud)
● Compete with data leaders:
○ Quick idea-to-production
○ Operational efficiency
{....}
{....}
{....}
11. www.scling.com
Success in data factories
11
● Work driven by use cases
○ Teams aligned along the value chain
● Minimal data innovation friction
○ Data democratised - accessible and usable
○ Quick code / test / debug / deploy feedback cycle
● High value / non-value work ratio
○ Guard rails to maintain speed without risk
■ Dev tooling, tests, quality metrics
○ Minimal operational toil
● Build on software engineering processes
○ Composability
○ DevOps, everything as code
○ Strong CI/CD process
14. www.scling.com
● Lowest common denominator = name, type, required
○ Types: string, long, double, binary, array, map, union, record
● Schema specification may support additional constraints, e.g. integer range, other collections
What is a schema?
14
Id Name Age Phone
1 "Anna" 34 null
2 "Bob" 42 "08-123456"
Fields
Name Type Required?
In RDBMS, relations are explicit
In lake/stream datasets, relations are implicit
18. www.scling.com
Schema on write
18
● Schema defined by writer
● Destination (table / dataset / stream topic) has defined schema
○ Technical definition with metadata (e.g. RDMBS, Kafka + registry)
○ By convention
● Writes not in compliance are not accepted
○ Technically aborted (e.g. RDBMS)
○ In violation of intent (e.g. HDFS datasets)
● Can be technically enforced by producer driver
○ Through ORM / code generation
○ Schema registry lookup
Strict checking philosophy
19. www.scling.com
Schema on read
19
● Anything (technically) accepted when writing
● Schema defined by reader, at consumption
○ When joining or filtering, unknown fields go through
● Reader may impose requirements on type & value
● Violations of constraints are detected at read
○ Perhaps long after production?
○ By team not owning producer?
Loose checking philosophy
hdfs dfs -cat part-00000.json | jq -c '. | select(.country == "SE")'
20. www.scling.com
Dynamic vs static typing
20
Schema on write Schema on read
Static typing Dynamic typing
Strict Loose
Possible
Java:
user.setName("Alice");
user2.getName();
Scala:
user = User(name = "Alice", ...)
user2.name
Java:
user.set("name", "Alice");
user2.get("name");
Python:
user.name = "Alice"
user2.name
21. www.scling.com
Schema on read or write?
21
DB
DB
DB
Service
Service
Export
Business
intelligence
Change agility important here
Production stability important here
22. www.scling.com
Backward:
E.g. add optional field
Compatibility
22
New consumer
Forward:
E.g. remove optional field
Old producer
= old datasets
Old consumer
New producer
23. www.scling.com
Incompatible changes
23
● Jobs can accept old input formats
● Translate to new format internally
● Or rerun computations from raw data
● Dedicated job to translate old to new
format - upcasting.
24. www.scling.com
Language choice, people & preferences
● Java most popular in engineering
+ Good engineering ecosystem
- Boilerplate bad for prototyping & enthusiasm
● Python popular among data scientists
+ Great AI libraries, quick prototypes
- Dynamic typing
● SQL popular among analysts
○ Error handling & data quality difficult
○ Does not compose and scale
● Scala connects both worlds
○ Home of data engineering innovation during the big data hype
24
● Static typing
● Error handling
● Monitor, debug, profile
ecosystem
● Rapid prototyping
● No rituals & boilerplate
● Science innovation
30. www.scling.com
Quasiquotes matching
30
val stat: Stat = "val a = b() + 3".parse[Stat].get
val stat: Stat = q"val a = b() + 3"
stat match {
case q"val $name: $typeOpt = $expr" =>
println(s"Got val declaration $name of type" +
s"${typeOpt.structure} : ${expr.structure} ")
}
31. www.scling.com
● Expressive
● Custom types
● IDE support
● Avro for data lake storage
Schema definition choice
31
● RDBMS: Table metadata
● Avro: JSON/DSL definition
○ Definition is bundled with avro data files
● Parquet
● pyschema / dataclass
● Scala case classes
● JSON-schema
● JSON: Each record
○ One record insufficient to deduce schema
case class User(id: String, name: String, age: Int,
phone: Option[String] = None)
val users = Seq( User("1", "Alice", 32),
User("2", "Bob", 43, Some("08-123456")))
32. www.scling.com
Schema offspring Test record
difference render
type classes
32
case classes
test equality
type classes
Avro
definitions
Java Avro
codec classes
Java <-> Scala
converters
Avro type
annotations
MySQL
schemas
CSV codecs
Privacy by
design
machinery
Python
Logical types
33. www.scling.com
Avro codecs
33
case classes
Avro
definitions
Java Avro
codec classes
Java <-> Scala
converters
{
"name": "JavaUser",
{ "name": "age", "type": "int" }
{ "name": "phone", "type": [ "null", "string" ] }
}
public class JavaUser implements SpecificRecord {
public Integer getAge() { ... }
public String getPhone() { ... }
}
object UserConverter extends AvroConverter[User] {
def fromSpecific(u: JavaUser): User
def toSpecific(u: User): JavaUser
}
case class User(age: Int,
phone: Option[String] = None)
34. www.scling.com
Quasiquotes in practice
34
q"""
object $converterName extends AvroConverter[${srcClass.clazz.name}
] {
import RecordFieldConverters._
type S = $jClassName
def schema: Schema = $javaClassTerm.getClassSchema()
def tag: ClassTag[S] = implicitly[ClassTag[S]]
def datumReader: SpecificDatumReader[S] = new SpecificDatumReader[$jClassName](classOf[$jClassName])
def datumWriter: SpecificDatumWriter[S] = new SpecificDatumWriter[$jClassName](classOf[$jClassName])
def fromSpecific(record: $jClassName): ${srcClass.clazz.name} =
${Term.Name(srcClass.clazz.name.value)}
(..$fromInits )
def toSpecific(record: ${srcClass.clazz.name}
): $jClassName =
new $jClassName(..$specificArgs)
}
"""
35. www.scling.com
CSV codecs
35
case classes
CSV codecs
import kantan.csv._
object CsvCodecs {
implicit val userDecoder: HeaderDecoder[User] = ...
}
case class User(age: Int,
phone: Option[String] = None)
36. www.scling.com
Test equality Test record
difference render
type classes
36
case classes
test equality
type classes
trait REquality[ T] { def equal(value: T, right: T): Boolean }
object REquality {
implicit val double: REquality[Double] = new REquality[Double] {
def equal(left: Double, right: Double): Boolean = {
// Use a combination of absolute and relative tolerance
left === right +- 1e-5.max(left.abs * 1e-5).max(right.abs * 1e-5)
}
}
/** binds the Magnolia macro to the `gen` method */
implicit def gen[T]: REquality[ T] = macro Magnolia. gen[T]
}
object Equalities {
implicit val equalityUser: REquality[User] =
REquality. gen[User]
}
38. www.scling.com
Logical types
38
case classes
Logical types
case t"""Instant""" =>
JObject(List(JField("type", JString("long")), JField("logicalType",
JString("timestamp-micros"))))
case t"""LocalDate""" => JObject(List(JField("type", JString("int")),
JField("logicalType", JString("date"))))
case t"""YearMonth""" => JObject(List(JField("type", JString("int"))))
case t"""JObject""" => JString("string")
● Avro logical types
○ E.g. date → int, timestamp → long
○ Default is timestamp-millis
■ Great for year > 294441 (!)
● Custom logical types
○ Time
○ Collections
○ Physical
39. www.scling.com
Stretching the type system
39
● Fail: mixup kW and kWh
● Could be a compile-time error. Should be.
● Physical dimension libraries
○ Boost.Units - C++
○ Coloumb - Scala
41. www.scling.com
● PII fields encrypted
● Per-user decryption key table
● Clear single user key => oblivion
- Extra join + decrypt
- Decryption (user) id needed
+ Multi-field oblivion
+ Single dataset leak → no PII leak
+ Handles transformed PII fields
Lost key pattern
41
42. www.scling.com
Shieldformation
42
@PrivacyShielded
case class Sale(
@PersonalId customerClubId: Option[String],
@PersonalData storeId: Option[String],
item: Option[String],
timestamp: String
)
case class SaleShielded(
shieldId: Option[String],
customerClubIdEncrypted: Option[String],
storeIdEncrypted: Option[String],
item: Option[String],
timestamp: String
)
case class SaleAnonymous(
item: Option[String],
timestamp: String
)
object SaleAnonymize extends SparkJob {
...
}
ShieldForm
object SaleExpose extends SparkJob {
...
}
object SaleShield extends SparkJob {
...
}
case class Shield(
shieldId: String,
personId: Option[String],
keyStr: Option[String],
encounterDate: String
)
43. www.scling.com
Shield
Shieldformation & lost key
43
SaleShield
Sale
Sale
Shielded
Shield
Deletion
requests
Customer
History
Exposed egress
SaleExpose
Limited retention
SaleAnonymize
Sale
Anonymous
Sale
Stats
45. www.scling.com
Shield
Shieldformation & locked lake
45
SaleShield
Sale
Sale
Shielded
Shield
Deletion
requests
Customer
History
Exposed egress
SaleExpose
Limited retention
SaleAnonymize
Sale
Anonymous
Sale
Stats
Owned by
target system
47. www.scling.com
Success in data factories
47
● Work driven by use cases
○ Teams aligned along the value chain
● Minimal data innovation friction
○ Data democratised - accessible and usable
○ Quick code / test / debug / deploy feedback cycle
● High value / non-value work ratio
○ Guard rails to maintain speed without risk
■ Dev tooling, tests, quality metrics
○ Minimal operational toil
● Build on software engineering processes
○ Composability
○ DevOps, everything as code
○ Strong CI/CD process
48. www.scling.com
● Work driven by use cases
○ Teams aligned along the value chain
● Minimal data innovation friction
○ Data democratised - accessible and usable
○ Quick code / test / debug / deploy feedback cycle
● High value / non-value work ratio
○ Guard rails to maintain speed without risk
■ Dev tooling, tests, quality metrics
○ Minimal operational toil
● Build on software engineering processes
○ Composability
○ DevOps, everything as code
○ Strong CI/CD process
Success in data factories vs data trends
48
Data
mesh
Data
contracts
No code /
low code
SQL / Data
warehouses
49. www.scling.com
Data factory track record
49
Time to
first flow
Staff size 1st flow
effort, weeks
1st flow cost
(w * 50K ?)
Time to
innovation
Flows 1y
after first
Media 1+ years 10-30 1500? 100M (0.5-1B) 1+ year ?
Finance 2 years 10-50 2000? 100M? Years 10?
Media 3 weeks 4.5 - 8 15 750K 3 months 30
Retail 7 weeks 1-3 7 500K * 6 months 70
Telecom 12 weeks 2-5 30 1500K 6 months 50
Consumer
products
20+ weeks 1.5 30+ 1200+K 6+ months 20
Construction 8 weeks 0.5 4 150K * 7 months 10
Manufacturing 8 weeks 0.5 4 200K * 6 months ?
50. www.scling.com
● Is Shieldformation open source?
○ No. It might be when it is older and less volatile, and we have grown enough to maintain it properly.
● Is cryptoshredding really acceptable as deletion?
○ Yes.
● Is lost key pattern legally sufficient for all use cases?
○ No.
○ It does not provide complete anonymisation, but pseudonymisation with a limited time span (1 month).
○ Unless data is very sensitive, it has been deemed legally sufficient.
○ Be careful with health & geo data.
Q & A?
50