Jason Timmes led the migration of the primary data warehouse for Nasdaq's Transaction Services U.S. business unit (which operates Nasdaq's U.S. equity and options exchanges) from a traditional on-premises MPP database to Amazon Redshift. The project significantly reduced operational expenses. Jason, who is an Associate Vice President of Software Development at Nasdaq, describes how his team migrated a warehouse that loads approximately 7 billion rows a day into the cloud, satisfied several security and regulatory audits, optimized read and write performance, ensures high availability, and orchestrates other back-office activities that depend on the warehouse daily loads completing. Along with sharing several technical lessons learned, Jason will discuss Nasdaq's roadmap to integrating Redshift with more AWS services, as well as with more Nasdaq products, to offer even greater benefit to clients (internal and external) in the months ahead.
2. 2
We make the
world’s capital markets
move faster
more efficient
more transparent
Public company in S&P 500
Develop and run
markets globally in
all asset classes
We provide technology, trading, intelligence and listing services
Intense Operational Focus on Efficiency
and Competitiveness
We provide the infrastructure, tools and strategic insight to help our customers navigate the complexity of global capital markets and realize their capital ambitions.
Get to know us
We have uniquely transformed our business from predominately a U.S. equities exchange to a global provider of corporate, trading, technology and information solutions.
3. 3
LEADING INDEX PROVIDER WITH
41,000+ INDEXES
ACROSS ASSET CLASSES AND GEOGRAPHIES
Over 10,000 Corporate Clients in
60 countries
Our technology powers over
70 MARKETPLACES, regulators, CSDs and clearing- houses
in over
50 COUNTRIES
100+ DATA
PRODUCT OFFERINGS
supporting 2.5+ million
investment professionals and users
IN 98 COUNTRIES
26Markets
3 Clearing Houses
5Central Securities Depositories
Lists more than 3,500
companies in 35 countries, representing more than $8.8 trillionin total market value
4.
5. Our warehouse can be used to analyze market share, client activity, surveillance, power our billing, and more…
6.
7.
8.
9.
10.
11.
12.
13. •A quality of an action such that repetitions of the action have no further effect on outcome
–In other words, f(x) = f(f(x)) = f(f(f(x))), etc.
•Ingest process is designed as a workflow engine with each step in each workflow being idempotent.
•Failures are easily recovered by repeating the failed step after resolving the root cause of any failure.
14. •Use a manifest file inside a transaction with a table lock, and keep a record of completed ingests
•If the S3 COPY (insert) fails, rollback the transaction
•If the insert succeeds, write a record of the completed ingest, and commit the transaction
•Idempotence: start transaction, lock destination table, check for prior successful ingest, and only start insert if data hasn’t already been loaded today
15. •Pay close attention to the mandatory flag!
•Redshift UNLOAD always sets this to false!!!
16.
17. •TableIngestStatus
–We originally put this table in Redshift itself
–Turns out Redshift is not efficient on really small data sets
–Significantly impacted performance, and increased concurrency contention
•Solution: Moved TableIngestStatusto a separate transactional RDBMS (MySQL)
–We were already using a MySQL instance to persist workflow states
18. •Multiple layers of security
–Direct Connect (private lines)
–VPC
–HTTPS/SSL/TLS (Encryption in flight)
–AES-256 (Encryption at rest in S3)
–Redshift encryption (Encryption at rest in Redshift)
–HSM integration (Redshift master key managed on premise)
–CloudTrail/STL_CONNECTION_LOG to monitor for unauthorized DB connections
19. •Direct Connect
–No company data travels over internet circuits
•VPC
–Isolate our Redshift servers from other tenets/internet connectivity
–Security Groups restrict inbound/outbound connectivity
20. •All AWS API calls are made over HTTPS
•All Redshift JDBC connections must use SSL/TLS
–Parameter Group: require_ssl= true
–Use Redshift cluster SSL certificate to verify cluster identity
•See http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-ssl- support.htmlfor details
21. •All Redshift load files staged in S3 are AES-256 encrypted (client side, not S3 SSE)
–Key is provided to Redshift in the S3 COPY command:
•Enable cluster encryption on Redshift
–Only specified during cluster creation, cannot be changed
–Applies to backups/snapshots as well
–Performance penalty, but not optional for Nasdaq
copy nbbofrom 's3://my_ingest/2014-09-17/nbbo.manifest'
credentials 'aws_access_key_id=<access-key-id>;
aws_secret_access_key=<secret-access-key>;master_symmetric_key=<master_key>'
manifest encrypted gzip;
22. •Redshift will store the cluster key in a singlecustomer premise HSM (or CloudHSM)
–SafeNetLuna SA HSM, firmware version should match CloudHSM
–Requires certificate exchange between cluster and HSM
–Requires cluster have an EIP
•On our side, required static 1-to-1 NAT of HSM private IP
•VPC Security Groups still apply; can still isolate cluster from others
–Encrypted database key decrypted in HSM, passed over encrypted channel to cluster on startup, stored in memory to decrypt data encryption (block) keys
–If running an HSM HA group, must synchronize keys after creation
23. •HSM integration was critical to Nasdaqadoption
•Monitor cluster access, react to any unauthorized connections
–STL_CONNECTION_LOG
•Query system table on a timed basis, alert to any unexpected access
–CloudTrailto SplunkRedshift connection & user logs
•Captures all API calls, not activity inside Redshift
–STL_DDLTEXT
•Audits all schema changes in the cluster
•In response to an alert, Redshift/HSM connectivity is severed, and cluster is immediately shut down
24. •With validation, data integrity, and security requirements met, the challenge remains to optimize ingest
•Why?
–Concurrency is a huge performance factor; can’t afford to be loading yesterday’s data when clients are running queries