Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data per le Startup: come creare applicazioni Big Data in modalità Serverless

1,582 views

Published on

La varietà e la quantità di dati che si crea ogni giorno accelera sempre più velocemente e rappresenta una opportunità irripetibile per innovare e creare nuove startup.

Tuttavia gestire grandi quantità di dati può apparire complesso: creare cluster Big Data su larga scala sembra essere un investimento accessibile solo ad aziende consolidate. Ma l’elasticità del Cloud e, in particolare, i servizi Serverless ci permettono di rompere questi limiti.

Vediamo quindi come è possibile sviluppare applicazioni Big Data rapidamente, senza preoccuparci dell’infrastruttura, ma dedicando tutte le risorse allo sviluppo delle nostre le nostre idee per creare prodotti innovativi.

  • Be the first to comment

Big Data per le Startup: come creare applicazioni Big Data in modalità Serverless

  1. 1. © 2020, Amazon Web Services, Inc. or its Affiliates. Big Data per le Startup come creare applicazioni Big Data in modalità Serverless Fausto Palma AWS Solution Architect
  2. 2. © 2020, Amazon Web Services, Inc. or its Affiliates. disk space RAM or CPU Use case for Bid Data tools Fits in standard DBs Structured data time CPU No excessive load spikes streaming Variety tabular nested images video Different data formats Velocity Streaming real time analysis Volume Large amount of data not fitting resources
  3. 3. © 2020, Amazon Web Services, Inc. or its Affiliates. Use case for Bid Data tools Data lake Open formats Central catalog Data collected when available even in raw format Recommendation systems Text mining Supply chain flow optimization Social network analysis Anomaly detection Sentiment analysis Customer churn prevention …
  4. 4. © 2020, Amazon Web Services, Inc. or its Affiliates. Analytics overall architecture (Data Lake) Data movement Storage Analytics Data value Catalog Management | Security
  5. 5. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS services AWS Lake Formation AWS Key Management Service AWS Identity & Access Management Amazon Macie … Data movement Storage Analytics Data value Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3 QuickSight SageMaker Comprehend Rekognition Translate Pinpoint … Managed Streaming for Apache Kafka Amazon Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Glacier AWS Glue data catalog
  6. 6. © 2020, Amazon Web Services, Inc. or its Affiliates. Parallel Processing Reduction Aggregation General pattern to scale Data Messages Streams … Mapping Sharding Shuffling Shuffling Shuffling Outputs
  7. 7. © 2020, Amazon Web Services, Inc. or its Affiliates. Most secure infrastructure: certifications CSA Cloud Security Alliance Controls ISO 9001 Global Quality Standard ISO 27001 Security Management Controls ISO 27017 Cloud Specific Controls ISO 27018 Personal Data Protection PCI DSS Level 1 Payment Card Standards SOC 1 Audit Controls Report SOC 2 Security, Availability, & Confidentiality Report SOC 3 General Controls Report Global United States CJIS Criminal Justice Information Services DoD SRG DoD Data Processing FedRAMP Government Data Standards FERPA Educational Privacy Act FIPS Government Security Standards FISMA Federal Information Security Management GxP Quality Guidelines and Regulations ISO FFIEC Financial Institutions Regulation HIPPA Protected Health Information ITAR International Arms Regulations MPAA Protected Media Content NIST National Institute of Standards and Technology SEC Rule 17a-4(f) Financial Data Standards VPAT/Section 508 Accountability Standards Asia Pacific FISC [Japan] Financial Industry Information Systems IRAP [Australia] Australian Security Standards K-ISMS [Korea] Korean Information Security MTCS Tier 3 [Singapore] Multi-Tier Cloud Security Standard My Number Act [Japan] Personal Information Protection Europe C5 [Germany] Operational Security Attestation Cyber Essentials Plus [UK] Cyber Threat Protection G-Cloud [UK] UK Government Standards IT-Grundschutz [Germany] Baseline Protection Methodology X P G https://aws.amazon.com/compliance/programs/
  8. 8. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS services AWS Lake Formation AWS Key Management Service AWS Identity & Access Management Amazon Macie … Data movement Storage Analytics Data value Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3 QuickSight SageMaker Comprehend Rekognition Translate Pinpoint … Managed Streaming for Apache Kafka Amazon Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Glacier AWS Glue data catalog
  9. 9. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon Simple Storage Service “S3” § Built to store any amount of data § Runs on the world’s largest global cloud infrastructure § Designed to deliver 99.999999999% durability § Geographic redundancy & automatic replication § Tiered storage to optimize price/performance S3 AZ AZ AZ Transit Transit
  10. 10. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon S3 Amazon Athena Amazon Redshift Spectrum Amazon SageMaker AWS Glue Process Data in Place
  11. 11. © 2020, Amazon Web Services, Inc. or its Affiliates. Output Format: delimited text (CSV, TSV), JSON … Clauses Data types Operators Functions Select String Conditional String From Integer, Float, Decimal Math Cast Where Timestamp Logical Math Boolean String (Like, ||) Aggregate Input Format: delimited text (CSV, TSV, JSON, Parquet… Compression: GZIP, BZIP2 … Amazon S3 Select SQL
  12. 12. © 2020, Amazon Web Services, Inc. or its Affiliates. S3 – how to access https://docs.aws.amazon.com/AmazonS3/latest/API/API_Operations.html AWS S3 console AWS S3 API documentation AWS S3 CLI https://docs.aws.amazon.com/cli/latest/reference/s3/#available-commands https://s3.console.aws.amazon.com/s3/
  13. 13. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS services AWS Lake Formation AWS Key Management Service AWS Identity & Access Management Amazon Macie … Data movement Storage Analytics Data value Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3 QuickSight SageMaker Comprehend Rekognition Translate Pinpoint … Managed Streaming for Apache Kafka Amazon Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Glacier AWS Glue data catalog
  14. 14. © 2020, Amazon Web Services, Inc. or its Affiliates. Kinesis Data Firehose — How it Works AWS IoT Amazon Kinesis Agent Amazon Kinesis Streams Amazon CloudWatch Logs Amazon CloudWatch Events Managed Streams for Kafka Amazon S3 Amazon Redshift Amazon Elasticsearch Service Ingest Transform Deliver Lambda function
  15. 15. © 2020, Amazon Web Services, Inc. or its Affiliates. Kinesis Firehouse – how to access https://docs.aws.amazon.com/firehose/latest/APIReference/API_Operations.html AWS Kinesis Firehouse console AWS Kinesis Firehouse API documentation AWS Kinesis Firehouse CLI https://docs.aws.amazon.com/cli/latest/reference/f irehose/index.html#available-commands https://eu-west-1.console.aws.amazon.com/kinesis/
  16. 16. © 2020, Amazon Web Services, Inc. or its Affiliates. Simple demo Amazon Kinesis Data Firehose Amazon Simple Storage Service (S3) Data movement Storage App
  17. 17. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon Kinesis Data Generator (KDG) https://awslabs.github.io/amazon-kinesis-data-generator/web/help.html { "sensorId": {{random.number(50)}}, "currentTemperature": {{random.number( { "min":15, "max":38 } )}}, "status": "{{random.weightedArrayElement( { "weights": [0.9,0.03,0.07], "data": ["OK","FAIL","WARN"] } )}}" }
  18. 18. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS services AWS Lake Formation AWS Key Management Service AWS Identity & Access Management Amazon Macie … Data movement Storage Analytics Data value Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3 QuickSight SageMaker Comprehend Rekognition Translate Pinpoint … Managed Streaming for Apache Kafka Amazon Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Glacier AWS Glue data catalog
  19. 19. © 2020, Amazon Web Services, Inc. or its Affiliates. Hive metastore service Glue Catalog and Crawlers Data Lake S3 EMR Athena AWS Glue Jobs AWS Glue Data CatalogAWS Glue Crawler
  20. 20. © 2020, Amazon Web Services, Inc. or its Affiliates. Glue Catalog console
  21. 21. © 2020, Amazon Web Services, Inc. or its Affiliates. Glue Crawlers console
  22. 22. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS services AWS Lake Formation AWS Key Management Service AWS Identity & Access Management Amazon Macie … Data movement Storage Analytics Data value Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3 QuickSight SageMaker Comprehend Rekognition Translate Pinpoint … Managed Streaming for Apache Kafka Amazon Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Glacier AWS Glue data catalog
  23. 23. © 2020, Amazon Web Services, Inc. or its Affiliates. Athena console https://eu-west-1.console.aws.amazon.com/athena/ Select catalog Select database Write Query S3
  24. 24. © 2020, Amazon Web Services, Inc. or its Affiliates. Data locations Coordinator Presto architecture Workers Worker Worker Worker Worker Parsing Metastore Planning Scheduling Connectors Client SELECT sport, count(distinct location) as locations, count(distinct event_id) as events, count(*) as tickets, avg(ticket_price) as avg_ticket_price FROM sporting_event_ticket_info GROUP BY 1 ORDER BY 1; Parsing Planning Scheduling
  25. 25. © 2020, Amazon Web Services, Inc. or its Affiliates. Row vs Columnar file orientation Tabular data File in storage or streaming
  26. 26. © 2020, Amazon Web Services, Inc. or its Affiliates. Row vs Columnar file orientation Tabular data File in storage or streaming
  27. 27. © 2020, Amazon Web Services, Inc. or its Affiliates. Row vs Columnar file orientation File in storage or streaming Nested data
  28. 28. © 2020, Amazon Web Services, Inc. or its Affiliates. Different file formats Avro ParquetORC Optimized Row Columnar Compression ★ ★ ★ ★ ★ ★ ★ ★ ★ Schema evolution ★ ★ ★ ★ ★ ★ ★ Row vs column row column column Splittability ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ Nested fields support ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ Best for Schema evolution Compression Nested fields
  29. 29. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS services AWS Lake Formation AWS Key Management Service AWS Identity & Access Management Amazon Macie … Data movement Storage Analytics Data value Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3 QuickSight SageMaker Comprehend Rekognition Translate Pinpoint … Managed Streaming for Apache Kafka Amazon Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Glacier AWS Glue data catalog
  30. 30. © 2020, Amazon Web Services, Inc. or its Affiliates. Glue jobs console https://aws.amazon.com/blogs/big-data/making-etl-easier-with-aws-glue-studio/
  31. 31. © 2020, Amazon Web Services, Inc. or its Affiliates. RDD data structure RDD § Resilient § Distributed § Datasets Node Node Object 1 Object 2 Key 1 Key 2 Object 3Key 3 Node Object nKey n Distributed on multiple node to take advantage of parallel processing 1 2 3 4 6 5 7 Resiliency by replicating the DAG execution (directed acyclic graph) in case of failures Object 1 Object 2 Object 3 Object n Key 1 Key 2 Key 3 Key n Collection of objects that may be organized in key object pairs
  32. 32. © 2020, Amazon Web Services, Inc. or its Affiliates. Narrow transformation – no shuffling among partitions Worker node Worker node Worker node Worker node § map() § flatMap() § mapPartition() § filter() § sample() § union()
  33. 33. © 2020, Amazon Web Services, Inc. or its Affiliates. Wide transformation – shuffling among partitions Worker node Worker node Worker node Worker node § intersection() § distinct() § reduceByKey() § groupByKey() § join() § cartesian() § repartition() § coalesce()
  34. 34. © 2020, Amazon Web Services, Inc. or its Affiliates. Spark Operations Worker node Worker node Worker node Worker node Worker node Worker node Worker node Worker node map() flatMap() mapPartition() filter() sample() union() intersection() distinct() reduceByKey() groupByKey() join() cartesian() repartition() coalesce() Narrow trasformations Wide transformations Actions count() collect() take() top() countByValue() reduce() fold() aggregate() foreach()
  35. 35. © 2020, Amazon Web Services, Inc. or its Affiliates. Driver spark = SparkSession... spark.sparkContext rdd_1 = spark.read... rdd_2 = spark.read... rdd_3 = rdd_1.filter(...) rdd_4 = rdd_2.filter(...) rdd_5 = rdd_3.join(rdd_4) rdd_6 = rdd_5.filter(...) output = rdd_6.count(...) DAG Scheduler Builds the DAG, splits into stages and tasks, and signals the Task Scheduler Cluster Manager Allocate worker nodes Worker node Worker node Worker node … Spark basic job execution process rdd_1 rdd_2 task task task task task rdd_3 rdd_5 rdd_4 Job Starts executers executer executer executer executer executer rdd_x rdd_x rdd_x rdd_x rdd_x rdd_6 out. Task Scheduler Places tasks on executors stage_1 stage_2 stage_3 spark = SparkSession... spark.sparkContext rdd_1 = spark.read... rdd_2 = spark.read... rdd_3 = rdd_1.filter(...) rdd_4 = rdd_2.filter(...) rdd_5 = rdd_3.join(rdd_4) rdd_6 = rdd_5.filter(...) output = rdd_6.count(...) task task task task task spark-submit mycode.py ...
  36. 36. © 2020, Amazon Web Services, Inc. or its Affiliates. Additional features in Glue jobs (focus on PySpark) PySpark Transforms GlueTransform ApplyMapping DropFields DropNullFields ErrorsAsDynamicFrame Filter FlatMap Join Map MapToCollection Relationalize RenameField ResolveChoice SelectFields SelectFromCollection Spigot SplitFields SplitRows Unbox UnnestFrame https://docs.aws.amazon.com/glue/latest/dg/aws-glue- programming-python-transforms.html AWS Glue PySpark Extensions getResolvedOptions Types DynamicFrame DynamicFrameCollection DynamicFrameWriter DynamicFrameReader GlueContext https://docs.aws.amazon.com/glue/latest/dg/aws- glue-programming-python-extensions.html RDD DataFrame Spark DataSet DynamicFrameGlue
  37. 37. © 2020, Amazon Web Services, Inc. or its Affiliates. Demo custom script
  38. 38. © 2020, Amazon Web Services, Inc. or its Affiliates. A 1 ★★★★ A 2 ★ A 3 ★★★ B 1 ★ B 2 ★★★★ B 3 ★ C 1 ★★★ C 2 ★ C 3 ★★★★ D 1 ★★ D 2 ★★★★ D 3 ★★ A 1 ★★★★ A 2 ★ A 3 ★★★ B 1 ★ B 2 ★★★★ B 3 ★ A 1 ★★★★ A 2 ★ A 3 ★★★ C 1 ★★★ C 2 ★ C 3 ★★★★ A 1 ★★★★ A 2 ★ A 3 ★★★ D 1 ★★ D 2 ★★★★ D 3 ★★ B 1 ★ B 2 ★★★★ B 3 ★ C 1 ★★★ C 2 ★ C 3 ★★★★ B 1 ★ B 2 ★★★★ B 3 ★ D 1 ★★ D 2 ★★★★ D 3 ★★ C 1 ★★★ C 2 ★ C 3 ★★★★ D 1 ★★ D 2 ★★★★ D 3 ★★ movies_pairs = movies.join(movies, on=user) movie user rating movieX userX ratingX movieY userY ratingY
  39. 39. © 2020, Amazon Web Services, Inc. or its Affiliates. A A A A A A A A A B B B B B B C C C B B B C C C D D D C C C D D D D D D movieX movieY ★★★★ ★ ★★★ ★★★★ ★ ★★★ ★★★★ ★ ★★★ ★ ★★★★ ★ ★ ★★★★ ★ ★★★ ★ ★★★★ ratingX ★ ★★★★ ★ ★★★ ★ ★★★★ ★★ ★★★★ ★★ ★★★ ★ ★★★★ ★★ ★★★★ ★★ ★★ ★★★★ ★★ ratingY movie_pairs = movie_pairs.groupBy((movieX,movieY)) A B A C A D B C B D C D
  40. 40. © 2020, Amazon Web Services, Inc. or its Affiliates. A A A A A A A A A B B B B B B C C C B B B C C C D D D C C C D D D D D D similarity = movie_pairs.mapValue(cosine_similarity) movieX movieY ★★★★ ★ ★★★ ★★★★ ★ ★★★ ★★★★ ★ ★★★ ★ ★★★★ ★ ★ ★★★★ ★ ★★★ ★ ★★★★ ratingX ★ ★★★★ ★ ★★★ ★ ★★★★ ★★ ★★★★ ★★ ★★★ ★ ★★★★ ★★ ★★★★ ★★ ★★ ★★★★ ★★ ratingY similarity ≠ = ≠ ≠ = ≠ movieX movieY similarity A B A C A D B C B D C D ≠ = ≠ ≠ = ≠ movie_pairs = movie_pairs.groupBy((movieX,movieY)) A B A C A D B C B D C D
  41. 41. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS services AWS Lake Formation AWS Key Management Service AWS Identity & Access Management Amazon Macie … Data movement Storage Analytics Data value Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3 QuickSight SageMaker Comprehend Rekognition Translate Pinpoint … Managed Streaming for Apache Kafka Amazon Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Glacier AWS Glue data catalog
  42. 42. © 2020, Amazon Web Services, Inc. or its Affiliates. Quicksight console
  43. 43. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS Training & Certification https://www.aws.training: Free on-demand courses to help you build new cloud skills E-Learning: Data Analytics Fundamentals https://www.aws.training/Details/eLearning?id=35364 E-Learning: AWS Hadoop Fundamentals https://www.aws.training/Details/eLearning?id=40337 Learning Path: Internet of Things Foundation Series https://www.aws.training/Details/Curriculum?id=27289 Video: Serverless Analytics https://www.aws.training/Details/Video?id=26848 Available AWS Certifications
  44. 44. © 2020, Amazon Web Services, Inc. or its Affiliates. Thanks!

×