AWS Lambda

  • Running little stateless code often used to glue different services
    • Real-time file / stream processing
    • ETL
    • Cron job (use time as trigger)
    • Processing AWS events
  • Supported languages
    • Node.js
    • Python
    • Java
    • C#
    • Go
    • Powershell
    • Ruby
  • Many AWS services can trigger Lambda
    • S3
    • Kinesis (Lambda is pulling the data from Kinesis in batches)
    • DynamoDB
    • SNS
    • SQS
    • IoT
    • etc.
  • Many AWS services can act as the destination
    • Elasticsearch
    • Data Pipeline
    • Redshift
    • etc.
  • Can use DynamoDB or S3 to store stateful data
  • Unlimited scalability (soft limit of 1000 concurrent)
  • Max timeout can be set is 900 seconds

Kinesis Data Streams + Lambda

  • Lambda can receive an event with a batch of stream records
    • Up to 10,000 records in a batch
    • Up to 6 MB in a function call (will be split automatically)
  • Lambda has a 900 seconds timeout limit
    • Lambda will keep retrying until succeeds or data expires
    • Retrying still uses shards
      • Avoid errors
      • Smaller batch size or longer timeout
      • Provision more shards
  • Lambda processes shard data synchronously
    • Shards will not be released when data is being processed by Lambda

Anti-patterns

  • Long-running applications
    • Max running time is 900s, use other services like Elastic Beanstalk
  • Dynamic websites
    • Too much queries, use servers like EC2 instead
  • Stateful applications
    • Lambda is stateless, use DynamoBD to store stateful data or use stateful services

AWS Glue

  • Discovery and definition of table definitions and schema form
    • Redshift
    • S3
    • RDS
    • Other RDBs
  • ETL jobs by
    • Trigger-driven
    • On a schedule
    • On demand

Glue Crawler

  • Scans data in S3 and populate Glue Data Catalog
    • Infer the schema automatically

Glue Data Catalog

  • Central metadata repository
    • Just store the data definition and do not remove the data itself
    • Can used by Athena, Redshift, EMR, QuickSight, ect.
  • S3 Partitions
    • Glue crawler will extract partitions based on how S3 data is organized
    • Organize the data structure based on your query priority
  • EMR Hive
    • Glue Data Catalog can provide metadata information to Hive
    • You can import a Hive metastore into Glue

Glue ETL

  • Automatic code generation
    • Scala or Python
  • Encryption at rest and in transit
  • Can be event-driven
  • Can provision additional Data Processing Units (DPU) to increase performance
  • Monitor through CloudWatch

Anti-patterns

  • Streaming data
    • Glue is batch oriented, has minimum 5 minute intervals
    • Use Kinesis and store data in S3 or Redshift in temporary
  • Using other ETL engines than Spark
    • Glue ETL is implemented in Spark
    • Use Data Pipeline or EMR
  • NoSQL databases
    • Schema makes no sense for NoSQL databases

Amazon EMR

  • Elastic MapReduce
  • Managed Hadoop framework on EC2 instances
  • Tools
    • Spark
    • HBase
    • Presto
    • Flink
    • Hive
    • etc.
  • EMR Notebooks
    • Query data interactively using Python
  • Integration with other AWS services

EMR Cluster

  • Master node (leader node)
    • One master node in a cluster
    • Coordinate the distribution of data and tasks
  • Core node
    • Host HDFS data and run tasks
    • Can be scaled up & down, but may lose data
      • Because core nodes store data themselves
  • Task node
    • run tasks but do not host data
    • Good use of spot instances

EMR Usage

  • Cluster types
    • Transient cluster
    • Long-running cluster
  • Run jobs
    • Connect directly to master to run jobs
    • Submit ordered steps via the console

AWS Integration

  • EC2 (different types)
  • VPC
  • S3
  • CloudWatch
  • IAM
  • CloudTrail
  • Data Pipeline (schedule and start clusters)

EMR Storage

  • HDFS
    • EBS
      • Can only be added when launching
      • Will be regarded as failure when detach an EBS volume at running
    • Ephemeral
    • Resilient and durable (data are span multiple instances)
    • Data block (chunk) size is 128 Mb
  • EMRFS
    • S3
    • Store input and output data
    • Consistency
      • EMRFS Consistent View
      • Use DynamoDB to track consistency

EMR Promises

  • Provisions new nodes if core nodes fail
  • Can add and remove tasks nodes on the fly
  • Can resize running core nodes

Hadoop

  • Zeppelin / EMR Notebook
  • Hive / Pig (old) / Presto
  • MapReduce (old) / Spark / Tez
  • YARN
  • HDFS -> HBase

Spark

  • Distributed processing system
  • Use in memory caching
  • Use directed acyclic graphs to optimize query execution
  • Example use cases
    • Stream processing through Spark Streaming (from Amazon Kinesis / Apache Kafka / etc.)
    • Streaming analytics and write the results to HDFS or EMRFS (S3)
    • Machine learning using MLLib
    • Interactive SQL using Spark SQL
  • Can not do real time job like OLTP

Spark structure

  • Driver program / driver script
    • Spark Context
    • Job code
  • Cluster Manager
    • Spark / YARN
    • AWS EMR uses YARN
  • Executors
    • Cache
    • Tasks

Spark Components

  • Spark Streaming
  • Spark SQL
  • MLLib
  • GraphX

Spark Structured Streaming

  • Spark Streaming will add new streaming data to the tail of exist data
  • Can use library built on Kinesis Client Library (KCL) to handle data from Kinesis Data Streams
  • Can do huge ETL on Redshift using Spark

Hive

  • Run SQL code (HiveQL) on unstructured data that live in Hadoop, Yarn, or S3
  • Easier and faster than writing MapReduce even Spark code for simple queries
  • Highly optimized
  • Highly extensible
    • External applications can communicate with hive using JDBC or ODBC

Hive Metastore

  • Structure definition on the unstructured data that is stored on HDFS, EMRFS, etc.
  • Metastore is stored in MySQL on the master node by default
    • Ephemeral
  • External Hive Metastore
    • AWS Glue Data Catalog
      • Can be used by Redshift, Athena, etc.
    • Amazon RDS

Hive - AWS Integration

  • Load data from S3 with table partitions
  • Directly write data (tables) to S3
  • Load scripts from S3
  • Use DynamoDB as an external table

Apache Pig

Pig Latin

  • A scripting language
  • Use SQL-like syntax to define your map & reduce steps
  • Highly extensible with user-defined functions (UDF)

Pig - AWS Integration

  • Query data in EMRFS (S3)
  • Load JARs or scripts from S3

HBase

  • Non-relational, petabyte-scale database
  • On the top of HDFS
  • Operate largely in memory across the entire cluster
  • Integration with Hive

HBase vs. DynamoDB

  • HBase
    • Efficient with sparse data
    • Appropriate for high consistent reads & writes
    • Higher write & update throughput
    • More integration with Hadoop
  • DynamoDB
    • Fully managed, auto-scaling
    • More integration with other AWS services (like Glue)

HBase - AWS Integration

  • Can build on the top of EMRFS (S3)
  • Can back up tp S3

Presto

  • Issue SQL style query
  • Can connect to different big data databases / data stores at once
  • Can issue interactive queries at petabyte scale
  • Optimized for OLAP (also not for OLTP)
  • Amazon Athena uses Presto under the hood
  • Can be used through JDBC, CLI, Tableau interfaces

Presto Datasource

  • HDFS
  • S3 / EMRFS
  • Cassandra
  • MongoDB
  • HBase
  • RDBs
  • Redshift
  • Teradata

Zeppelin

  • Notebook
    • Can interactively run code against data
    • Can interleave with nicely formatted notes
    • Can share notebooks with others on the cluster
  • Integration
    • Spark
    • Python
    • JDBC
    • HBase
    • Elasticsearch
    • etc.

Zeppelin with Spark

  • Zeppelin can run Spark code interactively
    • Speed up your development cycle
    • Easier experimentation and exploration of your big data
  • Can execute SQL queries using SparkSQL
  • Query results can be visualized in charts and graphs

EMR Notebook

  • Notebook
    • More AWS integration than Zeppelin
  • Automatically back up to S3
  • Can provision or shut down clusters directly from notebook
  • Hosted inside a VPC and can be accessed via AWS console

Hue

  • Hadoop User Experience
  • Graphical front-end for applications on your EMR cluster
  • AWS integration
    • IAM: Hue users inherit IAM roles
    • S3: can use Hue to browse & move data between HDFS and S3

Splunk

  • Operational tool used to visualize EMR and S3 data using your EMR Hadoop cluster

Flume

  • A tool to stream data into your cluster

MXNet

  • A framework for building and accelerating neural networks

S3DistCP

  • Tool for copying large amounts of data between S3 and HDFS
  • Copy in a distributed manner
    • Suitable for parallel copying of large numbers of objects
    • Can copy across buckets, across accounts

Other Hadoop Tools

  • Ganglia (monitoring)
    • CouldWatch
  • Mahout (machine learning library)
    • MXNet, TensorFlow
  • Accumulo (NoSQL database)
    • HBase, DynamoDB
  • Sqoop (relational database connector)
    • Can parallelize the coping of data between an RDB and your cluster
  • HCatalog (table and storage management for Hive metastore)
  • Kinesis Connector (access Kinesis streams in your scripts)
  • Tachyon (accelerator for Spark)
  • Derby (RDB)
    • Implemented in Java
  • Ranger (data security manager for Hadoop)

EMR Security

  • IAM
  • Kerberos
    • Provide strong authentication through secret key cryptography
  • SSH

Choosing EMR Instance Types - 1

  • Master node:
    • m4.large if < 50 nodes
    • m4.xlarge if > 50 nodes
  • Core & task nodes:
    • m4.large is usually good
    • t2.medium if external dependencies (too much idle time)
    • m4.xlarge for better performance
    • Other considerations
      • High CPU instances
      • High memory instances
      • Cluster computer instances
  • Spot instances
    • Good for task nodes

Choosing EMR Instance Types - 2

  • General Purpose: T2, T3, M4, M5
  • Compute Optimized: C4, C5
    • Batch, Analytics, Machine Learning
  • Memory Optimized: R4, R5, X1, Z1d
    • In memory database, Real time analytics
  • Storage Optimized: H1, I3, D2
    • HDFS, NFS, Map Reduce, Apache Kafka, Redshift
  • Accelerated Computing: P2, P3, G3, F1
    • GPU instances: Deep Learning

Amazon ML

  • Amazon Machine Learning
    • Provides visualization tools & wizards to make creating a model easy
    • Can train data from S3, Redshift, or RDS
    • Can use prediction models using batches or low-latency API
    • Can evaluate your model

Models in Amazon ML

  • Regression
  • Classification
    • Multiclass
    • Binary

Confusion Matrix

  • Visualize the accuracy of multiclass classification predictive models
    • True label - predicted label matrix

Hyperparameters in Amazon ML

  • Learning rate
  • Model size
  • Number of passes
  • Data shuffling
  • Regularization

Promises & Limitations

  • No downtime
  • Soft limit up to 100 GB training data
  • Soft limit up to 5 simultaneous jobs

Anti-Patterns

  • Terabyte-scale data
  • Unsupported learning tasks
    • Sequence prediction
    • Unsupervised clustering
    • Deep learning
  • Use EMR / Spark to build your own algorithms and models is an alternative

Amazon SageMaker

  • Use notebooks hosted on AWS to train large scale models and vent prediction
    • Jupyter notebooks using Python
  • Can do GPU accelerated deep learning
  • Can scaling effectively unlimited
  • Easily doing hyperparameter tuning jobs

Vs. Amazon ML

  • Scales better
  • More flexibility
  • More advanced and modern algorithms
  • But need to write code

SageMaker Modules

  • Build
  • Train
    • SageMaker Search: search training models
  • Deploy
    • SageMaker Neo: deploy trained modes to any edges

SageMaker Security

  • Code stored in ML storage volumes
    • Controlled by security groups
    • Optionally encrypted at rest
  • Artifacts encrypted in transit and at rest
  • API & console secured by SSL
  • IAM roles
  • KMS integration for notebooks, training jobs, and endpoints

AWS Data Pipeline

  • Task scheduling framework
    • Process and move data between different AWS resources

Data Pipeline Features

  • Destinations
    • S3, RDS, DynamoDB, Redshift, EMR
  • Retries and notifies on failures
    • Retry 3 times by default, up to 10 times
  • Can run jobs cross-region
  • Can do precondition checks
  • On-premises sources can also use Data Pipeline
    • Install Task Runner
    • You can run multiple Task Runner for a job for high availability

Data Pipeline Activities

  • EMR jobs
  • Hive
  • Copy
  • SQL
  • Scripts