AWS Glue Workflow: Getting started
Create a fundamental Glue workflow using the AWS Cloudformation template. The Glue workflow replaces the use of the Step functions, which have been used to maintain Glue flow states. However, if you plan to automate your build deployment, here is the blog post1 to help you.
In this post, I completely ignore the AWS BuildPipeline, which is the recommended CI/CD pipeline explained in the above post.
AWS Cloudformation for workflow
As shown in the above diagram, trigger action the Glue Crawler. The CFN template is as follows:
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a crawler
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
GlueWorkflowName:
Type: String
Description: workflow name for the filights
Default: flights-workflow
# The name of the crawler to be created
CFNCrawlerName:
Type: String
Default: cfn-crawler-flights-1
CFNDatabaseName:
Type: String
Default: cfn-database-flights-1
CFNTablePrefixName:
Type: String
Default: cfn_sample_1_
#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Glue Workflow
FlightWorkflow:
Type: AWS::Glue::Workflow
Properties:
Description: Glue workflow that tracks specified triggers, jobs, and crawlers as a single entity
Name: !Ref GlueWorkflowName
# Glue Triggers
TriggerFlightWorkflowStart:
Type: AWS::Glue::Trigger
Properties:
Name: t_Start
Type: SCHEDULED
Schedule: cron(0 8 * * ? *) # Runs once a day at 8 AM UTC
StartOnCreation: true
WorkflowName: !Ref GlueWorkflowName
Actions:
- CrawlerName: !Ref CFNCrawlerFlights
#Create IAM Role assumed by the Glue crawler. For demonstration, this role is given all permissions.
CFNRoleFlights:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: "Allow"
Principal:
Service:
- "glue.amazonaws.com"
Action:
- "sts:AssumeRole"
Path: "/"
Policies:
-
PolicyName: "root"
PolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: "Allow"
Action: "*"
Resource: "*"
# Create a database to contain tables created by the crawler
CFNDatabaseFlights:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Name: !Ref CFNDatabaseName
Description: "AWS Glue container to hold metadata tables for the flights crawler"
#Create a crawler to crawl the flights data on a public S3 bucket
CFNCrawlerFlights:
Type: AWS::Glue::Crawler
Properties:
Name: !Ref CFNCrawlerName
Role: !GetAtt CFNRoleFlights.Arn
#Classifiers: none, use the default classifier
Description: AWS Glue crawler to crawl flights data
#Schedule: none, use default run-on-demand
DatabaseName: !Ref CFNDatabaseName
Targets:
S3Targets:
# Public S3 bucket with the flights data
- Path: "s3://crawler-public-us-east-1/flight/2016/csv"
TablePrefix: !Ref CFNTablePrefixName
SchemaChangePolicy:
UpdateBehavior: "UPDATE_IN_DATABASE"
DeleteBehavior: "LOG"
Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"
After reading the AWS article2, I was motivated to write this blog, and the above source code of Glue Crawler is from that article.
Run the workflow
Run the following command to create the stack:
aws cloudformation update-stack --stack-name oj-test --template-body file:///Users/ojitha/aws/cfexamples/example1/my-app.yaml --capabilities CAPABILITY_NAMED_IAM
Above stack deploy the very simple workable Glue workflow:
Now you have to run the workflow manually because this Crawler will trigger on time, defined as in line# 38. If you want to run using CLI instead of console:
aws glue start-workflow-run --name flights-workflow
When the workflow finish, it should be similar to the above screenshot.
Query in Athena
After the Crawler finish, you can query the AWS Athena database:
SELECT * FROM "cfn-database-flights-1"."cfn_sample_1_csv" limit 10;
Cleanup
To delete this stack
aws cloudformation delete-stack --stack-name oj-test
References:
Comments
Post a Comment
commented your blog