AWS Glue Workflow: Getting started

Create a fundamental Glue workflow using the AWS Cloudformation template. The Glue workflow replaces the use of the Step functions, which have been used to maintain Glue flow states. However, if you plan to automate your build deployment, here is the blog post1 to help you.

In this post, I completely ignore the AWS BuildPipeline, which is the recommended CI/CD pipeline explained in the above post.



AWS Cloudformation for workflow

CFN stack with the workflow
CFN stack with the workflow

As shown in the above diagram, trigger action the Glue Crawler. The CFN template is as follows:

AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a crawler
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:      
  GlueWorkflowName:
    Type: String
    Description: workflow name for the filights
    Default: flights-workflow                                                                                                 
# The name of the crawler to be created
  CFNCrawlerName:  
    Type: String
    Default: cfn-crawler-flights-1
  CFNDatabaseName:
    Type: String
    Default: cfn-database-flights-1
  CFNTablePrefixName:
    Type: String
    Default: cfn_sample_1_  
#
#
# Resources section defines metadata for the Data Catalog
Resources:
  # Glue Workflow
  FlightWorkflow:
    Type: AWS::Glue::Workflow
    Properties: 
      Description: Glue workflow that tracks specified triggers, jobs, and crawlers as a single entity
      Name: !Ref GlueWorkflowName

  # Glue Triggers
  TriggerFlightWorkflowStart:
    Type: AWS::Glue::Trigger
    Properties:
      Name: t_Start
      Type: SCHEDULED
      Schedule: cron(0 8 * * ? *) # Runs once a day at 8 AM UTC
      StartOnCreation: true
      WorkflowName: !Ref GlueWorkflowName
      Actions:
        - CrawlerName: !Ref CFNCrawlerFlights

#Create IAM Role assumed by the Glue crawler. For demonstration, this role is given all permissions.
  CFNRoleFlights:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: "Allow"
            Principal:
              Service:
                - "glue.amazonaws.com"
            Action:
              - "sts:AssumeRole"
      Path: "/"
      Policies:
        -
          PolicyName: "root"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              -
                Effect: "Allow"
                Action: "*"
                Resource: "*"
 # Create a database to contain tables created by the crawler
  CFNDatabaseFlights:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: !Ref CFNDatabaseName
        Description: "AWS Glue container to hold metadata tables for the flights crawler"

 #Create a crawler to crawl the flights data on a public S3 bucket
  CFNCrawlerFlights:
    Type: AWS::Glue::Crawler
    Properties:
      Name: !Ref CFNCrawlerName
      Role: !GetAtt CFNRoleFlights.Arn
      #Classifiers: none, use the default classifier
      Description: AWS Glue crawler to crawl flights data
      #Schedule: none, use default run-on-demand
      DatabaseName: !Ref CFNDatabaseName
      Targets:
        S3Targets:
          # Public S3 bucket with the flights data
          - Path: "s3://crawler-public-us-east-1/flight/2016/csv"
      TablePrefix: !Ref CFNTablePrefixName
      SchemaChangePolicy:
        UpdateBehavior: "UPDATE_IN_DATABASE"
        DeleteBehavior: "LOG"
      Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"

After reading the AWS article2, I was motivated to write this blog, and the above source code of Glue Crawler is from that article.

Original CFN template (without a workflow)
Original CFN template (without a workflow)

Run the workflow

Run the following command to create the stack:

aws cloudformation update-stack --stack-name oj-test --template-body file:///Users/ojitha/aws/cfexamples/example1/my-app.yaml --capabilities CAPABILITY_NAMED_IAM

Above stack deploy the very simple workable Glue workflow:

Glue Workflow
Glue Workflow

Now you have to run the workflow manually because this Crawler will trigger on time, defined as in line# 38. If you want to run using CLI instead of console:

aws glue start-workflow-run --name flights-workflow

Successful workflow
Successful workflow

When the workflow finish, it should be similar to the above screenshot.

Query in Athena

After the Crawler finish, you can query the AWS Athena database:

SELECT * FROM "cfn-database-flights-1"."cfn_sample_1_csv" limit 10;

Cleanup

To delete this stack

aws cloudformation delete-stack --stack-name oj-test 

References:

Comments

Popular posts from this blog

How To: GitHub projects in Spring Tool Suite

Spring 3 Part 7: Spring with Databases

Parse the namespace based XML using Python