GCP Virus Scanning Pipeline - Deployment Instructions

This pipeline serves as an example for integrating the ClamAV API server into a virus scanning pipeline and automating the process of scanning new files uploaded to GCS buckets. The Pub/Sub topics and subscriptions for clean and infected files can be used to trigger downstream processing based on scan results.

If this example doesn't fit your specific requirements, please let us know your needs. If possible, we can develop a customized solution tailored to your use case.

Overview

This GCP pipeline automatically scans files uploaded to a Google Cloud Storage (GCS) bucket for viruses using an external ClamAV scanning API. When files are uploaded to the configured bucket, the pipeline:

  1. Detects the upload via GCS bucket notifications
  2. Triggers a Cloud Function via Pub/Sub
  3. Scans the file using the configured virus scanning API
  4. Publishes scan results to separate Pub/Sub topics for clean and infected files

Architecture

The pipeline consists of the following components:

  • GCS Bucket Notifications: Automatically publishes events to Pub/Sub when files are uploaded
  • Pub/Sub Topic & Subscription: Receives GCS upload notifications
  • Cloud Function (2nd Gen): Processes upload events and scans files for viruses
  • Pub/Sub Topics for Results: Separate topics for clean and infected file notifications
  • Service Accounts: Manages permissions for GCS-to-Pub/Sub notifications

Prerequisites

Before deploying this pipeline, ensure you have:

  1. GCP Account & Project

    • A GCP project with billing enabled
    • Owner or Editor role on the project
    • APIs enabled (see below)
  2. Required GCP APIs Enabled

    gcloud services enable cloudfunctions.googleapis.com
    gcloud services enable pubsub.googleapis.com
    gcloud services enable storage.googleapis.com
    gcloud services enable storage-component.googleapis.com
    gcloud services enable run.googleapis.com
    gcloud services enable cloudbuild.googleapis.com
    gcloud services enable artifactregistry.googleapis.com
    
  3. Terraform Installed

  4. gcloud CLI Installed & Configured

  5. Existing GCS Bucket

    • A GCS bucket where files will be uploaded (must already exist)
    • The bucket name will be used in the deployment configuration
  6. Virus Scanning API

    • A running ClamAV virus scanning API service
    • The API endpoint URL (IP address or hostname)
    • The API should be accessible from your Cloud Function's VPC network

Download Files

Download the required Terraform and Python files for the pipeline:

Quick Setup:

  1. Create a directory for your pipeline:

    mkdir gcp-pipeline
    cd gcp-pipeline
    
  2. Download all files using one of the following methods:

    Method 1: Direct download from documentation site

    # Set the base URL (adjust if your documentation is hosted elsewhere)
    BASE_URL="https://docs.elmcomputing.io"
    
    # Download Terraform files
    curl -O "${BASE_URL}/gcp/virus-scanning-pipeline/files/main.tf"
    curl -O "${BASE_URL}/gcp/virus-scanning-pipeline/files/variables.tf"
    
    # Create function directory and download Python files
    mkdir -p function
    curl -o function/main.py "${BASE_URL}/gcp/virus-scanning-pipeline/files/function/main.py"
    curl -o function/requirements.txt "${BASE_URL}/gcp/virus-scanning-pipeline/files/function/requirements.txt"
    

    Method 2: Manual download

    • Click on each file link above to download
    • Organize them in the following directory structure:
      gcp-pipeline/
      ├── main.tf
      ├── variables.tf
      └── function/
          ├── main.py
          └── requirements.txt
      

    Method 3: Right-click and "Save link as"

    • Right-click on each file link above
    • Select "Save link as" or "Download linked file"
    • Save to the appropriate location in your gcp-pipeline directory

Deployment Steps

Step 1: Prepare Your Environment

  1. Download the pipeline files (see Download Files section above) and organize them in a directory structure.

  2. Navigate to the pipeline directory:

    cd gcp-pipeline
    

Step 2: Configure Variables

Edit the variables in variables.tf or create a terraform.tfvars file with your specific values:

Required Variables:

project_id = "your-gcp-project-id"
region = "us-central1"  # Your preferred GCP region
zone = "us-central1-a"   # Your preferred GCP zone
uploads_bucket_name = "your-existing-bucket-name"  # Must already exist
virus_scan_api_url = "10.128.0.47"  # IP or hostname of your virus scanning API

Optional Variables (with defaults):

  • prefix: Prefix to add to all resource names for easy identification (default: "dev")
  • function_runtime: Python runtime version (default: "python310")
  • function_entry_point: Function entry point (default: "process_gcs_event")
  • function_memory: Memory allocation (default: "256M")
  • function_timeout: Timeout in seconds (default: 300)
  • function_max_instances: Maximum instances (default: 1)
  • function_min_instances: Minimum instances (default: 0)
  • function_cpu: CPU allocation (default: "0.5")
  • function_max_concurrency: Max concurrent requests per instance (default: 1)
  • service_account_id: Service account ID (default: "gcs-pubsub-notifier")
  • pubsub_ack_deadline_seconds: Message acknowledgement deadline (default: 60)
  • pubsub_message_retention_duration: Message retention (default: "2592000s" - 30 days)

Note on Resource Naming: All resource names will be prefixed with the prefix variable value (default: "dev"). For example, with the default prefix = "dev", resources will be named like:

  • dev-<bucket-name>-upload (Pub/Sub topic)
  • dev-gcs-virus-scan-<bucket-name> (Cloud Function)
  • dev-gcs-pubsub-notifier (Service account)

Example terraform.tfvars file:

project_id = "my-gcp-project"
region = "us-central1"
zone = "us-central1-a"
uploads_bucket_name = "my-uploads-bucket"
virus_scan_api_url = "10.128.0.47"
prefix = "dev"  # Optional: prefix for all resource names (default: "dev")
function_memory = "512M"
function_timeout = 600

Step 3: Initialize Terraform

Initialize Terraform to download required providers:

terraform init

This will create a .terraform directory and download the Google provider.

Step 4: Review Deployment Plan

Review what Terraform will create:

terraform plan

This will show you:

  • Resources that will be created
  • Any potential issues or warnings
  • Estimated costs (if applicable)

Expected Resources:

  • Pub/Sub topics (upload notifications, clean files, infected files)
  • Pub/Sub subscriptions
  • Cloud Function (2nd gen)
  • GCS bucket for function source code
  • Service accounts and IAM bindings
  • GCS bucket notification configuration

Step 5: Deploy the Pipeline

Apply the Terraform configuration:

terraform apply

Terraform will prompt you to confirm. Type yes to proceed.

Deployment typically takes 5-10 minutes as it:

  1. Creates Pub/Sub topics and subscriptions
  2. Packages and uploads the Cloud Function source code
  3. Deploys the Cloud Function
  4. Configures GCS bucket notifications
  5. Sets up IAM permissions
  6. Configures VPC egress settings

Step 6: Verify Deployment

After deployment completes, verify the resources:

  1. Check Cloud Function:

    gcloud functions describe <prefix>-gcs-virus-scan-<your-bucket-name> --region=<your-region> --gen2
    

    (Replace <prefix> with your actual prefix value, or omit if prefix is empty)

  2. Check Pub/Sub Topics:

    gcloud pubsub topics list | grep <your-bucket-name>
    

    (Topics will be prefixed if the prefix variable is set)

  3. Check GCS Notifications:

    gcloud storage buckets notifications list gs://<your-bucket-name>
    
  4. Check Function Logs:

    gcloud functions logs read <prefix>-gcs-virus-scan-<your-bucket-name> --region=<your-region> --gen2 --limit=50
    

    (Replace <prefix> with your actual prefix value, or omit if prefix is empty)

Configuration Details

Cloud Function Environment Variables

The Cloud Function is configured with the following environment variables:

  • GCP_PROJECT: Automatically set to your project ID
  • VIRUS_SCAN_API_URL: URL of your virus scanning API (from virus_scan_api_url variable)
  • CLEAN_TOPIC_NAME: Pub/Sub topic for clean files
  • INFECTED_TOPIC_NAME: Pub/Sub topic for infected files

VPC Network Configuration

The Cloud Function is configured with:

  • VPC Egress: private-ranges-only - Only private IP ranges can be accessed
  • Network: Default VPC network
  • Subnet: Default subnet

This ensures the function can access your virus scanning API if it's on a private network.

Pub/Sub Topics Created

  1. <prefix>-<bucket-name>-upload: Receives GCS upload notifications
  2. <prefix>-<bucket-name>-clean: Receives notifications for clean files
  3. <prefix>-<bucket-name>-infected: Receives notifications for infected files

Each topic has a corresponding subscription for message consumption.

Note: All resource names will include the prefix (default: "dev"). For example, with the default prefix, topics will be named dev-<bucket-name>-upload, etc.

Testing the Pipeline

Test with a Clean File

  1. Upload a test file to your GCS bucket:

    echo "This is a test file" > test.txt
    gsutil cp test.txt gs://<your-bucket-name>/test.txt
    
  2. Check Cloud Function logs:

    gcloud functions logs read <prefix>-gcs-virus-scan-<your-bucket-name> --region=<your-region> --gen2 --limit=20
    

    (Replace <prefix> with your actual prefix value, or omit if prefix is empty)

  3. Check the clean topic for messages:

    gcloud pubsub subscriptions pull <prefix>-<your-bucket-name>-clean-sub --limit=1
    

    (Replace <prefix> with your actual prefix value, or omit if prefix is empty)

Test with an Infected File (EICAR)

  1. Create EICAR test file (safe test virus signature):

    echo 'X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*' > eicar.txt
    
  2. Upload to bucket:

    gsutil cp eicar.txt gs://<your-bucket-name>/eicar.txt
    
  3. Check logs and infected topic:

    gcloud functions logs read <prefix>-gcs-virus-scan-<your-bucket-name> --region=<your-region> --gen2 --limit=20
    gcloud pubsub subscriptions pull <prefix>-<your-bucket-name>-infected-sub --limit=1
    

    (Replace <prefix> with your actual prefix value, or omit if prefix is empty)

Monitoring

View Function Logs

gcloud functions logs read <prefix>-gcs-virus-scan-<your-bucket-name> --region=<your-region> --gen2 --follow

(Replace <prefix> with your actual prefix value, or omit if prefix is empty)

Monitor Pub/Sub Metrics

In the GCP Console:

  1. Navigate to Pub/Sub > Topics
  2. Select your topics to view metrics:
    • Message count
    • Publish rate
    • Subscription backlog

Monitor Cloud Function Metrics

In the GCP Console:

  1. Navigate to Cloud Functions
  2. Select your function to view:
    • Invocation count
    • Execution time
    • Error rate
    • Active instances

Troubleshooting

Function Not Triggering

  1. Verify GCS notifications are configured:

    gcloud storage buckets notifications list gs://<your-bucket-name>
    
  2. Check Pub/Sub topic has messages:

    gcloud pubsub subscriptions pull <prefix>-<your-bucket-name>-upload-sub --limit=5
    

    (Replace <prefix> with your actual prefix value, or omit if prefix is empty)

  3. Verify function is active:

    gcloud functions describe <prefix>-gcs-virus-scan-<your-bucket-name> --region=<your-region> --gen2
    

    (Replace <prefix> with your actual prefix value, or omit if prefix is empty)

Virus Scan API Connection Issues

  1. Check function logs for connection errors:

    gcloud functions logs read <prefix>-gcs-virus-scan-<your-bucket-name> --region=<your-region> --gen2 --limit=50
    

    (Replace <prefix> with your actual prefix value, or omit if prefix is empty)

  2. Verify VPC configuration:

    • Ensure the virus scanning API is accessible from the function's VPC
    • Check firewall rules allow traffic from Cloud Functions
  3. Test API connectivity:

    • The function uses http://<virus_scan_api_url>:8080/api/clamav/scan/gcs/object
    • Verify this endpoint is reachable from your VPC

Permission Errors

  1. Check service account permissions:

    gcloud projects get-iam-policy <your-project-id>
    
  2. Verify GCS bucket permissions:

    gsutil iam get gs://<your-bucket-name>
    

Function Timeout Issues

If files are large and scans take longer:

  1. Increase function timeout in variables.tf:

    function_timeout = 600  # 10 minutes
    
  2. Increase function memory:

    function_memory = "512M"
    
  3. Re-apply Terraform:

    terraform apply
    

Updating the Pipeline

Update Function Code

  1. Modify code in function/main.py or function/requirements.txt

  2. Re-apply Terraform:

    terraform apply
    

    Terraform will detect changes and redeploy the function automatically.

Update Configuration Variables

  1. Modify variables in variables.tf or terraform.tfvars

  2. Review changes:

    terraform plan
    
  3. Apply changes:

    terraform apply
    

Cleanup / Uninstall

To remove all resources created by this pipeline:

terraform destroy

Warning: This will delete:

  • All Pub/Sub topics and subscriptions
  • The Cloud Function
  • GCS bucket notifications
  • Service accounts (if not used elsewhere)
  • Function source code bucket

The original uploads bucket will not be deleted.

Support

For issues or questions:

  1. Check the troubleshooting section above
  2. Review Cloud Function logs for error messages
  3. Verify all prerequisites are met
  4. Ensure your virus scanning API is running and accessible

Additional Resources

Disclaimer

This pipeline and its associated components are provided on an "AS-IS" basis. To the fullest extent permitted by law, Elm Computing disclaims and excludes any implied or statutory warranty, including any warranty of title, non-infringement, merchantability or fitness for a particular purpose. Elm Computing does not warrant that the pipeline will operate uninterrupted or error-free, or that all errors will be corrected.

Users are solely responsible for:

  • Ensuring the pipeline is properly configured and deployed according to their specific requirements
  • Maintaining and updating the pipeline components as needed
  • Verifying that the virus scanning API is properly configured and accessible
  • Implementing appropriate security measures and access controls
  • Backing up data and implementing disaster recovery procedures
  • Complying with all applicable laws, regulations, and security requirements
  • Any loss, damage, or liability resulting from the use or inability to use this pipeline

The pipeline relies on ClamAV, an open-source antivirus solution. While ClamAV is widely used and maintained, no antivirus solution can guarantee 100% detection of all malware. Users should implement additional security measures as appropriate for their use case and risk tolerance.