AWS S3

This getting started guide demonstrates how to use Alien Giraffe to access and transform data stored in AWS S3. You’ll learn how to connect to S3 buckets, define schemas for your data, and work with files using a familiar pandas-like interface.

Prerequisites

Python 3.8 or higher
An AWS S3 bucket with data files (Parquet, CSV, or JSON)
AWS credentials with read access to your S3 bucket

Install Alien Giraffe

pip install alien-giraffe

Configure S3 Data Source

Create an a10e.toml configuration file to define your S3 connection:

[[datasources.s3]]
name = "analytics-data"
region = "us-west-2"
bucket = "company-analytics"
prefix = "data/"  # Optional: filter to specific prefix
access_key = "${AWS_ACCESS_KEY_ID}"  # Uses environment variable
secret_key = "${AWS_SECRET_ACCESS_KEY}"

Alternative: Configure S3 with IAM Role

If running on AWS infrastructure with IAM roles:

[[datasources.s3]]
name = "analytics-data"
region = "us-west-2"
bucket = "company-analytics"
use_iam_role = true  # Uses instance/task role instead of credentials

Define Your Data Schema

Alien Giraffe uses JSON Schema to define the structure of your data. This ensures type safety and enables powerful query optimization.

Option 1: Define Schema in Code

import alien_giraffe
import json

# Initialize client
a10e = alien_giraffe.Client()

# Define schema for customer data stored in S3
schema = {
    "type": "object",
    "properties": {
        "customer_id": {"type": "integer", "pk": true},
        "name": {"type": "string"},
        "email": {"type": "string"},
        "age": {"type": "integer"},
        "total_purchases": {"type": "number"},
        "account_status": {
            "type": "string",
            "enum": ["active", "inactive", "suspended"]
        },
        "created_date": {"type": "string", "format": "date"},
        "tags": {"type": "array", "items": {"type": "string"}}
    },
    "required": ["customer_id", "name", "email"]
}

# Add the schema
a10e.add_schema("customers", json.dumps(schema))

Option 2: Define Schema from Natural Language

# Let Alien Giraffe generate the schema from a description
schema = a10e.nl_schema(
    "Customer data with ID, name, email, age, purchase history, "
    "account status (active/inactive/suspended), and tags"
)

# Add the generated schema
a10e.add_schema("customers", schema)

Option 3: Define Schema in Configuration

Add to your a10e.toml:

[[schemas.customers]]
schema = """
{
    "type": "object",
    "properties": {
        "customer_id": {"type": "integer", "pk": true},
        "name": {"type": "string"},
        "email": {"type": "string"},
        "age": {"type": "integer"},
        "total_purchases": {"type": "number"},
        "account_status": {
            "type": "string",
            "enum": ["active", "inactive", "suspended"]
        },
        "created_date": {"type": "string", "format": "date"},
        "tags": {"type": "array", "items": {"type": "string"}}
    },
    "required": ["customer_id", "name", "email"]
}
"""

Load Data from S3

Once your schema is defined, Alien Giraffe will automatically discover matching files in your S3 bucket:

# Load the schema - Alien Giraffe finds matching files in S3
a10e.load("customers")

# Or explicitly specify the S3 datasource
a10e.load("customers", datasources=["analytics-data"])

Alien Giraffe automatically:

Scans your S3 bucket for files matching the schema
Supports Parquet, CSV, and JSON formats
Handles partitioned datasets
Optimizes query performance with metadata caching

Work with S3 Data

Use the pandas-like API to work with your S3 data without downloading it:

# Get a dataframe handle (data not loaded yet)
df = a10e.df("customers")

# View first few rows
df.head()

# Check data types
df.dtypes

# Basic statistics
df.describe()

# Filter active customers
active_customers = df[df["account_status"] == "active"]

# Complex filtering
high_value_customers = df[
    (df["total_purchases"] > 1000) &
    (df["account_status"] == "active") &
    (df["age"] >= 25)
]

# Aggregations
avg_purchases_by_status = df.groupby("account_status")["total_purchases"].mean()

# Sort by purchase amount
top_customers = df.sort_values("total_purchases", ascending=False).head(100)

Run SQL Queries on S3 Data

# Run SQL directly on S3 data
results = a10e.sql("""
    SELECT
        account_status,
        COUNT(*) as customer_count,
        AVG(total_purchases) as avg_purchases
    FROM customers
    WHERE age > 18
    GROUP BY account_status
    ORDER BY avg_purchases DESC
""")

Export Results

When you’re ready to materialize the results:

# Export to pandas DataFrame
pandas_df = df.to_pandas()

# Export filtered results to CSV
active_customers.to_csv("active_customers.csv")

# Export to Parquet for better performance
high_value_customers.to_parquet("high_value_customers.parquet")

# Export as JSON
df.head(100).to_json("sample_customers.json")

Performance Optimization

Working with Large S3 Datasets

# Process in chunks to manage memory
for chunk in df.iter_chunks(chunk_size=10000):
    # Process each chunk
    processed = chunk[chunk["total_purchases"] > 0]
    processed.to_csv("processed_customers.csv", mode='a')

# Use column selection to reduce data transfer
essential_cols = ["customer_id", "name", "total_purchases"]
df[essential_cols].to_pandas()

S3 File Organization Best Practices

For optimal performance with Alien Giraffe:

Use Parquet format - Provides the best query performance
Partition by date or category - Enables query pruning
Compress files - Reduces S3 data transfer costs
Use consistent naming - Helps with schema discovery

Example S3 structure:

s3://company-analytics/
├── customers/
│   ├── year=2024/
│   │   ├── month=01/
│   │   │   ├── customers_20240101.parquet
│   │   │   └── customers_20240102.parquet
│   │   └── month=02/
│   │       └── customers_20240201.parquet

Security Considerations

Column-Level Security for S3 Data

Configure column-level security in your a10e.toml:

[[security.table_rules]]
table = "customers"
blocked_columns = ["ssn", "credit_card"]  # Never accessible
masked_columns = ["email", "phone"]       # Returns masked values

S3 Bucket Policies

Ensure your S3 bucket policy allows only read access:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:user/alien-giraffe-reader"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::company-analytics/*",
                "arn:aws:s3:::company-analytics"
            ]
        }
    ]
}

Troubleshooting

Common Issues

“No datasets found” error
- Verify S3 credentials are correct
- Check bucket and prefix configuration
- Ensure files match the defined schema
Slow query performance
- Use Parquet format instead of CSV/JSON
- Add partitioning to your S3 data
- Limit the data scanned with filters
Memory errors
- Use iter_chunks() for large datasets
- Select only needed columns
- Apply filters before exporting