Skip to content

AWS S3

This getting started guide demonstrates how to use Alien Giraffe to access and transform data stored in AWS S3. You’ll learn how to connect to S3 buckets, define schemas for your data, and work with files using a familiar pandas-like interface.

  • Python 3.8 or higher
  • An AWS S3 bucket with data files (Parquet, CSV, or JSON)
  • AWS credentials with read access to your S3 bucket
Terminal window
pip install alien-giraffe

Create an a10e.toml configuration file to define your S3 connection:

[[datasources.s3]]
name = "analytics-data"
region = "us-west-2"
bucket = "company-analytics"
prefix = "data/" # Optional: filter to specific prefix
access_key = "${AWS_ACCESS_KEY_ID}" # Uses environment variable
secret_key = "${AWS_SECRET_ACCESS_KEY}"

If running on AWS infrastructure with IAM roles:

[[datasources.s3]]
name = "analytics-data"
region = "us-west-2"
bucket = "company-analytics"
use_iam_role = true # Uses instance/task role instead of credentials

Alien Giraffe uses JSON Schema to define the structure of your data. This ensures type safety and enables powerful query optimization.

import alien_giraffe
import json
# Initialize client
a10e = alien_giraffe.Client()
# Define schema for customer data stored in S3
schema = {
"type": "object",
"properties": {
"customer_id": {"type": "integer", "pk": true},
"name": {"type": "string"},
"email": {"type": "string"},
"age": {"type": "integer"},
"total_purchases": {"type": "number"},
"account_status": {
"type": "string",
"enum": ["active", "inactive", "suspended"]
},
"created_date": {"type": "string", "format": "date"},
"tags": {"type": "array", "items": {"type": "string"}}
},
"required": ["customer_id", "name", "email"]
}
# Add the schema
a10e.add_schema("customers", json.dumps(schema))

Option 2: Define Schema from Natural Language

Section titled “Option 2: Define Schema from Natural Language”
# Let Alien Giraffe generate the schema from a description
schema = a10e.nl_schema(
"Customer data with ID, name, email, age, purchase history, "
"account status (active/inactive/suspended), and tags"
)
# Add the generated schema
a10e.add_schema("customers", schema)

Add to your a10e.toml:

[[schemas.customers]]
schema = """
{
"type": "object",
"properties": {
"customer_id": {"type": "integer", "pk": true},
"name": {"type": "string"},
"email": {"type": "string"},
"age": {"type": "integer"},
"total_purchases": {"type": "number"},
"account_status": {
"type": "string",
"enum": ["active", "inactive", "suspended"]
},
"created_date": {"type": "string", "format": "date"},
"tags": {"type": "array", "items": {"type": "string"}}
},
"required": ["customer_id", "name", "email"]
}
"""

Once your schema is defined, Alien Giraffe will automatically discover matching files in your S3 bucket:

# Load the schema - Alien Giraffe finds matching files in S3
a10e.load("customers")
# Or explicitly specify the S3 datasource
a10e.load("customers", datasources=["analytics-data"])

Alien Giraffe automatically:

  • Scans your S3 bucket for files matching the schema
  • Supports Parquet, CSV, and JSON formats
  • Handles partitioned datasets
  • Optimizes query performance with metadata caching

Use the pandas-like API to work with your S3 data without downloading it:

# Get a dataframe handle (data not loaded yet)
df = a10e.df("customers")
# View first few rows
df.head()
# Check data types
df.dtypes
# Basic statistics
df.describe()
# Filter active customers
active_customers = df[df["account_status"] == "active"]
# Complex filtering
high_value_customers = df[
(df["total_purchases"] > 1000) &
(df["account_status"] == "active") &
(df["age"] >= 25)
]
# Aggregations
avg_purchases_by_status = df.groupby("account_status")["total_purchases"].mean()
# Sort by purchase amount
top_customers = df.sort_values("total_purchases", ascending=False).head(100)
# Run SQL directly on S3 data
results = a10e.sql("""
SELECT
account_status,
COUNT(*) as customer_count,
AVG(total_purchases) as avg_purchases
FROM customers
WHERE age > 18
GROUP BY account_status
ORDER BY avg_purchases DESC
""")

When you’re ready to materialize the results:

# Export to pandas DataFrame
pandas_df = df.to_pandas()
# Export filtered results to CSV
active_customers.to_csv("active_customers.csv")
# Export to Parquet for better performance
high_value_customers.to_parquet("high_value_customers.parquet")
# Export as JSON
df.head(100).to_json("sample_customers.json")
# Process in chunks to manage memory
for chunk in df.iter_chunks(chunk_size=10000):
# Process each chunk
processed = chunk[chunk["total_purchases"] > 0]
processed.to_csv("processed_customers.csv", mode='a')
# Use column selection to reduce data transfer
essential_cols = ["customer_id", "name", "total_purchases"]
df[essential_cols].to_pandas()

For optimal performance with Alien Giraffe:

  1. Use Parquet format - Provides the best query performance
  2. Partition by date or category - Enables query pruning
  3. Compress files - Reduces S3 data transfer costs
  4. Use consistent naming - Helps with schema discovery

Example S3 structure:

s3://company-analytics/
├── customers/
│ ├── year=2024/
│ │ ├── month=01/
│ │ │ ├── customers_20240101.parquet
│ │ │ └── customers_20240102.parquet
│ │ └── month=02/
│ │ └── customers_20240201.parquet

Configure column-level security in your a10e.toml:

[[security.table_rules]]
table = "customers"
blocked_columns = ["ssn", "credit_card"] # Never accessible
masked_columns = ["email", "phone"] # Returns masked values

Ensure your S3 bucket policy allows only read access:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:user/alien-giraffe-reader"
},
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::company-analytics/*",
"arn:aws:s3:::company-analytics"
]
}
]
}
  1. “No datasets found” error

    • Verify S3 credentials are correct
    • Check bucket and prefix configuration
    • Ensure files match the defined schema
  2. Slow query performance

    • Use Parquet format instead of CSV/JSON
    • Add partitioning to your S3 data
    • Limit the data scanned with filters
  3. Memory errors

    • Use iter_chunks() for large datasets
    • Select only needed columns
    • Apply filters before exporting