AWS S3
This getting started guide demonstrates how to use Alien Giraffe to access and transform data stored in AWS S3. You’ll learn how to connect to S3 buckets, define schemas for your data, and work with files using a familiar pandas-like interface.
Prerequisites
Section titled “Prerequisites”- Python 3.8 or higher
- An AWS S3 bucket with data files (Parquet, CSV, or JSON)
- AWS credentials with read access to your S3 bucket
Install Alien Giraffe
Section titled “Install Alien Giraffe”pip install alien-giraffeConfigure S3 Data Source
Section titled “Configure S3 Data Source”Create an a10e.toml configuration file to define your S3 connection:
[[datasources.s3]]name = "analytics-data"region = "us-west-2"bucket = "company-analytics"prefix = "data/" # Optional: filter to specific prefixaccess_key = "${AWS_ACCESS_KEY_ID}" # Uses environment variablesecret_key = "${AWS_SECRET_ACCESS_KEY}"Alternative: Configure S3 with IAM Role
Section titled “Alternative: Configure S3 with IAM Role”If running on AWS infrastructure with IAM roles:
[[datasources.s3]]name = "analytics-data"region = "us-west-2"bucket = "company-analytics"use_iam_role = true # Uses instance/task role instead of credentialsDefine Your Data Schema
Section titled “Define Your Data Schema”Alien Giraffe uses JSON Schema to define the structure of your data. This ensures type safety and enables powerful query optimization.
Option 1: Define Schema in Code
Section titled “Option 1: Define Schema in Code”import alien_giraffeimport json
# Initialize clienta10e = alien_giraffe.Client()
# Define schema for customer data stored in S3schema = { "type": "object", "properties": { "customer_id": {"type": "integer", "pk": true}, "name": {"type": "string"}, "email": {"type": "string"}, "age": {"type": "integer"}, "total_purchases": {"type": "number"}, "account_status": { "type": "string", "enum": ["active", "inactive", "suspended"] }, "created_date": {"type": "string", "format": "date"}, "tags": {"type": "array", "items": {"type": "string"}} }, "required": ["customer_id", "name", "email"]}
# Add the schemaa10e.add_schema("customers", json.dumps(schema))Option 2: Define Schema from Natural Language
Section titled “Option 2: Define Schema from Natural Language”# Let Alien Giraffe generate the schema from a descriptionschema = a10e.nl_schema( "Customer data with ID, name, email, age, purchase history, " "account status (active/inactive/suspended), and tags")
# Add the generated schemaa10e.add_schema("customers", schema)Option 3: Define Schema in Configuration
Section titled “Option 3: Define Schema in Configuration”Add to your a10e.toml:
[[schemas.customers]]schema = """{ "type": "object", "properties": { "customer_id": {"type": "integer", "pk": true}, "name": {"type": "string"}, "email": {"type": "string"}, "age": {"type": "integer"}, "total_purchases": {"type": "number"}, "account_status": { "type": "string", "enum": ["active", "inactive", "suspended"] }, "created_date": {"type": "string", "format": "date"}, "tags": {"type": "array", "items": {"type": "string"}} }, "required": ["customer_id", "name", "email"]}"""Load Data from S3
Section titled “Load Data from S3”Once your schema is defined, Alien Giraffe will automatically discover matching files in your S3 bucket:
# Load the schema - Alien Giraffe finds matching files in S3a10e.load("customers")
# Or explicitly specify the S3 datasourcea10e.load("customers", datasources=["analytics-data"])Alien Giraffe automatically:
- Scans your S3 bucket for files matching the schema
- Supports Parquet, CSV, and JSON formats
- Handles partitioned datasets
- Optimizes query performance with metadata caching
Work with S3 Data
Section titled “Work with S3 Data”Use the pandas-like API to work with your S3 data without downloading it:
# Get a dataframe handle (data not loaded yet)df = a10e.df("customers")
# View first few rowsdf.head()
# Check data typesdf.dtypes
# Basic statisticsdf.describe()
# Filter active customersactive_customers = df[df["account_status"] == "active"]
# Complex filteringhigh_value_customers = df[ (df["total_purchases"] > 1000) & (df["account_status"] == "active") & (df["age"] >= 25)]
# Aggregationsavg_purchases_by_status = df.groupby("account_status")["total_purchases"].mean()
# Sort by purchase amounttop_customers = df.sort_values("total_purchases", ascending=False).head(100)Run SQL Queries on S3 Data
Section titled “Run SQL Queries on S3 Data”# Run SQL directly on S3 dataresults = a10e.sql(""" SELECT account_status, COUNT(*) as customer_count, AVG(total_purchases) as avg_purchases FROM customers WHERE age > 18 GROUP BY account_status ORDER BY avg_purchases DESC""")Export Results
Section titled “Export Results”When you’re ready to materialize the results:
# Export to pandas DataFramepandas_df = df.to_pandas()
# Export filtered results to CSVactive_customers.to_csv("active_customers.csv")
# Export to Parquet for better performancehigh_value_customers.to_parquet("high_value_customers.parquet")
# Export as JSONdf.head(100).to_json("sample_customers.json")Performance Optimization
Section titled “Performance Optimization”Working with Large S3 Datasets
Section titled “Working with Large S3 Datasets”# Process in chunks to manage memoryfor chunk in df.iter_chunks(chunk_size=10000): # Process each chunk processed = chunk[chunk["total_purchases"] > 0] processed.to_csv("processed_customers.csv", mode='a')
# Use column selection to reduce data transferessential_cols = ["customer_id", "name", "total_purchases"]df[essential_cols].to_pandas()S3 File Organization Best Practices
Section titled “S3 File Organization Best Practices”For optimal performance with Alien Giraffe:
- Use Parquet format - Provides the best query performance
- Partition by date or category - Enables query pruning
- Compress files - Reduces S3 data transfer costs
- Use consistent naming - Helps with schema discovery
Example S3 structure:
s3://company-analytics/├── customers/│ ├── year=2024/│ │ ├── month=01/│ │ │ ├── customers_20240101.parquet│ │ │ └── customers_20240102.parquet│ │ └── month=02/│ │ └── customers_20240201.parquetSecurity Considerations
Section titled “Security Considerations”Column-Level Security for S3 Data
Section titled “Column-Level Security for S3 Data”Configure column-level security in your a10e.toml:
[[security.table_rules]]table = "customers"blocked_columns = ["ssn", "credit_card"] # Never accessiblemasked_columns = ["email", "phone"] # Returns masked valuesS3 Bucket Policies
Section titled “S3 Bucket Policies”Ensure your S3 bucket policy allows only read access:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::123456789012:user/alien-giraffe-reader" }, "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::company-analytics/*", "arn:aws:s3:::company-analytics" ] } ]}Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”-
“No datasets found” error
- Verify S3 credentials are correct
- Check bucket and prefix configuration
- Ensure files match the defined schema
-
Slow query performance
- Use Parquet format instead of CSV/JSON
- Add partitioning to your S3 data
- Limit the data scanned with filters
-
Memory errors
- Use
iter_chunks()for large datasets - Select only needed columns
- Apply filters before exporting
- Use
Next Steps
Section titled “Next Steps”- Learn about PostgreSQL Integration
- Explore Multi-Source Data Access
- Configure Advanced Security Rules