Getting Started

This guide will help you get started with Alien Giraffe in just a few minutes. You’ll learn how to install the client, connect to a data source, and run your first query.

Prerequisites

Python 3.8 or higher
Access to a data source (AWS S3, PostgreSQL, etc.)

Install Alien Giraffe

pip install alien-giraffe

Quick Example

Here’s a complete example using data stored in AWS S3:

1. Create Configuration File

Create an a10e.toml file in your project directory:

[[datasources.s3]]
name = "my-data"
region = "us-west-2"
bucket = "my-company-data"
access_key = "${AWS_ACCESS_KEY_ID}"     # Uses environment variable
secret_key = "${AWS_SECRET_ACCESS_KEY}"

2. Define Your Data Schema

import alien_giraffe
import json

# Initialize client
a10e = alien_giraffe.Client()

# Define schema for patient data (automatically excludes PII)
schema = {
    "type": "object",
    "properties": {
        "patient_id": {
            "type": "integer",
            "description": "Unique anonymous identifier for the patient"
        },
        "diagnosis": {
            "type": "string",
            "description": "Patient's diagnosis code"
        },
        "medicine": {
            "type": "string",
            "description": "Prescribed medicine for the patient"
        },
        "income": {
            "type": "integer",
            "description": "Annual income (converted from string to int)"
        },
        "income_bracket": {
            "type": "string",
            "description": "Income range: low (<$70k), medium ($70k-$120k), high (>$120k)",
            "enum": ["low", "medium", "high"]
        }
    },
    "required": ["patient_id", "diagnosis"]
}

# Add the schema
a10e.add_schema("patient_data", json.dumps(schema))

3. Query Your Data

# Load data matching your schema
a10e.load("patient_data")

# Get a DataFrame handle
df = a10e.df("patient_data")

# Run queries using pandas-like syntax
high_income_patients = df[df["income_bracket"] == "high"]
print(high_income_patients.head())

# Or use SQL
results = a10e.sql("""
    SELECT diagnosis, COUNT(*) as patient_count, income_bracket
    FROM patient_data
    WHERE income > 50000
    GROUP BY diagnosis, income_bracket
    ORDER BY patient_count DESC
    LIMIT 10
""")

4. Export Results

# Export to pandas DataFrame
pandas_df = high_income_patients.to_pandas()

# Save to CSV
high_income_patients.to_csv("high_income_patients.csv")

Natural Language Schema Generation

Don’t want to write JSON schemas? Let Alien Giraffe generate them:

# Describe your data in plain English
schema = a10e.nl_schema(
    "Patient data with diagnosis and income information, excluding PII like names, emails, addresses"
)

# Use the generated schema
a10e.add_schema("patient_data", schema)

What’s Next?

Learn More About Data Sources

PostgreSQL Configuration - Set up secure database connections
AWS S3 Advanced Guide - Optimize S3 performance and costs
More Data Sources - Connect to Databricks, Snowflake, and more

Deploy in Production

Kubernetes Deployment - Deploy with Helm charts
Security Best Practices - Configure access controls and data masking

Advanced Features

Multi-Source Queries - Join data across S3, databases, and data warehouses
Column-Level Security - Mask or block sensitive data automatically
Performance Optimization - Handle datasets of any size efficiently

Getting Help

📖 Check the Data Sources section for detailed configuration guides
📧 Contact Support for assistance

Ready to dive deeper? Explore our Data Sources documentation for comprehensive guides on connecting to your specific data infrastructure.