Configure Data Sources

This guide walks you through connecting your first data source to Alien Giraffe. We’ll cover the most common sources: PostgreSQL and Amazon S3, then show you where to find configuration for other sources.

Before You Begin

Make sure you have:

Alien Giraffe installed and running (Installation Guide)
Access credentials for your data source
Network connectivity from Alien Giraffe to your data source

Quick Start: PostgreSQL

PostgreSQL is one of the most common data sources. Let’s connect one step by step.

Step 1: Gather Connection Information

You’ll need:

Host: Database server hostname or IP (e.g., db.company.com)
Port: Usually 5432
Database: Database name (e.g., production_db)
Username: Read-only user (recommended)
Password: Database password

Step 2: Create a Read-Only User (Recommended)

For security, create a dedicated read-only user:

-- Connect to PostgreSQL as an admin
psql -h db.company.com -U admin -d production_db

-- Create read-only user
CREATE USER alien_giraffe_reader WITH PASSWORD 'secure-password-here';

-- Grant connection privileges
GRANT CONNECT ON DATABASE production_db TO alien_giraffe_reader;

-- Grant schema access
GRANT USAGE ON SCHEMA public TO alien_giraffe_reader;

-- Grant SELECT on all tables
GRANT SELECT ON ALL TABLES IN SCHEMA public TO alien_giraffe_reader;

-- Ensure future tables are accessible
ALTER DEFAULT PRIVILEGES IN SCHEMA public
    GRANT SELECT ON TABLES TO alien_giraffe_reader;

Step 3: Create Source Configuration

Create a source configuration file sources/production-postgres.yaml:

apiVersion: v1
kind: Source
metadata:
  name: production-postgres
  namespace: production
  description: Production PostgreSQL database
  owner: backend-team

spec:
  type: postgresql
  version: "15.3"

  connection:
    host: db.company.com
    port: 5432
    database: production_db
    ssl: true
    sslMode: require

  credentials:
    # Option 1: Reference to secret manager
    secretRef: postgres-production-creds

    # Option 2: Environment variables
    # username: ${POSTGRES_USER}
    # password: ${POSTGRES_PASSWORD}

  classification:
    criticality: high
    dataTypes: [customer-data, transaction-data]
    compliance: [gdpr, sox]
    retention: 7y

  discovery:
    enabled: true
    includeSchemas: [public]
    excludeTables: [temp_*, migrations]

Step 4: Create Credentials Secret

Store credentials securely:

# Create secret file
cat > postgres-production-creds.yaml <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: postgres-production-creds
type: credentials
data:
  username: alien_giraffe_reader
  password: secure-password-here
EOF

# Apply the secret
a10e secret apply -f postgres-production-creds.yaml

# Verify it was created
a10e secret list

Step 5: Register the Source

# Apply the source configuration
a10e source apply -f sources/production-postgres.yaml

# Verify the source was registered
a10e source list

# Test connectivity
a10e source test production-postgres

Expected output:

✓ Connection successful
✓ Authentication successful
✓ Schema discovery: 12 tables found
✓ Source ready

Quick Start: Amazon S3

Now let’s connect an S3 bucket for object storage access.

Step 1: Gather S3 Information

You’ll need:

Bucket name: e.g., company-data-lake
Region: e.g., us-west-2
Access credentials: AWS Access Key ID and Secret Access Key (or IAM role)

Step 2: Create IAM Policy (Recommended)

Create a least-privilege IAM policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::company-data-lake",
        "arn:aws:s3:::company-data-lake/*"
      ]
    }
  ]
}

Create an IAM user or role with this policy attached.

Step 3: Create Source Configuration

Create sources/data-lake-s3.yaml:

apiVersion: v1
kind: Source
metadata:
  name: data-lake
  namespace: production
  description: Production data lake on S3
  owner: data-engineering

spec:
  type: s3

  connection:
    bucket: company-data-lake
    region: us-west-2

  credentials:
    # Option 1: IAM Role (recommended for AWS deployments)
    assumeRole: arn:aws:iam::123456789012:role/AlienGiraffeS3Access

    # Option 2: Access keys (use secret reference)
    # secretRef: s3-data-lake-creds

  classification:
    criticality: high
    dataTypes: [analytics, logs, customer-data]
    compliance: [gdpr]
    retention: 2y

  discovery:
    enabled: true
    includePrefix: [/analytics/, /processed/]
    excludePrefix: [/temp/, /_scratch/]

Step 4: Create Credentials (if using access keys)

# Create secret for S3 credentials
cat > s3-data-lake-creds.yaml <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: s3-data-lake-creds
type: credentials
data:
  accessKeyId: AKIAIOSFODNN7EXAMPLE
  secretAccessKey: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
EOF

# Apply the secret
a10e secret apply -f s3-data-lake-creds.yaml

Step 5: Register the Source

# Apply the source configuration
a10e source apply -f sources/data-lake-s3.yaml

# Verify and test
a10e source list
a10e source test data-lake

Configuration Options by Source Type

Databases

PostgreSQL / MySQL / MariaDB:

spec:
  type: postgresql  # or mysql, mariadb
  connection:
    host: db.example.com
    port: 5432
    database: mydb
    ssl: true
    sslMode: require  # require, verify-ca, verify-full

MongoDB:

spec:
  type: mongodb
  connection:
    connectionString: mongodb+srv://cluster.mongodb.net
    database: mydb
    replicaSet: rs0
    readPreference: secondaryPreferred

Redis:

spec:
  type: redis
  connection:
    host: redis.example.com
    port: 6379
    database: 0
    ssl: true

Snowflake:

spec:
  type: snowflake
  connection:
    account: xy12345.us-east-1
    warehouse: COMPUTE_WH
    database: ANALYTICS
    schema: PUBLIC

Object Storage

Amazon S3:

spec:
  type: s3
  connection:
    bucket: my-bucket
    region: us-west-2
    endpoint: s3.us-west-2.amazonaws.com  # Optional

Google Cloud Storage:

spec:
  type: gcs
  connection:
    bucket: my-bucket
    project: my-project

Azure Blob Storage:

spec:
  type: azure-blob
  connection:
    storageAccount: mystorageaccount
    container: mycontainer

Data Warehouses

BigQuery:

spec:
  type: bigquery
  connection:
    project: my-project
    dataset: analytics

Redshift:

spec:
  type: redshift
  connection:
    host: cluster.region.redshift.amazonaws.com
    port: 5439
    database: analytics

Advanced Configuration

Connection Pooling

For high-throughput sources, configure connection pooling:

spec:
  connection:
    # ... connection details ...

  pooling:
    enabled: true
    minConnections: 5
    maxConnections: 20
    connectionTimeout: 30s
    idleTimeout: 10m

Read Replicas

Use read replicas for analytics workloads:

spec:
  connection:
    host: db-replica.company.com  # Replica host
    readOnly: true                 # Enforce read-only

  primarySource: production-postgres  # Link to primary

SSL/TLS Configuration

Configure custom SSL certificates:

spec:
  connection:
    ssl: true
    sslMode: verify-full
    sslCert: /path/to/client-cert.pem
    sslKey: /path/to/client-key.pem
    sslRootCert: /path/to/ca-cert.pem

Discovery Configuration

Fine-tune automatic schema discovery:

spec:
  discovery:
    enabled: true
    schedule: "0 2 * * *"           # Daily at 2 AM

    # For databases
    includeSchemas: [public, analytics]
    excludeSchemas: [temp, staging]
    excludeTables: [temp_*, test_*, _*]

    # For object storage
    includePrefix: [/data/, /processed/]
    excludePrefix: [/temp/, /.git/]

    # Data profiling
    profiling:
      enabled: true
      sampleSize: 10000
      detectPII: true

Managing Sources

List All Sources

# List all configured sources
a10e source list

# Filter by namespace
a10e source list --namespace production

# Show detailed information
a10e source list --detailed

View Source Details

# Get source details
a10e source get production-postgres

# View in YAML format
a10e source get production-postgres -o yaml

# View discovered schemas/datasets
a10e source datasets production-postgres

Test Connectivity

# Test source connection
a10e source test production-postgres

# Run diagnostics
a10e source diagnose production-postgres

Update Source Configuration

# Edit the source YAML file
vim sources/production-postgres.yaml

# Apply changes
a10e source apply -f sources/production-postgres.yaml

# Or edit interactively
a10e source edit production-postgres

Delete a Source

# Delete a source (requires confirmation)
a10e source delete production-postgres

# Force delete without confirmation
a10e source delete production-postgres --force

Troubleshooting

Connection Failures

Problem: “Connection refused”

# Check network connectivity
ping db.company.com
telnet db.company.com 5432

# Verify firewall rules allow traffic
# Check if Alien Giraffe IP is whitelisted

Problem: “Authentication failed”

# Verify credentials
a10e secret get postgres-production-creds

# Test credentials manually
psql -h db.company.com -U alien_giraffe_reader -d production_db

# Check user permissions
# User might need CONNECT, USAGE, and SELECT grants

Problem: “SSL connection failed”

# Try different SSL modes
spec:
  connection:
    ssl: true
    sslMode: require  # Try: disable, require, verify-ca, verify-full

Discovery Issues

Problem: “No tables discovered”

# Check schema permissions
a10e source diagnose production-postgres

# Verify includeSchemas configuration
# Ensure user has USAGE on schemas and SELECT on tables

Problem: “Discovery too slow”

# Optimize discovery configuration
spec:
  discovery:
    # Exclude unnecessary schemas/tables
    excludeSchemas: [information_schema, pg_catalog]
    excludeTables: [temp_*, _*]

    # Disable profiling if not needed
    profiling:
      enabled: false

Performance Issues

Problem: “Slow query performance”

# Enable connection pooling
spec:
  pooling:
    enabled: true
    maxConnections: 50

# Use read replica for analytics
spec:
  connection:
    host: db-replica.company.com

Problem: “Too many connections”

# Reduce connection limits
spec:
  pooling:
    maxConnections: 10
    connectionTimeout: 30s

Best Practices

Security

Use Read-Only Users: Always create dedicated read-only database users
Rotate Credentials: Implement regular credential rotation (30-90 days)
Enable SSL/TLS: Always use encrypted connections for production
Least Privilege: Grant access only to necessary schemas and tables
Secret Management: Store credentials in secret managers, not configuration files

Performance

Use Read Replicas: Separate analytics workloads from production
Enable Connection Pooling: Reduce connection overhead
Optimize Discovery: Exclude unnecessary schemas and tables
Schedule Discovery: Run discovery during low-traffic periods

Organization

Use Namespaces: Organize sources by environment (production, staging, dev)
Consistent Naming: Use clear, descriptive source names
Document Ownership: Assign owners and contacts to each source
Tag Sources: Use classification metadata for compliance and discovery

Next Steps

Now that you have a data source configured:

Create Your First Policy - Define who can access this source
Sources Component Reference - Learn about advanced source features
Security Best Practices - Secure your data sources
Monitoring Guide - Monitor source health and performance

For source-specific guides: