Skip to content

Configure Data Sources

This guide walks you through connecting your first data source to Alien Giraffe. We’ll cover the most common sources: PostgreSQL and Amazon S3, then show you where to find configuration for other sources.

Make sure you have:

  • Alien Giraffe installed and running (Installation Guide)
  • Access credentials for your data source
  • Network connectivity from Alien Giraffe to your data source

PostgreSQL is one of the most common data sources. Let’s connect one step by step.

You’ll need:

  • Host: Database server hostname or IP (e.g., db.company.com)
  • Port: Usually 5432
  • Database: Database name (e.g., production_db)
  • Username: Read-only user (recommended)
  • Password: Database password
Section titled “Step 2: Create a Read-Only User (Recommended)”

For security, create a dedicated read-only user:

-- Connect to PostgreSQL as an admin
psql -h db.company.com -U admin -d production_db
-- Create read-only user
CREATE USER alien_giraffe_reader WITH PASSWORD 'secure-password-here';
-- Grant connection privileges
GRANT CONNECT ON DATABASE production_db TO alien_giraffe_reader;
-- Grant schema access
GRANT USAGE ON SCHEMA public TO alien_giraffe_reader;
-- Grant SELECT on all tables
GRANT SELECT ON ALL TABLES IN SCHEMA public TO alien_giraffe_reader;
-- Ensure future tables are accessible
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT ON TABLES TO alien_giraffe_reader;

Create a source configuration file sources/production-postgres.yaml:

apiVersion: v1
kind: Source
metadata:
name: production-postgres
namespace: production
description: Production PostgreSQL database
owner: backend-team
spec:
type: postgresql
version: "15.3"
connection:
host: db.company.com
port: 5432
database: production_db
ssl: true
sslMode: require
credentials:
# Option 1: Reference to secret manager
secretRef: postgres-production-creds
# Option 2: Environment variables
# username: ${POSTGRES_USER}
# password: ${POSTGRES_PASSWORD}
classification:
criticality: high
dataTypes: [customer-data, transaction-data]
compliance: [gdpr, sox]
retention: 7y
discovery:
enabled: true
includeSchemas: [public]
excludeTables: [temp_*, migrations]

Store credentials securely:

Terminal window
# Create secret file
cat > postgres-production-creds.yaml <<EOF
apiVersion: v1
kind: Secret
metadata:
name: postgres-production-creds
type: credentials
data:
username: alien_giraffe_reader
password: secure-password-here
EOF
# Apply the secret
a10e secret apply -f postgres-production-creds.yaml
# Verify it was created
a10e secret list
Terminal window
# Apply the source configuration
a10e source apply -f sources/production-postgres.yaml
# Verify the source was registered
a10e source list
# Test connectivity
a10e source test production-postgres

Expected output:

✓ Connection successful
✓ Authentication successful
✓ Schema discovery: 12 tables found
✓ Source ready

Now let’s connect an S3 bucket for object storage access.

You’ll need:

  • Bucket name: e.g., company-data-lake
  • Region: e.g., us-west-2
  • Access credentials: AWS Access Key ID and Secret Access Key (or IAM role)

Create a least-privilege IAM policy:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::company-data-lake",
"arn:aws:s3:::company-data-lake/*"
]
}
]
}

Create an IAM user or role with this policy attached.

Create sources/data-lake-s3.yaml:

apiVersion: v1
kind: Source
metadata:
name: data-lake
namespace: production
description: Production data lake on S3
owner: data-engineering
spec:
type: s3
connection:
bucket: company-data-lake
region: us-west-2
credentials:
# Option 1: IAM Role (recommended for AWS deployments)
assumeRole: arn:aws:iam::123456789012:role/AlienGiraffeS3Access
# Option 2: Access keys (use secret reference)
# secretRef: s3-data-lake-creds
classification:
criticality: high
dataTypes: [analytics, logs, customer-data]
compliance: [gdpr]
retention: 2y
discovery:
enabled: true
includePrefix: [/analytics/, /processed/]
excludePrefix: [/temp/, /_scratch/]

Step 4: Create Credentials (if using access keys)

Section titled “Step 4: Create Credentials (if using access keys)”
Terminal window
# Create secret for S3 credentials
cat > s3-data-lake-creds.yaml <<EOF
apiVersion: v1
kind: Secret
metadata:
name: s3-data-lake-creds
type: credentials
data:
accessKeyId: AKIAIOSFODNN7EXAMPLE
secretAccessKey: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
EOF
# Apply the secret
a10e secret apply -f s3-data-lake-creds.yaml
Terminal window
# Apply the source configuration
a10e source apply -f sources/data-lake-s3.yaml
# Verify and test
a10e source list
a10e source test data-lake

PostgreSQL / MySQL / MariaDB:

spec:
type: postgresql # or mysql, mariadb
connection:
host: db.example.com
port: 5432
database: mydb
ssl: true
sslMode: require # require, verify-ca, verify-full

MongoDB:

spec:
type: mongodb
connection:
connectionString: mongodb+srv://cluster.mongodb.net
database: mydb
replicaSet: rs0
readPreference: secondaryPreferred

Redis:

spec:
type: redis
connection:
host: redis.example.com
port: 6379
database: 0
ssl: true

Snowflake:

spec:
type: snowflake
connection:
account: xy12345.us-east-1
warehouse: COMPUTE_WH
database: ANALYTICS
schema: PUBLIC

Amazon S3:

spec:
type: s3
connection:
bucket: my-bucket
region: us-west-2
endpoint: s3.us-west-2.amazonaws.com # Optional

Google Cloud Storage:

spec:
type: gcs
connection:
bucket: my-bucket
project: my-project

Azure Blob Storage:

spec:
type: azure-blob
connection:
storageAccount: mystorageaccount
container: mycontainer

BigQuery:

spec:
type: bigquery
connection:
project: my-project
dataset: analytics

Redshift:

spec:
type: redshift
connection:
host: cluster.region.redshift.amazonaws.com
port: 5439
database: analytics

For high-throughput sources, configure connection pooling:

spec:
connection:
# ... connection details ...
pooling:
enabled: true
minConnections: 5
maxConnections: 20
connectionTimeout: 30s
idleTimeout: 10m

Use read replicas for analytics workloads:

spec:
connection:
host: db-replica.company.com # Replica host
readOnly: true # Enforce read-only
primarySource: production-postgres # Link to primary

Configure custom SSL certificates:

spec:
connection:
ssl: true
sslMode: verify-full
sslCert: /path/to/client-cert.pem
sslKey: /path/to/client-key.pem
sslRootCert: /path/to/ca-cert.pem

Fine-tune automatic schema discovery:

spec:
discovery:
enabled: true
schedule: "0 2 * * *" # Daily at 2 AM
# For databases
includeSchemas: [public, analytics]
excludeSchemas: [temp, staging]
excludeTables: [temp_*, test_*, _*]
# For object storage
includePrefix: [/data/, /processed/]
excludePrefix: [/temp/, /.git/]
# Data profiling
profiling:
enabled: true
sampleSize: 10000
detectPII: true
Terminal window
# List all configured sources
a10e source list
# Filter by namespace
a10e source list --namespace production
# Show detailed information
a10e source list --detailed
Terminal window
# Get source details
a10e source get production-postgres
# View in YAML format
a10e source get production-postgres -o yaml
# View discovered schemas/datasets
a10e source datasets production-postgres
Terminal window
# Test source connection
a10e source test production-postgres
# Run diagnostics
a10e source diagnose production-postgres
Terminal window
# Edit the source YAML file
vim sources/production-postgres.yaml
# Apply changes
a10e source apply -f sources/production-postgres.yaml
# Or edit interactively
a10e source edit production-postgres
Terminal window
# Delete a source (requires confirmation)
a10e source delete production-postgres
# Force delete without confirmation
a10e source delete production-postgres --force

Problem: “Connection refused”

Terminal window
# Check network connectivity
ping db.company.com
telnet db.company.com 5432
# Verify firewall rules allow traffic
# Check if Alien Giraffe IP is whitelisted

Problem: “Authentication failed”

Terminal window
# Verify credentials
a10e secret get postgres-production-creds
# Test credentials manually
psql -h db.company.com -U alien_giraffe_reader -d production_db
# Check user permissions
# User might need CONNECT, USAGE, and SELECT grants

Problem: “SSL connection failed”

# Try different SSL modes
spec:
connection:
ssl: true
sslMode: require # Try: disable, require, verify-ca, verify-full

Problem: “No tables discovered”

Terminal window
# Check schema permissions
a10e source diagnose production-postgres
# Verify includeSchemas configuration
# Ensure user has USAGE on schemas and SELECT on tables

Problem: “Discovery too slow”

# Optimize discovery configuration
spec:
discovery:
# Exclude unnecessary schemas/tables
excludeSchemas: [information_schema, pg_catalog]
excludeTables: [temp_*, _*]
# Disable profiling if not needed
profiling:
enabled: false

Problem: “Slow query performance”

# Enable connection pooling
spec:
pooling:
enabled: true
maxConnections: 50
# Use read replica for analytics
spec:
connection:
host: db-replica.company.com

Problem: “Too many connections”

# Reduce connection limits
spec:
pooling:
maxConnections: 10
connectionTimeout: 30s
  1. Use Read-Only Users: Always create dedicated read-only database users
  2. Rotate Credentials: Implement regular credential rotation (30-90 days)
  3. Enable SSL/TLS: Always use encrypted connections for production
  4. Least Privilege: Grant access only to necessary schemas and tables
  5. Secret Management: Store credentials in secret managers, not configuration files
  1. Use Read Replicas: Separate analytics workloads from production
  2. Enable Connection Pooling: Reduce connection overhead
  3. Optimize Discovery: Exclude unnecessary schemas and tables
  4. Schedule Discovery: Run discovery during low-traffic periods
  1. Use Namespaces: Organize sources by environment (production, staging, dev)
  2. Consistent Naming: Use clear, descriptive source names
  3. Document Ownership: Assign owners and contacts to each source
  4. Tag Sources: Use classification metadata for compliance and discovery

Now that you have a data source configured:

  1. Create Your First Policy - Define who can access this source
  2. Sources Component Reference - Learn about advanced source features
  3. Security Best Practices - Secure your data sources
  4. Monitoring Guide - Monitor source health and performance

For source-specific guides: