Configure Data Sources
This guide walks you through connecting your first data source to Alien Giraffe. We’ll cover the most common sources: PostgreSQL and Amazon S3, then show you where to find configuration for other sources.
Before You Begin
Section titled “Before You Begin”Make sure you have:
- Alien Giraffe installed and running (Installation Guide)
- Access credentials for your data source
- Network connectivity from Alien Giraffe to your data source
Quick Start: PostgreSQL
Section titled “Quick Start: PostgreSQL”PostgreSQL is one of the most common data sources. Let’s connect one step by step.
Step 1: Gather Connection Information
Section titled “Step 1: Gather Connection Information”You’ll need:
- Host: Database server hostname or IP (e.g.,
db.company.com) - Port: Usually
5432 - Database: Database name (e.g.,
production_db) - Username: Read-only user (recommended)
- Password: Database password
Step 2: Create a Read-Only User (Recommended)
Section titled “Step 2: Create a Read-Only User (Recommended)”For security, create a dedicated read-only user:
-- Connect to PostgreSQL as an adminpsql -h db.company.com -U admin -d production_db
-- Create read-only userCREATE USER alien_giraffe_reader WITH PASSWORD 'secure-password-here';
-- Grant connection privilegesGRANT CONNECT ON DATABASE production_db TO alien_giraffe_reader;
-- Grant schema accessGRANT USAGE ON SCHEMA public TO alien_giraffe_reader;
-- Grant SELECT on all tablesGRANT SELECT ON ALL TABLES IN SCHEMA public TO alien_giraffe_reader;
-- Ensure future tables are accessibleALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO alien_giraffe_reader;Step 3: Create Source Configuration
Section titled “Step 3: Create Source Configuration”Create a source configuration file sources/production-postgres.yaml:
apiVersion: v1kind: Sourcemetadata: name: production-postgres namespace: production description: Production PostgreSQL database owner: backend-team
spec: type: postgresql version: "15.3"
connection: host: db.company.com port: 5432 database: production_db ssl: true sslMode: require
credentials: # Option 1: Reference to secret manager secretRef: postgres-production-creds
# Option 2: Environment variables # username: ${POSTGRES_USER} # password: ${POSTGRES_PASSWORD}
classification: criticality: high dataTypes: [customer-data, transaction-data] compliance: [gdpr, sox] retention: 7y
discovery: enabled: true includeSchemas: [public] excludeTables: [temp_*, migrations]Step 4: Create Credentials Secret
Section titled “Step 4: Create Credentials Secret”Store credentials securely:
# Create secret filecat > postgres-production-creds.yaml <<EOFapiVersion: v1kind: Secretmetadata: name: postgres-production-credstype: credentialsdata: username: alien_giraffe_reader password: secure-password-hereEOF
# Apply the secreta10e secret apply -f postgres-production-creds.yaml
# Verify it was createda10e secret listStep 5: Register the Source
Section titled “Step 5: Register the Source”# Apply the source configurationa10e source apply -f sources/production-postgres.yaml
# Verify the source was registereda10e source list
# Test connectivitya10e source test production-postgresExpected output:
✓ Connection successful✓ Authentication successful✓ Schema discovery: 12 tables found✓ Source readyQuick Start: Amazon S3
Section titled “Quick Start: Amazon S3”Now let’s connect an S3 bucket for object storage access.
Step 1: Gather S3 Information
Section titled “Step 1: Gather S3 Information”You’ll need:
- Bucket name: e.g.,
company-data-lake - Region: e.g.,
us-west-2 - Access credentials: AWS Access Key ID and Secret Access Key (or IAM role)
Step 2: Create IAM Policy (Recommended)
Section titled “Step 2: Create IAM Policy (Recommended)”Create a least-privilege IAM policy:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::company-data-lake", "arn:aws:s3:::company-data-lake/*" ] } ]}Create an IAM user or role with this policy attached.
Step 3: Create Source Configuration
Section titled “Step 3: Create Source Configuration”Create sources/data-lake-s3.yaml:
apiVersion: v1kind: Sourcemetadata: name: data-lake namespace: production description: Production data lake on S3 owner: data-engineering
spec: type: s3
connection: bucket: company-data-lake region: us-west-2
credentials: # Option 1: IAM Role (recommended for AWS deployments) assumeRole: arn:aws:iam::123456789012:role/AlienGiraffeS3Access
# Option 2: Access keys (use secret reference) # secretRef: s3-data-lake-creds
classification: criticality: high dataTypes: [analytics, logs, customer-data] compliance: [gdpr] retention: 2y
discovery: enabled: true includePrefix: [/analytics/, /processed/] excludePrefix: [/temp/, /_scratch/]Step 4: Create Credentials (if using access keys)
Section titled “Step 4: Create Credentials (if using access keys)”# Create secret for S3 credentialscat > s3-data-lake-creds.yaml <<EOFapiVersion: v1kind: Secretmetadata: name: s3-data-lake-credstype: credentialsdata: accessKeyId: AKIAIOSFODNN7EXAMPLE secretAccessKey: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEYEOF
# Apply the secreta10e secret apply -f s3-data-lake-creds.yamlStep 5: Register the Source
Section titled “Step 5: Register the Source”# Apply the source configurationa10e source apply -f sources/data-lake-s3.yaml
# Verify and testa10e source lista10e source test data-lakeConfiguration Options by Source Type
Section titled “Configuration Options by Source Type”Databases
Section titled “Databases”PostgreSQL / MySQL / MariaDB:
spec: type: postgresql # or mysql, mariadb connection: host: db.example.com port: 5432 database: mydb ssl: true sslMode: require # require, verify-ca, verify-fullMongoDB:
spec: type: mongodb connection: connectionString: mongodb+srv://cluster.mongodb.net database: mydb replicaSet: rs0 readPreference: secondaryPreferredRedis:
spec: type: redis connection: host: redis.example.com port: 6379 database: 0 ssl: trueSnowflake:
spec: type: snowflake connection: account: xy12345.us-east-1 warehouse: COMPUTE_WH database: ANALYTICS schema: PUBLICObject Storage
Section titled “Object Storage”Amazon S3:
spec: type: s3 connection: bucket: my-bucket region: us-west-2 endpoint: s3.us-west-2.amazonaws.com # OptionalGoogle Cloud Storage:
spec: type: gcs connection: bucket: my-bucket project: my-projectAzure Blob Storage:
spec: type: azure-blob connection: storageAccount: mystorageaccount container: mycontainerData Warehouses
Section titled “Data Warehouses”BigQuery:
spec: type: bigquery connection: project: my-project dataset: analyticsRedshift:
spec: type: redshift connection: host: cluster.region.redshift.amazonaws.com port: 5439 database: analyticsAdvanced Configuration
Section titled “Advanced Configuration”Connection Pooling
Section titled “Connection Pooling”For high-throughput sources, configure connection pooling:
spec: connection: # ... connection details ...
pooling: enabled: true minConnections: 5 maxConnections: 20 connectionTimeout: 30s idleTimeout: 10mRead Replicas
Section titled “Read Replicas”Use read replicas for analytics workloads:
spec: connection: host: db-replica.company.com # Replica host readOnly: true # Enforce read-only
primarySource: production-postgres # Link to primarySSL/TLS Configuration
Section titled “SSL/TLS Configuration”Configure custom SSL certificates:
spec: connection: ssl: true sslMode: verify-full sslCert: /path/to/client-cert.pem sslKey: /path/to/client-key.pem sslRootCert: /path/to/ca-cert.pemDiscovery Configuration
Section titled “Discovery Configuration”Fine-tune automatic schema discovery:
spec: discovery: enabled: true schedule: "0 2 * * *" # Daily at 2 AM
# For databases includeSchemas: [public, analytics] excludeSchemas: [temp, staging] excludeTables: [temp_*, test_*, _*]
# For object storage includePrefix: [/data/, /processed/] excludePrefix: [/temp/, /.git/]
# Data profiling profiling: enabled: true sampleSize: 10000 detectPII: trueManaging Sources
Section titled “Managing Sources”List All Sources
Section titled “List All Sources”# List all configured sourcesa10e source list
# Filter by namespacea10e source list --namespace production
# Show detailed informationa10e source list --detailedView Source Details
Section titled “View Source Details”# Get source detailsa10e source get production-postgres
# View in YAML formata10e source get production-postgres -o yaml
# View discovered schemas/datasetsa10e source datasets production-postgresTest Connectivity
Section titled “Test Connectivity”# Test source connectiona10e source test production-postgres
# Run diagnosticsa10e source diagnose production-postgresUpdate Source Configuration
Section titled “Update Source Configuration”# Edit the source YAML filevim sources/production-postgres.yaml
# Apply changesa10e source apply -f sources/production-postgres.yaml
# Or edit interactivelya10e source edit production-postgresDelete a Source
Section titled “Delete a Source”# Delete a source (requires confirmation)a10e source delete production-postgres
# Force delete without confirmationa10e source delete production-postgres --forceTroubleshooting
Section titled “Troubleshooting”Connection Failures
Section titled “Connection Failures”Problem: “Connection refused”
# Check network connectivityping db.company.comtelnet db.company.com 5432
# Verify firewall rules allow traffic# Check if Alien Giraffe IP is whitelistedProblem: “Authentication failed”
# Verify credentialsa10e secret get postgres-production-creds
# Test credentials manuallypsql -h db.company.com -U alien_giraffe_reader -d production_db
# Check user permissions# User might need CONNECT, USAGE, and SELECT grantsProblem: “SSL connection failed”
# Try different SSL modesspec: connection: ssl: true sslMode: require # Try: disable, require, verify-ca, verify-fullDiscovery Issues
Section titled “Discovery Issues”Problem: “No tables discovered”
# Check schema permissionsa10e source diagnose production-postgres
# Verify includeSchemas configuration# Ensure user has USAGE on schemas and SELECT on tablesProblem: “Discovery too slow”
# Optimize discovery configurationspec: discovery: # Exclude unnecessary schemas/tables excludeSchemas: [information_schema, pg_catalog] excludeTables: [temp_*, _*]
# Disable profiling if not needed profiling: enabled: falsePerformance Issues
Section titled “Performance Issues”Problem: “Slow query performance”
# Enable connection poolingspec: pooling: enabled: true maxConnections: 50
# Use read replica for analyticsspec: connection: host: db-replica.company.comProblem: “Too many connections”
# Reduce connection limitsspec: pooling: maxConnections: 10 connectionTimeout: 30sBest Practices
Section titled “Best Practices”Security
Section titled “Security”- Use Read-Only Users: Always create dedicated read-only database users
- Rotate Credentials: Implement regular credential rotation (30-90 days)
- Enable SSL/TLS: Always use encrypted connections for production
- Least Privilege: Grant access only to necessary schemas and tables
- Secret Management: Store credentials in secret managers, not configuration files
Performance
Section titled “Performance”- Use Read Replicas: Separate analytics workloads from production
- Enable Connection Pooling: Reduce connection overhead
- Optimize Discovery: Exclude unnecessary schemas and tables
- Schedule Discovery: Run discovery during low-traffic periods
Organization
Section titled “Organization”- Use Namespaces: Organize sources by environment (production, staging, dev)
- Consistent Naming: Use clear, descriptive source names
- Document Ownership: Assign owners and contacts to each source
- Tag Sources: Use classification metadata for compliance and discovery
Next Steps
Section titled “Next Steps”Now that you have a data source configured:
- Create Your First Policy - Define who can access this source
- Sources Component Reference - Learn about advanced source features
- Security Best Practices - Secure your data sources
- Monitoring Guide - Monitor source health and performance
For source-specific guides: