Resources

Resources represent what data can be accessed in the Alien Giraffe access control model. This component manages the catalog of databases, object storage, data warehouses, and datasets. Policies reference resources in their resources: field to specify which data subjects are allowed to access.

Relationship to Policies

Resources are one of the five core components that policies coordinate. When you define a policy, the resources: field specifies what data is being protected. This component provides the infrastructure for cataloging and classifying those data systems—registering databases, discovering datasets, and tracking data sensitivity.

Overview

Instead of managing access separately for each database, object store, or data warehouse, Alien Giraffe provides a centralized resource registry. Each resource is configured once with connection details, credentials, and metadata, then referenced in policies and access requests.

Why Catalog Resources?

Centralized Management - One place to manage all data systems
Data Discovery - Understand what data exists and where
Access Control - Unified policies across heterogeneous systems
Data Classification - Track data criticality and sensitivity
Credential Rotation - Automate credential management

Supported Resource Types

Databases

Relational Databases:

PostgreSQL - Version 10+
MySQL - Version 5.7+
MariaDB - Version 10.3+
Microsoft SQL Server - 2017+
Oracle - 12c+

NoSQL Databases:

MongoDB - Version 4.0+
Redis - Version 5.0+
Cassandra - Version 3.0+
DynamoDB - AWS managed
Elasticsearch - Version 7.0+

Graph Databases:

Neo4j - Version 4.0+ (Coming Soon)
ArangoDB - Version 3.7+ (Coming Soon)

Object Storage

Cloud Object Storage:

Amazon S3 - Including S3-compatible (MinIO, Ceph)
Google Cloud Storage - Standard and regional buckets
Azure Blob Storage - All tiers
HDFS - Hadoop Distributed File System

File System Storage:

NFS - Network File System shares
SFTP - SSH File Transfer Protocol
SSH - Direct SSH access to filesystem paths

Data Warehouses

Snowflake - All editions
Google BigQuery - Standard and enterprise
Amazon Redshift - Provisioned and Serverless
Databricks - SQL warehouses
Azure Synapse - Dedicated and serverless

Analytics & BI

Apache Druid - Real-time analytics (Coming Soon)
ClickHouse - OLAP database (Coming Soon)
Presto/Trino - Distributed SQL queries (Coming Soon)
Apache Spark - Via JDBC/Thrift (Coming Soon)

Key Concepts

Resource Metadata

Every resource includes descriptive metadata:

Identification:

name - Unique identifier for the resource
namespace - Organizational grouping (production, staging, finance)
description - Human-readable description
owner - Team or individual responsible

Classification:

criticality - Business impact (critical, high, medium, low, auto)
dataTypes - Categories of data (pii, financial, logs, analytics)
retention - Data retention period (30d, 1y, 7y, indefinite)

When classification is set to auto, Alien Giraffe automatically classifies the resource by scanning data patterns, column names, and content to identify sensitive data types and determine appropriate criticality levels.

Technical:

type - Database engine or storage system
version - Software version
region - Geographic location or cloud region
environment - Environment type (production, staging, development)

Dataset Organization

Resources contain datasets (tables, collections, buckets):

For Databases:

Schema/Database - Logical grouping (e.g., public, sales, analytics)
Tables/Collections - Individual data containers
Views - Derived datasets
Schemas - Structural definitions

For Object Storage:

Buckets/Containers - Top-level organization
Prefixes/Paths - Hierarchical organization (e.g., /logs/2025/)
Objects/Files - Individual data files

Connection Methods

Direct Connection:

Alien Giraffe connects directly to the data source
Requires network connectivity and credentials
Best for cloud databases and managed services

Proxy Connection:

Alien Giraffe connects via a proxy/bastion host
Useful for on-premises databases or private networks
Supports SSH tunneling and jump hosts

Agent-Based:

Lightweight agent runs in the same network as data source
Agent communicates with Alien Giraffe control plane
Best for air-gapped or highly restricted environments

Configuration Examples

PostgreSQL Database

apiVersion: v1
kind: Source
metadata:
  name: production-postgres
  namespace: production
  description: Main application database
  owner: backend-team
spec:
  type: postgresql
  version: "15.3"

  connection:
    host: db.production.internal
    port: 5432
    database: app_production
    ssl: true
    sslMode: require

  credentials:
    secretRef: postgres-credentials  # Reference to secret store
    rotation:
      enabled: true
      period: 30d

  classification:
    criticality: critical
    dataTypes: [pii, financial, customer-data]
    retention: 7y

  discovery:
    enabled: true                    # Automatically discover schemas/tables
    schedule: "0 2 * * *"            # Daily at 2 AM
    includeSchemas: [public, sales, support]
    excludeTables: [temp_*, migration_*]

Amazon S3 Bucket

apiVersion: v1
kind: Source
metadata:
  name: production-data-lake
  namespace: production
  description: Analytics data lake
  owner: data-engineering
spec:
  type: s3

  connection:
    bucket: company-data-lake-prod
    region: us-west-2
    endpoint: s3.us-west-2.amazonaws.com  # Optional, for S3-compatible

  credentials:
    awsCredentials:
      assumeRole: arn:aws:iam::123456789012:role/AlienGiraffeAccess
      externalId: unique-external-id

  classification:
    criticality: high
    dataTypes: [analytics, logs, customer-data]
    retention: 2y

  discovery:
    enabled: true
    includePrefix: [/analytics/, /logs/]
    excludePrefix: [/temp/, /_scratch/]

NFS File System

apiVersion: v1
kind: Source
metadata:
  name: shared-file-storage
  namespace: production
  description: Shared NFS storage for data exports
  owner: data-engineering
spec:
  type: nfs

  connection:
    host: nfs.production.internal
    port: 2049
    exportPath: /exports/data
    version: nfs4                        # NFS protocol version

  paths:
    - /exports/data/analytics
    - /exports/data/reports
    - /exports/data/backups

  credentials:
    mountOptions:
      - ro                               # Read-only mount
      - noexec
      - nosuid

  classification:
    criticality: medium
    dataTypes: [analytics, reports]
    retention: 90d

  discovery:
    enabled: true
    includePattern: ["*.csv", "*.parquet", "*.json"]
    excludePattern: ["*.tmp", "*.lock"]

SFTP Server

apiVersion: v1
kind: Source
metadata:
  name: sftp-data-transfer
  namespace: production
  description: SFTP server for secure file transfers
  owner: data-ops
spec:
  type: sftp

  connection:
    host: sftp.company.com
    port: 22
    username: alien-giraffe

  paths:
    - /data/incoming
    - /data/processed
    - /data/archive

  credentials:
    authMethod: ssh-key
    privateKeyRef: sftp-private-key      # Reference to SSH private key
    passphrase: sftp-key-passphrase      # Optional passphrase for key

  classification:
    criticality: high
    dataTypes: [customer-uploads, file-transfers]
    retention: 30d

  discovery:
    enabled: true
    schedule: "0 */6 * * *"              # Every 6 hours
    followSymlinks: false

SSH Filesystem Access

apiVersion: v1
kind: Source
metadata:
  name: remote-file-server
  namespace: production
  description: Direct SSH access to remote filesystem
  owner: platform-team
spec:
  type: ssh-fs

  connection:
    host: files.production.internal
    port: 22
    username: data-access

  paths:
    - /var/data/logs
    - /var/data/exports
    - /mnt/backup/datasets

  credentials:
    authMethod: ssh-key
    privateKeyRef: ssh-access-key

  classification:
    criticality: medium
    dataTypes: [logs, system-data]
    retention: 60d

  discovery:
    enabled: true
    maxDepth: 3                          # Limit directory traversal depth
    excludePattern: ["/var/data/logs/debug/*"]

Snowflake Warehouse

apiVersion: v1
kind: Source
metadata:
  name: analytics-warehouse
  namespace: production
  description: Enterprise data warehouse
  owner: data-team
spec:
  type: snowflake
  version: enterprise

  connection:
    account: xy12345.us-east-1
    warehouse: COMPUTE_WH
    database: ANALYTICS
    schema: PUBLIC

  credentials:
    secretRef: snowflake-credentials
    rotation:
      enabled: true
      period: 90d

  classification:
    criticality: high
    dataTypes: [analytics, aggregated-metrics]
    retention: indefinite

  resourceManagement:
    warehouseSize: MEDIUM            # Auto-suspend warehouse
    autoSuspend: 300                 # After 5 minutes idle
    autoResume: true

MongoDB Cluster

apiVersion: v1
kind: Source
metadata:
  name: session-store
  namespace: production
  description: User session database
  owner: backend-team
spec:
  type: mongodb
  version: "6.0"

  connection:
    connectionString: mongodb+srv://cluster.mongodb.net
    database: sessions
    replicaSet: rs0
    readPreference: secondaryPreferred  # Read from replicas

  credentials:
    secretRef: mongodb-credentials

  classification:
    criticality: high
    dataTypes: [session-data, user-preferences]
    retention: 90d

  discovery:
    enabled: true
    includeCollections: [user_sessions, api_tokens]

Read Replica Configuration

Separate read/write access using replicas:

apiVersion: v1
kind: Source
metadata:
  name: production-db-replica
  namespace: production
  description: Read-only replica for analytics
  owner: data-team
spec:
  type: postgresql
  version: "15.3"

  connection:
    host: db-replica.production.internal
    port: 5432
    database: app_production
    readOnly: true                   # Enforce read-only at connection level

  credentials:
    secretRef: postgres-readonly-credentials

  primarySource: production-postgres # Link to primary source

  classification:
    criticality: medium
    dataTypes: [analytics, reporting]
    retention: 7y

Automatic Classification

Let Alien Giraffe automatically classify data sensitivity:

apiVersion: v1
kind: Source
metadata:
  name: new-database
  namespace: production
  description: Database with unknown data sensitivity
  owner: data-team
spec:
  type: postgresql
  version: "15.3"

  connection:
    host: db.production.internal
    port: 5432
    database: new_app

  credentials:
    secretRef: postgres-credentials

  classification: auto              # Enable automatic classification

  discovery:
    enabled: true                   # Required for auto-classification
    schedule: "0 2 * * *"

    classification:
      enabled: true
      rules:
        - pattern: ".*email.*|.*e_mail.*"
          type: pii
          confidence: high

        - pattern: ".*ssn.*|.*social_security.*"
          type: pii-sensitive
          confidence: high

        - pattern: ".*credit_card.*|.*payment.*|.*card_number.*"
          type: financial
          confidence: high

        - pattern: ".*password.*|.*secret.*|.*api_key.*"
          type: credentials
          confidence: high

    profiling:
      enabled: true
      sampleSize: 10000
      pii_detection:
        enabled: true
        methods: [pattern-matching, statistical-analysis]

When classification: auto is set:

Discovery scans all tables and columns
Pattern matching identifies sensitive data types
Statistical analysis detects PII patterns
Criticality is calculated based on findings
Classification is updated automatically
Changes trigger policy re-evaluation

Auto-Classification Results:

# After auto-classification completes, the resource is updated:
classification:
  criticality: high                    # Automatically determined
  dataTypes: [pii, financial]          # Detected from data patterns
  retention: 7y                        # Based on detected data types
  lastClassified: 2025-11-19T02:00:00Z
  confidence: high

Discovery and Cataloging

Automatic Discovery

Alien Giraffe can automatically discover and catalog datasets:

Discovery Process:

Connect to data source
Query metadata tables/APIs
Enumerate schemas, tables, columns
Extract statistics (row counts, sizes)
Classify data based on patterns
Update catalog

What’s Discovered:

Database schemas and tables
Column names and types
Primary/foreign keys
Indexes and constraints
Row counts and sizes
Last modified timestamps

For Object Storage:

Bucket/container names
Directory structure
File types and sizes
Object counts
Last modified times

Data Classification

Automatically classify data sensitivity:

spec:
  discovery:
    classification:
      enabled: true
      rules:
        - pattern: ".*email.*"
          type: pii
          confidence: high

        - pattern: ".*ssn.*|.*social_security.*"
          type: pii-sensitive
          confidence: high

        - pattern: ".*credit_card.*|.*payment.*"
          type: financial
          confidence: high

        - pattern: ".*password.*|.*secret.*"
          type: credentials
          confidence: high

Sampling and Profiling

Profile data to understand contents:

spec:
  discovery:
    profiling:
      enabled: true
      sampleSize: 10000              # Sample 10k rows
      schedule: "0 3 * * 0"          # Weekly on Sunday at 3 AM

      metrics:
        - uniqueValues: true
        - nullPercentage: true
        - dataDistribution: true
        - valueRanges: true

      pii_detection:
        enabled: true
        methods: [pattern-matching, statistical-analysis]

Best Practices

Organize with Namespaces

Use namespaces to group related sources:

production - Production data sources
staging - Staging/QA environments
development - Development databases
finance - Finance-specific sources
analytics - Analytics and reporting sources

Use Read Replicas

Separate read and write access:

Configure read replicas for analytics workloads
Prevent analytics queries from impacting production
Enable longer-running queries without locks
Provide better availability for reporting

Implement Credential Rotation

Regularly rotate database credentials:

Enable automatic rotation (30-90 day cycles)
Store credentials in secret managers (AWS Secrets Manager, HashiCorp Vault)
Use short-lived credentials when possible
Audit credential access and usage

Classify Data Sensitivity

Tag sources and datasets by sensitivity:

pii - Personally identifiable information
pii-sensitive - SSN, passport numbers, biometrics
financial - Payment data, bank accounts
confidential - Trade secrets, strategic plans
public - Publicly available data

This enables risk-based policies and appropriate access controls.

Enable Discovery Carefully

Balance automation with performance:

Schedule discovery during low-traffic periods
Exclude temporary tables and scratch spaces
Limit scope to relevant schemas/databases
Monitor discovery job performance
Cache results to reduce repeated scans

Document Ownership

Assign clear ownership:

Team responsible for the data source
Contact for access requests
Escalation path for incidents
Data steward for governance

Monitor Source Health

Track data source availability and performance:

Connection health checks
Query performance metrics
Credential validity
Discovery job success rates
Access patterns and anomalies

Common Patterns

Multi-Region Setup

Configure sources across geographic regions:

---
apiVersion: v1
kind: Source
metadata:
  name: production-db-us
  namespace: production
spec:
  type: postgresql
  connection:
    host: db.us-west-2.internal
    region: us-west-2

---
apiVersion: v1
kind: Source
metadata:
  name: production-db-eu
  namespace: production
spec:
  type: postgresql
  connection:
    host: db.eu-west-1.internal
    region: eu-west-1

Development vs Production

Separate configurations for different environments:

# Development - more permissive
metadata:
  name: dev-db
  namespace: development
spec:
  classification:
    criticality: low
    dataTypes: [test-data]

---
# Production - strict controls
metadata:
  name: prod-db
  namespace: production
spec:
  classification:
    criticality: critical
    dataTypes: [pii, financial]

Data Lake Organization

Organize object storage by purpose:

spec:
  type: s3
  connection:
    bucket: data-lake
  discovery:
    includePrefix:
      - /raw/                        # Raw ingestion
      - /processed/                  # Cleaned data
      - /analytics/                  # Analytics-ready
      - /archive/                    # Long-term storage
    excludePrefix:
      - /temp/
      - /_spark/

Policies - Centralize resource definitions with other access control components
Subjects - Define who can access resources
Constraints - Set temporal limits on resource access
Channels - Specify how resources are accessed
Context - Provide organizational context for resource classification