From Grid to Cloud


TL;DR: In 2005 a grid of 10 Intel based PC with 512 Mb memory extracted semantic data from text at 50 docs per second. The database server was running Windows 2000 with Oracle 9 as the engine. The task was comparing words in documents to knowledge graph concepts with semantic tags. The GRID managed about 1,125,000 category compares a second. Today implementing this process in AWS with S3 as the database and Hadoop map-reduce on elastic beanstalk provides an estimate throughput with comparable 10 worker nodes exceeds 400,000 docs per second.

LLM: Claude-3.5-Sonnet
Analyze this scenario and provide a modern AWS implementation comparison:

2005 System Specs:

  • 10 PCs with 512MB RAM each
  • 50 docs/second processing rate
  • ~1.125M category comparisons/second
  • Oracle 9 for storage/retrieval
  • Windows 2000 environment

Modern AWS Implementation:

Storage:

  • S3 replaces Oracle 9
  • Documents stored in S3 buckets with metadata tags
  • Categories/semantic tags stored as JSON in DynamoDB for fast lookup

Processing Architecture:

  • EMR (Elastic MapReduce) cluster running Hadoop
  • m5.xlarge instances (4 vCPU, 16GB RAM)
  • 10 worker nodes (matching original setup size)
  • Auto-scaling enabled between 5-15 nodes based on queue depth

Processing Flow:

  • Documents land in S3 input bucket
  • EMR cluster picks up documents in batches
  • Map phase: Documents split into words/phrases
  • Reduce phase: Parallel category matching across nodes
  • Results written back to S3 output bucket

Estimated Performance:

  • Single m5.xlarge: ~2000 docs/second (40x improvement per node)
  • 10 node cluster: ~20,000 docs/second
  • Category comparisons: ~45M/second
  • Additional gains from:
  • Distributed S3 access (vs single Oracle DB)
  • Modern CPU architecture
  • More RAM (16GB vs 512MB)
  • SSD storage
  • Network throughput

This represents roughly:

  • 400x improvement in document processing
  • 40x improvement in category comparisons
  • Better scalability and reliability
  • Lower operational overhead

How long would it take to process the 3B CommonCrawl web pages as a document. Estimate cost.

Let me break this down systematically:

CommonCrawl Size Analysis:

  • 3B web pages
  • Average page size ~50KB
  • Total data approximately 150TB uncompressed

Processing Requirements:

  • At 20,000 docs/second (from previous calculation)
  • Time calculation: 3,000,000,000 / 20,000 = 150,000 seconds
  • = 41.67 hours with 10 nodes

Cost Calculation:

EMR Cluster (using current US-East-1 on-demand pricing):

  • m5.xlarge: $0.192/hour per node
  • EMR cost: $0.048/hour per node
  • Total per node/hour: $0.24
  • 10 nodes × 42 hours × $0.24 = $100.80
  1. S3 Storage:
  • Input data: 150TB × $0.023/GB = $3,450/month
  • Output data (estimated 20% of input): 30TB × $0.023/GB = $690/month
  • S3 API requests: ~$100

Data Transfer

  • Internal AWS transfer: ~$0
  • Multiple reads/writes: $0.02/GB
  • 150TB × $0.02 = $3,000

DynamoDB for category lookup:

  • Provisioned capacity: ~$200

Total Estimated Cost:

  • Computing: ~$101
  • Storage: ~$4,140
  • Transfer: ~$3,000
  • Database: ~$200
  • Total: ~$7,441

Optimization Suggestions:

Storage:

  • Use spot instances (70% savings on compute)
  • Increase node count to reduce time
  • Compress data to reduce storage costs
  • Use S3 Intelligent Tiering for longer-term storage

With optimizations:

  • Cost could be reduced to ~$4,000-5,000
  • Processing time could be reduced to ~8-10 hours with 50 nodes
  • Spot instances would reduce compute costs to ~$30-40

What are spot instances.
Spot instances are unused AWS EC2 capacity that you can bid on for significantly lower prices – typically 50-90% cheaper than regular on-demand instance prices. Here’s how they work:

Key Characteristics:

  1. Price varies based on supply and demand
  2. AWS can reclaim the instance with 2 minutes notice
  3. Best for flexible, fault-tolerant workloads

Example Pricing:

  • On-demand m5.xlarge: $0.192/hour
  • Spot m5.xlarge: Often $0.04-0.06/hour (70-80% savings)

Best Used For:

  • Batch processing
  • Data analysis
  • Background jobs
  • Non-time-critical workloads
  • Workloads that can handle interruption

Not Suitable For:

  • Critical production applications
  • Database servers
  • Real-time processing
  • Systems requiring consistent uptime

Handling Interruptions:

  1. Applications should checkpoint progress
  2. Use instance groups to maintain capacity
  3. Store data externally (S3, EBS)
  4. Configure automatic failover

For Our CommonCrawl Example:

  • Use spot instances for processing nodes
  • Keep master node as on-demand for stability
  • Set up automatic bidding with maximum price
  • Configure job resume capability
  • Could reduce compute costs from $101 to ~$30

Would you like more details about implementing spot instances in an EMR cluster?

If we vectorize the data by tokenizing – no string compares – how much would that reduce cost.

Let’s recalculate with vectorized processing:

Performance Improvement:

  • String comparisons replaced with vector operations
  • Using CPU/GPU SIMD instructions for parallel processing
  • Estimated 20-30x speedup in comparison operations

Original Processing:

  • 20,000 docs/second with string comparisons
  • New estimate: 400,000-600,000 docs/second
  • 3B documents ÷ 500,000 = 6,000 seconds
  • ≈ 1.67 hours with 10 nodes

Cost Breakdown:

  1. EMR Compute (using spot instances):
  • 2 hours × 10 nodes × $0.06/hour = $1.20
  • Master node (on-demand) 2 hours × $0.24 = $0.48
  • Total compute: ~$1.68
  • Initial vectorization storage (compressed): ~50TB
  • S3 cost: 50TB × $0.023/GB = $1,150
  • Temporary processing storage: ~$100

Data Transfer:

  • Single pass read/write
  • 50TB × $0.02 = $1,000

Vector Database (optional):

  • Using FAISS or similar: ~$50

New Total: ~$2,300
(Compared to previous $4,000-5,000)

Additional Benefits:

  • Faster processing
  • Lower CPU utilization
  • Better scaling
  • Simpler comparison logic

Key Considerations:

  1. One-time cost to vectorize documents
  2. Need for vector similarity threshold tuning
  3. Memory requirements for vector operations
  4. Potential for GPU acceleration

Join the Conversation

  1. Unknown's avatar

1 Comment

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.