From Grid to Cloud

TL;DR: In 2005 a grid of 10 Intel based PC with 512 Mb memory extracted semantic data from text at 50 docs per second. The database server was running Windows 2000 with Oracle 9 as the engine. The task was comparing words in documents to knowledge graph concepts with semantic tags. The GRID managed about 1,125,000 category compares a second. Today implementing this process in AWS with S3 as the database and Hadoop map-reduce on elastic beanstalk provides an estimate throughput with comparable 10 worker nodes exceeds 400,000 docs per second.

LLM: Claude-3.5-Sonnet
Analyze this scenario and provide a modern AWS implementation comparison:

2005 System Specs:

10 PCs with 512MB RAM each
50 docs/second processing rate
~1.125M category comparisons/second
Oracle 9 for storage/retrieval
Windows 2000 environment

Modern AWS Implementation:

Storage:

S3 replaces Oracle 9
Documents stored in S3 buckets with metadata tags
Categories/semantic tags stored as JSON in DynamoDB for fast lookup

Processing Architecture:

EMR (Elastic MapReduce) cluster running Hadoop
m5.xlarge instances (4 vCPU, 16GB RAM)
10 worker nodes (matching original setup size)
Auto-scaling enabled between 5-15 nodes based on queue depth

Processing Flow:

Documents land in S3 input bucket
EMR cluster picks up documents in batches
Map phase: Documents split into words/phrases
Reduce phase: Parallel category matching across nodes
Results written back to S3 output bucket

Estimated Performance:

Single m5.xlarge: ~2000 docs/second (40x improvement per node)
10 node cluster: ~20,000 docs/second
Category comparisons: ~45M/second
Additional gains from:
Distributed S3 access (vs single Oracle DB)
Modern CPU architecture
More RAM (16GB vs 512MB)
SSD storage
Network throughput

This represents roughly:

400x improvement in document processing
40x improvement in category comparisons
Better scalability and reliability
Lower operational overhead

How long would it take to process the 3B CommonCrawl web pages as a document. Estimate cost.

Let me break this down systematically:

CommonCrawl Size Analysis:

3B web pages
Average page size ~50KB
Total data approximately 150TB uncompressed

Processing Requirements:

At 20,000 docs/second (from previous calculation)
Time calculation: 3,000,000,000 / 20,000 = 150,000 seconds
= 41.67 hours with 10 nodes

Cost Calculation:

EMR Cluster (using current US-East-1 on-demand pricing):

m5.xlarge: $0.192/hour per node
EMR cost: $0.048/hour per node
Total per node/hour: $0.24
10 nodes × 42 hours × $0.24 = $100.80

S3 Storage:

Input data: 150TB × $0.023/GB = $3,450/month
Output data (estimated 20% of input): 30TB × $0.023/GB = $690/month
S3 API requests: ~$100

Data Transfer

Internal AWS transfer: ~$0
Multiple reads/writes: $0.02/GB
150TB × $0.02 = $3,000

DynamoDB for category lookup:

Provisioned capacity: ~$200

Total Estimated Cost:

Computing: ~$101
Storage: ~$4,140
Transfer: ~$3,000
Database: ~$200
Total: ~$7,441

Optimization Suggestions:

Storage:

Use spot instances (70% savings on compute)
Increase node count to reduce time
Compress data to reduce storage costs
Use S3 Intelligent Tiering for longer-term storage

With optimizations:

Cost could be reduced to ~$4,000-5,000
Processing time could be reduced to ~8-10 hours with 50 nodes
Spot instances would reduce compute costs to ~$30-40

What are spot instances.
Spot instances are unused AWS EC2 capacity that you can bid on for significantly lower prices – typically 50-90% cheaper than regular on-demand instance prices. Here’s how they work:

Key Characteristics:

Price varies based on supply and demand
AWS can reclaim the instance with 2 minutes notice
Best for flexible, fault-tolerant workloads

Example Pricing:

On-demand m5.xlarge: $0.192/hour
Spot m5.xlarge: Often $0.04-0.06/hour (70-80% savings)

Best Used For:

Batch processing
Data analysis
Background jobs
Non-time-critical workloads
Workloads that can handle interruption

Not Suitable For:

Critical production applications
Database servers
Real-time processing
Systems requiring consistent uptime

Handling Interruptions:

Applications should checkpoint progress
Use instance groups to maintain capacity
Store data externally (S3, EBS)
Configure automatic failover

For Our CommonCrawl Example:

Use spot instances for processing nodes
Keep master node as on-demand for stability
Set up automatic bidding with maximum price
Configure job resume capability
Could reduce compute costs from $101 to ~$30

Would you like more details about implementing spot instances in an EMR cluster?

If we vectorize the data by tokenizing – no string compares – how much would that reduce cost.

Let’s recalculate with vectorized processing:

Performance Improvement:

String comparisons replaced with vector operations
Using CPU/GPU SIMD instructions for parallel processing
Estimated 20-30x speedup in comparison operations

Original Processing:

20,000 docs/second with string comparisons
New estimate: 400,000-600,000 docs/second
3B documents ÷ 500,000 = 6,000 seconds
≈ 1.67 hours with 10 nodes

Cost Breakdown:

EMR Compute (using spot instances):

2 hours × 10 nodes × $0.06/hour = $1.20
Master node (on-demand) 2 hours × $0.24 = $0.48
Total compute: ~$1.68

Initial vectorization storage (compressed): ~50TB
S3 cost: 50TB × $0.023/GB = $1,150
Temporary processing storage: ~$100

Data Transfer:

Single pass read/write
50TB × $0.02 = $1,000

Vector Database (optional):

Using FAISS or similar: ~$50

New Total: ~$2,300
(Compared to previous $4,000-5,000)

Additional Benefits:

Faster processing
Lower CPU utilization
Better scaling
Simpler comparison logic

Key Considerations:

One-time cost to vectorize documents
Need for vector similarity threshold tuning
Memory requirements for vector operations
Potential for GPU acceleration

From Grid to Cloud

Join the Conversation

Leave a comment

Cancel reply

Share this:

Join the Conversation

Leave a comment

Cancel reply