As on the Amazon AWS front page Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools.
In this independent review I will set up an Amazon Redshift datawarehouse and do some basic tests to check, without going too much deep, what can be achieved with a simple Redshift installation.
The installation of the datawarehouse is extremely easy. The user just needs to go on the Redshift administration console and with few clicks and in few minutes a new cluster is ready.
The user can choose the machine type, disk type (SSD or spinning disk) and the number of nodes. For this configuration I chose 2 dc1.large large nodes, with these characteristics for each node:
- 7 EC2 Compute Units (2 virtual cores)
- 15 Gb memory
- 160Gb SSD storage
During the cluster creation the user can also decide to encrypt the database with a strong level of security for sensitive data.
After the cluster initialization the user can administer the datawarehouse via a simple to use and clear web interface.
Amazon Redshift is based on the popular PostgreSQL database and provides a specific JDBC connector as well as an ODBC driver. This means that the SQL implementation is similar to PostgreSQL and any tool which connects to JDBC or ODBC can be used, which represent the absolute majority of data analysis tools.
Amazon RedShift is designed with a strong focus on scalability and the most common configuration is a clustered environment with a leader node. The leader node, which is provided for free, receives the queries and attempts to optimize the fetching strategy.
Amazon Redshift follows a MPP (massively parallel processing) architecture, which means that all operations are executed with as much parallelism as possible. Both data loading and data querying are executed in parallel among the nodes and this allows the cluster to scale in an eventually linear way by adding new nodes.
Another interesting characteristic of Redshift is that it is a columnar database, which means that each record is not saved as an unique block of data but it is stored in independent columns. The query performances can greatly improve by selecting a limited subset of columns rather than the full record. In addition a fine grained specific compression strategy can be assigned to every single column.
Redshift is based on the popular PostgreSQL which gives plenty of possible configurations to store, optimize, query and partition the data.
A typical datawarehouse should be optimized for batch loading and select queries, while insert queries and update queries are less relevant. This is because data should be periodically batch loaded and kept as immutable as possible. The suggested way by Amazon to load data in Redshift is via the COPY command.
In fact the INSERT query performances are very disappointing when only a single record is committed. In a test with Apache JMeter the maximum I could achieve was a modest 9 insertion / seconds (yes, 9 insert/seconds). However in a datawarehouse inserting a single record is a rare thing and if the INSERT is performed with a BATCH INSERT (aggregating for example 200 records and then committing every 1000) the performance can reach 1000-1200 inserts / second.
In any case the suggested method to load data is from an S3 bucket via the COPY command. I generated some YCSB compliant data (see Yahoo Cloud Serving Benchmark Tutorial) to be load with the suggested COPY command. The files were partitioned to contain no more than 10M records each (10Gb data) to take advantage of the parallel loading of Redshift. The SQL command VACUUM and ANALYZE was performed after each session. Here are the results:
|Records loaded||Time elapsed||Insertions / second||Total database size|
|1M (single file)||41.8 seconds||23923||1Gb|
|50M (5 x 10M files)||17 minutes 7 seconds||48685||51Gb|
|100M (10 x 10M files)||33 minutes 22 seconds||49677||151Gb|
The results can vary when additional features like distribution key, sort key or column compression are set. Most of the time these additional features are a must have to make the queries perform well. For example loading 150M records (150Gb) on the same table with a distkey and sortkey set took 1 hour, 10 minutes and 33 seconds, which means 35435 insertions / second.
In any case the insert performance with the COPY command is definitely better compared to the SQL INSERT command. When the files are split in parts Amazon Redshift will parallel load the data among the nodes, greatly improving he loading process.
In the separated blog post Amazon Redshift and Yahoo Cloud Serving Benchmark I used the Yahoo Cloud Service Benchmark (YCSB) to test the select query performance.
YCSB is not a light test for a columnar database and it is not the best test for a datawarehouse neither. In any case after some tuning Amazon Redshift reacted quite well.
For example with a simple cluster (3 nodes and 150Gb of data in a single table) it was possible to extract by key a record with 2 columns in 200 milliseconds and a heavy weight record with 11 columns in less than 3 seconds. With a cluster of 2 nodes it was possible to sum 150 million of decimal numbers in 1 second.
For a datawarehouse few seconds for simple queries and few minutes for complex queries is a reasonable time and Redshift performed well.
More details at Amazon Redshift and Yahoo Cloud Serving Benchmark.
I must admin that one of the coolest feature is the cluster resizing. In a typical environment the addition or removal of a SQL node can represent a complex, long and expensive procedure. In Redshift this is extremely easy and transparent.
The user just needs to click on the cluster resizing button, select the number of nodes and machine type and the cluster will resize in background, without any further action required. For a small cluster of 150G the resizing process took about 20 minutes.
Resizing can be made in both directions, in upside by adding new nodes or downside by removing nodes. The machines can be also upgraded or downgraded.
For example the cluster can be upgraded before an important event to handle the exceptional load, and then can be reduced when the work load returns normal.
By default, Amazon Redshift enables automated backups with a 1-day retention period, which can be extended up to 35 days.
Additional snapshots can be taken at any time and the data can be also replicated in global zones to secure the data even against catastrophic events. Big companies can spend millions to replicate the data in other countries, in Amazon Redshift this can be achieved with few clicks.
Cluster and queries monitoring are extremely important to optimize and balance the datawarehouse. Detailed load metrics such as CPU, disk and I/O usage cane be shown in the control panel.
In addition, for a fine grained monitoring and query optimization, all the queries are registered and it is possible to retrieve detailed execution metrics.
Costs estimation is not an easy task. It depends on the machines used, data stored, additional backups, etc. Most of the costs are related to the amount of data stored and the disk type used. For example machines with SSD disks are more expensive but also more performing compared to a spinning disk solution. In addition to the storage costs the user pays for the hourly usage of the machines, which depends on the machine size.
Very approximately this is an estimation of the costs that an user could expect. The contract is optional and allows to reduce the costs of the cluster.
|Contract length||Disk type||Cost per terabyte per year|
Compared to other options Redshift is definitely a cost effective solution. The ease of use, simplicity of administration and disaster recovery allows the user to quickly set up a complete datawarehouse and greatly cut the implementation costs.
The user can also experiment Redshift before investing on it and Amazon usually provides a free trial period in which the user can set up a free cluster.
I was positively impressed by the ease of use and scalability of Redshift. Complex operations like cluster resizing, backup and monitoring are so easy to implement and effective that costs and time to business are greatly reduced.
The functionality of the datwarehouse are comparable to high level databases with advanced functionalities like column compression and data encryption. In addition the SQL is similar to PostgreSQL from which Redshift derives from.
As far is is used as a datwarehouse and not for example as a key/value data store or store for unstructured data, Redshift is also very performant.
Performances and scalability while dealing with big quantity of data greatly depends on how well data is partitioned and distributed and Redshift with its MPP (massively parallel processing) architecture and rich SQL functionalities is definitely a winning product.
- Ease of use and administration
- Performances while taking advantage of the parallelism
- Rich SQL language derived from PostgreSQL
- Compatibility with existing SQL tools
- Monitoring, advanced backup, encryption, column compression
- Ease of cluster resizing
- Time to business
- Costs and maintenance costs
- Integration with the Amazon AWS suite
- Proprietary solution not installable locally
- Creates some dependency with the Amazon AWS suite, for example for the data loading
- Not exactly a disadvantage but it is a datawarehouse and not a general propose database