Frequently Asked Questions | Matillion ETL for Redshift

FAQ

 

General

What is Matillion ETL for Redshift?


Matillion ETL for Redshift is a push-down ETL/ELT tool for Amazon Redshift. It allows you to create powerful data transformation jobs, and orchestrate ELT processes, in a graphical tool. It includes all the functional and nonfunctional features you need to build and maintain enterprise data warehouses and analytical databases on Amazon Redshift, at scale. Delivered via the AWS Marketplace as an AMI, the product is browser-based and can be setup in under 5 minutes.

Who is Matillion ETL for Redshift for? Should I consider using this product?


Matillion ETL for Redshift is designed for people building, or planning to build, data warehouses and analytical databases on Amazon Redshift.  It suits advanced, power users who have previously used ETL tools like Informatica, Talend, IBM Datastage or Microsoft SSIS.  It also suits those who have been hand-coding data integrations in SQL.  Finally, because the tool, whilst powerful, is very easy to use, it also suits more business focussed users – data scientists and business analysts for example.

What are the benefits of Matillion compared to traditional ETL tools e.g. Talend, Pentaho, Informatica?


For Redshift users, Matillion delivers some significant benefits when compared to traditional ETL tools. It’s much faster. It’s simpler to use, install and manage. It’s linearly scalable (pretty much). As well as these benefits it’s usually cheaper – to purchase and to own.

Performance: Matillion is actually an ELT tool meaning it pushes down the hard work of data transformation to the Redshift cluster. This is much more efficient than a traditional ETL approach when using a massively parallel columnar data store like Redshift.

Scalability: Because Matillion pushes-down the hard work of data transformation to the Redshift cluster, it’s as scalable as Redshift is. If you need to go faster, or handle more data, simply increase the power of your Redshift cluster.

Simplicity: Matillion was specifically designed with the modern ETL developer in mind. The product is simple yet capable and includes all the features you need to develop data warehouses and analytical databases at scale. Because Matillion ETL for Redshift uses the Redshift cluster to do the hard work of data transformation, there’s no complex infrastructure required either. You don’t need to add Matillion servers to make your jobs go faster or handle more data. Finally the product is delivered as an AMI via AWS Marketplace, so install into your AWS environment takes less than 5 minutes.

Cost: Matillion saves you money in 3 ways. It’s cheaper than many competitive solutions. No extra infrastructure is required to run it. And finally, because it’s so fast, it saves you time in the development and maintenance of ETL/ELT jobs (we think normally about 50%).

How does Matillion ETL for Redshift work?


Matillion is actually an ELT tool meaning it pushes down the hard work of data transformation to the Redshift cluster. This is much more efficient than a traditional ETL approach when using a massively parallel columnar data store like Redshift.

As you create data transformation jobs in Matillion ETL for Redshift, the tool issues instructions (in the form of Redshift SQL) to the Redshift cluster. Think of Matillion as the conductor standing in front of the orchestra that is your Redshift cluster. This is great as it’s very high performance (depending on the power of your Redshift cluster, expect x100 performance improvement compared to traditional ETL). It also means the tool has some neat features, like being able to view the data live as you develop your job (no build – run – wait while it crunches, horribleness).

On top of the core data transformation capabilities, Matillion ETL for Redshift also adds orchestration, allowing you to choreograph transformations, data ingestion, integration with other systems, database maintenance, error handling, iteration and logging; scheduling; version control; monitoring; collaboration; and environment support.

How is Matillion ETL for delivered? Is it a SaaS product, a client install etc?


Matillion ETL for Redshift is delivered as an AMI via the AWS Marketplace. This means you launch your own instance of the product into your own AWS environment. It is not a SaaS product – you do not connect to Matillion to run Matillion ETL for Redshift.

Once your instance is spun-up and configured, you and your team of ETL developers/data professionals access the tool using your web browser. There is no client install.

Under the hood the product is simply a normal Amazon Linux instance running Tomcat, a small MongoDb (used to hold config information and saved jobs) and the Matillion software (a Java application). You launch it just like you would launch any other AMI from your EC2 console or from the AWS Marketplace.

The AWS Marketplace is a trusted storefront of applications, each designed for AWS and tested, scanned and authorised by Amazon. The only difference between a Marketplace AMI and a “normal” AMI is that the Marketplace AMI is tagged for billing. As such, billing for Matillion ETL for Redshift is via your existing AWS account (no need to give us your credit card details).

Who are Matillion?


Matillion is a cloud software company based in Knutsford, UK (just outside Manchester). We are an AWS Advanced Technology Partner, Redshift partner, Marketplace partner and are one of only 17 ISVs worldwide to hold the AWS Big Data Competency. We were founded in 2011 and support customers globally.

 

Supported Databases/Sources

What databases/sources can I load data to and from using Matillion ETL for Redshift?


Matillion ETL for Redshift provides a comprehensive set of integrations to a range of databases, services and APIs. These include to AWS core services, including S3, RDS and DynamoDb; to on premise and Cloud databases including MySQL, MS SQL, Oracle, Sybase, PostgreSQL, DB2, DB2 for I and Teradata; to FTP and HTTP; to REST, JSON and XML APIs; to Google BigQuery and MongoDb; to social media platforms including Facebook and Twitter; to marketing applications including Google Analytics, Google Adwords and Marketo; to CRM platforms including Salesforce.com, Microsoft Dynamics CRM and Netsuite CRM and to internet standards including LDAP.

You can also export/unload data from Redshift to a number of destinations: S3, MySQL, Aurora and Mariadb.

Find a complete list of integrations here.

Can I export data from Redshift to other databases using Matillion ETL for Redshift?


Using the S3 Unload component or the RDS Bulk Output component in Matillion ETL for Redshift you can push Redshift tables into S3, MySQL, Amazon Aurora and Mariadb. In this way Redshift can be the “engine” of your ETL process, with RDS or S3 as the destination for some or all of the finished data.

 

Integration

Can I call/invoke Matillion ETL for Redshift jobs from external applications?


Yes. You can invoke Matillion jobs and pass in parameters, from external applications, using Amazon SQS. SQS represents the architectural best practice way of integrating applications on AWS and you can write an SQS message from most any programming language in under 10 lines of code. Matillion ETL for Redshift can also provide return messages (e.g. success/fail/statistics/conditional) via SQS.

Can I call/invoke external applications/scripts from Matillion ETL for Redshift jobs?


Yes. Matillion ETL provides components to use and call SQS, SNS, SQL and Python.

Can I use Matillion ETL for Redshift to run existing/new SQL I have developed separately?


Yes. There are SQL components available in Orchestration and Transformation jobs and you can call programs containing embedded SQL using SQS, SNS, Python or SQL.

Which core AWS services does Matillion integrate with/support?


Matillion ETL for Redshift integrates with and supports S3, RDS, SQS and SNS. Using the Python component, Matillion ETL for Redshift can also interact with the AWS API.

 

Can I use Matillion with Amazon Lambda?


Yes. Lambda is useful in the context of Matillion ETL for Redshift as you can use Lambda to fire an SQS message onto a queue on the arrival of a file into an S3 bucket. In this way, you can integrate data into Redshift in real-time, from arrival in S3, through a load into Redshift then transform into your Redshift data warehouse. If you do not want to use Lambda, you can also write code tied to the S3 bucket create event and manage it yourself, to achieve the same effect.

 

Functionality

Can I use Matillion ETL for Redshift to build an Enterprise Data Warehouse?


Yes. Matillion ETL for Redshift was designed for this purpose and provides all the tools you need to build and maintain an EDW at scale – on its own, or in concert with other technologies.

Can I use Matillion ETL for Redshift to build a star schema?


Yes. Matillion ETL for Redshift contains all the components required to build and maintain star-schemas and other data warehouse forms.

Does Matillion ETL for Redshift support slowly changing dimensions?


Yes. You can see an example of building a Type 6 slowly changing dimension in this support site article

What data transformation features does Matillion ETL for Redshift include?


Matillion ETL for Redshift includes a comprehensive array of data transformation components, each of which leverage underlying Redshift SQL. In short, everything you are likely to need to build and maintain data warehouses and analytical databases in Redshift.

You can see a full list of available transformation components here.

Does Matillion ETL for Redshift include a scheduler?


Yes. Matillion ETL for Redshift includes a scheduler which can run Orchestration jobs (which in turn, can run one or more other Orchestration jobs and Transformation jobs). The scheduler can also run external applications (via SQS, SNS or Python). With the Matillion scheduler and Orchestration jobs, you can fully automate your data warehouse ETL/ELT process, as well as choreograph other upstream and downstream activities.

If you have an existing scheduler, you can integrate Matillion ETL for Redshift into it using SQS.

The scheduler includes comprehensive logging, error handling and alerting.

Does Matillion ETL for Redshift support version control?


Yes. Matillion ETL for Redshift includes comprehensive version control.

Note: Matillion does not at present support external version control systems (e.g. Subversion, git or Mercurial) but you can export jobs/projects into JSON files which can then be versioned in these systems.

Can Matillion ETL for Redshift work across multiple Redshift schemas and clusters?


Yes. As of version 1.17.2, Matillion ETL for Redshift supports multiple schemas in individual jobs. So you can read from schema a, perform transformation on schema b and output to schema c, in a single job.

Matillion ETL for Redshift provides a concept called ‘Environments’ where you can define one or more Redshift clusters, then choose which cluster your jobs run against e.g. Dev, Test, Live. Environments also provide variables that can be reference in your jobs to allow for soft-coding of, for instance, table names, database connection details, S3 bucket/file names, parameters in Python or SQL, SQS queue names, SNS topic names (or anything else).

Can I deliver a real-time data warehouse/BI solution using Matillion ETL for Redshift?


Yes. In conjunction with a suitable mechanism for delivering your data either into S3 or into Redshift, you can use Matillion ETL for Redshift to deliver a near real-time data warehouse.

Does Matillion ETL for Redshift support the use of variables and soft-coding?


Yes. You can define variables and use them in any parameter or script in your jobs. Variables can be populated either from Orchestration components (e.g. read from a database table as part of a Table Iteration), from an external integration (an SQS message with variable key-value pairs) or manually configured into the Environment configuration.

Can Matillion ETL for Redshift co-exist with/make use of my existing SQL/Python scripts etc?


Yes. There are SQL components available in Orchestration and Transformation jobs and you can call programs containing embedded SQL using SQS, SNS, Python or SQL.

You can call Python (or other) programs using SQS or SNS and you can run or call Python from the Python component.

Which Redshift SQL features does Matillion ETL for Redshift support?


The goal and usual status is that Matillion ETL for Redshift supports all Redshift SQL features through the components it provides. Sometimes these are a direct componentization of a Redshift feature (e.g. Filter). Sometimes, more complex components leverage multiple Redshift SQL statements to deliver a particular function (e.g. Rank, Lead/Lag).

When new Redshift features are introduced by AWS, we add these into the product as fast as possible (usually, a few weeks). Where a component isn’t available for a Redshift command you want to use, the SQL component allows you direct access to use Redshift SQL.

You can view the SQL any Matillion component creates in the SQL tab for that component.

 

Performance

I have a lot of data. Can Matillion ETL for Redshift scale to handle billions of rows?


Yes. Matillion ETL for Redshift scales to very large data volumes and delivers incredible speed when compared to traditional ETL tools. The “hard work” of transforming data is actually done by Redshift itself, as Matillion is actually an ELT tool. As such, it’s a just as scalable as Redshift – so near linearly and to petabytes of data.

How fast is Matillion ETL for Redshift?


Very fast.

The data transform speed of the product is governed by the power of your Redshift cluster, but assuming a fairly normal Redshift cluster size for your given data volume, expect in excess of x100 performance when compared with a well optimised, well resourced, traditional ETL technology.

Speed is important operationally (reducing ETL run times and allowing faster data refreshes). It’s also important at development time. In traditional ETL tools there’s a well established process of: build some of your job – run it to see if it works – wait for it to grind – find an error – repeat. This process is slow, so makes your development/maintenance slower than it needs to be. In Matillion ETL for Redshift, you can see data being processed in-job, in real time, so there’s no build-test-”wait to grind” cycle. We find this reduces ETL development and maintenance effort by 50%.

How does Matillion ETL for Redshift scale? Do I need a bigger Matillion instance to process more data?


You do not need a bigger Matillion ETL instance to Transform more data, faster.  The actual work of transforming data is done by the Redshift cluster, so if you need to process more data faster in your transformation, you make your Redshift cluster bigger.

The Matillion instance size governs the maximum number of ETL developers/administrators that can concurrently use the tool and also the maximum number of saved projects/environments. Also, larger instance sizes can Load more data faster.  You can read about choosing the right instance size on this page.

 

Installation

How do I install Matillion ETL for Redshift?


To install Matillion ETL for Redshift, go to the AWS Marketplace (via aws.amazon.com/marketplace or via your EC2 console), find ‘Matillion ETL for Redshift’ and follow the instructions to launch.

The product is delivered as an AMI, straight into your AWS environment. It’s your instance running in your VPC.

You can go straight to our Marketplace page here and launch the product immediately. There is a free 14 day trial.

Note: Follow the instructions provided in ‘Launching the Product’ to ensure your instance is correctly setup. It’s a simple process which takes less than 5 minutes.

Note: If, for some reason, you cannot use the AWS Marketplace (e.g. you are in a locked-down environment) you can contact support@matillion.com and we can arrange a direct install.

Do I require dedicated hardware or AWS infrastructure for Matillion ETL for Redshift?


No. The product runs in a self-contained AMI which includes all the functionality you need to run the product. The only other thing you need is Redshift (obviously).

Which AWS services do I require to support Matillion ETL for Redshift?


There are no services required to run Matillion ETL for Redshift apart from the AMI (and consequently no Cloud Formation template or setup prerequisites are required either).

Of course it’s highly likely that you’ll want a Redshift cluster setup. This/these can be setup before or after you install Matillion.

 

Security/Backups

Is Matillion ETL for Redshift secure?


Yes. It runs inside your VPC, just like any other AMI. It’s scanned and audited by AWS before listing on the AWS Marketplace, to ensure it complies with AWS rules and is free from viruses and malware. It supports SSL connections to Redshift and source databases and you can configure user security for the product using an authentication mechanism of your choice.

Do I have to connect my systems to Matillion or share data with you, to use Matillion ETL for Redshift?


No. Matillion ETL for Redshift is delivered via AMI. You get your own dedicated instance running in your own EC2 account/VPC. You have root access to the instance that Matillion runs on and no-one (including Matillion and AWS) has any access to that instance.

How can I be sure your software is free of viruses/malware or malicious code, or code that allows you to see my data?


AMIs undergo both automated and manual scans, conducted by AWS, before being made available on the AWS Marketplace, to ensure compliance with AWS policies. The relevant policy includes:

AMIs MUST NOT contain any known vulnerabilities, malware or viruses. 2. AMIs MUST NOT contain default passwords, auth keys, key pairs, security keys or other credentials for any reason. All instance authentication must use key pair access rather than password based auth, even if the password is generated, reset or defined by the user at launch. 3. Marketplace AMIs must not allow password authentication. Disable password authentication via your sshd_config file by setting the PasswordAuthentication to no. 6.2. Accessibility 1. Linux-based AMIs MUST lock/disable root login and allow only sudo access through a user account (not “root”). Sudo allows you to control which users are allowed to perform root functions and logs the activity so that there is an audit trail. 2. AMIs MUST allow OS-level administration capabilities to allow for compliance requirements, vulnerability updates and log file access. For Linux-based AMIs this is through SSH, and for Windows based AMIs this is normally through RDP. 3. Linux-based AMIs MUST NOT have blank or null root passwords. 4. AMIs MUST NOT contain Authorized Passwords or Authorized Keys 5. AMIs MUST NOT use default passwords for user interface access. It is recommended to use a randomization process such as using the instance_id from the AWS EC2 Metadata Service. 6. Windows-based AMIs MUST a. Use the most recent version of Ec2ConfigService b. ENABLE “Ec2SetPassword”, “Ec2WindowsActivate” and “Ec2HandleUserData” c. Remove Guest Accounts or Remote Desktop Users (none are allowed) 7. The seller MUST NOT maintain access to the customer’s running instances. The customer has to explicitly enable any outside access, and any accessibility built into the AMI must be off by default.

 

Pricing/Commercial

How much does Matillion ETL for Redshift cost?

Matillion ETL for Redshift is cost effective and easy to pay for, hourly or annually, via your existing AWS account.  There are 3 different instance sizes to choose from, each of which supports a maximum number of concurrent users, a maximum number of saved environments and offers different performance.  Larger instances deliver higher performance, particularly data ingestion performance.

You can pay for Matillion ETL for Redshift by the hour, or annually.  Hourly billing charges whenever the Matillion ETL for Redshift instance is started in your EC2 console and is ideal for proof of concepts, trials and projects where you do not need to use the instance all the time.  Annual customers pay a one-off annual subscription for unlimited use of the Matillion ETL for Redshift instance and enjoy direct access to Matillion’s support engineers via phone and e-mail.  Annual subscriptions are ideal for production workloads, as it works out cheaper than running an hourly instance 24 x 7. Also where a direct relationship is required or valuable.

Hourly prices start from $1.37 p/hour.  For comprehensive information on pricing, visit the pricing page.

 

Where do I buy Matillion ETL for Redshift?


You can launch and buy Matillion ETL for Redshift directly from the AWS Marketplace and billing is via your existing AWS account – no need to re-enter credit card details.  If you are unable to transact via AWS MArketplace, contact sales@matillion.com.

How do I pay for Matillion ETL for Redshift?


Billing for AWS Marketplace products, like Matillion ETL for Redshift, is via your existing AWS account. You do not need to re-enter credit card details. Matillion ETL for Redshift will appear as a line item on your AWS invoices, as for any other AWS service.

What instance size do I need?


You can choose an instance size based on your requirements. You can read about choosing the right instance size on this page, but in shorthand: m3.medium supports teams of 1-2 ETL developers (ideal for POCs). m3.large supports teams of up to 5 ETL developers and m3.xLarge is for teams of up to 12 ETL developers. Larger instances also load data at faster rates.

How many instances do I need?


You only need one instance of Matillion ETL for Redshift (and one instance can connect to multiple Redshift clusters), however you may choose to use more than one instance, for example to support two separate projects, or an instance for development and another for production. Purchase extra instances as required via the AWS Marketplace, either hourly or annually.

If you’re unsure, contact support@matillion.com for advice.

 

Support

What support is available with the product?


Matillion provide world class support to our Free Trial, Hourly and Annual Customers including monitored forums, detailed documentation and component guides, quick start guides, FAQs, and instructional videos. Free Trial and Annual customers may also get in touch via email and phone.

Access the support portal here or email support@matillion.com.  Read more about services and support here.

I need assistance using the product. Do you offer services?


Yes. We have a services team available for training, setup, proof-of-concepts and full builds.  Read more about services and support here.

Is there a service partner network for Matillion ETL for Redshift?


Yes, if you would like to be introduced to a Matillion services partner, or are interested in becoming a partner, e-mail us at sales@matillion.com