Improving Python Efficiency in Matillion: Passing through Matillion Variables
In our blog, “Improving Python Efficiency in Matillion: Offload Large Python Scripts” we talked about running Python scripts which are better run remotely using SSH, AWS Lambda or AWS Glue. It is often the case that there is a requirement that part of the Python script is made dynamic. This can be achieved using variables in Matillion using any of the approaches in the blog. Here we look at how to apply it to the SSH example.
In the SSH example discussed in the earlier blog, the S3 Bucket and file names were hard-coded but it is unlikely that these will always be consistent. It is also unrealistic to update the Python script to simply update the file or the bucket names. Instead, job variables in the Matillion job can be set up and referenced in the ‘heredoc’, which pushes the variable value to the remote server:
These variables are then referenced in the script using the `$varname` syntax.
The script has also been edited to return the bucket and filename created.
The output from the script run can also be pushed to an output file, which can then be read by another Matillion component and passed into another variable value if required.
In general, we recommend avoiding writing Python scripts in Matillion where possible. Instead, try to use components to push the processing down to the target data warehouse using an ELT approach. However, there may be circumstances where this isn’t possible due to the data used or the nature of the Transformation.
In those circumstances, the Python script can be run remotely and we recommend making one generic script which can be called with different variables to ensure the Python script doesn’t need to be continually modified. In our blog, “Improving Python Efficiency in Matillion: Generating Python File“ we demonstrate how to make the script dynamic.