In this article we will discuss Paging in the context of a REST API and how Matillion can help you download data from your REST APIs as manageable chunks, making the most of resources available at your disposal.
It is almost never a good idea to return all available data at once. Returning large datasets over HTTP requires a stable connection, large compute resources to service this request as well as sufficient memory and processing power on the client to process this data. Also, most REST APIs are designed to support 3rd party applications which display a pageful of information and not for bulk download of data.
Consequently, REST APIs (not all) put limits on the amount of data that can be downloaded with each request. They provide some sort of pagination mechanism, using which we can download the data in chunks. Please read your API’s documentation to understand how paging (if supported) is implemented.
This is also beneficial from Matillion’s point of view. Typically, Matillion has access to only a portion of the Memory available on the instance. The data downloaded via a REST call is held in memory, processed and then streamed to Amazon S3 to be subsequently loaded into Amazon Redshift. Large datasets returned from REST calls may eat-up available memory, starving other components in the job and causing all sorts of issues. Implementing paging and downloading smaller datasets helps Matillion stream data to Amazon S3 as soon as it’s available, making efficient use of available memory.
Most APIs support pagination by allowing you to specify certain parameters in the URL (GET) or in the request body (POST) to suggest the portion of data to download. It is common to use the parameters offset and limits to control the amount of data that is to be returned.
The first HTTP requests are usually simple. Subsequent requests expect you to specify some additional parameters which are –
1. Derived from the output of the first (or subsequent) request
2. Calculated based on some criteria
API Query component and Paging
The API Query component in Matillion is a popular choice when dealing with REST APIs (the other choice being the Python Script component). In this article we will see examples of paging based on the two scenarios mentioned above.
The API Query component is versatile in that it provides mechanisms to build complex REST requests. This can be achieved in the context of a single RSD definition where we specify how the component should build a HTTP request to get/download the next set of data as well as when to stop. The component can issue multiple HTTP requests to download all available data and then load it into Amazon Redshift.
To support automatic paging, introduce pseudo column rows@next in your list of columns and set the rows@next attribute to any information needed to make the request for the next page of data. When this value is set in the output, the driver will automatically call the RSD script again with the rows@next value in the input after it is finished returning results for this page. You can use the value of this input to modify the request on the next pass to get the next page of data.
Below is a basic workflow of the API Query component. To support paging, we introduce a pseudo-column rows@next and assign a value to it if there is more data available.
Step 3 is crucial in that it allows us to inspect incoming data and assign a value to rows@next. At Step 4, we decide if another iteration is necessary.
Some common Paging Patterns
Let’s look into two paging patterns and see how to address these scenarios to download available data.
The First example (Desk.com), we work with an API call where paging is controlled by the server and the address to the next page is part of the data returned by a call.
The next example (Netsuite RESTlets) works with a HTTP call where we control the amount of data returned with each request and quit when there is no more data to download.
Desk.com is an online customer service software and support ticket help desk application for small businesses and fast-growing companies. Their REST API (here) by default returns 50 results per request which can be increased to 100 results per page.
There is already some documentation on implementing paging with this API. Please read more about it at the following link – Paging with the API Query Component
The URL for the initial request is hard-coded, but that of the subsequent requests is built using the value in rows@next, which in-turn was set by parsing the JSON returned in the previous request.
Next, we review a Netsuite RESTlet powered by a custom function which always returns new/updated records from the previous day.
RESTlet is a script type where it is used for read/create/update records in NetSuite. It takes JSON string as input data and processes the required action in the NetSuite system.
The RESTlet was designed to allow control over the amount of data returned with each request. Each request specifies offsets – min, max – which control the amount of data to be returned. Assuming we have 50,000 records to be returned, we issue multiple HTTP requests with the following min/max offset values in the body, downloading 5000 rows per request.
The last request (#11) will not return any data and the driver stops paging.
Here’s the RSD to deal with this RESTlet – Download RSD file
Let’s look into the aspects of the RSD-script that support paging.
- Start by declaring the pseudo column rows@next – line 12
- Declare and initialise attributes startpoint and endpoint – lines 33 to 40. The block is executed on the first-request as rows@next is null. Subsequent requests will initialise startpoint and endpoint based on the value in rows@next.
<rsb:set item=”userns” attr=”startpoint” value=”[_input.Rows@Next]”/>
<rsb:set item=”userns” attr=”endpoint” value=”[userns.startpoint | add(5000)]”/>
<rsb:set item=”userns” attr=”startpoint” value=”0″/>
<rsb:set item=”userns” attr=”endpoint” value=”5000″/>
- Lines 42-48 set the data attribute which is used to pass information with a POST request.
- The script in the block is executed for each row derived from the downloaded JSON/XML. This is usually the best place to manage rows@next as we have access to the data returned. In our example we do not depend on the data itself but use this block to increment the count for each row and arrive at an appropriate value in rows@next to help with next iteration. The following lines increment startpoint for each row in the returned JSON and subsequently assign it to rows@next. By the time we finish executing the block, rows@next will have the appropriate value
<!– Increment the total row counter –>
<rsb:set item=”userns” attr=”startpoint” value=”[userns.startpoint | add(1)]”/>
<!– Set Rows@Next to invoke paging –>
Other points of interest –
- The URL attribute (line 16) is hard-coded and is not dynamic – unlike the previous example for desk.com.
- We are issuing a HTTP POST (line 30) by setting the “method” attribute which usually defaults to GET. A HTTP POST typically passes data via the body rather than the URL.
Hopefully this article would have given you a general idea of how to implement paging with the API Query component.
Paging is a common design-pattern with modern REST API’s. The semantics generally differ and your use-case may require a completely different approach.
Please get in touch (email@example.com) if you need help implementing paging for your API.
Download our free eBook below for loads of useful advice and best practices on how to optimize your Amazon Redshift setup