Half a day with Maia. A working pipeline by the end.

Register

Test network connectivity using Python and Ncat

In TCP/IP networking, data exchange happens using IP addresses and ports. An IP address is a unique 32-bit (IPv4) or 128-bit (IPv6) identifier that gets assigned to every device in a network. A port is a number between 0 and 65535 that allows servers to run more than one service at a single IP address.

Not all paths of communication are allowed! Network firewalls exist to enforce security by constantly monitoring and controlling incoming and outgoing data. Firewalls are a fundamental way to prevent unauthorized access while allowing legitimate communication.

In a large TCP/IP network, such as your company's data plane, or the internet itself, there are several further considerations:

  • Network routes - deciding the paths along which data should travel from source to destination. Routing tables dynamically select the most efficient paths based on distance, speed, and cost.
  • Public vs private IP addresses - especially relevant whenever data transitions between a virtual private cloud (VPC) and the public internet
  • The Domain Name System (DNS) - which translates human-friendly hostnames into IP addresses. DNS allows you to write "google.com" into your web browser rather than having to remember its IP address 74.125.193.139

The first step in any data engineering process (ETL or ELT) is getting hold of the source data. This usually requires connecting to a source system over a TCP/IP network. Large networks are in a state of constant change, so for DataOps, it's very important to be able to test network connectivity simply and reliably.

Network Connectivity for Data Engineering

Here are some of the most common network data sources that every data practitioner has to deal with regularly:

  • Database Servers—for example, Cloud Data Platforms, Cloud Data Warehouses, MySQL, PostgreSQL, Microsoft SQL Server, Oracle, and MongoDB. Network connectivity is often implicitly tied in with an SQL-based access protocol, such as JDBC. Hostnames and port numbers may be buried inside long connection strings or invisible and implied.
  • SaaS Applications - hosting and running business applications, often connected to backend databases, providing APIs for automated data access.
  • Cloud Services - a huge variety, including cloud storage (Amazon S3, Azure Blob storage, Google Cloud Storage), web pages, serverless functions (AWS Lambda, Azure, and Google Cloud Functions), through to generative AI large language models such as OpenAI
  • File/FTP/SFTP Servers - to exchange files from centralized storage using File Transfer Protocol (FTP) or Secure File Transfer Protocol (SFTP)

With all these sources, any problems with network connectivity usually manifest as intermittent, cryptic "timeout" errors. It can often appear as though the data pipeline logic is wrong or the credentials are incorrect - when, in fact, the underlying problem is the network.

Furthermore, the default timeout for network operations is often an unwieldy 60 seconds or more. Time-sensitive pipelines may simply hang (perhaps inside a loop) because of a connection problem that could have been quickly diagnosed in advance.

How to Test Network Connectivity Using Python

The Python "socket" library is perfect for the simple, foundational task of verifying network connectivity. It supports a huge range of functionality: both client- and server-side applications, several protocols, plus connection management and bidirectional data transmission. Only a very small part of that will be needed.

I'll use Amazon Redshift as an example. You might find from your AWS console that the JDBC URL of your provisioned cluster looks something like this:

jdbc:redshift://some-name.abcdef123456.eu-west-1.redshift.amazonaws.com:5439/dbname

Both the hostname and the port are present in the string:

  • Hostname: some-name.abcdef123456.eu-west-1.redshift.amazonaws.com
  • Port: 5439

This is enough information to write the Python code:

import socket

jv_host = "some-name.abcdef123456.eu-west-1.redshift.amazonaws.com"
jv_port = 5439

print(f"Host: {jv_host}")
print(f"Port: {jv_port}")

if jv_host is None:
    raise Exception("Host must be supplied")

if jv_port is None:
    raise Exception("Port must be supplied")

# Try a TCP connection to an IPv4 address, with 5 second timeout
sk = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sk.settimeout(5)

try:
    sk.connect((jv_host, int(jv_port)))
except TypeError as ix:
    raise Exception("Port must be an integer")
except socket.timeout as tx:
    raise Exception("Timed out")
except socket.gaierror as sgx:
    print("Host address error")
    raise sgx

print("Connection test successful")

The script creates a socket object with two important parameters:

  • The AF_INET address family - which will accept either a hostname (as above) or a dotted IP address. If you supply a hostname, it will automatically be translated into either a public or a private IP address depending where your script is running. This is important for security because it means that when you connect from a VPC network with a route to this database, no data will be transferred over the public internet.
  • SOCK_STREAM - which implies that the connection will be two-way (you submit SQL, and get a response), even though this script doesn't actually transfer any data.

If the Python script runs successfully, the network connectivity test has passed.

Testing Network Connectivity Using Ncat

If you're using a Linux server as part of your data orchestration, and have shell access to it, you will find a very useful alternative to Python scripting in the form of the Ncat utility.

Ncat is a great tool for network debugging and can be used to test connectivity to any server over specified ports. Ncat is part of the Nmap suite and is extremely versatile.

Here's the Ncat command to run the same test as shown in Python in the previous section. Note that the nc command is usually also installed as a symbolic link to ncat, so you can use either.

ncat -z -w 5 -v "some-name.abcdef123456.eu-west-1.redshift.amazonaws.com" 5439

The option flags set in this command are:

  • -z meaning zero I/O mode, which just reports the connection status without sending any data
  • -w sets the connection timeout in seconds (5)
  • -v sets a medium verbosity level

You can optionally increase the verbosity level by using -vv instead. This will also report the IP address of the database server, which allows you to check if it's using the public or the private one.

As with any Linux command, a zero exit status means the network connectivity test has passed.

Summary

By running the scripts and commands shown in this article, data engineers can quickly determine the reachability of target servers.

Verifying network connectivity with tools like Python and Ncat provides robust means to test connectivity, which helps with the runtime validation and swift debugging of many vital data engineering tasks.

Want something simpler? Matillion users have access to low-code versions of these network utilities, implemented via a graphical drag-and-drop interface. For a preview, take a look at the downloadable Check Network Access for Matillion Designer and similarly for Matillion ETL.

You can try out the Matillion data productivity cloud at zero cost for 14 days and build your own graphical data integration pipelines on your own data.

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell

Ready to get moving?

See how quickly your team can start delivering business-ready data, with Matillion.