Python UDFs in Amazon Redshift
February 01, 2016 | Elise Breda (Yhat)
Elise Breda is the Growth Strategist at Yhat, a tech company that builds software for deploying predictive analytics to mobile and web apps. When she isn’t at their awesome new office in DUMBO, she can be found exploring the bike paths and Thai restaurants of NYC.
In the Beginning…
In the early 2000’s, the behemoth collection of cloud computing services we now know as Amazon Web Services (AWS) was little more than sparks firing in Chris Pinkham and Benjamin Black’s neurons. In 2003, the two presented a paper (blog post here) outlining a radical vision for a retail computing infrastructure that would be standardized, automated and built upon web services. The next year, Simple Queue Service, the first AWS service for public usage, was launched.
Fast forward almost a decade, and AWS is now the most commonly used cloud platform among enterprise software developers. AWS products span the gamut of web services, from computation (eg EC2) to networking (eg VPC) and content delivery (eg S3). In this post we’ll explore a small fraction of a fraction of the AWS ecosystem–a database that’s generating all kinds of groundswell right now: Amazon Redshift.
The rest of this post will talk about Redshift at a high level and then dive into a mini overview of User Defined Functions (UDFs), how they work, why they’re great, and how to start using them.
Amazon Redshift is a hosted data warehouse that’s accessible / easy to set up, and built for speed and suitable for a variety of combining, storing, and compute-heavy analytics tasks.
Two things make Redshift particularly attractive. First, Redshift can handle insane amounts of data–it is a petabyte-scale warehouse. A petabyte is a lot (1015 bytes) of data. As a point of reference, the entire master catalog of Netflix video in 2013 amounted to about 3.14 petabytes of storage space (interesting read on Quora). Second, unlike Amazon’s other hosted database product, Amazon RDS, Redshift stores data according to column-based structure. Column orientation is good for tables containing columns with lots of repeated values (i.e. Credit Card Names, County/State, Product Type, etc, like CRM data. The benefit of column data is that because it’s uniform, there are opportunities for storage size optimization via compression. You can read more about how to maximize compression here.
Redshift handles large scale column-oriented datasets using massive parallel processing, performing coordinated computations across a large number of processors in parallel, making it a fast and powerful data warehouse option.
Data Warehouse Setup in the Data Age
Even just a few years ago, getting a data warehouse and proper ETL processes in place was a long, painful and probably very expensive ordeal. But we’ve arrived in the data age where easy-to-use, affordable data solutions are bountiful.
At Yhat, we use a Redshift to warehouse everything–CRM data (we use SFDC), product data, site metrics from Google Analytics, and data from a bunch of other data. It took us about 20 mins to set up the database on AWS, and it took us…wait for it…another 20 mins or so to set up all of our ETL using Fivetran which we couldn’t be more impressed with.
SQL IDE Done Right
Most SQL IDEs of yesteryear leave something to be desired in terms of UX. The majority are clunky and have super old school frankenstein UIs. Why they all focus on making exploring the DB schema rather than on making it easy to write queries, view results and think critically about your data has always been a mystery.
Well those days are also over. Wagon is the query-focused SQL app I’ve been looking for for years. Wagon boasts a clean UX designed analysts. Features are carefully chosen with a keen eye for usability for people writing tens or hundreds of queries per day. Wagon gets it in spades.
Overview of Python UDFs in Redshift
UDF stands for user-defined function, meaning that you can add functions to an environment (in this case, Redshift) in addition to those that come built in. Python UDFs allow you combine the power of Redshift with what you know and love about the Python programming language without switching between IDEs or systems.
The great thing about UDFs in Redshift is that Redshift will automatically execute it using its MPP architecture. One caveat to keep in mind is that your Python code still won’t execute as quickly as native SQL functions (
MAX, etc.) that are baked into the database.
How to Use UDFs
You can certainly work with text in pure SQL, but some tasks are just easier to do in a scripting language like Python instead. Here’s a toy example to illustrate how to use Python functionality within Redshift using a UDF.
Suppose a column in one of our tables contains huge chunks of text or html, and we’re interested to find any email addresses within any one record. Let’s write a function that will take in raw text and return a pipe
| separated string containing any email addresses found within the input text document. Define the function like so:
Once defined, you can use it like this:
This is a scalar function, so it’ll return one record for each input row (i.e. not an aggregate function). One thing to remember is that your UDFs are per-database, meaning that if you have multiple in your Redshift cluster, you’ll need to define your functions in each database.
Redshift Python UDFs are based on Python 2.7 and come preloaded with a lot of our favorite libraries, including NumPy, SciPy and Pandas. You can also import custom modules from S3 and the web.
Here’s the template published on the AWS blog that you can use to start creating your own scalar functions:
The scalar UDFs that you create will return a single result value for each input value. Once you’ve defined a UDF, you can use it in any SQL statement. One thing to remember is that your UDFs are per-database, meaning that if you have multiple in your Redshift cluster, you’ll need to define your functions in each database.
To learn more about Python UDFs in Redshift, check out Amazon’s documentation, which is super helpful and covers everything from constraints to security and python support. You can also check out the initial release blogpost and a more extensive post that uses UDFs to analyze the CMS Open Payments data set.
Yhat’s flagship product, ScienceOps, empowers teams of data scientists deploy their models into web and mobile applications. These models are embedded into production applications via REST APIs without any recoding from their native statistical language.